Get Time
long_comps_topcoder  Problem Statement
Contest: Master Data Management
Problem: MasterDataManagement

Problem Statement


Prize Distribution

  • 1st - $10,000
  • 2nd - $5,000
  • 3rd - $2,500
  • 4th - $1,500
  • 5th - $1,000
Bonus Prize - $100 per winning submission

We will ask the 5 winners to submit a Docker file for their solution, and, if so will provide guidance in that regard. We will pay the winners who submit a Docker file an additional $100 each for their efforts.

If you choose to not submit a Docker file, you will need to document precisely the steps that need to be taken to recreate your submission.

Requirements to Win a Prize

In order to receive a prize, you must do all the following:
  1. Achieve a score in the top 5, according to system test results. See the "Scoring" section below.
  2. Create a legitimate algorithm that runs successfully on a different data set with the same fields. Hard-coded solutions are unacceptable.
  3. Within 7 days from the end of the challenge, submit a complete report at least 2 pages long outlining your final algorithm, explaining the logic behind and steps to its approach. The required content and format appear in the "Report" section below.
  4. Within 7 days of the end of the challenge, submit all code used in your final algorithm in 1 appropriately named file (or tar or zip archive). We will contact the winners via email and ask for the file. The naming convention should be memberHandle-masterDataManagementContest. For example, handle "johndoe" would name his submission "johndoe- masterDataManagementContest."

If you place in the top 5 but fail to do any of the above, then you will not receive a prize, and it will be awarded to the contestant with the next best performance who did all of the above.


As healthcare systems evolve through mergers, acquisitions, and partnerships, there is a large problem identifying and recognizing duplicate and erroneous information on entities such as doctors, practices, and clinics when the data from various sources is combined. As the number and frequency of these mergers increases, there is a growing need to establish a single "source of truth," using data from key business domains. Examples of these domains include healthcare provider, patient, payer, location, item, medication, procedures, information management and reporting, facility, diagnoses, guidelines, and codes. The processes for creating, managing and governing a single "source of truth" are broadly classified under Master Data Management (MDM).

In MDM, it is necessary to reconcile inconsistencies and eliminate redundancies. Duplicate records are one such redundancy, and they tend to occur because of one or more of the following:

  1. Data Entry Errors: Data entry errors causing duplicate records for the same person. For example, a certain physician, Josef Weininger, may have been recorded erroneously as Joseph Weininger.
  2. Absent Data Fields: Records sourced from multiple systems can differ due to there being inconsistencies in the systems’ data fields. For example, "Dr. Josef Weininger" in one system may be registered as "Josef M. Weininger, MD" in another.
  3. Synonyms, Abbreviations, and Contractions: Such issues arise when data models vary between different systems. Examples include Suite - Ste., DC - District of Columbia, NYC, New York City, and 5 digit zip codes vs 9 digit zip codes.
  4. Equivalent Medical Terms: Dr. Weininger may be called a "Skin Doctor" in one practice and a "Dermatologist" in the other practice. Both refer to the same specialization and are likely the same doctor.
  5. Related Terms: Dr. Weininger may be called a nephrologist (i.e., a kidney specialist) in one practice and a urologist (i.e., a urinary specialist) in another practice. This makes sense, given the proximity of the kidneys to urology. However, it is less likely that Dr. Weininger the nephrologist (i.e., kidney doctor) is the same as Dr.Weininger the neurologist (i.e., brain specialist).
  6. Name Changes: Some providers may change their last names when they get married or divorced. When different data systems are merged, if the same name changes have not been made in both systems, then multiple records in the merged system may refer to the same provider.
In the Resources section below, we list thesauri and other tools may be useful. Additional open source tools and data sources are also highly encouraged.


Create an algorithm that:
  • Receives a "dirty" data set containing information about people (e.g., physicians, allied health providers) and organizations (e.g., health service organizations, accountable care organizations). Each of the n rows of data has an index ranging from 0 to (n - 1).
  • Returns an n x n upper triangular matrix M with entries Mij and having the following structure.
    • For all i > j, Mij = 0.0
    • For all i = j, Mij = 1.0, since every row is a duplicate of itself.
    • For all i < j, 0.0 <= Mij <= 1.0 indicates the probability that the ith row is a duplicate of the jth row. For example, if the ith row has a 80% chance of being a duplicate of the jth row, Mij = 0.8.

You may download a data set, which includes ground truth, to develop your algorithm here. You will need to test your code against this data set and submit results as a csv file on our site. The csv file will be used to score your submission.

Data Description

The data is in csv format with the following columns:

  1. ID (Integer) - Index from 0 to n-1
  2. Name (String) - Provider name
  3. Address (String) - Provider address
  4. Taxonomies (String) - Comma separated list of healthcare taxonomies

Nb. Multiple provider taxonomies may be given for a single provider.

Ground truth for the training set is provided in a .csv file as a list of duplicate pairs. Each row corresponds to the two comma seperated id numbers being a duplciate pair.

Data Format Requirements

A single .csv file must be created which contains the answers for the given test. The .csv file should not contain any header. Each line of the .csv file will correspond to a single duplicate pair. Each line contains the follow values, comma separated: the id of the first record (i), the id of the second record (j), and the probability of these records being duplicate (Mij). The index of the first record (i) must be less than the index of the second record (j), and the probability must be less than or equal to 1 and greater than 0. For example, the row "45,300,0.75" means that there is a 75% chance that records 45 and 300 are duplicates.

All pairs which are not specified in your .csv file will be assumed to have a duplicate probability of zero for scoring. Identical pair indices, where i=j, should not be specified as they are guaranteed to be equal.


During the contest, only your results will be submitted. You will submit code which implements only one function, getAnswerURL(). Your function will return a String corresponding to the URL of your answer .csv file. You may upload your .csv file to a cloud hosting service such as Dropbox which can provide a direct link to your .csv file.

To create a direct sharing link in Dropbox, right click on the uploaded file and select share. You should be able to copy a link to this specific file which ends with the tag "?dl=0". This URL will point directly to your file if you change this tag to "?dl=1". You can then use this link in your getAnswerURL() function.

Your complete code that generates these results will be tested at the end of the contest.


In addition to supervised learning algorithms to predict duplicates, we suggest trying ones that compare pairs of records directly and intelligently such that only a subset of all pairs of records are examined fully for duplicates.


We will use the mean squared error of prediction to evaluate your submissions.

Define: dij = 1 if the ith row is a duplicate of the jth row or 0 if the ith record is not a duplicate of the jth row

Brier Score = sum of (Mij - dij)^2 over all pairs (i,j) where i < j.

Score = 1,000,000,000 / (Brier Score + 1,000)

NOTE: For provisional tests, only a portion of the (i,j) pairs (approximately half) will be considered for towards the scoring. The remainder will be considered for system testing results.


English Names and Nickname Corpus


Unified Medical Language System (UMLS)

UMLS Semantic Network

UMLS Metathesaurus


Your report must be at least 2 pages long, contain at least the following sections, and use the section and bullet names below.

Your Information

This section must contain at least the following:

  • First Name
  • Last Name
  • Topcoder handle (if you have one)
  • Email address
  • Final code submission file name

Approach Used

Please describe your algorithm so that we know what you did even before seeing your code. Use line references to refer to specific portions of your code.

This section must contain at least the following:

  • Approaches considered
  • Approach ultimately chosen
  • Steps to approach ultimately chosen, including references to specific lines of your code
  • Open source resources and tools used (e.g., WordNet), including URLs and references to specific lines of your code
  • Advantages and disadvantages of the approach chosen
  • Comments on libraries
  • Comments on open source resources used
  • Special guidance given to algorithm based on training
  • Potential improvements to your algorithm

Notes on Time Limits

Your final code will be executed on an Amazon m4.xlarge machine running linux. The time limit for code execution is 1 hour.

Submissions will be limited to once every 2 hours, so plan your submissions near the end of the contest carefully.



Method signature:String getAnswerURL()
(be sure your method is public)


Seed: 0
Seed: 0
Seed: 0

This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2010, TopCoder, Inc. All rights reserved.