JOIN
Get Time
long_comps_topcoder  Problem Statement
Contest: MMRF
Problem: MMRF

Problem Statement

    

Prize Distribution

  • 1st - $9000
  • 2nd - $5000
  • 3rd - $3000
  • 4th - $2000
  • 5th - $1000

Requirements to Win a Prize

In order to receive a prize, you must do all the following:

  1. Achieve a score in the top 5, according to system test results. See the "Scoring" section below.
  2. Create a legitimate algorithm that runs successfully on a different data set with the same fields. Hard-coded solutions are unacceptable.
  3. Within 7 days from the end of the challenge, submit a complete report at least 2 pages long outlining your final algorithm, explaining the logic behind and steps to its approach. The required content and format appear in the "Report" section below.
  4. Within 7 days of the end of the challenge, submit all code used in your final algorithm in 1 appropriately named file (or tar or zip archive). We will contact the winners via email and ask for the file. The naming convention should be memberHandle-MMRF. For example, handle "johndoe" would name his submission "johndoe-MMRF."

If you place in the top 5 but fail to do any of the above, then the corresponding prize money will be given to the next ranking contestant who completes the above. Anyone from Genospace & Compass is not eligible for any prize, but they are not banned from competing.

Background

Multiple myeloma, a cancer of plasma cells, affects hundreds of thousands of people each year. The goal of this contest is to devise an algorithm capable of predicting progression outcomes for multiple myeloma patients, based upon their genetic data.

The results of the contest will provide researchers with new understanding and insights into the disease, and ultimately help researchers develop better treatments for myeloma patients.

Objective

The objective of this challenge is to rank order a given set of patients based on the predicted time that will elapse before disease progression based on the measured expression of 18,898 genes for each patient.

Functions

You must implement two functions in your submission: trainingData and testingData.

trainingData receives input arguments corresponding to the training set for the given test case. The function receives four String arrays. Each of these String arrays contains a single entry for each patient. Each entry in these arrays consists of a String representing a list of comma separated values. For the first three parameters (expr_avg, expr_diff, and mutation), each entry has 18,898 comma separated values (one for each gene). Each entry of the ground truth parameter (prog_obs_time) has 2 comma separated values: progression time and observation time. Input parameters:

  • int test_type : Type of test 0 for example, 1 for provisional, or 2 for system
  • String[] expr_avg : A header row, followed by N lines each with 18,898 comma separated real numbers
  • String[] expr_diff : A header row, followed by N lines each with 18,898 comma separated real numbers
  • String[] mutation : A header row, followed by N lines each with 18,898 comma separated booleans (0 or 1)
  • String[] prog_obs_time (ground truth) : A header row, followed by N lines each with 2 comma separated integers

Some values in the comma separated lists may be missing in each of these arrays. Missing values will be left blank.

testingData receives the same input arguments as trainingData except for the absence of the test type (test_type) and the progression and observation times (prog_obs_time) which serves as the ground truth. Your implementation of testingData should return an integer array of length N containing a single value for each patient indicating the rank-order from 1 to N before disease progression. These returned values are used in conjunction with the ground truth in order to score your solution. Your rank-order prediction array of length N should contain one of each value from 1 to N inclusive where 1 indicates the patient most likely to progress and N indicates the patient least likely to progress.

Data Description

Example data is only available through MMRF for download here. You will need credentials, which must be requested (response will take ~24hr), to download the example data. Credentials received before the start of the contest will be activated at the start of the contest.

For each patient, 3 values are specified for each of the 18,898 genes:

  • expr_avg : Number indicating average expression of this gene for this patient
  • expr_diff : Number indicating the difference in expression when measured inside a tumor in the patient relative to the average (expr_avg)
  • mutation : Boolean value indicating whether or not the gene is associated with a gene mutation in the patient

Two ground truth values are also specified for each patient (prog_obs_time):

  • Progression time
  • Observation time

The example data is contained in 4 csv files: expressions_example.csv, copynumber_example.csv, mutations_example.csv, and groundtruth_example.csv, corresponding to expr_avg, expr_diff, mutation, and prog_obs_time, respectively.

Details about file formats: what the rows and columns represent in all files

  • Each data set contains a header, which describes the data columns: for gene data, the columns are labeled by Gene IDs, for ground truth, the columns are labeled by either a progression time (PT) or an observation time (OT)
  • For each data set, rows correspond to different patient IDs

Patient data may be missing in some data sets, including the ground truth. Missing data will remain blank.

Data Partitioning

A total of 640 patients in the full data set are partitioned randomly into 3 subsets: approximately 50% for example, 25% for provisional, and 25% for system. Each subset is partitioned randomly into training and testing groups for each test case.

For each provisional test, 65% of the provisional data and all example data is used for training while the remaining 35% of the provisional data is used for testing and scoring.

For each system test, 65% of the system data and all provisional and example data is used for training while the remaining 35% of the system data is used for testing and final scoring.

The minimum number of provisional and system test cases are 50 and 100 respectively. Additional test cases will be added if deemed necessary.

Breakdown of missing data in the ground truth sets:

  • All patients in the provisional and systems sets will have observations times
  • Some patients in the provisional and systems sets will have progression times
  • Some patients in the example set will have neither progression nor observation times

Scoring

Assume an algorithm that returns a rank-ordered list of patients in decreasing order of likelihood of poor prognosis.

  • Let r_i denote the rank of patient i, as predicted by the algorithm
  • Let t_i represent the progression time of patient i, according to the ground truth
  • Let u_i represent the observation time of patient i, according to the ground truth

The score is based on the relative progression likelihood for all possible patient pairs. For a given test case, the test case score is given by:

s = N * sum_{ij} R_{ij} w_{ij}

where

N = 1 / sum_{ij} |w_{ij}|

and i, j run over all patient IDs in the test data set. The matrix R has matrix elements given by

R_{ij} = sgn(r_i - r_j)

which take on the values {+1,0,-1} depending on the relative ranking of the patient pairs, as returned by the algorithm. For each patient pair, the weight matrix elements are given by w_{ij} = 2*P_{ij}-1 , where P_{ij} represents the likelihood of the given ground truth patient pair ordering (noting that the patient progression times are drawn from an a priori unknown patient progression distribution, and thus their ordering is probabilistic). The probability matrix P_{ij} can be modeled based on reasonable assumptions about the patient progression distribution. Assuming patient risk is described by an exponential distribution function with a decay time set by the patient progression time, the weights will take a simple form, based on four possible cases:

  1. If both patients i and j progress:

    w_{ij} = (t_i - t_j)/(t_i + t_j)

  2. If patient j progresses and i does not:

    w_{ij} = (u_i - t_j)/(u_i + t_j) if u_i > t_j

    w_{ij} = 0 otherwise

  3. If patient i progresses and j does not:

    w_{ij} = (t_i - u_j)/(t_i + u_j) if t_i < u_j

    w_{ij} = 0 otherwise

  4. If neither patients progress:

    w_{ij} = 0

The algorithm score is given by

S = 1,000,000 * max(0, s_avg)

where s_avg is the average of s over all test cases considered (assume that test cases contain at least two progressing patients).

The score, as defined above, ranges between 0 (i.e., random guess solutions) and 1,000,000 (i.e., perfect prediction of ground truth test data).

General Notes

  • Time limits are 2 minutes, 3 minutes, and 4 minutes respectively for example, provisional, and system tests.
  • A visualizer tool can be downloaded here which can interface with your code locally through standard input/output.
  • This match is not rated.
  • The allowed programming languages are Java, Python, C++, C# and VB.
  • You can include open source code in your submission, provided it is free for you to use and would be for the client as well. Code from open source libraries under the BSD or MIT licenses will be accepted. Other open source licenses may be accepted too, just be sure to ask us.
  • The usage of external data and pre-trained models are allowed, as long they meet the license requirements.
  • The test servers have only the default installs of all languages, so no additional libraries will be available.
  • Use the match forum to ask general questions or report problems, but please do not post comments and questions that reveal information about the problem itself, possible solution techniques or related to data analysis.
  • If you request the data download from MMRF (after the match starts) then there will be ~24hr window before the account access is granted and credentials are emailed to the member.

Useful gene references

The following links may be useful for this challenge:

Report

Your report must be at least 2 pages long, contain at least the following sections, and use the section and bullet names below.

Your Information

This section must contain at least the following:

  • First Name
  • Last Name
  • Topcoder handle
  • Email address
  • Final code submission file name

Approach Used

Please describe your algorithm so that we know what you did even before seeing your code. Use line references to refer to specific portions of your code.

This section must contain at least the following:

  • Approaches considered
  • Approach ultimately chosen
  • Steps to approach ultimately chosen, including references to specific lines of your code
  • Open source resources and tools used, including URLs and references to specific lines of your code
  • Advantages and disadvantages of the approach chosen
  • Comments on libraries
  • Comments on open source resources used
  • Special guidance given to algorithm based on training
  • Potential improvements to your algorithm
 

Definition

    
Class:MMRF
Method:trainingData
Parameters:int, String[], String[], String[], String[]
Returns:int
Method signature:int trainingData(int test_type, String[] expr_avg, String[] expr_diff, String[] mutation, String[] prog_obs_time)
 
Method:testingData
Parameters:String[], String[], String[]
Returns:int[]
Method signature:int[] testingData(String[] expr_avg, String[] expr_diff, String[] mutation)
(be sure your methods are public)
    
 

Examples

0)
    
Seed: 1
1)
    
Seed: 2
2)
    
Seed: 3
3)
    
Seed: 4
4)
    
Seed: 5
5)
    
Seed: 6
6)
    
Seed: 7
7)
    
Seed: 8
8)
    
Seed: 9
9)
    
Seed: 10

This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2010, TopCoder, Inc. All rights reserved.