$20,000 in prizes divided among the top 5 finishers:
The Connectivity Map (CMap) is a Broad Institute project that enables the discovery of functional
connections between drugs, genes and diseases through the generation and analysis of gene expression
signatures (see the papers in
and Nature Reviews Cancer).
Recent technological advances and
laboratory automation have enabled data generation scale-up resulting in a massive catalog of drug
and gene signatures. When coupled with powerful pattern recognition algorithms, these data represent
a rich reference library for the research community.
CMap utilizes a novel, high-throughput gene expression profiling technology to generate gene
expression profiles at scale. The crux of this approach is that instead of measuring all ~20,000
genes in the human genome, CMap uses an assay called L1000 to measure a select subset of
approximately 1,000 genes and uses these “landmark” gene measurements to computationally infer a
large portion of the remainder, taking advantage of the fact that some genes’ expression levels are
correlated via functional redundancy and/or shared regulatory mechanisms. The current algorithm is
effective but imperfect, and improving the imputation methods will have an immediate impact on the
quality of data and the biologically meaningful connections that can be discovered. With this in
mind, we have designed our first contest to stimulate the exploration of new and improved inference
The existing gene inference CMap approach uses an ordinary least squares regression model. That
is, each non-landmark or “inferred” gene is estimated as a linear combination of all landmarks.
Linear models are popular across a variety of domains and can be built easily using even small
training datasets. The CMap inference model was trained using a compendium of gene expression data
generated across a diversity of human tissue types and cell lines. This approach makes the
assumption that the relationships between the landmark and inferred genes are linear and that they
are generally conserved across tissue types and cell lines. In theory, utilizing alternative
computational approaches that model nontrivial relationships and leverage larger compendia of data,
could further improve the predictions. This contest seeks to test that hypothesis.
You are given a large training dataset (100,000 samples) of 12,320 gene expression levels that
include 970 landmark and 11,350 non-landmark genes. (The existing CMap model was trained on a very
similar dataset.) You are also provided with a smaller offline-testing dataset of 1,000 samples
where the landmark and non-landmark genes are in separate files (for compatibility with the scoring
scripts). The challenge is to build a model that uses the 970 landmark genes to predict the
expression levels of the inferred genes for a distinct set of 1,650 separately measured samples.
Additional details about the scoring are provided below.
Data is provided as CSV files where the rows correspond to genes and the columns correspond to
samples. The matrix values are floating point numbers in the range 0.00 to 15.00. The first few
rows and columns of one such file is shown in Figure 1 below.
Figure 1. Example CSV matrix format. These matrices are headerless CSV files with genes along
the rows and samples along the columns.
100,000 samples of 12,320 genes. The first 970 rows are the landmark genes and the last 11,350 are
the non-landmark genes.
- landmarks.csv: 1,000 samples of the 970 landmark genes. The landmark genes are in the
same order as in the first 970 genes in training.csv
- truth.csv: The same 1,000 samples of the 11,350 non-landmark genes. The non-landmark
genes are in the same order as the last 11,350 genes in training.csv.
- refScores.csv: The baseline performance scores on all 11,350 non-landmark genes based on
the existing CMap algorithm applied to the samples in landmarks.csv and truth.csv. This is a
single-column CSV file where the score at row N corresponds to the Nth non-landmark gene in either
training.csv or landmarks.csv.
- cmap_scoring_function.R: The scoring function implemented in R.
- ConnectivityMap.java: The scoring function implemented in java.
1,650 samples with the landmark genes measured by L1000
(Note that this file contains both the provisional and system testing data, though it is not
disclosed which samples belong to which set.)
The objective is to predict the non-landmark genes’ expression values for the 1,650 samples in
testData.csv. Your output should be formatted as a CSV file and should contain exactly 11,350 lines
corresponding to each non-landmark gene. Each line should contain 1,650 numerical values
representing your predicted expression levels for each sample. The rows and columns of the data
should be in the same order as they appear in the provided test data.
To submit your entry, your code will only need to implement a single method, getURL(), which
takes no parameters, and returns the URL at which your predictions CSV file can be downloaded.
All example testing for this contest should be done offline, using the provided data. Note,
however, that you may do a “Test Examples” submission using your predictions file for the provided
test data. This will not provide any provisional scoring, but will confirm for you that the
predictions file works correctly (URL is accessible, correct number of rows and columns, and
numerical values parse correctly). This is not required, but can be used as a basic sanity check
before making a full submission.
The scoring function is designed to reward improved performance relative to the existing CMap
inference model. For each predicted gene, the scoring function takes into account two metrics of
- The spearman correlation
of the predicted expression levels with their corresponding ground truth equivalent across samples.
- The relative rank of the correlation when compared to the correlation to other
Those two metrics are averaged to arrive at a single score for each inferred gene. The
gene-level scores for your algorithm will be compared to the equivalent scores that have been
pre-computed using the existing CMap inference model.
Figure 2. Inference scores on the existing CMap model.
A) A scatter plot of the fraction rank vs. spearman correlation for the 11,350 non-inferred genes,
as predicted by the CMap inference model, compared to their ground truth equivalents. Note the wide
range of correlation values that contribute to high rank and the long tail of low-ranking genes,
which both indicate potential for improvement.
B) A histogram of the averaged fraction rank and correlation scores. The theoretical best score is
1.0. The existing CMap algorithm produces an average score of approximately 0.65.
Submissions will be scored based on the algorithm’s ability to generate the expected expression
patterns for each inferred gene across the M samples in question. You are provided with an R script
and a java executable that both contain an implementation of this scoring algorithm that you can
use for offline evaluation of your model.
Algorithms doing no better than the current model will receive scores of approximately 1,000,000.
Scores above 1,000,000 indicate algorithms doing better than the existing model and scores below
1,000,000 indicate algorithms doing worse.
Running the Scoring Scripts
The R scoring script can be invoked as follows:
Rscript cmap_scoring_function.R --inf_ds [/path/to/inference/result.csv] \
--truth_ds [/path/to/truth/matrix.csv] \
--reference_scores [/path/to/reference/scores.csv] \
You can also run `Rscript cmap_scoring_function.R -h` to view the script’s help message.
Running the Java executable:
java ConnectivityMap [prediction_file] [truth_file] [ref_score_file]
Note that the Java scorer will simply output the computed score total, without rescaling against
the maximum possible score. However, it is possible to calculate ScoreMax for any set by testing
the ground truth against itself. (See details below)
Scoring Algorithm Details
The scoring will be computed as a Spearman correlation of the gene expression values, ranked
across the samples, with the ground truth.
Let S be a matrix N * M of gene expressions, output by your algorithm, where
N = totalCases is 0he number of samples in the testcase and M = 11350 is the number of
inferred genes. Let G0to be the corresponding matrix N * M of gene expression
measures (i.e. ground truth).
Let S(j) be a list of expressions across the sample of the gene for
j in (1..M). Let G(j) be the corresponding list from ground truth. Then, a raw
single-gene score is computed as as a mean of the correlation across the samples and it’s scaled
rank across the genes:
ScoreRaw(j) = 1/2 * ( CorrelationScore(j,j) + (1/M) * Ranked[CorrelationScore(j,All)](j) )
Where the correlation matrix is
CorrelationScore(j,i) = Correlation [ Scaled [ Ranked [ S(j) ] ], Scaled [ Ranked [ G(i) ] ]
and CorrelationScore(j,All) is the vector of correlations between inferred gene j and all
ground truth genes.
and the functions Correlation, Scaled and Ranked are:
Correlation [ X, Y ] = Total [ X * Y ] / (N - 1) ;
Correlation [ X, Y ] -> 0, if Correlation [ X, Y ] <0
Scaled [ X ] = (X - Mean [ X ]) / StandardDeviation [ X ]
Scaled [ X ] -> 0 * X, if StandardDeviation [ X ] = 0
Ranked returns a position the element of the list would have if the list was sorted in increasing
order. If several elements on the list have the same value, the averaged position for those elements
To illustrate: Ranked [ (2, 1, 0, 4.1, 9.5, 5, 5) ] = (3, 2, 1, 4, 7, 5.5, 5.5)
For the final score, the raw single-gene scores are normalized using the benchmark score
(BenchmarkRaw(j) for j in (1..M)) for each of the genes, produced by a
linear-regression model, averaged across the genes, and rescaled with the use of maximal score
achievable on the test (i.e. the score one would obtain with perfect predictions):
ScoreFinal = Max (0, ScoreTotal * (ScoreMax - 1,000,000) / (ScoreMax - ScoreTotal))
ScoreTotal = 1,000,000 * Mean [ (2 - BenchmarkRaw) / (2 - ScoreRaw) ]
And ScoreMax is computed as ScoreTotal when a ground truth is substituted as an
answer. ScoreMax has been computed separately for provisional and system tests and is being
kept as a built-in constant for each of those.
Requirements to Win a Prize
- Place in TOP 5 according to system test results. See the “Scoring” section above.
- Scores of 1,000,000 and above will be eligible for prizes.
- Within 7 days from the announcement of the challenge winners, submit a complete report at least 2 pages long containing at least the following sections:
- Your Information
- First Name
- Last Name
- Topcoder handle
- Email address
- Approach Used
- Please describe your algorithm so that we know what you did even before seeing your code.
Use line references to refer to specific portions of your code.
- This section must contain at least the following:
- Approaches considered
- Approach ultimately chosen
- Steps to approach ultimately chosen, including references to specific lines of your code
- Advantages and disadvantages of the approach chosen
- Comments on open source resources used
- Special guidance given to algorithm based on training
- Potential improvements to your algorithm
- Local Programs Used
- If any parameters were obtained from the training data set, you will also need to provide
the program(s) used to generate these parameters.
- There is no restriction on the programming language/libraries used to generate these
training parameters. Submissions can be in any open-source language, or alternatively matlab.
- Actual programming code used
- Each of the top 5 competitors will be given access to an AWS VM instance, where they will
need to load their code, along with two basic scripts for running it.
- ./compile.sh should perform any necessary compilation of the code so that it can be run.
- ./run.sh [training_file] [test_landmarks] [output_file] should run the code and generate
the output CSV that was used during the contest.