JOIN
Get Time
long_comps_topcoder  Problem Statement
Contest: Connectivity Map Round 2
Problem: CMAP2Updated

Problem Statement

    

Introduction

The Connectivity Map (CMap) is a Broad Institute project that enables the discovery of functional connections between drugs, genes and diseases through the generation and analysis of gene expression signatures, where each signature represents the transcriptional responses of human cells to chemical or genetic perturbations. To identify such connections, a CMap user poses a biological question in the form of a “query” comprising a list of genes of interest. The query algorithm then searches the CMap database to identify the signatures that are most similar to the user's input. The biomedical community has used CMap to discover relationships between drugs, diseases, and genes and to generate new scientific hypotheses that point the way to therapeutic solutions. An example use case is as follows: A researcher discovers a set of genes that are collectively dysregulated in a certain disease. She then queries CMap with this gene set and discovers a drug that inversely matches her query and therefore hypothesizes that the drug may be of therapeutic relevance for the disease.

The CMap assay quantifies the responses of 10,174 genes to an experimental perturbation, such as a drug. Due to recent technological improvements, it has become possible to massively scale-up data generation and as a result the CMap signature database has grown to 476,251 signatures and this number is expected to increase rapidly. Hence improving the speed of the query algorithm is of great practical importance to those seeking to use CMap for therapeutic discovery. The goal of this contest is to reduce the time the query algorithm takes to execute.

Support for the Connectivity Map project comes from the NIH Library of Network-Based Cellular Signatures (LINCS) and NIH Big Data to Knowledge (BD2K) Programs. The aim of the LINCS program is to understand biological systems by profiling gene expression and other cellular processes when exposed to perturbagens. The goal of the BD2K program is to support research aimed at accelerating the use of Big Data applications and tools in biology. The Connectivity Map project is grateful for their support.

If you would like to access datasets for past CMap contests, or would like to learn more about CMap, please visit their website.

Prize Distribution

  • 1st: $9,000
  • 2nd: $5,000
  • 3rd: $3,000
  • 4th: $2,000
  • 5th: $1,000

Note: As part of the conditions of winning a prize, contestants will be required to submit a written report documenting the approach of how their code works. Competitors will optionally be allowed to provide their final submission for system testing in the form of an AWS VM (or AMI) with a standalone version of their solution running on it. While this is not required, it is encouraged, particularly for those competitors wishing to use multi-threaded solutions and/or take advantage of low-level configuration of how their code runs. (e.g. anything outside of the capability of testing when on the typical marathon platform)

Problem Description

The query algorithm starts with an a priori defined pair of gene sets. Typically, these are user-defined sets of genes that are up- and down-regulated under a biological condition of interest. The goal of the query process is to rank order all signatures within the CMap signature matrix by similarity to the pattern of up- and down-regulated genes specified in the query gene set.

Similarity between query gene set q (which can range in size from 10 to a few hundred genes) and a signature s (which is a vector of 10,174 values, one per gene, where each value is a measure of differential expression and therefore indicates the magnitude of the given gene’s response to the perturbation) in the CMap signature matrix (M genes x S signatures) and is computed using the weighted Kolmogorov-Smirnov (wtks) statistic, as described in Subramanian, et al., which reflects the degree to which genes in q are overrepresented at the top or bottom of the signature s. The wtks is computed separately for the up and down gene sets in the query q and then integrated into a combined wtks (wtkscomb). For a query q, this procedure is iteratively applied to each of the 476,251 signatures S (columns) of the CMap signature matrix, resulting in a rank ordered wtkscomb list. This process is easily extended to multiple query gene set pairs Q, in which case the query output is a matrix of 476,251 signatures S as rows by Q columns.

Description of the algorithm

For a given query gene set q containing N genes, and a signature of differential expression scores s for M genes, the single-sided wtks (wtksss) value is computed as follows:

  • Sort s descending.
  • Compute ssum as the sum of the absolute scores at all positions of s that correspond to genes in q.
  • Initialize a running sum statistic RS to zero.
  • Walk down s and and at each position sg (corresponding to a gene g) update RS as follows:
    • If g is a member of q, increment RS by sg / ssum.
    • Else, decrement RS by 1/(M - N).
  • wtksss is the maximum deviation of RS from zero.

The “mountain” plot below depicts the RS for an example query gene set q and an example signature s. The vertical ticks on the horizontal line at y = 0 denote positions in s that correspond to genes in q. The red dot indicates wtksss, the maximum deviation from zero.

CMap queries are comprised of a paired up and down gene set (qup and qdown). We leverage this to compute a combined wtks (wtkscomb) that integrates the single-sided wtks for both the up and down gene sets. wtkscomb is computed as follows:

  • Compute wtksup using qup
  • Compute wtksdown using qdown
  • Integrate wtksup and wtksdown according to the table below:
+---------------------------+-------------------------+
| wtksup and wtksdown signs | wtkscomb                |
+---------------------------+-------------------------+
| same                      | 0                       |
+---------------------------+-------------------------+
| different                 | (wtksup - wtksdown) / 2 |
+---------------------------+-------------------------+

Example Code

Available for download here.

Contestants are provided with example Matlab code that implements the wtkscomb algorithm as described above. Along with this code, we also provide a toy dataset, on which the code can be run, to illustrate the mechanics of the algorithm and the desired output format. The specific files provided are:

  • Core Scripts
    • compute_wtks.m: A basic implementation of wtksss.
    • run_queries.m: A functional demo that utilized the above methods to run queries against a toy dataset. As part of that, this code also computes wtkscomb.
  • Utilities
    • parse_query.m: Parses a query gene set CSV file.
    • parse_score.m: Parses a score data matrix CSV file.
    • plot_es.m: Code to generate an illustrative plot of the KS statistic.
  • Toy Dataset
    • query_up_n10.csv & query_down_n10.csv: 10 query gene set pairs
    • score_n10x10174.csv: A matrix of 10 signatures (columns), where the values are differential expression scores for the 10,174 genes reported by the CMap assay.
    • wtks_result_n10x10.csv: A matrix of ground truth wtkscomb values for the 10 queries above, generated by the existing CMap implementation
    • example_enrichment_plot.png: A plot of RS for one of the gene sets and one signature.

Note that this code is also compatible with Octave, an open-source statistical environment similar to Matlab.

Current optimization strategies

The CMap group has adopted a few optimizations to speed up queries. These are enumerated here so as to describe past work and provided to contestants as potential avenues for further improvements:

  1. Algorithm Optimization
    • As the algorithm uses ranks derived from the CMap Signature Matrix (which contains scores), the ranks can be pre-computed and stored for quicker lookup.
    • Rather than walk down the entire rank list of 10,174 genes, the running sum statistic is only evaluated at the member positions using the scores and ranks extracted only for members of the input query list.
    • When multiple gene sets are given as input, the algorithm first determines the union of all genes present and extracts the scores and ranks for those genes all at once, rather than making a separate call to disk for each query.
    • Since the computation is embarrassingly parallel, an efficient implementation can leverage all available CPU cores.
  2. File i/o Optimization With large signature matrices that can’t be assumed to fit in available memory, optimization of file i/o is beneficial:
    • In order to enable random access to a subset of the data matrix, we employ a binary format, called GCTX, based on HDF5 to store the score and rank matrices. The l1ktools repository linked below contains utilities to operate on GCTX files.
    • Efficient encoding of the score and ranks also provides performance improvements. Given that ranks are unsigned integers with limited range (1-10174) and scores are signed floating-point values, a rank and score pair (R, S) can be encoded as a 64-bit float F with minimal loss of precision as follows:

        F = sign(S) * R * 1e2 + abs(round(S * 1e8)/1e8))

      decoding F is done by

        R = fix(abs(F)/1e2)

        S = sign(F) * (abs(F) - R*1e2)

Contestants are allowed to improve upon the above approaches—however simply implementing the above is unlikely to yield sufficient improvement over baseline performance as they are already in the current implementation. Contestants might also explore other approaches including more efficient code (restricted to languages Java, C++), more performant file i/o (note that the test environment loads data off local SSD-based storage), using CPU-level multiprocessing etc. All VM-based tests will be run on an Amazon EC2 instance of the following specification and contestants are allowed to use all available resources on the instance:

+---------------+--------------+-------+------------------+
| Instance Type | Memory (GiB) | # CPU | SSD Storage (GB) |
+---------------+--------------+-------+------------------+
| c3.4xlarge    | 30 GB        | 16    | 2 x 160          |
+---------------+--------------+-------+------------------+

* The instances used for testing on the platform during the match may not match the above, but all submissions will be scored on identical instances. The instance type above will be used for all post-contest validation.

File types

Gene sets are provided as CSV files. These are comma-delimited text files where each row contains a single gene set. The first and second columns contain gene set IDs and descriptions, respectively, and the remaining columns contain the genes in the gene set. For this contest, the description column is empty for all gene sets. Each row can contain values for an arbitrary number of columns. For a given gene set collection, there are two CSV files, one each for the up and down gene sets. It is assumed that the sets are in the same order in both the up and down files.

* When your getWTKScomb() method is called during the contest, there will be no heading rows or columns. Each line of the input will be a comma delimited list of genes for each query.

Matrices are provided as CSV files with row and column identifiers. Contestants' code should output CSV files of 476,251 rows by N columns, where N is the number of query gene set pairs for a given collection.

* During the contest your code should return a double[] with 476251 * N values, where the first N elements represent the wtkscomb for each query against the first signature, and so on.

Useful links

l1ktools: Github repository containing code to parse and interact with various CMap file types, including GCTX.

Method Implementation

You will need to write several(*) methods:

int init(int[] geneList)

This method will provide your code with the list of 10174 gene IDs used during the contest, in the order in which they appear in all of the data. This method is free to pre-load and/or pre-index data, including the saving of scratch files. However, the genes which will be part of the queries are not yet provided. This method is not timed for scoring purposes (although there is a limit to the total allowable runtime).

double[] getWTKScomb(string[] up_collection, string[] down_collection)

Given a collection of pairs of up and down gene sets, compute the wtkscomb for all signatures (columns) in the CMap signature matrix. The execution speed of this method is timed, and the results judged according to their accuracy, as explained below.

Available Library Methods (Class CMAPLib)

Several methods are provided in the CMAPLib class for your code, which you can call to save and load scratch files during the test process. They each function in the obvious way.

int saveIntFile(String filename, int[] data)

int saveDoubleFile(String filename, double[] data)

These methods will save out scratch files for later use during the test run. The return values are 0 in the case of success of -1 in case of error.

int[] loadFromIntFile(String filename, long start, int length)

double[] loadFromDoubleFile(String filename, long start, int length)

  • These methods will load a previously saved (or default-provided) binary file.
  • loadFromIntFile() treats the file as though it's a binary list of 32-bit integers, whereas loadFromDoubleFile() is like a binary list of FP64 values.
  • Both start and length are treated like an array index (not a byte position).
  • The maximum length returned at once must be less than 2GB of data.
  • Calling with start = 0, length = -1 will return the entire file (intended only for user-saved scratch files, not the provided data files)

Data Files Provided

Four data files are provided by default, and are available on the test machine to be loaded from your code:

  • "scoresByGene": A file of FP64 scores, sorted by gene.
    • The first 476,251 values are the scores for the first gene in each signature, and so on.
    • loadFromDoubleFile("scoresByGene", idx * 476251, 476251) gets all the scores for gene idx.
  • "ranksByGene": A file of INT32 ranks, sorted by gene.
    • The first 476,251 values are the ranks for the first gene in each signature, and so on.
    • loadFromIntFile("ranksByGene", idx * 476251, 476251) gets all the ranks for gene idx.
  • "scoresBySig": A file of FP64 scores, sorted by signature.
    • The first 10,174 values are the scores for each gene in first signature, and so on.
    • loadFromDoubleFile("scoresBySig", idx * 10174, 10174) gets all the scores for signature idx.
  • "ranksBySig": A file of INT32 ranks, sorted by signature.
    • The first 10,174 values are the ranks for each gene in first signature, and so on.
    • loadFromIntFile("ranksBySig", idx * 10174, 10174) gets all the ranks for signature idx.

TYPES OF TESTING

Offline testing (does not contribute to scores, for development purpose only). Contestants are provided with query gene set pairs and the CMap signature score and rank matrices. These files are intended to be used for building and testing wtkscomb implementations. These files include:

Copies of the binary files used on the tester are available here (note that the values are identical to what is in the CSV files, just converted to a binary format to facilitate loading):

Provisional testing (populates the leaderboard). During the match, contestants have access to the full CMap Signature Matrix (as in binary format above, which contains the same data as CSV format sig_score_n476251x10174.csv and sig_rank_n476251x10174.csv) along with the Provisional Query Sets (500 pairs of up and down gene sets, each set has a variable number of genes).

System testing (used for holdout analysis and final selection of winners). At the conclusion of the match period a contestant's submission will be evaluated against a collection of 1,000 holdout Query sets as a measure to ensure the Provisional result was not overfit. Importantly, the CMap signature matrix used in holdout testing will be identical to that used in Provisional, but the Query sets will be entirely different. As mentioned earlier in the problem statement, competitors will have the option to have their code tested "bare-metal" (well, bare-VM) on an image they setup to run standalone. Though this is not required, it is highly recommended for those wishing to implemented multi-threaded solutions or take advantage of low-level tweaking.

Notes on standalone submissions: Your code should be setup so it can be run from a single command: "./run.sh <upfile> <downfile> <outputfile>"

The output file format can be CSV or binary (the latter is almost certainly faster).

Timing will be based upon the entire run of the above.

Scoring

Accuracy: As this contest is focussed on improving speed, we seek solutions that are exactly the same numerically as the reference algorithm. We compute the absolute differences between a matrix of test and ground truth wtkscomb values according to the formula:

D = abs(WTKScont - WTKSref)

Where WTKScont is a matrix of contestant-generated wtkscomb values, WTKSref is the equivalent reference matrix, and D is a matrix of the absolute element-wise differences between WTKScont and WTKSref. All elements of D must be less than 0.001 to be considered sufficiently accurate and eligible for timing-based scoring.

However, we recognize that some measures to improve speed, such as choice of data types and optimizing data storage by encoding score and rank in the same matrix, may result in small losses in precision. This in turn may cause stochastic changes in the sign of wtksup or wtksdn in cases where the gene ranks are randomly distributed and max and min of RS are very close in magnitude. As described above, if wtksup and wtksdn are both of the same sign, wtkscomb will be set to zero. This can lead to test values of wtkscomb differing from ground truth in a small percentage (<0.001%) of cases. In all such cases, the wtkscomb value in either the test submission or ground truth will be zero. To account for this, we allow for up to 1,000 such differences of greater than 0.001 between the test submission and ground truth where either the test or ground truth wtkscomb value is zero. Submissions that exceed this number will be disqualified.

Timing: All algorithms that pass the accuracy constraints will be scored based on their time to execute on the entire database of gene sets provided. The goal is for the algorithm to execute as fast as possible, so faster executions will receive higher scores. The algorithm score will be based on the scoring function:

Score = 1000000 * b / (b + 4 * t)

Where t is the runtime of the submission and b is the baseline runtime of the existing algorithm. For the provisional test, b = 60 mins, and for the system test, b = 120 mins, respectively. Note that algorithm scores exceeding 200,000 indicate an improvement in speed over the existing algorithm, whereas scores less than 200,000 indicate no improvement.

Notes on time calculation: The sandbox testing framework on the platform has a fair amount of overhead from serialization of passing data back and forth between the test harness and the competitor submission. The "runtime" reported when making example submission tests only includes the literal time spent in the competitor's submitted code. As disk I/O is part of this problem, the test harness internally keeps track of time spent processing file load/save functions once the query has been called (as above, time spent during the init() function is not tracked). These two values -- the submission runtime and the I/O time -- are summed together, and that sum is what is used as "runtime" for purposes of score calculation. Note specifically that time spent due to the data passing between the tester and the submission, is *not* counted in this time.

 

Definition

    
Class:CMAP2Updated
Method:init
Parameters:int[]
Returns:int
Method signature:int init(int[] geneList)
 
Method:getWTKScomb
Parameters:String[], String[]
Returns:double[]
Method signature:double[] getWTKScomb(String[] up, String[] down)
(be sure your methods are public)
 

Available Libraries

    
Class:CMAPLib
Method:loadFromDoubleFile
Parameters:String, long, int
Returns:double[]
Sample Call:val = CMAPLib.loadFromDoubleFile(fileName, start, length);
Method:loadFromIntFile
Parameters:String, long, int
Returns:int[]
Sample Call:val = CMAPLib.loadFromIntFile(fileName, start, length);
Method:saveDoubleFile
Parameters:String, double[]
Returns:int
Sample Call:val = CMAPLib.saveDoubleFile(fileName, data);
Method:saveIntFile
Parameters:String, int[]
Returns:int
Sample Call:val = CMAPLib.saveIntFile(fileName, data);
    
 

Examples

0)
    

This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2010, TopCoder, Inc. All rights reserved.