JOIN
Get Time
long_comps_topcoder  Problem Statement
Contest: Antibody Marathon Match
Problem: ABCSpeedup

Problem Statement

    

Overview

The human immune system produces a vast variety of antibodies in order to respond to external stimuli. Next-generation sequencing technology allows researchers to obtain the sequences of all antibodies from a single person. Clustering these antibody sequences allows us to understand the lineage structure of all human antibodies. Researchers can then further monitor the global changes of the human antibody repertoire in response to the different medical conditions, and identify the true acting antibodies with the potential to cure the disease.

However, the number of antibody sequences from a single sample can go up to millions. Clustering at such a large scale poses a major computational challenge.

The goal of this contest is to optimize an existing algorithm for clustering antibody sequences, which computes a pairwise distance matrix and then performs hierarchical clustering to group sequences into clusters. This algorithm is implemented in Python as the provided clonify_contest.py script (will be posted in the MM forum)

Furthermore, as the goal of this contest is speedup, we have also provided a C++ implementation (will be posted in the MM forum). The ultimate goal of the contest is to make the current algorithm tractable even on a data set of 1 million antibodies while maintaining similar accuracy.

The existing algorithm and its converted C++ version could be downloaded here.

Prize Distribution:

  • 1st - $5,000
  • 2nd - $2,000
  • 3rd - $1,000
  • 4th - $500

Details

Algorithm

There are some important information of the existing clustering algorithm:

  • Pairwise distances are computed between all pairs of antibodies. The main component of the distance computation is a Levenshtein distance term (see the get_LD function in the Python code), which is combined with a few other terms that modify the final distance score, e.g., by penalizing mismatches of antibody genes (see the get_score function).
  • Clustering is then performed using based on the pairwise distance matrix. The Python modules fastcluster and scipy.cluster.hierarchy.fcluster are used to create flat clusters with a maximum distance cutoff set to 0.32.
  • The best way to understand the details of the algorithm is to look at the Python reference code, which is quite short and readable.

Data

We provide this sample input file (will be posted in the MM forum) for algorithm development, which is also used for the example test:

  • 1K.json this file contains 1,000 sequence objects

And its sample output_file as the reference:

  • 1K.out (1K.json as the input_file)

Meanwhile, we are also providing a large input file 100K.json (will be posted in the MM forum), which contains 100K antibodies for you to test your method on large dataset. Note that this test case will also be available in example test.

The test data for the contest consist of 2 input sets in the example test, 5 sets in provisional tests, and 10 sets in system tests. In the example test, which are same as 1K.json and 100K.json we released, the number of objects n will be 1K. In the provisional test, n will be at the level of 5K and 10K, while n will be at least 10K and at most 100K in system tests.

Example test cases are available here.

Implementation

The antibody contains the descriptions of all antibodies which need to be clustered. The format is the same as the json file, i.e. by adding the strings together, you will get the json file. See more details in the released example json files.

Your method should return the cluster IDs of these antibodies in the same order.

Scoring

Submissions will be scored by running the solution against the original outputs of clonify_test.py.

Your score will be determined by (i) clustering accuracy compared to the original output and (ii) running time. That is, you need to efficiently compute accurate clustering results. The scoring metric is defined as follows. Suppose the final clustering result is A[1..n] for n antibodies in the original output, i.e. A[i] is the cluster ID associated to the i-th antibody. Similarly, B is your output. Then, the accuracy will be computed based on the proportion of correctly clustered pairs. Meanwhile, the base accuracy by assigning each antibody a separate cluster (i.e. B[i] = i) will be subtracted. See the following pseudocode for details.

    correct = 0
    baseCorrect = 0
    for i := 1 to n do {
	for j := i + 1 to n do { // considering all pairs
            if (A[i] == A[j]) == (B[i] == B[j]) {
                correct += 1
            }
            if ((A[i] == A[j]) == (i == j)) {
                baseCorrect += 1
            }
        }
    }
    accuracy = max(correct - baseCorrect, 0) / (n * (n - 1) / 2 - baseCorrect)
    timeCost := time cost in milliseconds
    rawScore := pow(max(accuracy - 0.99, 0) / (1 - 0.99), 4) / max(100, timeCost)

If your solution exceeds time or memory limit, crashes or returns invalid result, your score for that test case is 0.0. Your result is considered valid if it contains the same number of antibodies as the original output and each is assigned a valid cluster ID (numbered from 1 to n). Your final score will be normalized by the best rawScore from all contestants. That is, the maximum rawScore on each test case will be normalized to 10,000,000.

 

Definition

    
Class:ABCSpeedup
Method:cluster
Parameters:String[]
Returns:String[]
Method signature:String[] cluster(String[] antibody)
(be sure your method is public)
    
 

Notes

-You can include open source code in your submission. Open Source libraries under the BSD, GPL, or GPL2 license will be accepted. Other open source licenses could be accepted too. Just be sure to ask us.
-In order to receive the prize money, you will need to fully document your code and explain your algorithm. There is no language preference, but your code must be wrappable in a Python API. If any parameters were obtained from the training data set, you will also need to provide the program used to generate these parameters. There is no restriction on the programming language used to generate these training parameters. Note that all this documentation should not be submitted anywhere during the coding phase. Instead, if you win a prize, a TopCoder representative will contact you directly in order to collect this data.
-You may not use any external (outside of this competition) source of data to train your solution.
-Please check the match forum regularly because important clarifications and/or updates may be posted there.
-You can train your solution offline based on the given files and you can hardcode data into your solution--just remember that you can't use data from other sources than this contest.
-Time limit is 10 minutes per test set and memory limit is 3GB. We know that the task is memory-demanding and intentionally ask for memory management.
-There is no explicit code size limit. The implicit source code size limit is around 10 MB (it is not advisable to submit codes of size close to that or larger).
-The compilation time limit is 600 seconds. You can find information about compilers that we use and compilation options here.
-The implementation will be evaluated under Linux environment (we use the latest Ubuntu Server edition).
 

Examples

0)
    
Seed: 1
1)
    
Seed: 2

This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2010, TopCoder, Inc. All rights reserved.