Get Time
long_comps_topcoder  Problem Statement
Contest: Marathon Match 73
Problem: OrdinalTraitAssociationMapping

Problem Statement

    One major goal of plant genetics and breeding is to identify DNA causal variants underlying complex traits. This goal is completed through association mapping.

In this problem, you will be provided with a DNA marker genotype data for a large number of individuals. The set of all individuals is called association panel. All traits considered in this problem are measured in discrete manner so that the value of a trait for a specific individual is always an integer number in {1, 2, 3, 4, 5}. In each test case, you will be given the values of one specific trait for each of the individuals. It is known that these values are generated from some non-empty subset of DNA markers using a specific unknown procedure (here and later "unknown" doesn't mean that something is not known to everybody; it only means that this information is hidden from problem solvers). In other words, the values of a trait are not naturally observed or measured, but are rather artificially simulated. The generation procedure is similar for all test cases, but the subset of DNA markers used for generation and some inner generation parameters are different for each test case. The procedure can be quite complex, in particular, there can be some generation parameters specific to each individual and the trait's value for an individual may depend on genotypes of other individuals.

Your task is, given the genotype data and the values of a trait, to determine which DNA markers are significantly associated with this trait. You will find the definition for "significantly associated" below.

Genotype data

In association mapping, DNA marker genotype data is produced using a large number of Single Nucleotide Polymorphism (SNP) DNA markers (thousands or even millions). Generally, one SNP marker has two values (two of four nucleotides A, T, G and C) and the genotype at an SNP marker of one individual can be one of BB, Bb or bb, where B is the nucleotide that occurs more often within this SNP marker among all individual and b is the nucleotide that occurs less often. In order to represent all SNP markers uniformly, we will use 1 to denote BB and -1 to denote bb. The Bb genotypes are denoted as NA and we will assume that the data for these genotypes is missing. Note that the proportion of Bb genotype is generally small for each SNP marker.

All test cases in this problem will be generated using the same genotype data which can be downloaded here. The ".tsv" extension stands for "tab-separated values". This is a plain-text file that describes a 2-dimensional table, where rows correspond to individuals and columns correspond to SNP markers. The data is listed in row-major order, with consecutive rows separated with new-line characters and consecutive cells in a row separated with tab characters. The first row contains the column names, where each name is an ID of an SNP marker. The first cell in each row contains the name for this row, which is an ID of an individual. The cell in row X, column Y contains the genotype of individual X at SNP marker Y, which can be one of 1, -1 or NA. Overall, there are 281 individuals and 2913 SNP markers.

For each test case, exactly the same genotype data will be provided to your code as String[] genotype. It will contain exactly 281 elements and each element will contain exactly 2913 characters. The j-th character in the i-th element represents the genotype at SNP marker #j of individual #i. We will use characters '+', '-' and 'X' to denote 1, -1 and NA, correspondingly. We assume that both SNP markers and individuals are numbered starting from 0 in the same order as they appear in genotype data file.

The generation procedure uses a certain unknown way to fill each NA cell with one of values 1 or -1 before doing the actual generation. This way is the same for all test cases. We will use the term "augmented data" to refer to the genotype data where each NA is replaced with 1 or -1. In contrast to this, the original data will also be called "non-augmented data". Please note that data provided in String[] genotype is non-augmented data.

Linkage disequilibrium

Linkage disequilibrium (LD) is an important concept that characterizes the level of dependency between two SNP markers.

Let X and Y be two SNP markers. Let's number the individuals 0, 1, ..., 280 and let (X[0], X[1], ..., X[280]), (Y[0], Y[1], ..., Y[280]) be the genotypes of all individuals at two considered SNP markers. Among pairs (X[i], Y[i]), 0 <= i <= 280, let's choose only those where both X[i] and Y[i] are not NA and throw away all the rest. After renumbering of remaining pairs by consecutive integers starting from 0, we'll get vectors (X[0], ..., X[N-1]), (Y[0], ..., Y[N-1]), where N <= 281. In order to calculate the value of LD between SNP markers X and Y, the following formulas should be applied:

mean(X) := sum(X[i], 0 <= i < N) / N
mean(Y) := sum(Y[i], 0 <= i < N) / N
stdev(X) := square root (sum((X[i] - mean(X))^2, 0 <= i < N) / N)
stdev(Y) := square root (sum((Y[i] - mean(Y))^2, 0 <= i < N) / N)
cor(X, Y) := sum((X[i] - mean(X)) * (Y[i] - mean(Y)), 0 <= i < N) / (N * stdev(X) * stdev(Y))
LD(X, Y) = cor(X, Y) * cor(X, Y)

Note that if all values in X or Y are the same, then corresponding stdev value will be 0 and we'll get division by 0 when calculating cor(X, Y). In order to avoid this, we assume that LD in this case is equal to 0.

The LD value always ranges from 0 to 1. The value of 1 is reached if and only if X[i] = Y[i] for all i or X[i] = -Y[i] for all i, and generally the values of LD close to 1 means that X[i] = Y[i] occurs very often or X[i] = -Y[i] occurs very often. In fact, LD(X, Y) is the square of the Pearson correlation coefficient between X and Y.

Significant association with a trait

In order to define what it means for an SNP marker to be significantly associated with a trait, let's first introduce the concept of similar SNP markers.

Consider the undirected graph where each vertex corresponds to a single SNP marker and two vertices are connected with an edge if and only if the LD value for the corresponding two SNP markers is greater than or equal to 0.8. In this graph we can find connected components. Two SNP markers are called similar if their corresponding vertices belong to the same connected component.

Let S be a set of SNP markers. An SNP marker is called to be similar to S if it is similar to at least one SNP marker from S. The similarity closure of S is defined as the set of all SNP markers similar to S. Since each SNP marker is similar to itself, S is always a subset of its similarity closure.

Consider a certain trait T. As it was mentioned before, each trait is generated from a certain subset of DNA markers. An SNP marker is called to be significantly associated with T if it belongs to similarity closure of the set of DNA markers from which T was generated.

Sample trait data

In order for you to have a notion about the trait generation procedure, we provide the information about 10 sample traits generated through this procedure along with all SNP markers significantly associated with these traits.

You can download the trait data here. It is in TSV format. The first row contains column names, these are IDs of individuals, in the same order (from left to right) as in genotype data file (from top to bottom). Each of the next 10 rows describes one sample trait. It contains tab-separated list of values of this trait for each of the individuals. Each value is an integer 1, 2, 3, 4 or 5.

For each test case, you will be provided with similar description of a trait in int[] traitData. It will contain exactly 281 elements. For each i, traitData[i] will be an integer 1 to 5, inclusive, equal to the trait value for individual #i. The individuals are numbered 0 to 280 in the same order as they appear in the genotype data file.

The data about significantly associated SNP markers for sample traits can be downloaded here. It is also in TSV format. The first row contains column names, these are IDs of SNP markers, in the same order as in genotype data file. Each of the next 10 rows corresponds to one sample trait. It contains tab-separated list of values, where each value is 1 if corresponding SNP marker is significantly associated with this trait and 0 if it is not significantly associated. The traits follow in the same order in both trait data file and significantly associated SNP markers file.

These 10 sample traits will be reused as example test cases, but none of them will be included into submission and system test cases.

Return value and scoring

For each test case, your code should return a int[] containing exactly 2913 elements, each being 0 or 1. If you think that SNP marker #i is significantly associated with the trait in traitData, then the i-th element of your return should be 1, otherwise it should be 0. The SNP markers are assumed to be numbered 0 to 2912, in the same order as they appear in genotype data file.

Recall the graph we introduced when defining the term "similar" for SNP markers. Let's call the connected components of this graph combined SNP markers. For a set of SNP markers S we define the set comb(S) as the set of those combined SNP markers that contain at least one SNP marker from S.

The score for a test case is calculated as follows. Suppose that the correct set of significantly associated markers for this test case is corS and the set of SNP markers returned by your solution is retS. The following 3 values are then calculated:
  1. X -- the number of combined SNP markers that belong to both comb(corS) and comb(retS).
  2. Y -- the number of combined SNP markers that belong to comb(corS), but don't belong to comb(retS).
  3. Z -- the number of combined SNP markers that belong to comb(retS), but don't belong to comb(corS).

Your score for a test case is displayed as 1,000,000,000,000 * X + 1,000,000 * Y + Z. Note that score for a single test case is no more than a way to store three variables X, Y, Z in one value. Larger score doesn't necessarily mean a better answer, so you should not aim to maximize the score for each single test case. Instead, you should maximize the overall score, which is defined below.

If your solution exceeded time limit or memory limit, if it crashed or if your return value has wrong format, then it is assumed that you returned the empty set of SNP markers for this test case and the values of X, Y and Z are calculated accordingly.

The recall value for a test case is defined as X / (X + Y) (exact division). Note that X + Y is always greater than 0 because it is equal to the size of comb(corS). The precision value for a test case is X / (X + Z) (exact division). In case when X + Z = 0, we assume that precision is equal to 0. The average recall (precision) for a set of test cases is the arithmetic average of recall (precision) values for single test cases from the set.

Your overall score will be equal to 1,000,000 * aveRec * f(avePrec), where aveRec and avePrec is average recall and precision values, correspondingly, and f is the function defined as follows:

    f(x) = (1.0 / 9.0) * x, 0 <= x <= 0.9;
    f(x) = 0.1 + 16 * (x - 0.9), 0.9 <= x <= 0.95;
    f(x) = 0.9 + 2 * (x - 0.95), 0.95 <= x <= 1.00.


A tester is provided for offline testing. You can check its source code for a precise implementation of scoring calculation. Note that the tester is not capable of generating traits because the generation algorithm is secret. It's only possible to execute your solution on a trait data supplied in an external file.

Algorithm descriptions competition

Immediately following the end of the coding phase of this match, another competition will be launched. In this competition, you will be able to provide a description of the solution that you submitted in the marathon match, along with any additional information about your solution that you consider to be relevant. The goal of the competition is to identify solutions that perform well not only for the particular type of traits used for this problem, but also for other types of traits. You can find more details here.

Prizes and conditions

In order to receive the prize money, you will need to provide a description of how your algorithm works (detailed enough that we can understand it). This is not required if you participate in algorithm descriptions competition. Note that this description should not be submitted anywhere during the coding phase. Instead, if you win a prize, a TopCoder representative will contact you directly in order to collect it.


Parameters:String[], int[]
Method signature:int[] identifyGenes(String[] genotype, int[] traitData)
(be sure your method is public)


-The time limit is 5 minutes per single test case (which includes only the time spent in your code), the memory limit is 1024 MB.
-There is no explicit code size limit. The implicit source code size limit is around 1 MB (it is not advisable to submit codes of size close to that or larger). Once your code is compiled, the binary size should not exceed 1 MB.
-The compilation time limit is 30 seconds. You can find information about compilers that we use and compilation options here.
-The processing server specifications can be found here.
-There are 10 example and 100 full submission test cases. We reserve the right to reduce the number of example and/or submission test cases if the load on testing servers will be too huge and large queue of submissions waiting for test results will arise.


-genotype will contain exactly 281 elements.
-Each element of genotype will contain exactly 2913 characters.
-Each character in each element of genotype will be '+', '-' or 'X'.
-genotype will contain the genotype data which can be downloaded here.
-traitData will contain exactly 281 elements.
-Each element of traitData will be between 1 and 5, inclusive.



This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2010, TopCoder, Inc. All rights reserved.