JOIN Problem Statement
Contest: HMS Challenge #3
Problem: FitnessEstimator

### Problem Statement

Large, defined populations ("libraries") of DNA molecules can now be rapidly characterized by massively parallel DNA sequencing. If the total number of sequencing reads is sufficiently large, one can estimate each library member's relative abundance in the population by counting the number of times each individual sequence is observed (similar to estimating a multinomial distribution by sampling counts from it). In the presence of a selective pressure, specific sequences may become preferentially enriched or depleted depending on their "fitness" relative to the rest of the population. In addition, a library member's measured abundance can change due to stochastic (sampling) fluctuation. As an example, one can imagine making thousands of individual DNA sequence variants that encode a bacterial enzyme that provides resistance to an antibiotic. In the absence of the antibiotic, the relative abundance of individual library members will fluctuate stochastically. In the presence of antibiotics, however, certain DNA sequences become enriched or depleted because they differentially alter the enzyme's activity, thereby providing a survival advantage or disadvantage to the host bacteria. Those cells with a survival advantage will replicate and increase the abundance of their sequences in the population, whereas those cells with a disadvantage will die off and their sequences will become relatively depleted from the population. The library can be sampled at multiple time points ("phases") to reveal an evolutionary timecourse.

We seek a general method for the analysis of count data that reflects library member abundance before and after application of a selective pressure at possibly multiple time points. The input to the analysis is a set of count vectors that contain the counts from each member of the population at multiple time points (at least an initial count vector with a post-selection output vector). The output of the analysis should be an estimate of the fitness coefficients ("fitnesses") that describes each population member's competitiveness in the presence of a given selective pressure. Importantly, the starting abundance of population is not uniform, but rather varies according to some initial distribution. The result is that some fraction of the population will be relatively undersampled, making inferences on these library members more challenging.

Model definition and test data generation

A Bayesian network of our model is shown in the figure below. All variables are vectors of length N. The only observed variables are the count vectors Xi, where Xi ~ Multinomiali, ni). The Θi vectors are distributed as Dirichlet(w * Xi-1), where w is the vector of "true" fitnesses. Each test case can be described by the following parameters:
• N -- the number of different library members considered in this test case (i.e., the number of rows in the vector). 5,000 <= N <= 400,000.
• P -- the number of phases used to generate the test case. 1 <= P <= 3.
• n0, n1, ..., nP -- average sampling depths for initial counts (n0) and for counts at each phase (n1, ..., nP). 2 <= ni <= 10,000.
• distribution -- probability distribution used to generate "true" fitnesses for each library member. Larger fitness value means better fitness of a library member in the presence of a given selective pressure.
In order for you to test your algorithm locally, we provide 60 training test cases. You can download them here. Additionally, the download page provides the Python code used to generate those test cases and describes how this code works. Please make sure to read and understand this description before reading the rest of the problem statement.

The submission and system test cases will be generated using almost the same algorithm. The key difference is that fitnesses can be generated using some other distributions than those which were used for training test cases. We will not reveal these distributions until the end of the match because we are seeking general solutions capable of inferring fitnesses from many different potential distributions.

Another difference between training and submission/system test cases is their size (value of N). As you can see, some training cases are very large (N = 400,000). On the contrary, submission tests are much smaller (N <= 5,000). System tests are larger, but still not as large as training tests (N <= 100,000).

Finally, the training tests never use more than 2 generation phases. However, some of provisional or system tests can use 3 phases.

Testing and scoring

You will need to implement one method estimateFitnesses. It will take the following parameters:
• int[] X0 -- the i-th element gives the value of X0[i].
• int[] X1 -- the i-th element gives the value of X1[i].
• int[] X2 -- the i-th element gives the value of X2[i]. If there are less than 2 phases in a test case, then X2 will be an empty int[].
• int[] X3 -- the i-th element gives the value of X3[i]. If there are less than 3 phases in a test case, then X3 will be an empty int[].
The return value must be a double[] where the i-th element must give an estimate of fitness for the i-th library member.

Our eventual goal is to identify library members who display relatively extreme behaviors (being either highly competitive or easily outcompeted). In a typical genetic screen, an investigator might have resources to further investigate ~2% of the library population as "candidate" fit or unfit members. Successful algorithms will thus correctly identify the most differentially fit members of the population.

The scoring works as follows. In order to get a positive score, your return value must contain exactly N elements, none of them can be NaN and all of them must be within [-1e100, 1e100] range. Assuming this is the case, consider the following 4 vectors of length n = N/50 (rounded down):
• A: contains n largest "true" fitness values;
• B: B[i] is the fitness estimate that your solution returned for a library member for which the "true" fitness value is A[i];
• C: contains n smallest "true" fitness values;
• D: D[i] is the fitness estimate that your solution returned for a library member for which the "true" fitness value is C[i].
Now, let S be the sum of the following 4 values:
Your score for a single test case will be equal to 1,000 * (S + 4). The overall score for a group of test cases will be the arithmetic average of scores on all test cases from the group.

The scoring method will have no special processing for handling equal fitnesses. For example, if there's a tie for inclusion into A or C, it is not defined which elements will be included. Similarly, when rank correlation coefficient is computed, if there is a tie for ranks i, i+1, ..., j, each of these ranks will be assigned to exactly one element and it is not defined which element gets each of the ranks.

Special condition

In order to receive the prize money, you will need to provide a description of how your solution works. Note that this description should not be submitted anywhere during the coding phase. Instead, if you win a prize, a TopCoder representative will contact you directly in order to collect the description. Once the results of the marathon match are available on the website, the description needs to be submitted within the next 7 calendar days. This description may or may not be published in a scientific article, in part or in full and may be substantially altered, at the discretion of TopCoder. If the description is submitted for publication, you may or may not be offered an authorship position on the manuscript.

### Definition

 Class: FitnessEstimator Method: estimateFitnesses Parameters: int[], int[], int[], int[] Returns: double[] Method signature: double[] estimateFitnesses(int[] X0, int[] X1, int[] X2, int[] X3) (be sure your method is public)

### Notes

-When all elements in vector A or vector B are exactly the same, the correlation coefficient between these two vectors is assumed to be equal -1.
-There are 10 example, 50 full submission and 80 system test cases. The example test cases are randomly chosen among the training data.
-The memory limit is 1024 MB and the time limit is 2 minutes per test case (which includes only time spent in your code).
-There is no explicit code size limit. The implicit source code size limit is around 1 MB (it is not advisable to submit codes of size close to that or larger). Once your code is compiled, the binary size should not exceed 1 MB.
-The compilation time limit is 30 seconds. You can find information about compilers that we use and processing server specifications here.

### Examples

0)

`Input file: train06.csv`
1)

`Input file: train04.csv`
2)

`Input file: train54.csv`
3)

`Input file: train25.csv`
4)

`Input file: train58.csv`
5)

`Input file: train07.csv`
6)

`Input file: train32.csv`
7)

`Input file: train12.csv`
8)

`Input file: train49.csv`
9)

`Input file: train10.csv`

This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2010, TopCoder, Inc. All rights reserved.