Get Time
long_comps_topcoder  Problem Statement
Contest: HMS Challenge #1
Problem: MinorityVariants

Problem Statement

    We have a mixture of genomes. Each genome can be treated as a sequence of nucleotides denoted using characters 'A', 'C', 'G', 'T'. Consider a certain fixed position within each genome and nucleotides at that position. It is known that one of two possible cases holds:
  1. Each genome has the same nucleotide at this position. In this case, we call this position constant.
  2. There are two different nucleotides at this position. Let's call them X and Y. Suppose that X occurs more often than Y. In this case, nucleotide Y is called minority variant at this position. The fraction of nucleotides that contain Y at this position is called the frequency of this minority variant.
Given a lot of sequencing calls for this mixture, we would like to identify which positions are constant and which have minority variants. Also, we are interested in frequencies of minority variants. This seems to be trivial to do, however, some sequencing calls have low quality and we can't trust to their results. The goal of this problem is to develop an approach to separate low quality sequencing calls from real calls, and thus improve the detection ability of minority variants.

Training data

We provide some data in order to help you develop the solution. It consists of 3 files, all in TSV format.

The data about sequencing calls is contained in files "input1.tsv" and input2.tsv. Each line in these files corresponds to a single read and contains the following items (separated by single tab characters):
  • Position called within a genome. You can assume it's an integer in 1..10000. However, positions are not necessarily numbered by consecutive integers.
  • Nucleotide obtained at this position. This will be one of 'A', 'C', 'G', 'T'.
  • Read quality. A quality score given to the individual base called by a sequencer. Higher values indicate increased confidence the call is correct. The qualities are given in Phred scale.
  • K-mer abundance. A metric describing the uniqueness of the 13bp region surrounding the read. Rare k-mers with low frequencies are more likely to be sequencing errors. This blog post comment describes the logic behind using k-mers for removing sequencing artifacts.
  • Mapping score. The score assigned to a sequencing read positioned in the genome. Higher values indicate more confidence in a correctly aligned read. Novalign, the read aligner, generates the mapping scores. Section 4.3.2 of the manual describes them in details.
For each position in "input1.tsv" and "input2.tsv", it is known whether it is constant or has a minority variant. This information is provided in file "locations.tsv". Note that both "input1.tsv" and "input2.tsv" refer to the same mixture of genomes, so we provide just one "locations.tsv" file that refers to both these files. Each line of "locations.tsv" contains 3 items: position, nucleotide and frequency. For constant positions the file will contain contains nucleotides at them and 100.0 as frequency. For positions with minority variants, you will be given nucleotides of minority variants and their frequencies. For example, consider the following 2 lines:
1000	C	100.0
1010	A	13.5

It would mean that 1000 is a constant position and it contains nucleotide C. Position 1010 has a minority variant of frequency 13.5% being nucleotide A.

Implementation details

You will need to implement one method classifyReads. Each element of its input parameter, String[] reads, will describe a single read in the following format. It will contain the values of 3 metrics for this read separated by single space characters: read quality, k-mer abundance and mapping score, in this exact order. The values will be formatted exactly as in "input1.tsv" and "input2.tsv" files, so the first and third values will be integers and the second one will be a floating-point value. Note that we do not provide any information about positions called and nucleotides obtained.

The return value needs to be an int[] containing the same number of elements as reads. Its i-th element must be an integer in 0..1,000,000,000, indicating the confidence that the i-th read in input is real (higher confidence values indicate more certainty in them being real).

Testing methodology

The whole data we have is similar to the provided "input1.tsv", "input2.tsv" and "locations.tsv", but bigger. The set of all positions present in the data was randomly split into 3 subsets A, B, C according to 50%, 15%, 35% proportion, respectively. All data corresponding to positions from A was provided to you for training. The only example test case will correspond to testing on "input1.tsv". The data that corresponds to positions from B and C will be used for provisional and system tests, correspondingly. Thus, there will be 2 provisional test cases and 2 system test cases (one for each input file).

Only a small part of all reads will be used for tests. More exactly, only the following two kinds of reads will be used:
  • A read at a constant position such that obtained nucleotide is not the same as real nucleotide at this position. We can deduce for sure that all these reads are fake (have low quality).
  • A read at a position with minority variant of low frequency (less than or equal to 6.5%) such that obtained nucleotide is the same as real minority variant at this position. We know that such read is most likely to be real (has high quality). While we can't tell for sure that all these reads are real, the grader program will assume that all of them are real.


Your return value will be scored using the area under the ROC curve.

If your solution exceeds time or memory limit, crashes or if its return is improperly formatted, your will receive a score of 0. Otherwise, your score is calculated using the following pseudocode:
score := 0

For each read R1 that is real:
   For each read R2 that is fake:
      X := confidence that R1 is real (according to your return)
      Y := confidence that R2 is real (according to your return)
      score += {1.0, if X > Y; 0.5, if X = Y; 0.0, if X < Y}

score /= (number of real reads) * (number of fake reads)
score *= 1000000.0		// just for scaling

The overall score will be calculated as the sum of scores on the 2 test cases used.

Additional information

In order to receive the prize money, you will need to fully document the derivation of all parameters internal to your algorithm. If these parameters were obtained from the training data set, you will also need to provide the program used to generate these training parameters. There is no restriction on the programming language used to generate these training parameters. Note that all this data should not be submitted anywhere during the coding phase. Instead, if you win a prize, a TopCoder representative will contact you directly in order to collect this data.


Method signature:int[] classifyReads(String[] reads)
(be sure your method is public)


-The time limit is 10 minutes per test case (this includes only the time spent in your code). The memory limit is 2 gigabytes.
-There is no explicit code size limit. The implicit source code size limit is around 1 MB (it is not advisable to submit codes of size close to that or larger). Once your code is compiled, the binary size should not exceed 1 MB.
-The compilation time limit is 30 seconds. You can find information about compilers that we use and compilation options here.
-The processing server specifications can be found here.



This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2010, TopCoder, Inc. All rights reserved.