JOIN
Get Time
long_comps_topcoder  Problem Statement
Contest: DNAS 1
Problem: DNASequencing

Problem Statement

    

Broad Institute and Crowd Innovation Lab at Harvard University present Challenge DNAS1

Prize distribution

The total prize pool for the contest is set to $20K.

Competitors with the best 5 submissions in this contest according to system test results will receive the following prizes:

Prize    USD
-------  ------
1st      $8,000
2nd      $4,000
3rd      $2,500
4th      $1,500
5th      $1,000

In order to receive a full amount of the prize, the final score of your algorithm on the system test has to be ≥1,000,000. Otherwise, 50% of the corresponding prize amount is to be paid.

Additionally, we have set bonus prizes for the first submissions, reaching the 1,000,000 threshold on provisional test:

  • 1st solution to reach 1,000,000 threshold - $2000
  • 2nd solution to reach 1,000,000 threshold - $1000

In the case a tie, the corresponding prizes will be split among all tied competitors.

Introduction

DNA is a molecule that carries most of the genetic instructions used in the development, functioning and reproduction of all known living organisms and many viruses. For humans, it represents a string of a bit more than 3 billion characters, with the vast majority of them shared by all human beings and the remaining ~0.5% encapsulating all the individual differences between them. A full DNA molecule is too long to be read in one piece. In order to analyze it, we must fragment it into millions of short 150 character strings (calls reads), which can be directly read. Like assembling a jigsaw puzzle, these are then reassembled to obtain the original 3 billion character string. This reassembly can then be used as a reference to aid subsequent sequencing efforts — it is much easier to assemble a jigsaw puzzle if you can use the picture on the box as a guide.

The current challenge is the first in a series of challenges targeting this problem. It focuses on the easiest but most prevalent case of DNA sequencing: a reference is available, and all of the reads are expected to have well-determined positions on the reference DNA which they can be associ ated with (aligned to). We seek an algorithm that not only aligns each read quickly and correctly, but also accurately estimates the ambiguity of each match if alternative matching positions are available.

Background (Biology 101)

A molecule of double stranded DNA can be thought of as two strings of identical length each consisting of four characters — A, C, G, or T. These two strings are reverse complementary: substitute A for T, C for G, G for C, and T for A in one of the strings, reverse the result, and you obtain the other string. The two strings are bound to each other such that each character is always adjacent to its complement:

Thus, double stranded DNA sequences are discussed with respect to only one of the two strings, since it is trivial to obtain the other via reverse complement.

Every somatic human cell contains two copies of the individual’s genome, inherited from the mother and father. A copy comprises 23 separate molecules of double stranded DNA called chromatids. The length of a single string from each chromatid ranges from ~250 million to ~45 million characters, summing to a total length of approximately 3.2 billion characters:

The two copies differ only by approximately 0.5%, so it is often not necessary to consider each copy as a totally separate entity. Indeed, this 0.5% difference holds on the population level as well — two unrelated people can expect their genomes will be 99.5% identical. This lets us assemble a reference genome: a universal set of chromatid sequences constructed from several people, designed to closely approximate the human species as a whole. An individual’s genome can therefore be characterized strictly by its differences from this reference.

To elucidate these differences, however, we still must sequence every character in an individual, since we do not know a priori where the differences will arise. Unfortunately, current technology does not allow us to directly read each chromatid in a single pass. Instead, we must extract the individual chromatids from several million cells in a tissue sample, and fragment these long double stranded DNA molecules into shorter double stranded fragments, each about 600 characters long (normally distributed). The 150 characters from both ends of each fragment can be directly sequenced, forming paired end reads. The strands can be read only in one direction, reversely oriented for two strands. Thus the reads for a read pair always come from opposite strands:

We then do one of two things:

  1. Reassemble these reads into a complete chromatid, a process called de novo assembly. This is akin to reconstructing a shredded document, or assembling a jigsaw puzzle without knowing anything about what it will look like. This process is extremely time-intensive, since there can be billions of reads, and each must be compared to every other. This is how reference genomes are first constructed for a species.

  2. Align these reads with respect to a reference genome. Since no two human genomes differ by more than 0.5%, we can expect most reads will be exact or near-exact matches to corresponding substrings in the reference. This process is called alignment:

    and is orders of magnitude faster: if we construct a suffix tree on the reference, exact substring matching of a read of length n can be done in O(n) time. This is akin to assembling a jigsaw puzzle using the picture on the box as an aid.

This challenge will focus on the second approach: alignment, since it is currently the most widely used approach to sequence individual genomes.

Objective

There are two major challenges within the alignment problem that we hope you will address in this competition:

  1. A Read May Not Perfectly Match the Reference. A fraction of the reads will have no exact substring match in the reference due to genetic variation and sequencing errors. Although it is possible to obtain the exact alignment of two strings via the Smith-Waterman or Needleman-Wunsch algorithms, this procedure runs in time proportional to the product of the strings’ lengths; in theory, this could be done for every read, but would be computationally infeasible.

    There is no known efficient solution to inexact substring matching.

  2. Matching May Be Ambiguous. Portions of the genome are highly repetitive — in extreme cases, thousands of characters may be repeated thousands of times. It is impossible for 150 character reads to unambiguously map to these regions, but it is very important to know that the reported match may not be accurate.

    The aligner must accurately measure the ambiguity of each mapping and report it via a mapping confidence score.

Existing Benchmarks

Currently, there are four algorithms that are widely used for DNA alignment task. We have selected one of them as a target for improvement, and we based our scoring system on its performance. Here are the characteristics of those four algorithms in comparison:

                      Accuracy          Speed          Contest Score
                    -----------        -------        ---------------
 **Algorithm 1**        1                1                 952,381    
   Algorithm 2          0.27             0.22              457,714
   Algorithm 3          0.15             1.12              125,714
   Algorithm 4          1.85             2.84             Too Slow

Our ultimate objective here is to outperform Algorithm 1 by at least 5%.

Data Description

For training, contestants will be provided with:

  1. Reference genome. This consists of 24 text files (in FASTA format) containing one string per chromatid sequence, totalling ~3.2 billion characters. As noted previously, this does not include the genomic reverse complement; however, the simulated reads will come from both strands. The reference contains some unknown sequences, which are masked by the character N; we will not be simulating reads from these regions.

  2. Simulated reads. Sequences will be extracted randomly from the genome and will contain deviations from the reference according to empirically observed genomic variants and sequencing error rates. These will be also be provided in the FASTA format, which provides the 150 characters for each read. The read pairs will be split across two matched order FASTA files, i.e. all the first-of-pair reads will be in the first file (.fa1) and all the second-of-pair reads will be in the second file (.fa2).

  3. Ground truth. The correct mapping of each read will be given in a ground truth table. The format of this table is as follows:

    ReadName    ChromatidSequenceId     Start position      End position    strand (reference [+] or reverse complement [-])    read string i.e.
    sim1/1      1                       10000               10149           +                                                   ACGTACGGATATACGATAGGACAGT...
    sim1/2      1                       10500               10652           -                                                   CGATACGAGTTAGACACAGATTATA...
            

Read names will be identical to their counterparts in the FASTA files.

For the competition, submissions will be provided with the same reference genome and new sets of simulated reads. Each submission will undergo up to three competition tests with increasing difficulty, with access to more difficult tests contingent on adequate performance on easier tests. The tests differ in complexity by the number of reads to be aligned and the number of chromatids the reads are to be aligned to.

  1. Small test: Chromosome 20 (~64 million characters in human reference hg38), up to 10 thousand read pairs
  2. Medium test: Chromosomes 1, 11, 20 (~450 million characters), up to 1 million read pairs
  3. Large test: Entire genome — 24 chromatids (~3.2 billion characters), up to 10 million read pairs

Data available for downloading:

Functions

The example and provisional test case will contain two tests with different difficulty (small, medium) while the system test case will contain three tests (small, medium, large). The DNASequencing class won’t be reinstantiated between the small and medium tests. For the large system test, solutions will be run stand-alone outside of the normal submission system on VMs with similar characteristics to the test processors.

This set of functions will be called at most 2 times from smaller test to bigger ones. For large system tests, it will only be called once.Functions call order for a test within a testcase as follows:

    1. -- Preprocessing time limit starts counting here
    
    2. int initTest(int testDifficulty)
       testDifficulty can be 0 - small, 1 - medium, 2 - large. Return can be any valid integer.
    
    3. int passReferenceGenome(int chromatidSequenceId, string[] chromatidSequence)
       Will be called 1, 3, or 24 times according to the test difficulty. Return can be any valid integer.
    
    4. int preProcessing()
       Return can be any valid integer.
    
    5. -- Preprocessing time limit ends here
       -- Time Cutoff starts counting here (the time used for scoring)
    
    6. string[] getAlignment(int N, double normA, double normS, string[] readName, string[] readSequence)
       Will be called once per test size.
       
       N is the number of reads in the test.
       Both array will contain exactly N elements as well as the return array. Elements at (2*i) and (2*i+1) are representing a read pair.
    
       Format of the return string:
    
       ReadName, ChromatidSequenceId, Start position (1-based), End position (1-based), strand (reference [+] or reverse complement [-]), mapping confidence score
       “sim1/1,1,10000,10149,+,0.89”
       “sim1/2,1,10500,10652,-,0.82”
       “sim2/1,1,11500,11649,-,0.75”
       “sim2/2,1,11300,11449,+,0.63”

    7. -- Time Cutoff ends here

Scoring

  1. The total scoring on a test is a product of 2 parts - Accuracy and Speed:

    Score =TestNorm * Accuracy * Speed

    where TestNorm = 1 000/1.05, 1 000 000/1.05, and 1 000 000/1.05 for the tests on the Small, Medium, and Large datasets, respectively.

    The provisional score is computed as the maximum of the scores from the Small and Medium tests.

    The system score, if the algorithm passes to the Large test, is computed as the maximum of the score on the Large test and the value 750,000. Otherwise, the system score is the score achieved on medium test. Thus, if your algorithm passes to the largest test, you are guaranteed a system score higher than the one for any algorithm that does not pass regardless of the outcome.

  2. The Accuracy scoring is based on the verified output of the algorithm that has a form of a list of matching outcomes for each of the sequences: R = {1 or 0}, (1 — matched / |aligned read start - true read start| <= 300/, 0 — unmatched / |aligned read start - true read start| > 300/), sorted by the confidence in descending order. The accuracy value is to be computed as follows:

    where the NormA is introduced to make normalize Accuracy = 1 for Algorithm 1, and the logarithmic area under the curve is computed as follows:

    where the cumulative sensitivity is computed as:

    and the log false detection rate is computed as:

    with the following addition:

  3. The Speed scoring is measured by the following coefficient:

    Here the TimeCutOff is the maximal time we want to allow on the test, which is set to a 2x of the alignment time of Algorithm 1, and the NormS makes Speed = 1 for the Algorithm 1 benchmarking code. Also, in order to pass the test, your DNA preprocessing time has to be lower than 3x of such time for Algorithm 1 for the test.

    The mean values of the limits on preprocessing and alignment times, as well as scoring norms and thresholds are shown in the table below for each test size:

                Preprocessing       NormA       NormS       TimeCutOff              Threshold to go on
    	       time limit (sec)                               (sec)
              ------------------   -------     -------     ------------    ------------------------------------
    Small             201          -3.392        0.5              16.1      Score on small test > 100
    Medium          1,669          -3.962        0.5           1,102        Score on medium test > 750,000
    Large          13,239          -2.710        0.5          13,730        N/A
    
               EXAMPLE TEST DATA     PROVISIONAL TEST DATA     SYSTEM TEST DATA
               NormA       NormS       NormA       NormS       NormA       NormS  
              -------     -------     -------     -------     -------     ------- 
    Small     -3.3697      0.4964     -3.3544      0.4803     -3.4591      0.4995   
    Medium    -3.9728      0.4981     -3.9765      0.5071     -3.9679      0.5246   
    Large     -2.4089      0.5011        N/A         N/A      -2.4156        TBD   
    

    Note: Large test is only evaluated in post-contest system testing.

Data generation

  1. Data Overview: All the reads in the data have their positions mapped to the reference. The absolute majority of the reads will also have very good to perfect alignment with the reference. Some overlaps with large deletions may occur and have to be taken into account.

  2. Simulated reads: We simulated true genomic variants and sequencing errors according to their actual observed frequencies:

    1. True variants: Single character substitutions occur every ~1/800 bases. Insertions and deletions occur with cumulative frequency of ~1/5000 bases; relative probability of insertion or deletion is equal to inverse length of insertion or deletion. The insertions are limited to a length of 40 and deletions to a length of 200.

    2. Sequencing errors: Single character substitutions occur every ~1/500 bases. Erroneous insertions and deletions occur every ~1/1000 bases. It is rare to see an erroneous insertion/deletion of more than two characters; the vast majority are one character.

Limitations

  • Memory limit is 16GB
  • Test processors are AWS m3.2xlarge instances
  • There is no explicit code size limit. The implicit source code size limit is around 1 MB (it is not advisable to submit codes of size close to that or larger). The compilation time limit is 60 seconds.
  • You will be allowed to make example submissions each 30 minutes and provisional submissions each 3 hours.

General notes

  • This match is rated.
  • The allowed programming languages are C++, and Java.
  • You can include open source code in your submission, provided it is free for you to use and would be for the client as well. Code from open source libraries under the FreeBSD, Boost.org or MIT licenses will be accepted. Other open source licenses may be accepted too, just be sure to ask us.
  • The test servers have only the default installs of all languages, so no additional libraries will be available.
  • Use the match forum to ask general questions or report problems, but please do not post comments and questions that reveal information about the problem itself or possible solution techniques.

Requirements to Win a Prize

  1. Place in TOP 5 according to system test results. See the “Scoring” section above.
  2. Within 7 days from the announcement of the challenge winners, submit a complete report at least 2 pages long containing at least the following sections:
    1. Your Information
      • First Name
      • Last Name
      • Topcoder handle
      • Email address
    2. Approach Used
      • Please describe your algorithm so that we know what you did even before seeing your code. Use line references to refer to specific portions of your code.

        This section must contain at least the following:

        • Approaches considered
        • Approach ultimately chosen
        • Steps to approach ultimately chosen, including references to specific lines of your code
        • Advantages and disadvantages of the approach chosen
        • Comments on open source resources used
        • Special guidance given to algorithm based on training
        • Potential improvements to your algorithm
    3. Local Programs Used
      • If any parameters were obtained from the training data set, you will also need to provide the program(s) used to generate these parameters.
      • There is no restriction on the programming language/libraries used to generate these training parameters, provided they are a free, open-source solution for you to use and would be for the client as well.
 

Definition

    
Class:DNASequencing
Method:passReferenceGenome
Parameters:int, String[]
Returns:int
Method signature:int passReferenceGenome(int chromatidSequenceId, String[] chromatidSequence)
 
Method:initTest
Parameters:int
Returns:int
Method signature:int initTest(int testDifficulty)
 
Method:preProcessing
Parameters:
Returns:int
Method signature:int preProcessing()
 
Method:getAlignment
Parameters:int, double, double, String[], String[]
Returns:String[]
Method signature:String[] getAlignment(int N, double NormA, double NormS, String[] readName, String[] readSequence)
(be sure your methods are public)
    
 

Examples

0)
    
Seed: 1

This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2010, TopCoder, Inc. All rights reserved.