JOIN
Get Time
long_comps_topcoder  Problem Statement
Contest: AffineSWOptimization
Problem: AffineSWOptimization

Problem Statement

    In this problem you will need to implement the Smith-Waterman routine for local DNA sequence alignment using affine gap model. Your implementation should be optimized for speed while maintaining correctness.



Problem definition



Suppose that we have two DNA sequences, a query sequence a of length m and a target sequence b of length n. Each of these sequences can be treated as a string containing characters 'A', 'C', 'G', 'T'. We are also given 4 integers match_score, mismatch_score, gap_open and gap_extension. The value of match_score is positive, other 3 values are negative, moreover gap_open is strictly less than gap_extension.



This paragraph explains what does it mean to align the entire sequence a to entire sequence b. Suppose that we can insert arbitrary amount of characters '-' (gap characters) into arbitrary places of a and b. Let's do it so that a and b have the same length after all insertions are done. For clarity, let's denote the resulted strings as a' and b'. For example, the result can look like this:
        position        0123456789
	a'		A---CTTCAC
	b'		-CTGATCT-C
Each possible pair (a', b') is called an alignment of a to b. Each alignment has an associated alignment score that is a sum of 3 parts (first part is positive and the other two parts are negative):
  1. Match score. For each pair of equal non-gap characters that stand on the same position in a' and b', we get a score of match_score. For the example above, we get this score twice, for a pair of 'T's (position 5) and for a pair of 'C's (position 9).
  2. Mismatch score. For each pair of distinct non-gap characters that stand on the same position in a' and b', we get a score of mismatch_score. For the example above, this score is received 3 times, for positions 4, 6 and 7.
  3. Gap score. Each maximal consecutive sequence (i.e., a sequence that can't be extended to the left or to the right) of L gap characters in either a' or b' is called a gap of length L. For each such gap we get a score of gap_open + L * gap_extension. For the example above, there are 3 gaps of lengths 3, 1 and 1. Thus, the total received score will be (gap_open + 3 * gap_extension) + (gap_open + 1 * gap_extension) + (gap_open + 1 * gap_extension) = 3 * gap_open + 5 * gap_extension.
The optimal alignment of a to b is a pair (a', b') that has the maximum possible alignment score.



Smith-Waterman algorithm



Smith-Waterman algorithm is a well-known alignment algorithm. It is a basic solution that you will need to optimize. The algorithm is based on dynamic programming. It computes 3 two-dimensional tables M(i, j), V(i, j) and H(i, j), 0 <= i <= m, 0 <= j <= n. In this problem, we consider 4 variations of the algorithm. The variation that needs to be executed for each particular input is controlled by parameters query_start_clip and query_end_clip.



Variation 1. Align any consecutive subsequence of a to any consecutive subsequence of b (query_start_clip = 1, query_end_clip = 1).



Use the following equations:
M(i, 0) = 0, 0 <= i <= m
M(0, j) = 0, 0 <= j <= n
H(i, 0) = H(0, j) = V(i, 0) = V(0, j) = -infinity, 0 <= i <= m, 0 <= j <= n

For 1 <= i <= m, 1 <= j <= n:
  M(i, j) = max{0, max{M(i-1, j-1), V(i-1, j-1), H(i-1, j-1)} + w(a[i], b[j])}
  V(i, j) = max{M(i-1, j) + o + e, V(i-1, j) + e}
  H(i, j) = max{M(i, j-1) + o + e, H(i, j-1) + e}


Here infinity is a large enough integer (2^30 can be used for these purposes), w(a[i], b[j]) is match_score if i-th (1-based) character in a is the same as j-th (1-based) character in b and mismatch_score otherwise, o is gap_open and e is gap_extension.



The value of the optimal alignment is the maximum value among all M(i, j), H(i, j), V(i, j), 1 <= i <= m, 1 <= j <= n.



Variation 2. Align any prefix of a to any consecutive subsequence of b (query_start_clip = 0, query_end_clip = 1).



M(0, j) = 0, 0 <= j <= n
M(i, 0) = -infinity, 1 <= i <= m
H(0, j) = -infinity, 0 <= j <= n
H(i, 0) = o + e*i, 1 <= i <= m
V(i, 0) = V(0, j) = -infinity, 0 <= i <= m, 0 <= j <= n

For 1 <= i <= m, 1 <= j <= n:
  M(i, j) = max{M(i-1, j-1), V(i-1, j-1), H(i-1, j-1)} + w(a[i], b[j])
  V(i, j) = max{M(i-1, j) + o + e, V(i-1, j) + e}
  H(i, j) = max{M(i, j-1) + o + e, H(i, j-1) + e}


The value of the optimal alignment is the maximum value among all M(i, j), H(i, j), V(i, j), 1 <= i <= m, 1 <= j <= n.



Variation 3. Align any suffix of a to any consecutive subsequence of b (query_start_clip = 1, query_end_clip = 0).



Perform the same steps as in variation 1. The value of the optimal alignment is the maximum value among all M(m, j), H(m, j), V(m, j), 1 <= j <= n.



Variation 4. Align the entire sequence a to any consecutive subsequence of b (query_start_clip = 0, query_end_clip = 0).



Perform the same steps as in variation 2. The value of the optimal alignment is the maximum value among all M(m, j), H(m, j), V(m, j), 1 <= j <= n.



Implementation details



You will need to implement one method process that takes the following input parameters:
  • target -- the value of b. It is a string containing between 1 and 1024 characters. Each character is one of 'A', 'C', 'G', 'T'.
  • query -- the value of a. It is a string containing between 1 and 512 characters. Each character is one of 'A', 'C', 'G', 'T'.
  • query_start_clip and query_end_clip -- the values of query_start_clip and query_end_clip. Each of them is equal to 0 or 1.
  • match_score -- the value of match_score. It is an integer in 1..10.
  • mismatch_score -- the value of mismatch_score. It is an integer in -10..-1.
  • gap_open -- the value of gap_open. It is an integer in -10..-2.
  • gap_extension -- the value of gap_extension. It is an integer in (gap_open+1)..-1.
  • direction -- an integer equal to 0 or 1. The meaning is explained below.


The return value needs to be formatted as "opt query_end target_end n_best", where:
  • opt is the value of the optimal alignment.
  • n_best is the number of different pairs (i, j), 1 <= i <= m, 1 <= j <= n, such that max{M(i, j), V(i, j), H(i, j)} = opt. For variations 2 and 3 only cells where i = m need to be considered.
  • (query_end, target_end) is the location of the cell corresponding to the optimal alignment. In other words, the cell (query_end, target_end) in one of tables M, V or H must be equal to opt. For variations 2 and 3 query_end is always equal to m.
  • direction explains how to break ties. If direction = 0, then query_end needs to be as small as possible (if there are still several possibilities, choose the one with the smallest value of target_end among them). If direction = 1, then query_end needs to be as large as possible (if there are still several possibilities, choose the one with the largest value of target_end among them).


Use single spaces to separate integers in the return value. Do not use leading/trailing spaces. Format all integers without unnecessary leading zeros.



Testing and scoring



The data in this problem consists of 16 files. Each of them contains approximately 1 million inputs, where input is a combination of parameters for single process call. All inputs within the same file are guaranteed to have the same values of match_score, mismatch_score, gap_open and gap_extension.



Each of the files is split into 3 parts (called A, B and C) as follows:
  • The set of all possible (target, query) pairs that occur in at least one input of at least one of the 16 files is built.
  • This set is randomly split into 3 subsets A, B, C so that 40% of (target, query) pairs go to subset A, 30% go to B and 30% go to C.
  • For each file, its parts A, B, C will contain only the inputs where corresponding (target, query) pair belongs to subset A, B and C, respectively.


Parts A are used as training data. You can download these files here and use them to test your solution locally. Each input is in a single line and consists of 13 values separated by single tab characters: target, query, query_start_clip, query_end_clip, match_score, mismatch_score, gap_open, gap_extension, direction, opt, query_end, target_end, n_best. 5 of these files are also reused as example test cases. Parts B and C are used as submission and system test cases, correspondingly (so there are 16 submission and 16 system test cases, one per file).



For each test case, the grader will consecutively supply the inputs from corresponding file to your process method. It will stop as soon as it receives an incorrect return value, the execution time limit of 30 seconds is exceeded or all inputs from the file are processed, whatever happens earlier. If you were not able to process all inputs, your score for the test case will be equal to the number of times your process method was executed minus 1 (so the last call that you were not able to process is not counted). Otherwise, your score is calculated as n_inputs + 300,000 - 10 * time, where n_inputs is the total number of inputs in this test case and time is the time that your solution spent to process all of them (in milliseconds).



Your overall score on a set of test cases is the arithmetic average of scores obtained on separate test cases.



Tools



We provide a sample implementation that is guaranteed to process all inputs correctly (including the ones in submission/system tests). The speed of this implementation, however, is fairly low.



Special conditions



Only C++ submissions are allowed in this contest.



It is allowed to use MMX, SSE, SSE2 and SSE3 instructions.



Multithreading in your solution is not allowed.



No usage of assembly is allowed.



In order to receive the prize money, you may need to provide a description of how your solution works. Note that this description should not be submitted anywhere during the coding phase. Instead, if you win a prize and we need this description from you, a TopCoder representative will contact you directly.
 

Definition

    
Class:AffineSWOptimization
Method:process
Parameters:String, String, int, int, int, int, int, int, int
Returns:String
Method signature:String process(String target, String query, int query_start_clip, int query_end_clip, int match_score, int mismatch_score, int gap_open, int gap_extension, int direction)
(be sure your method is public)
    
 

Notes

-The match forum is located here. Please check it regularly because some important clarifications and/or updates may be posted there. You can click "Watch forum" if you would like to receive automatic notifications about all posted messages to your email.
-A great set of links on Smith-Waterman implementations and accelerated versions can be found in Wikipedia.
-Your implementation of process does not need to work exactly as described in the statement. As long as your return value is correct, it doesn't matter what algorithm you use internally to calculate it.
-The time limit is 30 seconds and the memory limit is 2 gigabytes.
-There is no explicit code size limit. The implicit source code size limit is around 1 MB (it is not advisable to submit codes of size close to that or larger). Once your code is compiled, the binary size should not exceed 1 MB.
-The compilation time limit is 30 seconds. You can find information about compilers that we use and compilation options here.
 

Examples

0)
    
1.txt
1)
    
4.txt
2)
    
13.txt
3)
    
16.txt

This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2010, TopCoder, Inc. All rights reserved.