| In this problem you will need to implement the Smith-Waterman routine for local DNA sequence alignment using affine gap model. Your implementation should be optimized for speed while maintaining correctness.
**Problem definition**
Suppose that we have two DNA sequences, a query sequence *a* of length *m* and a target sequence *b* of length *n*. Each of these sequences can be treated as a string containing characters 'A', 'C', 'G', 'T'. We are also given 4 integers match_score, mismatch_score, gap_open and gap_extension. The value of match_score is positive, other 3 values are negative, moreover gap_open is strictly less than gap_extension.
This paragraph explains what does it mean to align the *entire* sequence *a* to *entire* sequence *b*. Suppose that we can insert arbitrary amount of characters '-' (gap characters) into arbitrary places of *a* and *b*. Let's do it so that *a* and *b* have the same length after all insertions are done. For clarity, let's denote the resulted strings as a' and b'. For example, the result can look like this:
position 0123456789
a' A---CTTCAC
b' -CTGATCT-C
Each possible pair (a', b') is called an *alignment* of *a* to *b*. Each alignment has an associated alignment score that is a sum of 3 parts (first part is positive and the other two parts are negative):
*Match score*. For each pair of equal non-gap characters that stand on the same position in a' and b', we get a score of match_score. For the example above, we get this score twice, for a pair of 'T's (position 5) and for a pair of 'C's (position 9).
*Mismatch score*. For each pair of distinct non-gap characters that stand on the same position in a' and b', we get a score of mismatch_score. For the example above, this score is received 3 times, for positions 4, 6 and 7.
*Gap score*. Each maximal consecutive sequence (i.e., a sequence that can't be extended to the left or to the right) of L gap characters in either a' or b' is called a gap of length L. For each such gap we get a score of gap_open + L * gap_extension. For the example above, there are 3 gaps of lengths 3, 1 and 1. Thus, the total received score will be (gap_open + 3 * gap_extension) + (gap_open + 1 * gap_extension) + (gap_open + 1 * gap_extension) = 3 * gap_open + 5 * gap_extension.
The optimal alignment of *a* to *b* is a pair (a', b') that has the maximum possible alignment score.
**Smith-Waterman algorithm**
Smith-Waterman algorithm is a well-known alignment algorithm. It is a basic solution that you will need to optimize. The algorithm is based on dynamic programming. It computes 3 two-dimensional tables M(i, j), V(i, j) and H(i, j), 0 <= i <= *m*, 0 <= j <= *n*. In this problem, we consider 4 variations of the algorithm. The variation that needs to be executed for each particular input is controlled by parameters query_start_clip and query_end_clip.
*Variation 1*. Align any consecutive subsequence of *a* to any consecutive subsequence of *b* (query_start_clip = 1, query_end_clip = 1).
Use the following equations:
M(i, 0) = 0, 0 <= i <= m
M(0, j) = 0, 0 <= j <= n
H(i, 0) = H(0, j) = V(i, 0) = V(0, j) = -infinity, 0 <= i <= m, 0 <= j <= n
For 1 <= i <= m, 1 <= j <= n:
M(i, j) = max{0, max{M(i-1, j-1), V(i-1, j-1), H(i-1, j-1)} + w(a[i], b[j])}
V(i, j) = max{M(i-1, j) + o + e, V(i-1, j) + e}
H(i, j) = max{M(i, j-1) + o + e, H(i, j-1) + e}
Here *infinity* is a large enough integer (2^30 can be used for these purposes), w(a[i], b[j]) is match_score if i-th (1-based) character in *a* is the same as j-th (1-based) character in *b* and mismatch_score otherwise, *o* is gap_open and *e* is gap_extension.
The value of the optimal alignment is the maximum value among all M(i, j), H(i, j), V(i, j), 1 <= i <= m, 1 <= j <= n.
*Variation 2*. Align any prefix of *a* to any consecutive subsequence of *b* (query_start_clip = 0, query_end_clip = 1).
M(0, j) = 0, 0 <= j <= n
M(i, 0) = -infinity, 1 <= i <= m
H(0, j) = -infinity, 0 <= j <= n
H(i, 0) = o + e*i, 1 <= i <= m
V(i, 0) = V(0, j) = -infinity, 0 <= i <= m, 0 <= j <= n
For 1 <= i <= m, 1 <= j <= n:
M(i, j) = max{M(i-1, j-1), V(i-1, j-1), H(i-1, j-1)} + w(a[i], b[j])
V(i, j) = max{M(i-1, j) + o + e, V(i-1, j) + e}
H(i, j) = max{M(i, j-1) + o + e, H(i, j-1) + e}
The value of the optimal alignment is the maximum value among all M(i, j), H(i, j), V(i, j), 1 <= i <= m, 1 <= j <= n.
*Variation 3*. Align any suffix of *a* to any consecutive subsequence of *b* (query_start_clip = 1, query_end_clip = 0).
Perform the same steps as in variation 1. The value of the optimal alignment is the maximum value among all M(m, j), H(m, j), V(m, j), 1 <= j <= n.
*Variation 4*. Align the entire sequence *a* to any consecutive subsequence of *b* (query_start_clip = 0, query_end_clip = 0).
Perform the same steps as in variation 2. The value of the optimal alignment is the maximum value among all M(m, j), H(m, j), V(m, j), 1 <= j <= n.
**Implementation details**
You will need to implement one method `process` that takes the following input parameters:
**target** -- the value of *b*. It is a string containing between 1 and 1024 characters. Each character is one of 'A', 'C', 'G', 'T'.
**query** -- the value of *a*. It is a string containing between 1 and 512 characters. Each character is one of 'A', 'C', 'G', 'T'.
**query_start_clip** and **query_end_clip** -- the values of query_start_clip and query_end_clip. Each of them is equal to 0 or 1.
**match_score** -- the value of match_score. It is an integer in 1..10.
**mismatch_score** -- the value of mismatch_score. It is an integer in -10..-1.
**gap_open** -- the value of gap_open. It is an integer in -10..-2.
**gap_extension** -- the value of gap_extension. It is an integer in (gap_open+1)..-1.
**direction** -- an integer equal to 0 or 1. The meaning is explained below.
The return value needs to be formatted as "opt query_end target_end n_best", where:
`opt` is the value of the optimal alignment.
`n_best` is the number of different pairs (i, j), 1 <= i <= m, 1 <= j <= n, such that max{M(i, j), V(i, j), H(i, j)} = `opt`. For variations 2 and 3 only cells where i = m need to be considered.
- (
`query_end`, `target_end`) is the location of the cell corresponding to the optimal alignment. In other words, the cell (`query_end`, `target_end`) in one of tables M, V or H must be equal to `opt`. For variations 2 and 3 `query_end` is always equal to m.
**direction** explains how to break ties. If **direction** = 0, then `query_end` needs to be as small as possible (if there are still several possibilities, choose the one with the smallest value of `target_end` among them). If **direction** = 1, then `query_end` needs to be as large as possible (if there are still several possibilities, choose the one with the largest value of `target_end` among them).
Use single spaces to separate integers in the return value. Do not use leading/trailing spaces. Format all integers without unnecessary leading zeros.
**Testing and scoring**
The data in this problem consists of 16 files. Each of them contains approximately 1 million inputs, where input is a combination of parameters for single `process` call. All inputs within the same file are guaranteed to have the same values of **match_score**, **mismatch_score**, **gap_open** and **gap_extension**.
Each of the files is split into 3 parts (called A, B and C) as follows:
- The set of all possible (
**target**, **query**) pairs that occur in at least one input of at least one of the 16 files is built.
- This set is randomly split into 3 subsets A, B, C so that 40% of (
**target**, **query**) pairs go to subset A, 30% go to B and 30% go to C.
- For each file, its parts A, B, C will contain only the inputs where corresponding (
**target**, **query**) pair belongs to subset A, B and C, respectively.
Parts A are used as training data. You can download these files here and use them to test your solution locally. Each input is in a single line and consists of 13 values separated by single tab characters: **target**, **query**, **query_start_clip**, **query_end_clip**, **match_score**, **mismatch_score**, **gap_open**, **gap_extension**, **direction**, `opt`, `query_end`, `target_end`, `n_best`. 5 of these files are also reused as example test cases. Parts B and C are used as submission and system test cases, correspondingly (so there are 16 submission and 16 system test cases, one per file).
For each test case, the grader will consecutively supply the inputs from corresponding file to your `process` method. It will stop as soon as it receives an incorrect return value, the execution time limit of 30 seconds is exceeded or all inputs from the file are processed, whatever happens earlier. If you were not able to process all inputs, your score for the test case will be equal to the number of times your `process` method was executed minus 1 (so the last call that you were not able to process is not counted). Otherwise, your score is calculated as `n_inputs` + 300,000 - 10 * `time`, where `n_inputs` is the total number of inputs in this test case and `time` is the time that your solution spent to process all of them (in milliseconds).
Your overall score on a set of test cases is the arithmetic average of scores obtained on separate test cases.
**Tools**
We provide a sample implementation that is guaranteed to process all inputs correctly (including the ones in submission/system tests). The speed of this implementation, however, is fairly low.
**Special conditions**
Only C++ submissions are allowed in this contest.
It is allowed to use MMX, SSE, SSE2 and SSE3 instructions.
Multithreading in your solution is not allowed.
*No usage of assembly is allowed.*
In order to receive the prize money, you may need to provide a description of how your solution works. Note that this description should not be submitted anywhere during the coding phase. Instead, if you win a prize and we need this description from you, a TopCoder representative will contact you directly. |