JOIN
Get Time
long_comps_topcoder  Problem Statement
Contest: Treephaser
Problem: Treephaser

Problem Statement

    In this problem you will need to optimize a specific algorithm. Your implementation should be optimized for speed while maintaining correctness.



Problem definition

You are provided with a source code of the algorithm that you need to optimize. The code is almost completely taken from the implementation of a certain client's tool. Minor code modifications are made (see the bottom of the file) so that it can be directly submitted as a solution for this marathon match.



You will need to implement a class Treephaser that contains one method process. The algorithm inside the method doesn't matter, but it should produce (almost) the same results as process method in the implementation provided above. Since the code deals with floating point calculations, precision issues are possible, so we allow your code to have a small number of differences in comparison to model answer. See the "Scoring" section for more details.



Algorithm description

Below you can find a high level description of the algorithm which you will need to optimize. Please note that it is not a formal comprehensive specification, but just a high level description. To understand the code in deeper details is assumed to be a part of problem solving process.



Treephaser algorithm performs two tasks: measurement normalization and dephasing. Both tasks are performed iteratively, in a sliding window fashion, with each iteration working on a window slightly further downstream along the "flow" dimension.



Dephasing has a generative model of phasing (progressive, sequence and flow-order dependent blurring). It uses this model to build partial base sequences and match model predictions to the normalized measurements. The space of partial sequences is traversed by heuristic tree search that keeps a fixed number of open promising paths (eight) and expands the best path at each step, until a stopping criterion is met.



Normalization algorithm performs its first normalization using its knowledge of the initial "expected" measurements, the key sequence. After an iteration of dephasing is complete, normalization compares phasing model's predictions to measurements and adjusts the latter for a better fit in the subsequent iteration.



Test data

The data in this problem consists of 10 files. Each file contains the following data:
  • double CF -- a value between 0.001 and 0.01, inclusive.
  • double IE -- a value between 0.001 and 0.01, inclusive.
  • int numFlows -- a value between 100 and 1000, inclusive.
  • String flowOrder -- a String that contains exactly numFlows characters, each being one of 'A', 'C', 'G' or 'T'.
  • int numKeyFlows -- it is always equal to 7.
  • int keyVec -- it contains exactly numFlows elements. The first 7 of them are 1, 0, 1, 0, 0, 1, 0. The rest are all zeros.
It also contains many inputs, where each input is a value of double[] flow. Each input will contain exactly numFlows elements. Each element can be between 0.0 and 200.0, inclusive.



All inputs from the same file are separated into several test cases. The total number of elements in all inputs of the same test case can be between 100,000 and 1,000,000.



All test cases from 3 of these 10 data files are provided to you as training data. You can download these files here and use them to test your solution locally. Each file corresponds to one test case. The first six parameters are CF, IE, flowOrder, numFlows, numKeyFlows and keyVec. The rest of the file describes inputs. Each line corresponds to one input and contains the elements of double[] flow.



Testing and scoring

For each test case, the grader will consecutively supply its inputs to your process method, making a single separate call to process for each input.



If your solution exceeds a time limit of 60 seconds (per all inputs of a test case together, not per each input), a memory limit of 1 gigabytes or crashes, your score for a test case will be 0. If your return value for any input will not contain exactly numFlows elements, your score for this test case will also be 0.



Otherwise, each your return value will be compared to a model answer. Let Tot be the total number of elements in all inputs of this test case and Err be the number of positions where your answer disagrees with the model one. Your penalty for incorrect positions is calculated as follows:

  int allowedErrors = Tot / 1000; // 0.1%, integer division
  int errorsThreshold = Tot / 100; // 1%, integer division
  double penaltyForEachValue = 60000.0 / Tot; // 60sec for each 0.1%, exact division

  double penalty = 0;
  if (Err <= errorsThreshold) {
    int t = max(Err - allowedErrors, 0);
    penalty = t * penaltyForEachValue;
  } else {
    penalty = 600;
  }
Now, let P be the penalty and T the execution time of your solution (in milliseconds). Your score for a test case will be equal to 660000.0 - 1000 * P - T. Your overall score on a group of test cases will be an arithmetic average of scores on test cases from the group.



Important note 1: the model answers were produced at a different environment than TopCoder testing servers, so Err is not guaranteed to always be 0 if you submit the basic implementation that we provided above. However, it is guaranteed to be below Tot / 2000 on any test case.



Important note 2: you can assume that provided basic implementation work at most 40 seconds for any test case on TopCoder servers.



Extra submission phase

The match will be followed with an extra phase where the goal is to produce a solution that compiles and works efficiently in the client's environment. This phase will feature a prize purse of $2,500. Only the best 10 competitors from the marathon match will be invited to participate in this extra phase. Participation in the extra phase is not required in order to receive prizes from the marathon match. More information on this phase will be posted later on the match forum.



Special conditions

Only C++ submissions are allowed in this contest.



It is allowed to use MMX, SSE, SSE2 and SSE3 instructions.



Multithreading in your solution is not allowed.



No usage of assembly is allowed. It is fine to use intrinsic functions though.



In order to receive the prize money, you may need to provide a description of how your solution works. Note that this description should not be submitted anywhere during the coding phase. Instead, if you win a prize and we need this description from you, a TopCoder representative will contact you directly. Once the results of the marathon match are available, the description needs to be submitted within the next 7 calendar days.
 

Definition

    
Class:Treephaser
Method:process
Parameters:double, double, String, int, int, int[], double[]
Returns:int[]
Method signature:int[] process(double CF, double IE, String flowOrder, int numFlows, int numKeyFlows, int[] keyVec, double[] flow)
(be sure your method is public)
    
 

Notes

-The match forum is located here. Please check it regularly because some important clarifications and/or updates may be posted there. You can click "Watch forum" if you would like to receive automatic notifications about all posted messages to your email.
-The memory limit is 1024 MB and the time limit is 1 minute per test case (which includes only time spent in your code).
-There is no explicit code size limit. The implicit source code size limit is around 1 MB (it is not advisable to submit codes of size close to that or larger). Once your code is compiled, the binary size should not exceed 1 MB.
-The compilation time limit is 30 seconds. You can find information about compilers that we use and processing server specifications here.
-vector<int> keyVec and vector<double> flow are allowed to be reference parameters. This way you can avoid extra data copying during calls of process and thus reduce the runtime of your solution.
-The compiler is expected to optimize the return value to be returned without extra data copying (without any involvement from your side). However, we do not provide any guarantees that it will happen this way in absolutely all cases.
 

Examples

0)
    
test943.txt
1)
    
test923.txt
2)
    
test954.txt
3)
    
test972.txt
4)
    
test889.txt
5)
    
test962.txt
6)
    
test935.txt
7)
    
test916.txt
8)
    
test890.txt
9)
    
test947.txt

This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2010, TopCoder, Inc. All rights reserved.