JOIN
Get Time
long_comps_topcoder  Problem Statement
Contest: DATCompression2
Problem: DATCompression2

Problem Statement

    DNA sequencing has revolutionized biological research and is set to transform medicine as well as this technology scales. Machines like Ion Torrent's Proton sequencer will deliver a $1000 genome in a day. In order to deliver these data at scale the data need to be as compact and fast as possible. Please be a part of history delivering the first $1000 genome and help bring this technology to the clinic to save lives by making the data storage as efficient as possible in both time and space.



Each well on the Ion Torrent chip is a pH sensor producing a number of counts proportional to the concentration in hydrogen ions (protons) in the wells. Each well occupies a unique x, y position on the chip and produces a vector of 14-bit integers measuring the concentration of hydrogen ions in the well over different time points for a single flow. The concentration of protons observed (drawn in purple) is primarily driven by the concentration of the nucleotide solution (drawn in blue) and the concentration of protons produced by the DNA polymerase incorporating (drawn in red) and the observed is a combination of these processes (red + blue = purple).







In practice the proton concentration of the well can be modeled as a combination of at least 3 different processes:
  1. The proton concentration in the solution washing over the chip shown as a blue dashed line in the figure. First wash solution is flowing then a nucleotide solution which causes the rise in counts at frame 20 or so and then a wash solution again (not depicted in the plot but present in the sample data).
  2. The protons produced by the incorporation of the current nucleotide {A,C,T,G} into the growing DNA nucleotide strands by the DNA polymerases drawn as the red curve above. These protons are produced quickly when the nucleotide solution diffuses into the well and nucleotides become available to the polymerase. The protons also diffuse out and these competing processes produce the incorporation signal in red. Note that not every flow causes an incorporation in every well and that only wells with a bead containing DNA will have incorporations.
  3. The proton buffering capability of the chip and the well including the DNA and polymerase in the well if the well contains a bead. This buffering impacts the shape of both the red and the blue lines above preventing the change in concentration due to the nucleotide solution from being a simple step curve.


Input data



The test data in this problem consists of 70 files. 35 of them (randomly selected) are available for download. The other 35 files are randomly split between submission (10 files) and system (25 files) test cases.



Each line of the files consists several tab-separated integers. The first two integers are x and y -- the coordinates of a position on the chip. Each next integer is the result of a single measurement. The measurements are listed in the same order as they are done. Each line contains the same number of measurements (we will denote it as L). The measurements represent the concentration of hydrogen ions in the well (x, y). It should be changing relatively smoothly over time (t = 0, 1, ..., L-1), however huge leaps of concentration are also possible. The coordinates are always listed in increasing order of x, breaking ties by increasing order of y. In other words, if X and Y are the amounts of different x-coordinates and y-coordinates, respectively, then the order of coordinates in the file is (0, 0), (0, 1), ..., (0, Y-1), (1, 0), (1, 1), ..., (1, Y-1), ..., (X-1, 0), (X-1, 1), (X-1, Y-1). In each file (including the ones used for submission and system tests), you can assume that the following holds: 890 <= X <= 910, 890 <= Y <= 910, 55 <= L <= 60, each measurement value is in 0..16383 (inclusive).



Implementation details



You will need to implement 2 methods: compress and decompress. compress will take the data to be compressed as int[] data. This int[] is produced as follows:
  • Create a list of integers in the following order: X, Y, L, L measurements for the 1st line, L measurements for the 2nd line, ..., L measurements for the (X*Y)-th line. (The first two integers in each line, i.e., x and y coordinates, are omitted because of being redundant.)
  • If the number of integers in the list above is odd, append an extra 0 in the end.
  • Break the resulted list into pairs of consecutive integers. For example, a list (A, B, C, D, E, F) would be broken into (A, B), (C, D), (E, F).
  • For each pair (V, W) an integer (W << 16) + V is calculated. These integers (following in the same order as pairs) are elements of data.


For example, consider the following small sample file:
0 0 10000 10005


In this case X=1, Y=1 and L=2. The list for this file will look as (1, 1, 2, 10000, 10005). It has an odd number of integers, so an extra 0 will be added. This will result in the following pairs: (1, 1), (2, 10000), (10005, 0), so data will be {65537, 655360002, 10005}.



The returned value of compress must be a compressed representation of data. Each element can use the whole range of 32-bit signed integer data type.



This returned value of compress gets passed as int[] data to your decompress implementation. It needs to decompress the data back and return a int[] with exactly the same number of elements as int[] that was provided to your compress method.



Lossy compression



The compression does not have to be lossless, however only limited losses are allowed.



The grader will transform the returned value from decompress back to the list of integers (X, Y, L, L measurements for the 1st line, L measurements for the 2nd line, ..., L measurements for the (X*Y)-th line) by reversing the steps made to produce the contents of data supplied to compress method. We will use INP and OUT to denote the input (supplied to compress) and the output (obtained from decompress) lists, respectively. Let's DIFF = (d[0], d[1], ..., d[n-1]) be a vector of differences between corresponding measurements in INP and OUT, where n = X*Y*L. The following conditions need to be fulfilled in order for your return from decompress to be considered valid:
  • The values of X, Y and L in INP and OUT need to be exactly the same.
  • Each measurement value in OUT needs to be in 0..16383.
  • -2 * n < sum(d[i], 0 <= i < n) < 2 * n.
  • sum(d[i] * d[i], 0 <= i < n) < 36 * n.


Scoring



If your solution violates time or memory limit, crashes or violates any of restrictions specified above, the score for a test case will be 0. Otherwise, the score is calculated as 100,000 * A / B, where A is the number of elements in data provided to compress method and B is the number of element in the return value from compress method.



Your overall score on a set of test cases is just the arithmetic average of scores obtained on separate test cases.



Extra submission phase



The match will be followed with an extra phase where the goal is to produce a solution that compiles and works efficiently on the client's environment. This phase will feature a prize purse of $2,500. Only best 10 competitors from the marathon match will be invited to participate in this extra phase. Participation in the extra phase is not required in order to receive prizes from the marathon match. More information on this phase will be posted shortly on the match forum.



Special conditions



Only C++ submissions are allowed in this contest.



Multithreading in your solution is not allowed.



No usage of assembly is allowed.



In order to receive the prize money, you may need to provide a description of how your solution works. Note that this description should not be submitted anywhere during the coding phase. Instead, if you win a prize and we need this description from you, a TopCoder representative will contact you directly.



The TopCoder Marathon Framework is organized so that you are able to pass data between different calls of your methods. Passing any data between compress and decompress (except compressed version of input) contradicts to the nature of the problem and therefore is not allowed. In order to make easier to check that your solution does not break this rule, the following restrictions must hold for your solution:
  • You are allowed to have global/static data. compress and decompress can only read this data, but can't modify it.
  • All fields in DATCompression2 class can only be read-only within compress and decompress. The non-static fields in other classes, if you have them, are allowed to be modified.
  • The return value from compress must have a well interpretable general meaning not tied to the current run of your solution. In particular, you are not allowed to pass pointers to data through the return value.


Note the restrictions above that both compress and decompress need to allocate all memory necessary for computations and return value and no such memory can be shared between different calls.



It is not allowed to use mallopt to speed up memory allocation.
 

Definition

    
Class:DATCompression2
Method:compress
Parameters:int[]
Returns:int[]
Method signature:int[] compress(int[] data)
 
Method:decompress
Parameters:int[]
Returns:int[]
Method signature:int[] decompress(int[] data)
(be sure your methods are public)
    
 

Notes

-The match forum is located here. Please check it regularly because some important clarifications and/or updates may be posted there. You can click "Watch forum" if you would like to receive automatic notifications about all posted messages to your email.
-The memory limit is 2 gigabytes. The time limit is 10 seconds (for all time spent in compress and decompress methods).
-There is no explicit code size limit. The implicit source code size limit is around 1 MB (it is not advisable to submit codes of size close to that or larger). Once your code is compiled, the binary size should not exceed 1 MB.
-The compilation time limit is 30 seconds. You can find information about compilers that we use and compilation options here.
-The processing server specifications can be found here.
-vector<int> data is allowed to be a reference parameter. This way you can avoid extra data copying during calls of compress/decompress and thus reduce the runtime of your solution.
-The compiler is expected to optimize the return value to be returned without extra data copying (without any involvement from your side). However, we do not provide any guarantees that it will happen this way in absolutely all cases.
 

Examples

0)
    
5716a414b41586e046887486ceb0f13d.tab
1)
    
2bf3c74ae472ef9b268d6bb89e23db33.tab
2)
    
32ea14604c00b6ee878442be5e0d4915.tab
3)
    
52909913e796610ffbb8bad2b252c4cd.tab
4)
    
62b9d5cff05423698cca681220205134.tab

This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2010, TopCoder, Inc. All rights reserved.