JOIN
Get Time
long_comps_topcoder  Problem Statement
Contest: Marathon Match 18
Problem: BlockEditDistance2

Problem Statement

    

Changes from previous version

  1. For each contiguous region of like characters in your return (I's, D's, and M's) S will be added to your overall cost, where S is between 1 and 4. For instance, IIIMMIII has 3 regions of contiguous characters -- one of I's, followed by M's, followed by I's
  2. The time formula has been changed so that your code must be faster to achieve the same modifier as before.
  3. You may run-length encode your return. For instance, you could replace IIII with 4I.

Problem

A detailed changelog is maintained for each entry in Wikipedia. Each line in this changelog represents one version of the page, made by some user. In this problem, your task is to figure out how each version transitioned to the final version, and to do this as quickly as possible.



To do this, we will use the idea of block edit distance. When considering a past version and the final version, we want to know how cheaply we can transform the past version into the final version using the following procedure. First, we take a sequence of blocks from the past version, and arrange them however we want, where a block is a continuous subsequence, defined by its start and end index in the past version. Once we've done this, we insert and delete single characters (just like regular edit distance). Each block that we use has a cost of B, while each insertion or deletion costs 1. The goal is to tranform each past version into the final version as cheaply as possible. For instance, consider the following example, where B = 2 and S = 1.
past:  ABCDEFGHIJKL
final: GGHIJMACDEFGZ
One way to do this is to pick two blocks from the past version (characters 6 to 9 and characters 0 to 6), and arrange them as:
  GHIJABCDEFG
Next we insert and delete some characters (illustrated below, we add 3 and delete 1)
  -GHIJ-ABCDEFG-
  GGHIJMA-CDEFGZ
The total cost of these operations is 4 for the two blocks, and 4 for the inserts and deletes.



In addition to the cost mentioned above, you will also be charged a cost of S for each contiguous region of inserts, deletes, or matches.



The input will be given to you as a String[], where each element represents one version of the page, in the order of their revisions, so the final element is the final version. You should return a String[] where each element corresponds to one version of the page (excluding the final version). Each element should first give the blocks from the previous version to arrange, as inclusive ranges separated by spaces. Following these should be a string describing how to change the string formed from the sequence of blocks into the final version. This string should be a sequence of 'M', 'I', and 'D' characters, standing for match, insert and delete. A good way to think about this is to imagine a pointer starting at the beginning of each string: the string formed from blocks, and the final version. A 'M' character advances both pointers, an 'I' advances only the final version pointer, while a 'D' advances only the block pointer. At the end of the sequence of 'M's, 'I's and 'D's, the two pointers should both be one character past the last character of the strings. Furthermore, an 'M' character can only be used if the characters at the two pointer positions match. In the example above, the sequence IMMMMIMDMMMMMI would be correct (note that 'I' corresponds to a '-' in the first sequence, and 'D' to a '-' in the second). This sequence has 7 contiguous regions of characters. Thus, the whole return for the example above could be "6-9 0-6 IMMMMIMDMMMMMI". Note that the cost will be B*numBlocks+Ds+Is+S*contiguousRegions where Ds is the number of 'D' characters, and Is is the number of 'I' characters.



Your score will be based on your total block edit distance, summed over all versions, and on the time it takes you. For a particular test case, a baseline score is calculated by multiplying (number of versions minus one) times the length of the final version (this is the cost of a solution that used only 'I's). Your percent improvement will be calculated from this as: (baseline-cost)/baseline. Your percent improvement will then be multiplied by a time modifier. This modifier is based on the rate at which you process the data in KB/s. So, if all the versions have 500,000 characters in them combined, and you process this case in 1 second, your rate would be 500. The modifier function is 1.0/(1+exp(4-rate/500)). This is a sigmoidal curve (try plotting it!) which goes to 1 as rate increases. To give you an idea of what to shoot for, a rate of 2500 would give you a modifier of about 0.73, while a rate of 1000 would give you a modifier of about 0.12 and a rate of 3750 would give you a modifier of 0.97. To allow a bit more time for small articles, articles with less than 250K characters will be treated as if they had 250K characters.



Your final score will be the sum over all test cases of your score for each test case, multiplied by 100.
 

Definition

    
Class:BlockEditDistance2
Method:transform
Parameters:String[], int, int
Returns:String[]
Method signature:String[] transform(String[] versions, int B, int S)
(be sure your method is public)
    
 

Notes

-You are translating each version into the final version not into the next version.
-Matching is case sensitive.
-The time limit per test case is max(5,min(size/5e5,60)) seconds, where size is the total number of characters in the test. Thus, you have at least 5 seconds for each test case, and for big test cases you have 1 second per 500K characters, up to a max of 60 seconds.
-The memory limit is 1024M.
-Test cases are chosen by using the random page feature on Wikipedia, and discarding pages with less than 20 versions or more than 200 million total characters or a baseline score over 500 million. Some of them (e.g. Martial_arts) are quite large.
-Download the examples.
-If you want to download more data, you can use this Java program. Simply install Java and run "java -Xmx512M WikiDownload <article>" to download a specific article (which will be saved as a .txt file). If you run it with no parameters it will download random pages until you kill it.
-A negative score on a test case will be increased to 0 when computing the final score.
 

Constraints

-B will be chosen uniformly between 10 and 40, inclusive.
-The total number of characters will be less than 200 million (this excludes a few of the biggest pages)
 

Examples

0)
    
Article = Suzuki_DL650_V-Strom
B = 30, S = 2 
1)
    
Article = Dayenu
B = 40, S = 1 
2)
    
Article = Technology_of_the_Song_Dynasty
B = 28, S = 3 
3)
    
Article = Porcia_Catonis
B = 36, S = 3 
4)
    
Article = Eugene_H%C3%BCtz
B = 24, S = 2 
5)
    
Article = Kung_Faux
B = 23, S = 2 
6)
    
Article = Alaska_Game_Management_Units
B = 12, S = 3 
7)
    
Article = The_Kresge_Foundation
B = 38, S = 2 
8)
    
Article = Cassius_Dionysius_Longinus
B = 12, S = 3 
9)
    
Article = Johnston%2C_Iowa
B = 25, S = 3 
10)
    
Article = Prince_Beltran_of_Bulgaria
B = 32, S = 3 
11)
    
Article = Robert_Fortune
B = 18, S = 4 
12)
    
Article = Jabal_Amel
B = 29, S = 1 
13)
    
Article = Harry_Gant
B = 13, S = 2 
14)
    
Article = List_of_characters_in_Saint_Tail
B = 12, S = 2 
15)
    
Article = Lawrence_Chimezie_Akandu
B = 10, S = 1 
16)
    
Article = Spy_vs._Spy
B = 23, S = 1 
17)
    
Article = Kingshurst_CTC
B = 22, S = 1 
18)
    
Article = Lionel_Barrymore
B = 31, S = 3 
19)
    
Article = Marvan_Atapattu
B = 17, S = 1 

This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2010, TopCoder, Inc. All rights reserved.