ORC Error Reduction Marathon Challenge
1st place - $5,000
2nd place - $2,500
3rd place - $1,500
4th place - $1,000
We are looking for solutions that can reduce the OCR errors in an unsupervised manner. The modified texts will be validated against a keyword/keyphrase vocabulary and the original texts. We hope to receive wonderful solutions. Good Luck!
Requirements to Win a Prize
In order to receive a prize, you must do the following:
If you place in the top 4, but fail to do all of the above, then you will not receive a prize, which will be awarded to the contestant with the next best performance who completed the submission requirements above.
- Achieve a score in the top 4, according to system test results calculated using the test dataset. The score must be higher than 300,000.0. See the “Data Description and Scoring” section below.
- Within 7 days from the announcement of the contest winners, submit a complete 2-page (minimum) report that: (1) outlines your final algorithm and (2) explains the logic behind and steps of your approach.
Optical character recognition (OCR) software is capable of turning images of text into text, but it is not perfect. There are sources of errors that occur because of the nature of the document image itself (sizing, rotation, etc.) that should be handled in preprocessing, but there are sources of errors that need to be handled in post-processing, after the image is converted to text. Among the more common types of errors are segmentation errors (`too m uc h wh itespac e` or `toolittle whitespace`), `character rnisrecognitioii`, smudges or other artifacts that could cause misrecognition of `pun,ctuation`.
The cleaner we can make the images before the OCR step, and the cleaner we can make the output of the OCR, the more confidently we will be able to use this text as input to other models. Your challenge is to find these needles in the haystack, and train a model to find and correct the types of common issues described above -- as well as ones we might not anticipate.
There are 10,076 OCR documents in total. There are 8,241,011 total words, while only 15,497 unique words. All words have been anonymized.
We have randomly split these documents into 3 sets and processed them into paragraphs:
- Train Set: 260,007 paragraphs. The longest one has 844 words, while the average number of words is 10.63.
- Provisional Set: 261,294 paragraphs. The longest one has 1508 words, while the average number of words is 10.58.
- Final Test Set: 254,461 paragraphs. The longest one has 1806 words, while the average number of words is 10.67.
We provide the "Train Set" for your local development and testing. In addition, we have a lexicon list of 373,837 keyphrases. These phrases are used in the evaluation. You can downloaded them from the following link.
You are asked to implement a class OCRErrorReduction with two functions: initialize and clean. The initialize function will be called exactly once at the beginning of the test. It receives a list of single-space-separated strings where every string represent a keyphrase. The clean function will be called multiple times during the test. Every time, it receives an original OCR paragraph. It needs to return a cleaned paragraph.
For each OCR document, we want to maximize the ratio of the words that can be tagged as keyphrases over the total number of words, while the cleaned string must be still not far away from the original string.
Therefore, given the dictionary D and two strings, i.e., cleaned string (CS) and original string (OS), we define the raw score as
raw(CS, OS | D) = Sqrt( (Match(CS, D) / |CS|) * (1 - EditDistance(CS, OS) / max(|CS|, |OS|)) )
Where Match(CS, D) represents the maximum number of words that can be matched in the string CS to the dictionary D, |CS| and |OS| denote the number of words in these two strings, respectively. In order to calculate Match(CS, D), we will simply split the string by whitespace, and run a dynamic programming algorithm. |CS| is required to be between 1 and max(2 * |OS|, |OS| + 100). Otherwise, the raw score will be 0.0. More details can be found in the released test harness. If you find any question or error regarding the test harness, please post it in the forum.
The final score will be the average of the raw scores on all tested paragraphs multiplied by 1,000,000.
In the Example Test, we will use the first 10,000 paragraphs in the released "Train Set" for evaluation. The time limit is 20 mins. In the Provisional Test and System Test, we will use the whole "Provisional Test Set" and "System Test Set" respectively. The time limits are both 2 hours.