JOIN Problem Statement
Contest: USPTO Algorithm Challenge
Problem: PatentLabeling

### Problem Statement

In this problem, you will need to extract certain useful information from a patent drawing page.

You will be provided with an image of a patent drawing page. It will contain one or several figures. There can also be additional data that does not belong to any of the figures. Each figure has a title and consists of many parts. Each part is labeled with a certain text (typically a number). Some parts may have multiple labels.

For example, consider the following image of a patent drawing page. It contains 3 figures named "2A", "2B" and "2C". They have 14, 8 and 8 part labels, respectively, a total of 30 part labels for the whole drawing page. All figures (blue color) and part labels (red color) are shown in the following image.

Your task is to extract the following data from the given image:
• for each figure, its location and title;
• for each part label, its location and text.
In order to help with the task, you will also be provided with patent text ("Claim" and "Description" sections) in HTML format. In some rare cases (for example, when a patent is old), the text is not available and then you will only be provided with an image.

Implementation details

You will need to implement 2 methods.

1. getFigures. The input data for this method is an image and a patent text. You are given ints H and W -- the height and the width of the image in pixels. The int[] image contains H*W elements and provides the contents of the image. Each element image[r * W + c] is an integer in 0..255 representing the 8-bit grayscale value of a pixel at row r, column c (both 0-based). text describes the patent text in HTML format. Each element corresponds to a single line in an HTML file. If the text for this patent is not available, then text will be an empty String[].

For the given image, you need to find all figures. The return value must be a String[] where each element corresponds to a single figure and is formatted "N x_1 y_1 x_2 y_2 ... x_N y_N title" (quotes for clarity), where (x_1, y_1) - (x_2, y_2) - ... - (x_N, y_N) - (x_1, y_1) is a polygon surrounding the figure (we will call it "bounding box") and title is the figure's title. In each coordinate (x_i, y_i), x_i is a 0-based column number (where 0 is the leftmost column) and y_i is a 0-based row number (0 is the topmost row). N (the amount of polygon's vertices) is allowed to be at most 15, the polygon needs to have a positive area and can't have self-intersections or self-touchings, each vertex (x_i, y_i) must be a pixel within the image bounds. The returned title can have at most 20 characters. The order of figures can be arbitrary. The amount of figures must not exceed 1000.

2. getPartLabels. The input data is an image and a patent text in the same format as for getFigures. For this image, you need to find all part labels. The return value must contain bounding boxes and texts for each part label. It must be formatted in the same way as return value for getFigures and needs to adhere to the same restrictions. I.e., it needs to be a String[] where each element corresponds to a single part label and is formatted "N x_1 y_1 x_2 y_2 ... x_N y_N text" (quotes for clarity). Note that you need to find and return part labels for all figures in the image simultaneously. However, you do not need to indicate which figure each of the part labels belongs to.

Training set

For the purposes of this match, we have prepared a corpus of 306 patent drawing pages. For some patents there is just 1 drawing page in the set and for other patents there can be multiple drawing pages.

The whole corpus was divided into 3 subsets A, B and C containing 178, 35 and 93 drawing pages, respectively. The division was made in a completely random way, but with one restriction: all drawing pages belonging to the same patent were always placed into the same subset.

All drawing pages in subset A along with expected correct answers for them are provided to you in advance. You can use them in order to understand the nature of the problem better and to test/train your solution. Images in subsets B and C will be used for submission and system tests, correspondingly.

You can download the subset A data here. Folders "images" and "texts" contains images in JPEG format and patent texts in HTML format, correspondingly. For each image "images/NAME.jpg" the expected correct list of figures and part labels can be found at "figures/NAME.ans" and "parts/NAME.ans", correspondingly. Both files are in similar format. The first line contains an integer -- the number of figures (part labels) in the file. Each of the next lines describes one figure (part label) in the following format: "N x_1 y_1 x_2 y_2 ... x_N y_N title (text)", where (x_1, y_1) - (x_2, y_2) - ... - (x_N, y_N) is the bounding box for the figure (part label) and title (text) is the figure's title (part label's text). Note that it is the same format that needs to be used for return values of your solution.

Scoring

Your solution will be tested on a set of images. For each image, there are two test cases. In one test case the grader program calls getFigures and scores the return, in the other test case it does the same for getPartLabels method. The scoring mechanism is very similar in both cases, so we describe it for figures and point the differences for part labels.

First of all, you get 0 points for a test case if we were unable to get a return from your solution (due to time limit, memory limit or crash), if the return is not formatted properly or violates the restrictions enforced in the "Implementation details" section. Assuming your return is proper, it is scored as 1,000,000 * correctness score * performance score.

The performance score is calculated as 0.9 + 0.1 * ((1/max{T, 1})^0.75) (exact division, ^ denotes power operator), where T is the runtime of your solution in seconds. So for example, the runtime of 5 seconds results in ~0.9299 for performance score.

Calculation of the correctness score is a bit more complicated. First of all, let's introduce the concept of matching bounding boxes. Let B1, B2 be two bounding boxes and B3 their intersection. The two boxes are matching if area(B3) >= threshold * max{area(B1), area(B2)}, where area(Bi) is the area of box Bi and threshold is set to 0.8 for figures and to 0.3 for part labels.

Now, let A be the return value of your solution and B the expected correct return value. The grader program will calculate the intersection measure between A and B according to the following pseudocode:

```  result := 0
matched := array of length(B) boolean elements, initialized to false values
For i = 0, ..., length(A) - 1:
For j = 0, ..., length(B) - 1:
if (not matched[j]) and (bounding boxes in A[i] and B[j] are matching):
matched[j] := true
result += 0.25
if (titles in A[i] and B[j] are similar):
result += 0.75
break
```

The intersection measure is equal to the final value of variable result. When determining whether two titles are similar:

1. All characters in both titles except letters ('a'-'z', 'A'-'Z'), digits ('0'-'9'), '(' (ASCII 40), ')' (ASCII 41), '-' (ASCII 45), ''' (ASCII 39), '<' (ASCII 60), '>' (ASCII 62), '.' (ASCII 46) and '/' (ASCII 47) are removed.
2. All uppercase letters ('A'-'Z') are converted to lowercase.
3. If the resulting strings are the same, then the titles are considered to be similar, otherwise they are not similar.

Intersection measure is used to calculate two other values: precision measure = (intersection measure) / size_of(A), and recall measure = (intersection measure) / size_of(B). In cases when the size of A or B is 0, the corresponding measure is assumed to be equal to 1. Finally, your correctness score is calculated as harmonic mean of precision and recall measures.

Your overall score on a set of test cases is defined as plain sum of scores on individual test cases.

Agreements on figure titles and part label texts.

There can be many similar way to name the same figure. For example, for figure 2 its title can be printed "FIGURE 2", "Fig. 2", "Figure 2 (prior art)" and so on. For the purposes of this problem all these ways are considered to be the same and your solution is required to return only the actual title of the figure, without any accompanying words/sentences. For example, you would need to return "1a" instead of "Fig. 1a" or "2-2" instead of "Figure 2-2". You can assume that figure titles do not contain any characters except letters, digits and '-'.

If it is absolutely impossible to determine the title of figure, you need to return the empty title for it. (This is a very exceptional case and is not going to occur often in test data.)

In most cases part label texts will contain only letters and digits. However, there are relatively rare trickier cases. The advices below attempt to handle them. We do not guarantee that they cover all tricky cases or that test data is absolutely always labeled exactly according to these advices, but we made our best to cover all or almost all tricky cases correctly.
• Roman numerals. Sometimes a part label can be named using a Roman numeral. In this case, you should return the label's text using Roman notation as well. For example, "III" or "IX". Use usual Latin letters 'I', 'V', 'X', 'L', and so on to construct the return value.
• Characters '-', '/'. Some part labels can contain characters '-' or '/' in their titles. These are important characters and they need to be preserved in the return value.
• Single quotes. Some part label names can contain single quote characters. These characters are only considered important when they stand in the end of the name, for example, "16'" or "123''". However, in other cases, for example, "'Z'", they are not important and must be omitted from the return value (so, for "'Z'" you would need to return "Z"). Always use only the character ''' (ASCII 39) for single quotes.
• Brackets. Part label names can contain brackets '(', ')', for example "102(a)" or "93(1)". Brackets are important and must be preserved in the return value. However, sometimes brackets are used to distinguish different part labels. For example, "102(103)" can refer to two part labels located at the same place, "102" and "103". In these cases, a correct solution should identify two part labels and omit brackets.
• Dot character. Dot character is important only if it stands in the middle of a part label text, for example, "7.5". In this case it must be preserved. However, if it stands in the end, like "2.", it is not important and must be omitted.
• Comma character. Comma character is typically used to separate two part labels located at the same place. For example, "102,103" would usually mean two part labels with texts "102" and "103". In this case a correct solution identifies two part labels and omits the comma.
• Subscripts. Sometimes part label texts contain subscripts. You should use characters '<' and '>' to designate them. For example, "A<7>'" would mean a part label with text A7'.
• Superscripts. Part label text can also contain superscripts. They should not be designated in any special way. For example, 123b should be returned just as "123b".
Tools and additional information

An offline tester/vizualizer is provided. You can check its source code for a precise implementation of scoring calculation.

In order to receive the prize money, you will need to fully document the derivation of all parameters internal to your algorithm. If these parameters were obtained from the training data set, you will also need to provide the program used to generate these training parameters. There is no restriction on the programming language used to generate these training parameters. Note that all this data should not be submitted anywhere during the coding phase. Instead, if you win a prize, a TopCoder representative will contact you directly in order to collect this data.

### Definition

 Class: PatentLabeling Method: getFigures Parameters: int, int, int[], String[] Returns: String[] Method signature: String[] getFigures(int H, int W, int[] image, String[] text) Method: getPartLabels Parameters: int, int, int[], String[] Returns: String[] Method signature: String[] getPartLabels(int H, int W, int[] image, String[] text) (be sure your methods are public)

### Notes

-We do not provide definitions of terms "figure" and "part label" (since it is impossible to derive strict definitions). You can check the provided training set of images and the expected correct answers for them to get a better notion of these terms.
-The patent texts are not complete HTML files. Rather, each text is an excerpt from an HTML file. They are also not guaranteed to be well-formed. Please check the patent texts in the training set for a better notion of what to expect in the test data.
-The time limit is 1 minutes per test case and the memory limit is 1024 MB. We reserve the right to reduce the time limit if the computation power of our systems will prove to be not high enough and a very big testing queue will arise.
-There is no explicit code size limit. The implicit source code size limit is around 1 MB (it is not advisable to submit codes of size close to that or larger). Once your code is compiled, the binary size should not exceed 1 MB.
-The compilation time limit is 30 seconds. You can find information about compilers that we use and compilation options here.
-The processing server specifications can be found here.

### Examples

0)

`getFigures call for image 6346114-10.`
 labeled image (both figures and part labels)
1)

`getPartLabels call for image 6346114-10.`
 labeled image (both figures and part labels)
2)

`getFigures call for image 7370123-2.`
 labeled image (both figures and part labels)
3)

`getPartLabels call for image 7370123-2.`
 labeled image (both figures and part labels)
4)

`getFigures call for image 11679536-2.`
 labeled image (both figures and part labels)
5)

`getPartLabels call for image 11679536-2.`
 labeled image (both figures and part labels)
6)

`getFigures call for image 7849538-2.`
 labeled image (both figures and part labels)
7)

`getPartLabels call for image 7849538-2.`
 labeled image (both figures and part labels)
8)

`getFigures call for image 7153436-3.`
 labeled image (both figures and part labels)
9)

`getPartLabels call for image 7153436-3.`
 labeled image (both figures and part labels)
10)

`getFigures call for image 11006228-3.`
 labeled image (both figures and part labels)
11)

`getPartLabels call for image 11006228-3.`
 labeled image (both figures and part labels)
12)

`getFigures call for image 7537740-3.`
 labeled image (both figures and part labels)
13)

`getPartLabels call for image 7537740-3.`
 labeled image (both figures and part labels)
14)

`getFigures call for image 10662505-2.`
 labeled image (both figures and part labels)
15)

`getPartLabels call for image 10662505-2.`
 labeled image (both figures and part labels)
16)

`getFigures call for image 7506830-2.`
 labeled image (both figures and part labels)
17)

`getPartLabels call for image 7506830-2.`
 labeled image (both figures and part labels)
18)

`getFigures call for image 7562025-2.`
 labeled image (both figures and part labels)
19)

`getPartLabels call for image 7562025-2.`
 labeled image (both figures and part labels)

This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2010, TopCoder, Inc. All rights reserved.