JOIN
 Problem Statement
Contest: AutoTops Marathon
Problem: AutoTops

## Prize Distribution

• 1st - \$7,000
• 2nd - \$3,500
• 3rd - \$2,000
• 4th - \$1,500
• 5th - \$1,000
• Total - \$15,000

## Background

An aquifer is an underground layer of water-bearing, permeable rock from which groundwater can be extracted using a water well. The composition of underground rock layers affects an aquiferâ€™s quality. Changes in underground rock layer composition can be measured using instruments that record natural radiation (gamma ray, or "GR") and electrical properties (resistivity, or "RESD"). Fluctuations in underground layer "GR" and "RESD" values are plotted against depth and are referred to as "well logs", where a well log is comprised of multiple measured logs (e.g. gamma ray logs and resistivity logs). Well logs begin at the ground surface (depth = 0) with depth values increasing as the well penetrates deeper underground. Gamma ray and resistivity measurements may not be recorded over the entire length of the well. Changes in gamma ray and deep resistivity are used to identify the boundaries of different underground layers which allow for the quantification of aquifer properties.

Experts utilize changes in measured logs, along with their knowledge of the subsurface, to identify important aquifer boundaries. The processes of identifying aquifer boundaries across hundreds or even thousands of wells is time-consuming and prone to human error. We are seeking a method to rapidly identify aquifer boundaries in unidentified wells using several expert identified aquifer boundaries as a reference.

## Objective

Your task is to identify the depth of aquifer boundaries for eight different underground rock layers, or strata, in a well at a given location based on gamma ray logs and resistivity logs measured at a range of depths within the well.

For each well, the eight depths that your algorithm returns will be compared with the ground truth, and the quality of your solution will be judged according to how well your solution matches the ground truth. See the "Scoring" section below for details.

## Input Data Files

The complete data set contains 1,218 wells. This data has been partitioned into three data sets: training, provisional, and system. The training data set which can be used to train and improve your algorithm contains 609 wells, and the ground truth strata depths are included for these wells. The provisional data set which is used for scoring your submissions during the contest contains 183 wells. The system data set which is used for scoring your final submission at the end of the contest contains 426 wells.

The complete data set was partitioned into these subsets randomly. The data set further referred to as the testing data set is the union of the provisional and system data sets. Contestants will not be privy to the partitioning of the testing data set into provisional and system data sets.

Your algorithm will receive a well summary CSV file containing the location of all wells along with the eight strata depths for the wells contained in the ground truth. This CSV file contains a header and one row for each well in both the training and testing data sets. This CSV file will contain the following 12 columns:

• WELL_INDEX - well index number -- there is a csv file of depth, gamma ray, and resd data associated with each well
• DATASET - Training or Testing
• SURF_X - x location of well
• SURF_Y - y location of well
• ARKOSE - 1st stratum depth in feet
• BROOK - 2nd stratum depth in feet
• CONGLOMERATE - 3rd stratum depth in feet
• DEVONIAN - 4th stratum depth in feet
• RAINBOW - 5th stratum depth in feet
• SAPPHIRE - 6th stratum depth in feet
• TRIASSIC - 7th stratum depth in feet
• ZIRCON - 8th stratum depth in feet

For wells in the testing data set, all of the strata depths will be left empty. For wells in the training data set, some wells will have one or more missing strata depths. If a stratum depth is missing from the training ground truth, this means that this stratum does not physically exist at this location.

Your algorithm will also receive one depth data csv file for each well in both the training and testing data sets. Each csv file contains a header and one row for each measurement of gamma radiation and resistivity. Each row will contain the following columns:

• DEPT - Depth in feet
• GR - Gamma radiation in API units
• RESD - Deep resistivity in ohm-meters

The zipped csv files with the instrument data for each well can be downloaded here.

## Output File

This contest uses the result submission style. For the duration of the contest, you will run your solution locally and produce a CSV file which contains your results.

Your output file must be a CSV file, without a header, which contains one row for each well in the testing data set. The CSV file should have only the nine following columns, in this order:

• WELL_INDEX - well index number
• ARKOSE - 1st stratum depth in feet
• BROOK - 2nd stratum depth in feet
• CONGLOMERATE - 3rd stratum depth in feet
• DEVONIAN - 4th stratum depth in feet
• RAINBOW - 5th stratum depth in feet
• SAPPHIRE - 6th stratum depth in feet
• TRIASSIC - 7th stratum depth in feet
• ZIRCON - 8th stratum depth in feet

You must include predictions for all wells in the testing data set (609 wells). Do not include any wells from the training data set in this file.

## Functions

During the contest, only your output CSV file, containing the provisional results, will be submitted. In order for your solution to be evaluated by Topcoder's marathon match system, you must implement a class named AutoTops, which implements a single function: getAnswerURL(). Your function will return a String corresponding to the URL of your submission file. You may upload your output file to a cloud hosting service such as Dropbox or Google Drive, which can provide a direct link to the file.

To create a direct sharing link in Dropbox, right click on the uploaded file and select share. You should be able to copy a link to this specific file which ends with the tag "?dl=0". This URL will point directly to your file if you change this tag to "?dl=1". You can then use this link in your getAnswerURL() function.

Note that Google Drive has a file size limit of 25MB and can't provide direct links to files larger than this. (For larger files the link opens a warning message saying that automatic virus checking of the file is not done.)

You can use any other way to share your result file, but make sure the link you provide opens the filestream directly, and is available for anyone with the link (not only the file owner), to allow the automated tester to download and evaluate it.

An example of the code you have to submit, using Java:

```public class AutoTops
{
{
// Replace the returned String with your submission file's URL
}
}
```

Keep in mind that your complete code that generates these results will be verified at the end of the contest if you achieve a score in the top 10, as described later in the "Requirements to Win a Prize" section, i.e. participants will be required to provide fully automated executable software to allow for independent verification of the performance of your algorithm and the quality of the output data.

## Scoring

Example submissions will be scored against the entire training data set. Provisional submissions should include predictions for all wells in the testing data set. Provisional submissions will be scored against the provisional data set during the contest. Your final provisional system will be scored against the system data set at the end of the contest.

When a stratum exists in the ground truth and a prediction is made, the penalty for each stratum depth prediction is:

P_stratum = abs(prediction - ground_truth)

When a stratum exists in the ground truth but no prediction is made, this constitutes a false negative, and the penalty is 250:

P_stratum = 250

When a stratum does not exist in the ground truth but a prediction is made, this constitutes a false positive, and the penalty is 500:

P_stratum = 500

The penalty for each well is the sum of penalties for each stratum which contains a ground truth entry:

P_well = sum(P_stratum)

The total penalty for all wells in the relevant data set is:

P_total = sum(P_well)

The maximum possible total penalty is computed as follows:

P_stratum_max = max(max_depth - ground_truth, ground_truth - min_depth)

where min_depth and max_depth are the minimum and maximum depth of the depth data CSV file for the well. Note that in rare cases, the ground truth might lie outside this range.

The maximum penalty for each well is the sum of maximum penalties for each stratum which contains a ground truth entry:

P_well_max = sum(P_stratum_max)

The total penalty for all wells in the relevant data set is:

P_total_max = sum(P_well_max)

Your final score is calculated as follows:

Score = 1,000,000 * (1 - P_total / P_total_max)10

## Final Scoring

The top 10 competitors with non-zero provisional scores are asked to participate in a two phased final verification process. Participation is optional but necessary for receiving prizes.

Phase 1. Code Review

Within 2 days from the end of submission phase you must package the source codes you used to generate your latest submission and send it to jsculley@copilots.topcoder.com and tim@copilots.topcoder.com so that we can verify that your submission was generated algorithmically. We won't try to run your code at this point, so you don't have to package data files, model files or external libraries, this is just a superficial check to see whether your system looks convincingly automatized. If you pass this screening you'll be invited to Phase 2.

Phase 2. Online Testing

You will be given access to an AWS VM instance. You will need to load your code to your assigned VM, along with three scripts for running it:

• compile.sh should perform any necessary compilation of the code so that it can be run. It is possible to upload precompiled code, so running compile.sh is optional. But even if you use precompiled code your source code must be uploaded as well.
• train.sh <training_data_folder_path> <model_folder_path> should create any data files that your algorithm needs for running test.sh later. All data files created must be placed in the supplied model folder. The allowed time limit for the train.sh script is 12 hours. The training data folder contains the same data as is available to you during the contest. It should be possible to skip the training step on the server and upload a model file that you created previously.
• test.sh <testing_data_folder_path> <model_folder_path> <output_file> should run the code using new, unlabeled data and should generate the output CSV file. The allowed time limit for the test.sh script is 6 hours. The testing data folder contains similar data in the same structure as in the downloaded provisional testing data. The model folder is the same as was used in the call of train.sh, so you can use any data files that were created during training.

Your solution will be validated. We will check if it produces the same output file as your last provisional submission, using the same testing input files used in this contest. We are aware that it is not always possible to reproduce the exact same results. For example, if you do online training then the difference in the training environments may result in a different number of iterations, meaning different models. Also, you may have no control over random number generation in certain 3rd party libraries. In any case, the results must be statistically similar, and in case of differences you must have a convincing explanation why the same result can not be reproduced.

Competitors who fail to provide their solution as expected will receive a zero score in this final scoring phase, and will not be eligible to win prizes.

## General Notes

• This match is not rated.
• You may use open source languages and libraries provided they are equally free for your use, use by another competitor, or use by the client.
• If your solution includes licensed software (e.g. commercial software, open source software, etc), you must include the full license agreements with your submission. Include your licenses in a folder labeled "Licenses". Within the same folder, include a text file labeled "README" that explains the purpose of each licensed software package as it is used in your solution.
• The usage of external resources is allowed as long as they are freely available.
• Use the match forum to ask general questions or report problems, but please do not post comments and questions that reveal information about the problem itself or possible solution techniques.

## Requirements to Win a Prize

In order to receive a prize, you must do all the following:

1. Achieve a non-zero score in the top 5, according to system test results. Send your source codes for code review. See the "Final scoring" section above for details on these steps.
2. Within three (3) days of the end of the code review phase, you need to provide the copilot and administrator with VM requirements for your solution. Five (5) days from the date you receive the VM, the VM must be set up with your solution so Topcoder can easily validate and run it. Once the final scores are posted and winners are announced, the top five (5) winners have seven (7) days to submit a report outlining your final algorithm explaining the logic behind and steps to its approach. You will receive a template that helps creating your final report.

### Definition

 Class: AutoTops Method: getAnswerURL Parameters: Returns: String Method signature: String getAnswerURL() (be sure your method is public)

### Examples

0)

`Seed: 1`
1)

`Seed: 100`
2)

`Seed: 100`

This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2010, TopCoder, Inc. All rights reserved.