Bonus Prize - $100 per winning submission
We will ask the 5 winners to submit a Docker file for their solution. We will pay the winners who submit a Docker file an additional $100 each for their efforts.
Requirements to Win a Prize
In order to receive a prize, you must do all of the following:
- Achieve a score in the top 5, according to system test results. See the "Scoring" section below.
- Within 7 days from the announcement of the challenge winners, submit:
- A complete report at least 2 pages long outlining your final algorithm and explaining the logic behind and steps to its approach. The required content appears in the "Report" section below.
- All code used in your final algorithm in 1 appropriately named file (or tar or zip archive). We will contact the winners via email and ask for the file. The naming convention should be memberHandle-fooContest. For example, handle "johndoe" would name his submission "johndoe-fooContest". The file should include your final model, saved so that it can make predictions without being retrained. Prospective winners will be contacted for this.
If you place in the top 5 but fail to do any of the above, then you will not receive a prize, and it will be awarded to the contestant with the next best performance who did all of the above.
In view of the format of the match and the nature of the data, we expect to need at least 1 week to confirm the winning solutions run successfully and to replicate the system test results. We thank you in advance for your patience.
Prostate cancer is the second most common cancer in US men and the 4th most common tumor diagnosed worldwide. In 2015 alone, it is estimated that over 220,000 US men will be diagnosed with prostate cancer, and more than 27,000 will die from the disease.
Prostate cancer is diagnosed by tissue biopsy. Pathologists examine biopsies and classify them with respect to certain characteristics (e.g., cancerous, benign, cancer type, cancer stage). Physicians make clinical decisions on the basis of those classifications.
Micrograph of prostate adenocarcinoma, acinar type, the most common type of prostate cancer. Needle biopsy, H&E stain (Source: Wikipedia)
The genomes of cancer cells are very different from those of healthy cells. One very common change in tumor cells is the loss (deletion) or gain (duplication) of specific pieces of DNA, a process called chromosomal instability. Losses and gains can affect entire chromosomes, parts of chromosomes, or sometimes just individual genes. Certain losses or gains can give the tumor cells a growth advantage, converting a normal cell into a cancerous cell. The prostate cancers of different patients can have a wide range of changes, ranging from almost no DNA changes to heavily altered genomes.
The extent of these gains and losses, also called copy-number changes, in an individual tumor can be measured experimentally from DNA extracted from the tissue and quantified as the “fraction of genome altered” (FGA). This is the fraction of DNA in tumor cells that has gains or losses. As genomic data has become easier and less expensive to obtain, such information is now beginning to inform prognosis and treatment decisions, and the FGA of prostate cancer tissue could become a predictor of survival.
This gene duplication has created a copy-number alteration. The chromosome now has two copies of this section of DNA, rather than one. (Source: Wikipedia)
Currently, physicians use an expensive, 2-week long genetic test to determine the FGA of a tumor sample. The purpose of this challenge is to develop a quicker and less expensive way to infer the FGA of a tumor sample based on the images obtained from staining of a tumor section (these are routinely available for all tumors). This new method could then help identify the patients in which genetic testing might be useful.
In the customary training and test framework, create an algorithm that uses images of cancerous prostate tissue biopsies to predict the FGAs values of those tissues.
For each patient, there is a pair of images and 1 FGA measurement. Both images in the pair relate to the same FGA measurement, and you can use either member of the pair individually or both members simultaneously to predict FGA. The differences between the 2 types of images is explained in the “Data Description” section immediately below. We do not know which type of image is best to use and suggest you experiment and do what you find works best.
A training data set of 131 pairs of biopsy images from 131 cases (i.e., patients) for each case can be found here. Images with "DX" in the file name were taken of biopsies preserved in paraffin wax, and those with "TS" in the file name were taken of biopsies that were frozen for preservation. Test images (140 pairs) are available in the same folder. There are 542 image files in total, about 28 GB of data.
The DX images have higher resolution than the TS images, and more details of the tissue itself have been captured in the former than in the latter (the freezing process can destroy certain cellular structures). DNA from a region close to the frozen tissue section in the TS images were processed and used to measure the FGAs. As a result, the FGA values possibly have a closer connection to the TS images than the DX images.
A file containing the FGA of each training case can be found here.
Data Format Requirements
A single .csv file must be created which contains your algorithm’s FGA predictions for the given data set. The .csv file should not contain any header, must contain one line for each case in the given data set, and must contain the following columns:
Column Contents Format
1 Case ID Character
2 FGA Float, between 0 and 1 inclusive
During the contest, you will submit a single .csv file containing your algorithm’s predicted FGA measurements. You will submit the .csv using code which implements only one function, getAnswerURL(). Your function will return a String corresponding to the URL of your .csv file. You may upload your .csv file to a cloud hosting service such as Dropbox or Google Drive, which can provide a direct link to the file.
To create a direct sharing link in Dropbox, right click on the uploaded file and select share. You should be able to copy a link to this specific file which ends with the tag "?dl=0". This URL will point directly to your file if you change this tag to "?dl=1". You can then use this link in your getAnswerURL() function.
If you use Google drive to share the link, then please use the following format: "https://drive.google.com/uc?export=download&id=" + id
You can use any other way to share your result file, but make sure the link you provide opens the filestream directly.
Your complete code that generates these results will be tested at the end of the contest if you achieve a score in the top 5, as described in the “Requirements to Win a Prize” section.
We will run both provisional and system tests using your .csv file submission, but you will not know which cases are used for which test. We will score your submission as described below.
Define the following:
MSE = [Sum of (Ai - Pi)^2, i from 1 to N] / N
N = number of test cases
Ai = Actual FGA
Pi = Your algorithm prediction
Score = 1,000,000,000 / (100,000 * MSE + 1,000)
If your answer URL is not accessible or if its does not contain the expected .csv format, your score will be zero.
Your report must be at least 2 pages long and contain at least the following sections.
- First Name
- Last Name
- Topcoder handle
- Email address
- Final code submission file name
Please describe your algorithm so that we know what you did even before seeing your code. Use line references to refer to specific portions of your code.
This section must contain at least the following:
- Approaches considered
- Approach ultimately chosen
- How TS and DX images were used
- Steps to approach ultimately chosen, including references to specific lines of your code
- Advantages and disadvantages of the approach chosen
- Comments on libraries
- Comments on open source resources used
- Special guidance given to algorithm based on training
- Potential improvements to your algorithm
Installation and Usage:
Please give us everything we need to run your algorithm successfully.
- Operating system
- Program(s) and version(s)
- External libraries
- Version numbers
- URLs for download
- Step by step installation and usage instructions
Notes on Data Sets
The data set used in this contest contains 271 cases:
- 131 will be used as training data
- 40 for provisional tests
- 100 for system tests
Your answer .csv file should contain FGA predictions for all 140 test cases (i.e., for both provisional and system test cases).
There are no example test cases, but you can use example submissions to check if your .csv file is accessible and has the expected format.
Notes on Time Limits
If you are eligible to win a prize, your final code will be executed on an Amazon m4.xlarge or g2.2xlarge machine running Linux and must run in 5 minutes for a single test case (i.e. time to predict the FGA of one case, not including a previously executed training phase, which may have taken longer).
This match is rated.
You may use any programming language and libraries, provided it is a free, open-source solution for you to use and would be for the client as well.
Use the match forum to ask general questions or report problems, but please do not post comments and questions that reveal information about the problem itself or possible solution techniques.
If you want to check integrity of downloaded images, their MD5 is available here.