IMPORTANT: This contest has a two-phase format which has quantitative round which is this marathon match and then a qualitative round following this match where you will get a chance to support your submissions qualitatively. Please read carefully at the end of the statement for more details. The qualitative round has a very exciting prize purse in addition to MM prizes and every member submitting in this match is encouraged and welcomed to submit in that round.
The best 4 performers of this contest (according to system test results) will receive the following prizes:
- 1st place: $4,000
- 2nd place: $3,000
- 3rd place: $2,000
- 4th place: $1,000
Please note: The above prizes will be awarded to the marathon match top finishers. The prizes of the follow-up qualitative round mentioned earlier are additional to the above prizes. Please read at the end for details on that prize structure.
ToxCast Project Background
In 2005, the federal government launched Tox21, an initiative to use in vitro high-throughput screening (HTS) to identify what proteins, pathways, and cellular processes chemicals interact with and at what concentration they interact. The goal is to use the screening data to more cost-effectively and efficiently prioritize the thousands of chemicals that need toxicity testing and, in the future, predict the potential human health effects of chemicals. Tox21 currently pools the resources and expertise of EPA, National Institutes of Environmental Health Sciences/National Toxicology Program, National Institute of Health/National Center for Advancing Translational Sciences, and the Food and Drug Administration to screen almost 10,000 chemicals.
One of EPA’s main contributions to Tox21 is the Toxicity Forecaster or ToxCast for short. The first phase of ToxCast was designed as a "proof-of-concept" and was completed in 2009. It evaluated approximately 300 chemicals in over 500 biochemical and cell-based in vitro HTS assays. The 300 chemicals selected for the first phase were primarily data rich pesticides that have a large battery of in vivo toxicity studies performed on them. Data collection on the second phase of ToxCast was completed in 2013. The second phase evaluated approximately 1,800 chemicals in an expanded set of over 700 biochemical and cell-based in vitro HTS assays. The 1,800 chemicals were from a broad range of sources, including industrial and consumer products, food additives, and potentially "green" substances that could be safer alternatives to existing chemicals. These chemicals were not as data rich as those selected for the first phase and many do not have in vivo toxicity studies. The in vitro data are accessible through the interactive Chemical Safety for Sustainability Dashboard (iCSS) and raw data files are also posted to the Dashboard web page.
In addition to the in vitro HTS data, the EPA has created a complementary Toxicity Reference Database (ToxRefDB), which comprehensively captures results from in vivo animal toxicity studies. ToxRefDB provides detailed chemical toxicity data from over 30 years and $2 billion in animal testing, in a publicly accessible and computable format.
Motivation and Definition of LEL
People are exposed to many man-made chemicals throughout their lives. These include food ingredients and additives, pesticides, cosmetics, medicines, cleaners, solvents, etc. Historically, a series of standard animal studies have been used as a means to evaluate whether a chemical can cause a range of different adverse effects and at what dose these effects occur. The term "systemic toxicity" is often used because the effects can occur in different organ systems such as the liver, kidney, lungs, etc.
The systemic Lowest Effect Level or LEL is the lowest dose that shows adverse effects in these animal toxicity tests. The LEL is then conservatively adjusted in different ways by regulators to derive a value that can be used by the Agency to set exposure limits that are expected to be tolerated by the majority of the population.
Ideally, every chemical to which we are exposed would have a well-defined LEL. However, the full battery of animal studies required to estimate the LEL costs millions of dollars and takes many months to complete. As a result, thousands of chemicals lack the required data needed to estimate an LEL. To help fill this gap, the EPA has screened nearly 2,000 chemicals across a battery of more than 700 biochemical and cell-based in vitro assays to identify what proteins, pathways, and cellular processes these chemicals interact with and at what concentration they interact.
The goal of this challenge is to build a prediction model (algorithm) using data from high-throughput in vitro assays, chemical properties, and chemical structural descriptors to quantitatively predict a chemical’s systemic LEL.
We have provided the complete data set consisting of chemical information and assay data bundled into an archive – ToxCast_MM_data.zip. There are a total of 1854 chemicals for which the complete input data is available and these are the chemicals for which the LEL is to be predicted. The ground truth values for some of these chemicals have been provided in training file while other chemicals will be used for testing as explained in "Testing and scoring" section.
Following is a brief explanation of the contents of data archive:
a) ToxRefDB_Challenge_Training.csv: This file contains the list of all 1854 chemicals for which the LEL value is to be predicted. Each row contains one chemical. Each chemical has CARSN_ID which will be useful to map the chemical across all other files in the data. The file also contains the chemical name of each chemical. Finally, it contains the ground truth value of LEL for the set of chemicals which are to be used for training. LEL values for other chemicals (to be used in testing) are marked with “?” and you are required to predict those values.
IMPORTANT NOTE: Due to the change described here, the ToxRefDB_Challenge_Training.csv within the data archive is outdated. Please download the updated version of the file here.
Please Note: The LEL values in this file are in negative log unit.
b) Chemical Lists and Annotations: Contains the most current, official public listing of all of the ToxCast 1,854 chemicals. It also contains files used to analyze chemical structure (fingerprinting & SMILEs). These files are to be used in combination with the provided libraries in the accompanying document to fetch the structural features of chemical and incorporate that knowledge into the model. This information should also be used in conjunction with the ToxCast Assay Data (described below) to improve the model.
c) ToxCast Assay Data: Contains high-throughput in vitro data on 1,854 chemicals from more than 700 high-throughput screening assays. It also contains concentration response files and curve fits as well as detailed descriptions of all the assays. These data are to be used in combination with the provided libraries and the structural features of the chemicals to develop the model to predict the LEL.
d) EPA ToxCast - Chemical Structural Descriptors (pdf version).pdf: Provides information about various libraries and tools that can be used to capture the structural properties of chemicals and use them as features into prediction model. In addition to the libraries and tools provided in this pdf, we would also like to draw your attention to the tool developed by EPA called DSSTox (http://www.epa.gov/ncct/dsstox/). It provides a public forum for publishing for publishing downloadable, structure-searchable, standardized chemical structure files associated with chemical inventories or toxicity data sets of environmental relevance. If you want to use this, you can find various help topics at (http://www.epa.gov/ncct/dsstox/Help.html, http://www.epa.gov/ncct/dsstox/FAQ.html and http://www.epa.gov/ncct/dsstox/About.html).
e) ToxCast_Data_guide.pdf: A comprehensive guide to understand all of the above data. As the data is very domain specific, we wanted you to feel comfortable about its understanding and we hope that this guide will help you to understand the data set in very simple manner. The guide itself refers to a publication by ToxCast which was used to create this guide. We will also provide that publication in forums.
#1. All the data was initially received as .txt and/or .xlsx files. It has been converted to .csv format by topcoder. So, in case you find any issues in using any .csv files, please ask immediately in forums and we will help to resolve the issue as fast as possible. Also for reference, following are the direct links to access the original dataset provided by EPA:
- Chemical list & Annotation Files (DSS Tox Information): http://epa.gov/ncct/toxcast/files/Chemical%20list%20&%20Annotation%20Files%2012-11-2013.zip
- High-Throughput Chemical Screening Data from ToxCast: http://epa.gov/ncct/toxcast/files/ToxCast_AssayData_2013_12_11.zip
- ToxCast Phase II Data Release README File: http://epa.gov/ncct/toxcast/files/ToxCast%20Release%202013-12-10%20README.pdf
Please make sure to ask in forums if you plan to use any part of this original data.
#2. Finally you can visit: http://en.wikipedia.org/wiki/Cheminformatics for getting more information on use of computer and informational techniques in field of chemistry.
This match will be different from many other topcoder marathon matches. Usually, in a marathon match competitors submit implementations of their models and they are executed at topcoder servers. In this match, you will need to execute your model in your local environment, obtain predictions for all chemicals and then submit the result for evaluation. The code that you execute locally can be written in any language, it is fine if this language is not supported by topcoder marathon match engine. There is only one restriction: both topcoder and EPA should then be able to compile (if necessary) and execute your solution using freely available software only.
As topcoder currently does not support evaluation of data files, you will still be required to submit source code (in one of supported languages: C++, Java, C#, Python, VB). This code will need to contain a definition of class LELPredictor and this class should contain a single method named predictLEL. This method will not have any parameters as it is only expected to return the results of your predictions and not expected to perform any actual computations. The return value must be a double containing exactly 1854 elements. The i-th element should be the LEL value that your model predicts for the i-th chemical in ToxRefDB_Challenge_Training.csv file.
#1. It is extremely crucial to retain the order of chemicals in this array same as that in the training file. Your scoring will be done based on the values in this array and if you do not maintain the same order of chemicals as present in training file, the final score will come out to be wrong. Please double check that you do not lose the order of chemicals in the process of prediction.
#2. The unit of LEL value in training file is negative log and you should also use the same unit while submitting the results. If you submit raw values without this unit conversion, it may lead to large RMSE’s and thus negative scores.
#3. Although you are still allowed to perform any kind of computations in this method (within time and memory limit), we recommend that you just hardcode your predictions into the method and return a constant array/vector/tuple (depending on the language used). Please note that the language that you use to submit your predictions into TopCoder system does not need to match the language (or languages) that you use to compute these predictions in your local environment. For example, it is perfectly valid to compute predictions locally using R or MATLAB and then submit them using C++ or Java.
Testing and scoring
As it was already specified, the data in this match consists of 1854 chemicals. Internally, this data has been split into 3 groups:
- Group A contains 483 chemicals,
- Group B contains 143 chemicals, and
- Group C contains remaining 1228 chemicals.
We will not disclose the information on which group each chemical belongs to. Please note that the split between groups is not guaranteed to be random, so different groups can have different characteristics. We only have ground truth available for groups A and B.
This problem has 1 example, 1 provisional and 1 system test case. The score for each test case will be evaluated using RMSE metric. However, this metric will be applied only to a certain subset of data:
- For example test case this subset is training data (the data for which we have shared LEL values with you). It contains all chemicals from group A.
- For provisional test case this subset contains 63 chemicals randomly selected from group B.
- For system test case this subset contains remaining 80 chemicals from group B.
For n chemicals, if xi is the ground truth LEL and yi is the predicted LEL value for a chemical i, then your score will be calculated as:
Score = 1,000,000.0 * (2 - SquareRoot(((x0-y0)2 + (x1-y1)2 + ... + (xn-1-yn-1)2) / n))
where n = 426 for example test case, n = 100 for both provisional and system test cases. Your objective here is to maximize the score.
Note: if any of yi values is outside [-50, 50] range (and thus is obviously a very bad prediction), it will be changed to -50 (if it is negative) or 50 (if it is positive) before the score is calculated. If we are unable to get any return value from your solution (time limit, memory limit, runtime error, etc.) or if it contains improper number of elements, your score will be set to -100,000,000.
The purpose of example, provisional and system test cases is as follows:
- Your solution is tested on example test case when you make an example submission. You receive complete test log which includes, in particular, your score, runtime and everything you output into standard output and error streams.
- Your solution is tested on provisional test case when you make a full submission. You receive only score. Your score and up-to-date scores of all your competitors are shown in match standings.
- Your last full submission is tested on system test case to determine your final score. Competitors with highest final scores win the match.
- You are allowed to use any programming language when you implement your model (to be executed at your local environment). You are also allowed to use multiple languages. The only real restriction is that topcoder and EPA should be able to compile (if necessary) and execute your code using freely available software. The client’s preferred programming languages are Java, R and MATLAB, so please consider using one of them if it is among the languages you’re most comfortable with.
- You are allowed to use any libraries (open source or not), provided that topcoder and EPA are able to use these libraries for free. If you doubt whether a particular language/library can be used, you can always ask us.
- You are allowed and encouraged to use any of the libraries listed in the Chemical Descriptor.pdf document for fetching the structural information of chemicals. No other libraries must be used for this purpose.
- You are strictly forbidden to use any data other than the data that we provide to you in this competition.
- In order to receive the prize money, you will need to provide the code that you used to obtain the submitted predictions and to show how this code obtains exactly (or at least statistically similar) predictions. topcoder will provide a Linux or a Windows VM to you for this purpose. You will also need to fully document your code and explain your algorithm. Note that all of this will need to be done after the match (if you win a prize, a topcoder representative will contact you with details). During the match you do not need to provide neither the code you execute locally nor any documentation.
Follow-up Round (Qualitative Contest)
After the end of the Marathon Match, we will conduct a follow-up round in which we will welcome all the members (who submitted to this match) to participate and provide a scientific explanation and/or support to their solutions. The exact start time and the place of submission will be conveyed in the forums during the marathon match.
The top three submissions in this contest will be awarded following prizes based on how much scientific relevance the solutions have.
- 1st place: $2,200
- 2nd place: $1,300
- 3rd place: $900
Please Note: The prizes in this contest will be awarded independent of your position in the match. Hence if you did not achieve top 4 score in MM but your submission is deemed scientifically more relevant, then your submission has a great chance to win a prize in this contest.
The purpose of this contest is to encourage you to build your model and algorithm in such a way that it is able to capture and use the biological and scientific aspect of the information provided by the data. We want to understand how the scientific aspect of the data contributed to the predictions of the toxicity level. This would help us to answer larger questions like how well these machine learning models would serve and how reliable these predictions can be considered in real-world scenarios dealing with biological topics like toxicity level.
Hence, in this contest, we would like you to submit a document explaining various aspects of your solution and thereby addressing several questions like:
Was your model completely numerical-driven? Or did it consciously use some scientifically relevant information?
What kind of feature combinations did you use? Can you explain the reason for your choices and how they fared better than other features?
What kind of correlations were observed between data available in different files?
If your model was completely numerical-driven (that is just giving good prediction based on values), why was it difficult to find the correlation between semantics of various columns?
These are just few examples and we will convey more details about this contest in the forums in next few days.