JOIN
Get Time
long_comps_topcoder  Problem Statement
Contest: Tree Based Models
Problem: TimeSeriesLearning

Problem Statement

    

Overall Prizes

  1. $6,000
  2. $4,000
  3. $2,500
  4. $1,500
  5. $1,000

Sub-Contest Prizes (Per each Sub-contest - 4 total)

  1. $2,000
  2. $1,250
  3. $750
  • $500 (Bonus top scorer end of week #1)
  • $500 (Bonus top scorer end of week #2)

Background

The Healthy Birth, Growth, and Development (HBGD) program addresses the dual problem of growth faltering/stunting and poor neurocognitive development, including contributing factors such as fetal growth restriction and preterm birth. The HBGD program is creating a unified strategy for integrated interventions to solve complex questions about (1) life cycle, (2) pathophysiology, (3) interventions, and (4) scaling intervention delivery. The HBGD Open Innovation platform was developed to mobilize the global “unusual suspects” data science community to better understand how to improve neurocognitive and physical health for children worldwide. The data science contests are aimed at developing predictive models and tools that quantify geographic, regional, cultural, socioeconomic, and nutritional trends that contribute to poor neurocognitive and physical growth outcomes in children. The solutions developed by this challenge will support the efforts of the HBGD Open Innovation initiative.

Objective

The goal of this contest is to develop flexible methods that are able to adaptively fill­in, back­fill, and predict time-series using a large number of heterogeneous training datasets. The data is a set of thousands of aggressively obfuscated, multi-variate time­series measurements. There are multiple output variables and multiple input variables.

For each time-series, there are parts missing. Either individual measurements, or entire sections. Each time-series has a different number of known measurements and missing measurements, the goal is to fill in the missing output variables with the best accuracy possible. How the missing input variables are treated is an open question, and is one of the key challenges to solve.

This problem, unlike many data science contest problems, is not easy to fit into the standard machine learning framework. Some reasons that this is the case:

  • There are multiple time-series outputs.
  • There are multiple time­series inputs.
  • The time-series are sampled irregularly, and in different time points for each subject.
  • There is a huge amount of missing data, and it is not missing at random.
  • Many of the variables are nominal/categorical, and some of these are very high cardinality. The most important variable, subject id, is the primary example. A good solution should not ignore the subject id.

The goal of this contest is two-fold: we wish not only to obtain a very good, flexible solution, but we would also like to encourage competitors to try a diverse set of approaches, and have documentation of which ones worked and why. Ideally, we would like competitors to try methodologies outside of the standard scope of contest algorithms. While many contests use Random Forests and Gradient Boosted Regression trees, we would like competitors to branch out and try recurrent neural networks, Gaussian Process Regression, polynomial regression, VARMAX processes, and other approaches.

In facilitating this goal, there will be several sub-contests running parallel to the main contest. Each of these sub-contests will focus on a particular type of solution approach, and will have additional prizes (in addition to the primary prize pool for the overall contest). Competitors should include a brief explanation of how their solution fits into the framework of one of the sub-contests, where applicable:

Mixed Effects Models

Linear and non-linear mixed effects models are appropriate for our data as we have discrete subjects for which we'd like to have discernable models. Further, it would be great to have interpretable estimates for the effects of each nominal variable. For solid implementations, see the lme4 and nlme packages in R.

Key challenge: linear models are insufficiently expressive for human growth data, but it is tricky to extend non-linear mixed effects models to the multivariate case.

Neural Networks

Deep neural networks and recurrent neural networks (RNNs) are of great interest because of the amount of empirically successful research that has recently emerged, suggesting that these types of models have potential to revolutionize many other computational fields. RNNs are of particular interest as they have the capacity to model variable length inputs, and deal natively with multiple outputs. For interesting recent work see: https://arxiv.org/abs/1606.04130

There are many good RNN implementations for Python in Theano, TensorFlow, Keras, and other packages.

Key challenge: RNNs implicitly assume that the inputs are regularly sampled, but our data is both sparsely and irregularly sampled.

Tree-Based Models

Random Forests and Gradient Boosted Decision Trees are some of the most popular machine learning models. Even though our data is not a precise fit to the independent and identically distributed vectors-of-features model underlying classic supervised learning, predictive models can still be fit and evaluated to good effect. The xgboost package and scikit-learn have great implementations of these types of models.

Key challenge: tree-based models do not perform well when dealing with high-cardinality nominal variables, but these variables (such as subject id) provide key information that is necessary for good predictions.

Matrix Completion Models

Although not an obvious approach for time-series data, we can use matrix completion methods to address the sparsely sampled nature of our data. For example, instead of users and items for rows and columns, consider using subjects and time (days) for rows and columns. LightFM and libFM are two good packages to consider here.

Key challenge: integrating the side-information for each row and each column.

Note that competitors are free to submit different solutions in multiple sub-contests and/or for the main contest as well, and can win prizes in more than one.

Data Description

The training and test data contains several columns:

----------+--------------------+------------+-------------------------------------------------------
 Column#s | Column Name(s)     | Data Type  | Description
----------+--------------------+------------+-------------------------------------------------------
 1-3      | y1, y2, y2         | Float      | The three dependent variables to be predicted in test
----------+--------------------+------------+-------------------------------------------------------
 4        | STUDYID            | Integer    | 
----------+--------------------+------------+-------------------------------------------------------
 5        | SITEID             | Integer    | 
----------+--------------------+------------+-------------------------------------------------------
 6        | COUNTRY            | Integer    | 
----------+--------------------+------------+-------------------------------------------------------
 7        | SUBJID             | Integer    | 
----------+--------------------+------------+-------------------------------------------------------
 8        | TIMEVAR1           | Float      | 
----------+--------------------+------------+-------------------------------------------------------
 9        | TIMEVAR2           | Float      | 
----------+--------------------+------------+-------------------------------------------------------
 10-39    | COVAR_CONTINUOUS_n | Float      | (30 fields) 
----------+--------------------+------------+-------------------------------------------------------
 40-47    | COVAR_ORDINAL_n    | Integer    | (8 fields) 
----------+--------------------+------------+-------------------------------------------------------
 48-55    | COVAR_NOMINAL_n    | Char       | (8 fields) 
----------+--------------------+------------+-------------------------------------------------------
 56-58    | y1, y2, y3 missing | True/False | (3 fields) does the value exist in ground truth
----------+--------------------+------------+-------------------------------------------------------

The combination of STUDYID and SUBJID is sufficient to uniquely identify a specific individual. Adding TIMEVAR1 is sufficient to identify to uniquely identify each row.

The validation and test data file contains the same fields as the training data, with one primary difference. y1, y2, and y3, aren't given, and are left empty. The last three columns contain the values “True” or “False” indicate whether y1, y2, or y3 is missing from the ground truth data. This test data file contains the tests for both provisional and system testing, however you will not know which set each row belongs to. Any given subject/study pair belongs entirely to one or the other.

The submitted predictions should contain one row for each row in the test data set. Each row should contain three values, comma-separated: the predicted values for y1, y2, y3. The rows should be in the same order as given in the test data. Note again that for any places where the test data indicates we have no ground truth for one or more of the three values, you can feel free to use a 0 for the prediction of that value, as it will be ignored and not contribute towards scoring.

Scoring

Your score for each individual prediction p, compared against actual ground-truth value t, will be |p - t|. The score for each row, r, will then be the mean of the scores for the individual predictions on that row (possibly 1, 2, or 3 values).

Over the full n rows, your final score will be calculated as 10 * (1 - Sum(r) / n). Thus a score of 10.00 represents perfect predictions with no error at all. All scores will be rounded down to two significant digits to help prevent overfitting. (Note that we may internally evaluate at higher precision in the event tie-breaking is needed for prize awards.)

Submissions which are malformed in any way, such as having the wrong number of rows, values that do not parse as numeric, etc, will score a 0.

To submit your entry, your code will only need to implement a single method, getURL(), which takes no parameters, and returns the URL at which your predictions CSV file can be downloaded.

Example Testing

All example testing for this contest should be done offline, using the provided data. Note, however, that you may do a “Test Examples” submission using your predictions file for the provided test data. This will not provide any provisional scoring, but will confirm for you that the predictions file works correctly (URL is accessible, correct number of rows and columns, and numerical values parse correctly). This is not required, but can be used as a basic sanity check before making a full submission.

Baseline

A baseline score of 9.88 has been achieved using a combination of methodologies.

Competitors in the main challenge will need to reach a score of 9.90 to be eligible for a prize, and the sub-contests will need a score of at least 9.80.

General Notes

  • This match is RATED for the overall TOP 5 portion of the contest, but not for the 4 sub-contests.
  • The winners of the $500 end-of-week bonus for each of the 4 sub-contests will not be required to submit a full report, but will be asked for a brief explanation of how their code met the requirements of the room.
  • Use the match forum to ask general questions or report problems, but please do not post comments and questions that reveal information about the problem itself or possible solution techniques.
  • It is not permitted to use external data sources as part of this challenge, and attempting to reverse engineer any obfuscation of the data in the contest is likewise forbidden.
  • In this match you may use any programming language and libraries, provided Topcoder is able to run it free of any charge. You may also use open source languages and libraries, with the restrictions listed in the next section below. If your solution requires licenses, you must have these licenses and be able to legally install them in a testing VM (see “Requirements to Win a Prize” section). Prior to submission, please make absolutely sure your submission can be run by Topcoder free of cost, and with all necessary licenses pre-installed in your solution. Topcoder is not required to contact submitters for additional instructions if the code does not run. If we are unable to run your solution due to license problems, including any requirement to download a license, your submission might be rejected. Be sure to contact us right away if you have concerns about this requirement.
  • You may use open source languages and libraries if they are licensed under the following licenses: Apache2, FreeBSD and MIT. Other licenses may be acceptable, so please request authorization in the forums or via email if you’d like to use another. Please note that GPL and LGPL licenses cannot be accepted on this challenge.

Requirements to Win a Prize

  1. Place in TOP 5 according to system test results on the main contest, or in the top 3 on any of the sub-contests. See the “Scoring” section above.
  2. Report: Within 7 days from the announcement of the challenge winners, submit a complete report outlining your final algorithm and explaining the logic behind and steps to its approach, which should contain:
    • Your Information: first and last name, Topcoder handle and email address.
    • Approach Used: a detailed description of your solution, which should include: approaches considered and ultimately chosen; advantages, disadvantages and potential improvements of your approach; detailed comments on libraries used, precise steps taken to run your submission
  3. Local Programs Used: If any parameters were obtained from the training data set, you will also need to provide the program(s) used to generate these parameters. (The same licensing restrictions as above shall apply here.)
  4. Actual programming code used: each of the top 5 competitors will be given access to an AWS VM instance, where they will need to load their code, along with two basic scripts for running it:
    • compile.sh should perform any necessary compilation of the code so that it can be run.
    • run.sh [training file] [testing file] [output file] should run the code and generate the output that was submitted during the contest.

If you place in the overall top (5), or top (3) in one of the four sub-contests but fail to do any of the above, then you will not receive a prize, and it will be awarded to the contestant with the next best performance who did all of the above.

 

Definition

    
Class:TimeSeriesLearning
Method:getUrl
Parameters:
Returns:String
Method signature:String getUrl()
(be sure your method is public)
    
 

Examples

0)
    
Test: E

This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2010, TopCoder, Inc. All rights reserved.