JOIN
Get Time
long_comps_topcoder  Problem Statement
Contest: Child Stuntedness 4
Problem: ChildStuntedness4

Problem Statement

    

Prizes

  • 1st Place - $10,000
  • 2nd Place - $5,000
  • 3rd Place - $2,500
  • 4th Place - $1,500
  • 5th Place - $1,000

Stunting (shortness for age) affects more than one in four children worldwide. Wasting (under-weightedness) and stunting in early childhood is associated with lethargy, reduced levels of play, an increased risk of early death, higher burden of disease, compromised physical capacities, and diminished cognitive development. Stunting and wasting in the first two years of life have been shown to be associated with lower school attainment and reduced economic productivity. This can reduce the productivity of an entire generation. Furthermore, stunting between 12 and 36 months is also linked to poor cognitive performance and/or lower school grades in middle childhood, and both height and head circumference at 2 years were shown to be inversely associated with educational attainment.

The ability to predict, at birth and early in childhood, whether a child is on an appropriate growth trajectory will help initiate preventive or therapeutic interventions leading to good cognitive growth and development outcomes as determined by school performance and a thriving child- and adulthood.

Our goal is to determine a combination of early measures that would be a good predictor for recumbent length (length of child measured while child is lying down, cm), weight (kg), and head circumference (cm). In pursuit of this goal, we have collected time series measurements of child growth, and family trait data (mother’s age, mother’s height, number of previous pregnancies, breast-feeding practices, and father’s height). We would like you to use this data to predict a child’s weight, recumbent length, weight, and head circumference in the attached dataset where values have been censored.

You may download the learning data set from here. The format for the data in the data set is a csv with details provided below:

Data Description

Col Variable Label                                   Notes
 1  SUBJID   Subject ID                              1 to N
 2  AGEDAYS  Age since birth at examination (days)   Day 1 = day of birth
 3  MUACCM   Mid upper-arm circmuference (cm)
 4  TSFTMM   Triceps skinfold thickness (mm)
 5  MUAZ     MUAC for age z-score                    Per WHO algorithm
 6  TSFTAZ   Tricep SFT for age z-score              Per WHO algorithm
 7  BFEDFL   Child breast fed on this day            1=Yes, 0=No
 8  WEANFL   Child being weaned on this day          1=Yes, 0=No
 9  SITEID   Investigational Site ID                 1, 2, 3 or 4
10  SEXN     Sex                                     1 = Male 2 = Female
11  GAGEBRTH Gestational age at birth in days
12  BRTHWEEK Week of Birth                           Jan 1st-7th = 1, Jan 8th–14th = 2, etc.
13  BWTREPT  Reported birth weight (gm)
14  BIRTHLEN Birth length (cm)
15  BIRTHHC  Birth head circumference (cm)
16  MAGE     Maternal age at birth of child (yrs)
17  MHTCM    Maternal height (cm)
18  FHTCM    Fathers height (cm)
19  PARITY   Maternal parity                         # of previous live births at time of this child’s birth.
20  WTKG     Weight (kg)
21  HTCM     Standing height (cm)
22  HCIRCM   Head circumference (cm)

Each child is designated

  • a unique id [column 1],
  • sex [column 10],
  • observed mid upper-arm circumference [column 3],
  • skinfold thickness [column 4],
  • mother’s age [column 16],
  • mother’s height [column 17],
  • number of mother’s previous pregnancies [column 19],
  • mother’s breast-feeding practices [column 7 and 8],
  • and father’s height[column 18], and
  • multiple growth measurements:
    • weight [column 20],
    • recumbent length [column 21],
    • head circumference [column 22]

during early childhood growth (with the time variable provided as Age since birth in days [column 2] and gestational age at birth in days [column 11] . The value “.” in any cell implies that the value has not been measured and is therefore not available.

An example of measurements for a single child is given below:

SUBJID AGEDAYS MUACCM      TSFTMM      MUAZ TSFTAZ BFEDFL WEANFL SITEID SEXN GAGEBRTH BRTHWEEK BWTREPT      BIRTHLEN    BIRTHHC      MAGE        MHTCM        FHTCM PARITY WTKG         HTCM         HCIRCM
   100	    45                                          1      0      4    2      291        4 -0.308707977 0.498588401 -0.509778147 0.287446485 -0.069090909            4 -1.423777038 -1.425738199
   100	   114 1.469656593 0.807642227 1.6    0.32      1      0      4    2      291        4 -0.308707977 0.498588401 -0.509778147 0.287446485 -0.069090909            4 -0.357199356 -0.667208544 -0.892947184
   100	   130                                          1      0      4    2      291        4 -0.308707977 0.498588401 -0.509778147 0.287446485 -0.069090909            4 -0.114118861 -0.656525027
   100	   149                                          1      1      4    2      291        4 -0.308707977 0.498588401 -0.509778147 0.287446485 -0.069090909            4  0.014862626 -0.239867893
   100	   170 1.763972954 2.487053578 1.53   2.18                    4    2      291        4 -0.308707977 0.498588401 -0.509778147 0.287446485 -0.069090909            4  0.054549237 -0.186450312 -0.474161119
   100	   310 1.911131134 2.929003933 1.33   2.87      1      1      4    2      291        4 -0.308707977 0.498588401 -0.509778147 0.287446485 -0.069090909            4  0.659770061  0.443877148  0.18393127
   100	   359                                          1      1      4    2      291        4 -0.308707977 0.498588401 -0.509778147 0.287446485 -0.069090909            4  0.441493698  0.636180441
   100	   366 1.322498413 2.487053578 0.65   2.65      1      1      4    2      291        4 -0.308707977 0.498588401 -0.509778147 0.287446485 -0.069090909            4  0.560553533  0.657547473  0.423237593

For each prediction (wi, li and ci), where at least one of the DV values is missing, the error from the true Weight, Recumbent length and Head circumference will be measured as the squared Mahalanobis distance,

where S-1 is the inverse of the sample covariance matrix calculated on all data points in the complete dataset for the current month of the prediction. Current month = AGEDAYS / 30 (rounded down), so 0-29 = Month 0, 30-59 = Month 1, etc.

Scores will be calculated as a generalized R2 measure of fit. This is calculated as follows. The total sum of errors for the submission will be calculated as SSE = SUM(ei).

A baseline sum of squared error will be calculated by predicting the sample means for each measurement, where at least one of the DV values is missing, that is the mean values of w, l and c for the current training set, for the correct month (as explained above),

SSE0 = SUM(e0i)

Then the submission score will be Score = 1000000 * MAX(1 - SSE/SSE0, 0).

In the string[] trainingData, each string states a record of some measurement, and has 22 tokens, comma-separated, in the same order as described above in the table. As before missing values for non-DV variables are presented as “.” strings. In trainingData, not all DV values are present. The format of testingData is almost the same as the trainingData. The only difference is that some of the DV values are also replaced by “.” strings, therefore your task will be to predict them. Replacement goes in the following way:

N = number of time points for an ID
X = random between 0 and N/2 inclusive
Y = random between X and N inclusive
foreach time point W(1..N) for an ID
  if W <= X or if AGEDAYS <90, then all three DV values present
  else if W <= Y then 'c' is replaced by "."
  else all three DV values are replaced by "."

The data with same IDs are consecutive and ordered by Agedays (time point). The returned string[] should contain the corresponding predictions for weight, recumbent length and head circumference of the child, in this particular order, comma-separated, for each time point, in the same order as it is in testingData. The length of the return array equals to the number of measurements.

NOTE: All data values are normalized as part of data obfuscation requirements.

Notes on Data Set Generation

  • The full data set contains approximately 40,000 lines, covering almost 3000 ID values.
  • The full data set is divided into 40% for example tests, 20% for provisional tests, and 40% for system tests. All data belonging to the same ID is placed in the same data set.
  • For each test, approximately 66% of the data (from that segment) is selected for training, and the remainder for testing.
  • For provisional tests, all example data is also added to the training set.
  • For system tests, all example and approximately 50% of the provisional data is also added to the training set.
  • There are 10 example cases and 50 provisional test cases.

Notes on Time Limits

Because different test types deal with different volumes of data, the time limits will also differ. Example tests are limited to 360s (6 minutes), provisional tests to 540s (9 minutes) and system tests to 900s (15 minutes). The testType parameter will be 0, 1, or 2, to indicate Example, Provisional, or System test, respectively, so that your code can take timing into account.

Scoring and Inverse Covariance

Example scoring code (with comments) here.

Full list of inverseS values for each month here.

 

Definition

    
Class:ChildStuntedness4
Method:predict
Parameters:int, String[], String[]
Returns:String[]
Method signature:String[] predict(int testType, String[] training, String[] testing)
(be sure your method is public)
    
 

Examples

0)
    
Seed: 1
1)
    
Seed: 2
2)
    
Seed: 3
3)
    
Seed: 4
4)
    
Seed: 5
5)
    
Seed: 6
6)
    
Seed: 7
7)
    
Seed: 8
8)
    
Seed: 9
9)
    
Seed: 10

This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2010, TopCoder, Inc. All rights reserved.