 Prizes
 1st Place  $10,000
 2nd Place  $5,000
 3rd Place  $2,500
 4th Place  $1,500
 5th Place  $1,000
Stunting (shortness for age) affects more than one in four children worldwide. Wasting
(underweightedness) and stunting in early childhood is associated with lethargy, reduced levels of
play, an increased risk of early death, higher burden of disease, compromised physical capacities,
and diminished cognitive development. Stunting and wasting in the first two years of life have been
shown to be associated with lower school attainment and reduced economic productivity. This can
reduce the productivity of an entire generation. Furthermore, stunting between 12 and 36 months is
also linked to poor cognitive performance and/or lower school grades in middle childhood, and both
height and head circumference at 2 years were shown to be inversely associated with educational
attainment.
The ability to predict, at birth and early in childhood, whether a child is on an appropriate
growth trajectory will help initiate preventive or therapeutic interventions leading to good
cognitive growth and development outcomes as determined by school performance and a thriving child
and adulthood.
Our goal is to determine a combination of early measures that would be a good predictor for
recumbent length (length of child measured while child is lying down, cm), weight (kg), and head
circumference (cm). In pursuit of this goal, we have collected time series measurements of child
growth, and family trait data (mother’s age, mother’s height, number of previous pregnancies,
breastfeeding practices, and father’s height). We would like you to use this data to predict a
child’s weight, recumbent length, weight, and head circumference in the attached dataset where
values have been censored.
You may download the learning data set from here.
The format for the data in the data set is a csv with details provided below:
Data Description
Col Variable Label Notes
1 SUBJID Subject ID 1 to N
2 AGEDAYS Age since birth at examination (days) Day 1 = day of birth
3 MUACCM Mid upperarm circmuference (cm)
4 TSFTMM Triceps skinfold thickness (mm)
5 MUAZ MUAC for age zscore Per WHO algorithm
6 TSFTAZ Tricep SFT for age zscore Per WHO algorithm
7 BFEDFL Child breast fed on this day 1=Yes, 0=No
8 WEANFL Child being weaned on this day 1=Yes, 0=No
9 SITEID Investigational Site ID 1, 2, 3 or 4
10 SEXN Sex 1 = Male 2 = Female
11 GAGEBRTH Gestational age at birth in days
12 BRTHWEEK Week of Birth Jan 1st7th = 1, Jan 8th–14th = 2, etc.
13 BWTREPT Reported birth weight (gm)
14 BIRTHLEN Birth length (cm)
15 BIRTHHC Birth head circumference (cm)
16 MAGE Maternal age at birth of child (yrs)
17 MHTCM Maternal height (cm)
18 FHTCM Fathers height (cm)
19 PARITY Maternal parity # of previous live births at time of this child’s birth.
20 WTKG Weight (kg)
21 HTCM Standing height (cm)
22 HCIRCM Head circumference (cm)
Each child is designated
 a unique id [column 1],
 sex [column 10],
 observed mid upperarm circumference [column 3],
 skinfold thickness [column 4],
 mother’s age [column 16],
 mother’s height [column 17],
 number of mother’s previous pregnancies [column 19],
 mother’s breastfeeding practices [column 7 and 8],
 and father’s height[column 18], and
 multiple growth measurements:
 weight [column 20],
 recumbent length [column 21],
 head circumference [column 22]
during early childhood growth (with the time variable provided as Age since birth in days
[column 2] and gestational age at birth in days [column 11] . The value “.” in any cell implies
that the value has not been measured and is therefore not available.
An example of measurements for a single child is given below:
SUBJID AGEDAYS MUACCM TSFTMM MUAZ TSFTAZ BFEDFL WEANFL SITEID SEXN GAGEBRTH BRTHWEEK BWTREPT BIRTHLEN BIRTHHC MAGE MHTCM FHTCM PARITY WTKG HTCM HCIRCM
100 45 1 0 4 2 291 4 0.308707977 0.498588401 0.509778147 0.287446485 0.069090909 4 1.423777038 1.425738199
100 114 1.469656593 0.807642227 1.6 0.32 1 0 4 2 291 4 0.308707977 0.498588401 0.509778147 0.287446485 0.069090909 4 0.357199356 0.667208544 0.892947184
100 130 1 0 4 2 291 4 0.308707977 0.498588401 0.509778147 0.287446485 0.069090909 4 0.114118861 0.656525027
100 149 1 1 4 2 291 4 0.308707977 0.498588401 0.509778147 0.287446485 0.069090909 4 0.014862626 0.239867893
100 170 1.763972954 2.487053578 1.53 2.18 4 2 291 4 0.308707977 0.498588401 0.509778147 0.287446485 0.069090909 4 0.054549237 0.186450312 0.474161119
100 310 1.911131134 2.929003933 1.33 2.87 1 1 4 2 291 4 0.308707977 0.498588401 0.509778147 0.287446485 0.069090909 4 0.659770061 0.443877148 0.18393127
100 359 1 1 4 2 291 4 0.308707977 0.498588401 0.509778147 0.287446485 0.069090909 4 0.441493698 0.636180441
100 366 1.322498413 2.487053578 0.65 2.65 1 1 4 2 291 4 0.308707977 0.498588401 0.509778147 0.287446485 0.069090909 4 0.560553533 0.657547473 0.423237593
For each prediction (w_{i}, l_{i} and c_{i}), where
at least one of the DV values is missing, the error from the true Weight, Recumbent length and Head
circumference will be measured as the squared
Mahalanobis distance,
where S^{1} is the inverse of the sample covariance matrix calculated on all data
points in the complete dataset for the current month of the prediction.
Current month = AGEDAYS / 30 (rounded down), so 029 = Month 0, 3059 = Month 1, etc.
Scores will be calculated as a generalized R^{2}
measure of fit. This is calculated as follows. The total sum of errors for the submission will be calculated as
SSE = SUM(e_{i}).
A baseline sum of squared error will be calculated by predicting the sample means for each measurement,
where at least one of the DV values is missing, that is the mean values of w, l and c
for the current training set, for the correct month (as explained above),
SSE_{0} = SUM(e_{0i})
Then the submission score will be
Score = 1000000 * MAX(1  SSE/SSE_{0}, 0).
In the string[] trainingData, each string states a record of some measurement, and has 22
tokens, commaseparated, in the same order as described above in the table. As before missing values
for nonDV variables are presented as “.” strings. In trainingData, not all DV values are
present. The format of testingData is almost the same as the trainingData. The only
difference is that some of the DV values are also replaced by “.” strings, therefore your task will
be to predict them. Replacement goes in the following way:
N = number of time points for an ID
X = random between 0 and N/2 inclusive
Y = random between X and N inclusive
foreach time point W(1..N) for an ID
if W <= X or if AGEDAYS <90, then all three DV values present
else if W <= Y then 'c' is replaced by "."
else all three DV values are replaced by "."
The data with same IDs are consecutive and ordered by Agedays (time point). The returned
string[] should contain the corresponding predictions for weight, recumbent length and head
circumference of the child, in this particular order, commaseparated, for each time point,
in the same order as it is in testingData. The length of the return array equals to the
number of measurements.
NOTE: All data values are normalized as part of data obfuscation requirements.
Notes on Data Set Generation
 The full data set contains approximately 40,000 lines, covering almost 3000 ID values.
 The full data set is divided into 40% for example tests, 20% for provisional tests, and 40% for system tests.
All data belonging to the same ID is placed in the same data set.
 For each test, approximately 66% of the data (from that segment) is selected for training, and the remainder for testing.
 For provisional tests, all example data is also added to the training set.
 For system tests, all example and approximately 50% of the provisional data is also added to the training set.
 There are 10 example cases and 50 provisional test cases.
Notes on Time Limits
Because different test types deal with different volumes of data, the time limits will also differ. Example tests are limited
to 360s (6 minutes), provisional tests to 540s (9 minutes) and system tests to 900s (15 minutes). The testType parameter
will be 0, 1, or 2, to indicate Example, Provisional, or System test, respectively, so that your code can take timing into account.
Scoring and Inverse Covariance
Example scoring code (with comments) here.
Full list of inverseS values for each month here.
