Stunting (shortness for age) affects more than one in four children worldwide. Wasting (under-weightedness) and stunting in early childhood is associated with lethargy, reduced levels of play, an increased risk of early death, higher burden of disease, compromised physical capacities, and diminished cognitive development. Stunting and wasting in the first two years of life have been shown to be associated with lower school attainment and reduced economic productivity. This can reduce the productivity of an entire generation. Furthermore, stunting between 12 and 36 months is also linked to poor cognitive performance and/or lower school grades in middle childhood, and both height and head circumference at 2 years were shown to be inversely associated with educational attainment.
The ability to predict, at birth and early in childhood, whether a child is on an appropriate growth trajectory will help initiate preventive or therapeutic interventions leading to good cognitive growth and development outcomes as determined by school performance and a thriving child- and adulthood.
Our goal is to determine a combination of early measures that would be a good predictor for recumbent length (length of child measured while child is lying down, cm), weight (kg), and head circumference (cm). In pursuit of this goal, we have collected time series measurements of child growth, and family trait data (mother’s age, mother’s height, number of previous pregnancies, breast-feeding practices, and father’s height). We would like you to use this data to predict a child’s weight, recumbent length, weight, and head circumference in the attached dataset where values have been censored.
You may download the learning data set from here. The format for the data in the data set is a csv with details provided below:
Column Variable Type Label/Description DV for prediction
1 UID int Unique child ID No
2 AGEDAYS float Age sinc birth at examination (days) No
3 GAGEDAYS float Gestational age at examination (days) No
4 SEX int Sex, 1 = Male, 2 = Female No
5 MUACCM float Mid upper-arm circumference (cm) No
6 SFTMM float Skinfold thickness (mm) No
7 BFED int Child breast fed at time of visist No
8 WEAN int Child being weaned at time of visit No
9 GAGEBRTH float Gestational age at birth in days No
10 MAGE float Maternal age at examination (years) No
11 MHTCM float Maternal height (cm) No
12 MPARITY int Maternal parity No
13 FHTCM float Fathers height (cm) No
14 WTKG float Weight (kg) Yes
15 LENCM float Recumbent length (cm) Yes
16 HCIRCM float Head circumference (cm) Yes
Each child is designated
- a unique id [column 1],
- sex [column 4],
- observed mid upper-arm circumference [column 5] and
- skinfold thickness [column 6],
- mother’s age [column 10],
- mother’s height [column 11],
- number of mother’s previous pregnancies [column 12],
- mother’s breast-feeding practices [column 7 and 8],
- and father’s height[column 13], and
- multiple growth measurements:
- weight [column 14],
- recumbent length [column 15],
- head circumference [column 16],
during early childhood growth (with the time variable provided as Age since birth in days [column 2] and age since conception in days [column 3] . The value “.” in any cell implies that that value has not been measured and is therefore not available.
An example of measurements for a single child is given below:
UID AGEDAYS GAGEDAYS SEX MUACCM SFTMM BFED WEAN GAGEBRTH MAGE MHTCM MPARITY FHTCM WTKG LENCM HCIRCM
550 -1.356576074 -1.274154148 2 . . . . 1.614604045 0.112627355 -0.127853527 4 . -1.983209108 -1.682735629 -2.248066477
550 -0.865922259 -0.783913419 2 1.497903433 0.894389325 1 0 1.614604045 0.138831421 -0.127853527 4 . -0.179345397 -0.495300141 -0.56945552
550 -0.622766386 -0.540962261 2 1.776302755 2.593698371 . . 1.614604045 0.159518841 -0.127853527 4 . 0.18942477 -0.050011833 -0.223859146
550 -0.014876703 0.066415633 2 1.915502416 3.040884962 1 1 1.614604045 0.211926973 -0.127853527 4 . 0.731472486 0.533810615 0.31922087
550 0.22827917 0.309366791 2 1.358703772 2.593698371 1 1 1.614604045 0.233993555 -0.127853527 4 . 0.642612205 0.73171653 0.516704512
For each prediction (wi, li and ci), where at least one of the DV values is missing, the error from the true Weight, Recumbent length and Head circumference will be measured as the squared Mahalanobis distance,
where S-1 is the inverse of the sample covariance matrix calculated on the complete dataset.
inverseS = 11.90869495; inverseS = -7.523165469; inverseS = -4.11222794;
inverseS = -7.523165469; inverseS = 13.5665806; inverseS = -4.742982596;
inverseS = -4.11222794; inverseS = -4.742982596; inverseS = 8.669060303;
Scores will be calculated as a generalized R2 measure of fit. This is calculated as follows. The total sum of errors for the submission will be calculated as SSE = SUM(ei).
A baseline sum of squared error will be calculated by predicting the sample means for each measurement, where at least one of the DV values is missing, that is the mean values of w, l and c for the current training set,
SSE0 = SUM(e0i)
Then the submission score will be
Score = 1000000 * MAX(1 - SSE/SSE0, 0).
In the string trainingData, each string states a record of some measurement, and has 16 tokens, comma-separated, in the same order as described above in the table. As before missing values for non-DV variables are presented as “.” strings. You can assume that in trainingData all DV values are present. The format of testingData is almost the same as the trainingData. The only difference is that some of the DV values are also replaced by “.” strings, therefore your task will be to predict them. Replacement goes in the following way:
N = number of time points for an ID
X = random between 0 and N/2 inclusive
Y = random between X and N inclusive
foreach time point W(1..N) for an ID
if W <= X then all three DV values present
else if W <= Y then 'c' is replaced by "."
else all three DV values are replaced by "."
The data with same IDs are consecutive and ordered by Agedays (time point). The returned string should contain the corresponding predictions for weight, recumbent length and head circumference of the child, in this particular order, comma-separated, for each time point, in the same order as it is in testingData. The length of the return array equals to the number of measurements.
NOTE: All data values are normalized between -6 and 6 as part of data obfuscation requirements.
Notes on Data Set Generation
- The full data set contains approximately 20,000 lines, covering almost 2000 ID values.
- The full data set is divided into 35% for example tests, 20% for provisional tests, and 45% for system tests. All data belonging to the same ID is placed in the same data set.
- For each test, approximately 66% of the data (from that segment) is selected for training, and the remainder for testing. Only for system tests, then we drop approximately 20% of the IDs from the testing set.
- For provisional tests, all example data is also added to the training set.
- For system tests, all example and approximately 80% of provisional data is also added to the training set.