Get Time
long_comps_topcoder  Problem Statement
Contest: Child Stuntedness 5
Problem: ChildStuntedness5

Problem Statement



  • 1st Place - $10,000
  • 2nd Place - $5,000
  • 3rd Place - $2,500
  • 4th Place - $1,500
  • 5th Place - $1,000

Stunting (i.e., shortness for age) affects more than one in four children worldwide. Wasting (i.e., being underweight for age) and stunting in early childhood are associated with lethargy, reduced levels of play, an increased risk of early death, higher burden of disease, compromised physical capacities, and diminished cognitive development. Stunting and wasting in the first two years of life have been shown to be associated with lower school attainment and reduced economic productivity. This can reduce the productivity of an entire generation. Furthermore, stunting between 12 and 36 months has also been linked to poor cognitive performance and/or lower school grades in middle childhood, and both height and head circumference at 2 years have been shown to be inversely associated with educational attainment.

In this challenge, we explore the link between cognitive development and child stuntedness. While stuntedness is known to correlate with poor cognitive development, we are interested in finding out if this is reversible. Are there children who were born stunted but nevertheless are able to successfully overcome their slow cognitive development rates later in life? Furthermore, do children born stunted who overcome their small size (perhaps by adequate nutrition) also show increased cognitive ability? Are there external factors (either inherited from the parents or environmental) which can contribute positively to this recovery?

In pursuit of the above, we have collected time series measurements of child growth and family trait data (e.g., mother’s age, mother’s height, number of previous pregnancies, breast-feeding practices, and father’s height). You will need to use this data to predict a child’s IQ measured at 7 years of age in the attached dataset, which contains censored values. To test our hypotheses, we would like to predict IQ in 3 scenarios:

  1. Just using:
    • Columns 1, 12, 14, 15, 16, 17, 18
    • weight at birth
    • length at birth
    • gestational age at birth
    • APGAR scores at birth
    • sex at birth
  2. Everything in Scenario 1 above and:
    • Columns 2, 3, 4, 5, 6, 7, 8, 9, 10
    • weight measurements at various points in the child’s growth
    • length (i.e., height) measurements at various points in the child’s growth
    • combinations of weight and length at various points in the child’s growth (e.g., weight/length^2, etc.)
    • Velocity changes in measures in weight and height (alternatively called growth tempo) between measurement periods
  3. Everything in Scenario 2 above and:
    • Columns 11, 13, 19, 20, 21, 22, 23, 24, 25, 26
    • Investigation site ID
    • Characteristics of the parents
    • The extent to which the child was breastfed

You may download the learning data set here.

Data Set Description

Col# Name     Type   Description/Notes
1    subjid   int    Subject ID (In ascending order, not all values necessarily exist)
2    agedays  int    Age of child in days (Day 1 = day of birth)
3    wtkg     float  Weight (kg)
4    htcm     float  Standing height (cm)
5    lencm    int    Recumbent length (cm)
6    bmi      float  Body Mass Index (kg/m2)
7    waz      float  Weight for age Z-score (Per WHO algorithm)
8    haz      float  Height of age Z-score (Per WHO algorithm)
9    whz      float  Weight for height Z-score (Per WHO algorithm)
10   baz      float  BMI for age Z-score (Per WHO algorithm)
11   siteid   int    Investigation site ID (several values in the range 5-82)
12   sexn     int    Sex of the child (1 = Male, 2 = Female)
13   feedingn int    Breast feeding category 
                        1 = Exclusively breast fed
                        2 = Exclusively formula fed
                        3 = Mixture breast/formula fed
                       90 = Unknown  
14   gagebrth int    Gestational age at birth in days
15   birthwt  int    Birth weight (grams)
16   birthlen int    Birth length (cm)
17   apgar1   int    APGAR score at 1 minute post birth
18   apgar5   int    APGAR score at 5 minutes post birth
19   mage     int    Maternal age at birth of child (years)
20   demo1n   int    Maternal demographic variable 1 (Nominal value 1 or 2)
21   mmaritn  int    Mother’s marital status
                        1 = Married
                        2 = Common law
                        3 = Separated
                        4 = Divorced
                        5 = Widowed
                        6 = Single
22   mcignum  int    Mother's # of cigarettes per day during pregnancy
23   parity   int    Maternal parity (# of previous live births at the time of this child’s birth)
24   gravida  int    Maternal gravidity (# of pervious times pregnant)
25   meducyrs int    Mother's education level (years)
26   demo2n   int    Maternal demographic variable 2 (Nomial value 1, 2, 3, 4 or 5)

27   geniq    int    IQ measured at age 7
                     * Variable to predict

Code and Scoring

Your code will be given String[] training and String[] testing, and will need to return a double[], containing one value (the predicted IQ) for each ID present in the test data. The test data will be provided in order by ID, and your return values should be in that same order.

Your code will also be given ints testType and scenario, indicating the type of test and which of the three scenarios is being tested.

Your code will be scored by calculating the SSE (sum of squared error) of your predictions. Also, SSE0 will be calculated, using the average of all IQ values in the training set as the prediction. Your score for a test case will then be given by:

Score = 1000000 * MAX(0, 1 - SSE/SSE0)

Your overall score will be the average score across all test cases.

Notes on Data Set Generation

  • The full data set contains approximately 138,000 lines, covering just over 30,000 ID values.
  • The full data set is divided into 40% for example tests, 20% for provisional tests, and 40% for system tests. All data belonging to the same ID is placed in the same data set.
  • For each test, approximately 66% of the data (from that segment) is selected for training, and the remainder for testing.
  • For provisional tests, all example data is also added to the training set.
  • For system tests, all example and approximately 50% of the provisional data is also added to the training set.
  • There are 10 example cases and 50 provisional test cases.

Notes on Time Limits

Because different test types deal with different volumes of data, the time limits will also differ. Example tests are limited to 360s (6 minutes), provisional tests to 540s (9 minutes) and system tests to 900s (15 minutes). The testType parameter will be 0, 1, or 2, to indicate Example, Provisional, or System test, respectively, so that your code can take timing into account. Similarly, the scenario parameter is also 0, 1, or 2, referring to the three scenarios listed above.



Parameters:int, int, String[], String[]
Method signature:double[] predict(int testType, int scenario, String[] training, String[] testing)
(be sure your method is public)


Seed = 1
Seed = 2
Seed = 3
Seed = 4
Seed = 5
Seed = 6
Seed = 7
Seed = 8
Seed = 9

This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2010, TopCoder, Inc. All rights reserved.