JOIN
Get Time
long_comps_topcoder  Problem Statement
Contest: Marathon Match
Problem: DemographicMembership

Problem Statement

    

Prize Distribution

Prize      US$
1st     10,000
2nd      5,000
3rd      2,500
4th      1,500
5th      1,000

Background

A media company generates revenue by selling advertising space to firms in which to advertise their products and services. Advertisers pay for access to audiences of certain kinds of media consumers, and an important demographic into which some consumers fall is called Demographic X. Being able to identify individual media consumers in Demographic X so that the company can run targeted marketing campaigns is important to the firm’s business.

Objective

In the customary training and test framework, create an algorithm that uses demographic data on individual media consumers to predict whether or not they are in Demographic X.

Data Description

In the data available for this contest, each record contains 244 columns belonging to a single media consumer, described here. The first column (CONSUMER_ID) is the consumer unique identifier, and the last column (DEMO_X) is the response variable.

Data Sets

  • The full data set used in this contest contains a total of 20,072 media consumers.
  • The full data set is divided randomly into 40% for example tests, 20% for provisional tests, and 40% for system tests.
  • For each test, approximately 66% of the data (from that segment) is randomly selected for training, and the remainder for testing.
  • For provisional tests, all example data is also added to the training set.
  • For system tests, all example and approximately 50% of the provisional data, randomly selected, is also added to the training set.
  • There are 2 example cases, 10 provisional test cases and at least 50 system test cases.

The example data set can be downloaded here. It contains a header (first row), with the name of each column.

Functions

You must implement a class DemographicMembership with only one function: predict, which receives the testType (0, 1, or 2, to indicate Example, Provisional, or System test, respectively), trainingData, and testingData.

In the string[] trainingData, each string contains a single record, and has 244 tokens, comma-separated, in the same order as described in Data Description section above.

The format of testingData is the same as the trainingData, except for the absence of the last column, which is the variable to be predicted.

The returned int[] should contain the corresponding predictions, which can be 0 or 1 only, in the same order as in testingData. The length of the return array should be equal to the length of testingData.

Scoring

                    +------------------------------------+
                    |              Truth                 |
+-------------------+----------------+-------------------+---------------+
|    Prediction     |  Demographic X | Not Demographic X |     Total     |
+-------------------+----------------+-------------------+---------------+
|   Demographic X   |        a       |         b         |     a + b     |
+-------------------+----------------+-------------------+---------------+
| Not Demographic X |        c       |         d         |     c + d     |
+-------------------+----------------+-------------------+---------------+
|      Total        |      a + c     |       b + d       | a + b + c + d |
+-------------------+----------------+-------------------+---------------+

Using the table above in which a through d represent cell counts, define the following:

    Precision = a / (a + b)
    Recall = a / (a + c)

The score for a single test case will be calculated as:

    Score = 1,000,000 * MIN(Precision, Recall)

If your return is invalid (i.e., does not have the expected length or contains a value different from 0 and 1), then your score for that test case will be zero. Your overall score is the arithmetic mean of all cases.

Limits

Because different test types deal with different volumes of data, the time limits will also differ. Example tests are limited to 600s (10 minutes), provisional tests to 900s (15 minutes) and system tests to 1500s (25 minutes). The testType parameter will be 0, 1, or 2, to indicate Example, Provisional, or System test, respectively, so that your code can take timing into account.

The memory limit is 2048 megabytes.

There is no explicit code size limit. The implicit source code size limit is around 1 MB (it is not advisable to submit codes of size close to that or larger). The compilation time limit is 60 seconds.

You will be allowed to make example submissions each 30 minutes and provisional submissions each 3 hours.

General Notes

  • This match is rated.
  • The allowed programming languages are C++, Java, C#, Python and VB.
  • You can include open source code in your submission, provided it is free for you to use and would be for the client as well. Code from open source libraries under the BSD or MIT licenses will be accepted. Other open source licenses may be accepted too, just be sure to ask us.
  • The test servers have only the default installs of all languages, so no additional libraries will be available.
  • Use the match forum to ask general questions or report problems, but please do not post comments and questions that reveal information about the problem itself or possible solution techniques.

Requirements to Win a Prize

In order to receive a prize, you must do all of the following:

  1. Achieve a score in the top 5, according to system test results. See the “Scoring” section above.
  2. Within 7 days from the announcement of the challenge winners, submit a complete report at least 2 pages long containing at least the following sections:
    • Your Information
      • First Name
      • Last Name
      • Topcoder handle
      • Email address
    • Approach Used

      Please describe your algorithm so that we know what you did even before seeing your code. Use line references to refer to specific portions of your code.

      This section must contain at least the following:

      • Approaches considered
      • Approach ultimately chosen
      • Steps to approach ultimately chosen, including references to specific lines of your code
      • Advantages and disadvantages of the approach chosen
      • Comments on open source resources used
      • Special guidance given to algorithm based on training
      • Potential improvements to your algorithm
    • Local Programs Used
      • If any parameters were obtained from the training data set, you will also need to provide the program(s) used to generate these parameters.
      • There is no restriction on the programming language/libraries used to generate these training parameters, provided they are a free, open-source solution for you to use and would be for the client as well.

If you place in the top 5 but fail to do any of the above, then you will not receive a prize, and it will be awarded to the contestant with the next best performance who did all of the above.

 

Definition

    
Class:DemographicMembership
Method:predict
Parameters:int, String[], String[]
Returns:int[]
Method signature:int[] predict(int testType, String[] trainingData, String[] testingData)
(be sure your method is public)
    

This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2010, TopCoder, Inc. All rights reserved.