- 1st - $9,000
- 2nd - $4,500
- 3rd - $2,250
- 4th - $1,350
- 5th - $900
An electronics parts supplier ("Supplier") sells thousands of parts to businesses nationwide. The firm’s customers view a certain subset of parts differently from the rest. Parts in that subset are readily available, can be purchased from more firms than just the Supplier, and are integral to the business operations of the Supplier's customers. As a result, the firm refers to these items as "special parts" ("SPs").
SPs are believed to meet the following criteria:
A customer purchases an SP in similar quantities over time:
- Regardless of changes in "the market price" (i.e., PRODUCT_COST1 in the data set); and
- Regardless of change in price markup over "the market price" (i.e., PRODUCT_COST1 in the data set) for that part, up to a certain limit.; and
At regular intervals, Supplier designates certain parts as SPs, based on past experience and intuition. That designation helps inform business decisions.
While the two criteria above constitute Supplier’s working definition of an SP, the company suspects that more detailed and specific rules involving other dimensions may also affect customer perception of SPs, like season and geographic location, and customer segment. The firm also suspects that SPs may vary from one customer to another.
In the customary training and test framework:
- Create an algorithm that uses independent variables (e.g., order source), transaction history (e.g., price, quantity shipped, transaction date) and the 2 rules above to predict which products are SPs;
- Don’t use neural networks or other black box approaches; and
- On the basis of your algorithm, write additional rules, criteria, etc., documenting the variables and relationships between them that identify SPs.
The company believes SPs have the characteristics above, however, the extent to which the data supports that belief is unknown. Begin your work assuming the characteristics supposed to define SPs are correct, document evidence against those characteristics defining SPs, and use the variables and relationships between them that are most predictive of SPs in your algorithm.
There are 29 variables of data available for this contest, which are described here. The first column, PRODUCT_NUMBER, is the unique product identifier, and multiple transaction records will pertain to the same product number. The 9th column, CUSTOMER_SEGMENT1, is the main customer segment for this transaction. The last column, SPECIAL_PART, is the response variable. SPECIAL_PART will have the same value for all transactions which share the same values of both PRODUCT_NUMBER and CUSTOMER_SEGMENT1.
- The full data set used in this contest contains a total of 93,958 observations.
- The full data set is divided into 60% for example tests, 20% for provisional tests, and 20% for system tests. Each product’s transactions will be completely contained within one of these sets.
- For each test, approximately 66% of the data (from that segment) is selected for training, and the remainder for testing. Each product’s transactions will be completely within the training set or the testing set.
- For provisional tests, all example data is also added to the training set.
- For system tests, all example and approximately 50% of the provisional data is also added to the training set.
- There are 2 example cases, 10 provisional test cases and at least 50 system test cases.
The example data set can be downloaded here. It contains a header (first row), with the name of each column.
We created the SPECIAL_PART variable on the basis of the results of a survey of Supplier’s sales management. In the survey, respondents were asked the degree to which they agreed that each product was an SP for certain product segments in their territories. We combined their responses to create the SP indicator. Implicit in our creation and use of the SP variable is the assumption that, for a certain geography and territory, an SP is an SP for all customers at all points in time in the same segment (same value of CUSTOMER_SEGMENT1).
A supplementary data set of 365,205 observations is also available here. Note that SPECIAL_PART designation is not available for these observations, and, as a result, we can’t use them to evaluate your algorithms. We’ve supplied them in case you are able to find a way to make any use of them in your efforts.
You must implement a class ElectronicPartsClassification with only one function, classifyParts, which receives trainingData, and testingData.
In the string trainingData, each string contains a single record, and has 29 tokens, comma-separated, in the same order as described in the Data Description section above.
The format of testingData is the same as the trainingData, except for the absence of the last column, which is the variable to be predicted.
Each element in the returned string should correspond to a single product in the testingData. Each element should contain three comma separated values: PRODUCT_NUMBER, SPECIAL_PART for CUSTOMER_SEGMENT1 = A, and SPECIAL_PART for CUSTOMER_SEGMENT1 = B. The SPECIAL_PART predictions should take on one of four possible values: “Yes”, “No”, “Maybe”, or “NA”. “NA” must only be used if this product had no transactions in the corresponding segment. For example, the element "12345,Yes,NA" would indicate that the product number 12345 is a SP in segment A and has no transactions in segment B. A single prediction element must be returned for every unique product number in the testingData. If you fail to return a prediction for any product, return a prediction for a product not in testingData, or return “NA” for a product which does have transactions in the corresponding segment, you will receive a score of zero for this test.
The predictions from your algorithm will be tabled as shown below, where "a" through "i" represent the cell totals in the 3 x 3 table. Correct predictions of "NA" are not included in these tables.
Table 2 below represents the cost of the various kinds of misclassification errors.
Score = 1,000,000 * (1 - sum_ij(Table1(i,j) * Table2(i,j)) / sum_ij(Table1(i,j)))
Where sum_ij sums over all 9 elements in Tables 1 / 2. If there are ties in the final system test results, we will split the prize between the tied contestants.
- All test are limited to 600s (10 minutes) running time.
- The memory limit is 2048 megabytes.
- There is no explicit code size limit. The implicit source code size limit is around 1 MB (it is not advisable to submit codes of size close to that or larger). The compilation time limit is 60 seconds.
- You will be allowed to make example submissions each 30 minutes and provisional submissions each 3 hours.
- This match is rated.
- The allowed programming languages are C++, Java, C#, Python and VB.
- You can include open source code in your submission, provided it is free for you to use and would be for the client as well. Code from open source libraries under the BSD or MIT licenses will be accepted. Other open source licenses may be accepted too, just be sure to ask us.
- The test servers have only the default installs of all languages, so no additional libraries will be available.
- Use the match forum to ask general questions or report problems, but please do not post comments and questions that reveal information about the problem itself or possible solution techniques.
Requirements to Win a Prize
In order to receive a prize, you must do all the following:
- Achieve a score in the top 5, according to system test results. See the Scoring section above.
- Within 7 days from the announcement of the challenge winners, submit a complete report at least 2 pages long containing at least the following sections:
- First name
- Last name
- Topcoder handle
- Email address
Please describe how your algorithm identifies SPs so that a non-technical reader can understand exactly what it does in the simplest and clearest terms possible.
- Rules, criteria, variables, and relationships between them that identify SPs
- Justification for the above on the basis of your algorithm, including references to specific lines of your code where applicable
Please describe your algorithm so that we know what you did even before seeing your code. Use line references to refer to specific portions of your code.
- Approaches considered
- Approach ultimately chosen
- Justification for approach ultimately chosen
- Steps to approach ultimately chosen, including references to specific lines of your code
- Advantages and disadvantages of the approach chosen
- Comments on open source resources used
- Special guidance given to algorithm based on training
- Potential improvements to your algorithm
Local Programs Used
- If any parameters were obtained from the training data set, you will also need to provide the program(s) used to generate these parameters.
- There is no restriction on the programming language/libraries used to generate these training parameters, provided they are a free, open-source solution for you to use and would be for the client as well.
If you place in the top 5 but fail to do any of the above, then you will not receive a prize, and it will be awarded to the contestant with the next best performance who did all of the above.