Get Time
long_comps_topcoder  Problem Statement
Contest: Soteria Serious Injury Prediction
Problem: SeriousInjuryPrediction

Problem Statement

    The goal of this competition is to create a solution to predict the number of serious injuries by employer unit that may occur in a given year, based on historical data about previous injuries, preventive actions, incidents, industry etc.

Problem Overview

In this competition you’ll work with real world data, which is organized as follows:

  • Static data – information about employers units such as sector, subsector, employer type, registration date and classification of nature and source of injuries, accident types and occupations.
  • Dynamic data – historical data (up to 12 years back) for each employer unit, which contains:
    • Claims (short term, long term, fatal) and injuries.
    • Preventive actions, inspections, penalties, warning letters from authorities.
    • Number of employees, payroll, ‘performance’ relative to industry.

You’ll need to implement an algorithm that outputs a prediction about the ‘next’ year's number of serious injuries for a given employer unit.

Serious Injury Definition

All injuries are classified following the criteria set out by the "International Classification of Diseases 9th revision Clinical Modification".

In this match, we will call ‘serious’ an injury that satisfies one of the following conditions:

  1. Its ICD9 code is in the ‘Tier 1’ list.
  2. Its ICD9 code is in the ‘Tier 2’ list and there is at least 50 days of wage loss related to this injury.

The data contains the ICD9 code for each injury, as well as the associated wage loss. For your convenience, you’ll also have a field indicating whether the injury is classified as serious. You can also find the list here.


The data provided to your algorithm in this competition, is composed of several logical groups and is quite complex. Please be sure to take enough time to understand it and don't hesitate to ask questions in the forum.

The logical groups are:
  • Employer referential - this is composed of 5 arguments (see below for explanation):
    • Employers
    • Classification Units
    • EmployersCu
    • Locations
    • EmployersCuLo
  • Employer assessment - number of employees, payroll at the EmployersCu level etc:
    • EmployerAssesment
  • Claims and injuries. The 'Classification' contains descriptions of different aspects of claims. They usually have several different classifications (lvl1, lvl2, lvl3, cwis codes, internal codes). Note that the claims data is provided with EmployerCuLo granularity while your solution needs to deliver predictions on EmployerCu granularity:
    • Claims: the list of all claims for the selected employers, with details about the occupation type, injury type, source of injury etc.
    • OccupationType: classification of jobs, sufficiently similar in work performed to be grouped under a common title.
    • SourceOfInjuryType: classification of the object, substance, exposure or bodily motion that directly produced, transmitted or inflicted the injury or illness.
    • NatureOfInjury: classification of an accident or exposure that describes the nature and the characteristics of the injury or disease.
    • AccidentType: classification of an accident or exposure that describes the manner in which the injury or disease was produced or inflicted by the identified source of injury or disease.
    • BodyPart: classification of an accident or exposure that describes the general area or body system of the injury or disease.
  • Prevention Correction activities (with EmployerCuLo granularity):
    • PreventionCorrectionActions: this list contains all the actions (inspections, education presentations, recommendations for proceeding etc) taken by authorities.
    • HealthAndSafetyActions: details for actions that are of IR type (Safety and Health Inspection).
    • VsrpActions: details for actions that are of RFS type -- recommendations for a Variance and Sanction Review Proceeding.
    • ActionsRegulations: help file that contains the 1-n relations between preventive actions and regulations they deal with.
    • ActionsRelations: lists all relations between different actions (1 action can relate to n others).
    • InspectionOrders: details for orders (instructions and warnings).
    • InspectionOrdersRegulations: help file that contains the 1-n relations between inspection orders and regulations.
    • NoticeOfIncidents: details about incidents (not all injuries are linked to an incident).
    • Regulations: list of different regulations and their basic classification.

The static data provided is complete –- it means that all employer units you’ll be queried for in example / provisional / final testing are included in this file (though not all are necessarily included in queries).

For more detailed field definitions, please check the csv file.

Please note that it’s a real world competition, so some of the data may be erroneous and/or incomplete.

All the files provided to you in init method are available here:


Employers and employer units

There’re three layers of classification for each employer.

At the highest level, we have an employer which is uniquely identified by the employerID. This represents the company as a whole.

On the second level, we have employer / classification unit (identified by the EmployerCuID) -– these are distinct business units in which an employer operate. An employer can have several units operating in different businesses.

On the third and most detailed level, we have employer / classification unit / locations (identified by the EmployerCuLoID) -– in addition to a classification units, this level specifies the location at which the unit is registered. A business unit can operate in several different locations.

You are also provided with a complete list of classification units (used in the second layer) and locations (used in the third layer) available with some additional descriptions.

Implementation, testing and scoring

You’ll need to implement two methods :

  • init: this method will be called first
    • Data previously described (all are String[]s, semicolon separated) : this is your training data for the problem.
    • EmployersCuListTraining -- String[] containing the list of all EmployersCu included in the training data. The reason to provide this is that for some employers there're no claims nor actions taken by authorities -- you simply won't find any entry for them in Claims / Actions data.
  • predictSeriousInjuries which takes as arguments:
    • Queries: semicolon separated array with “EmployerCuId;Year” tuples. Each line describe one separate query.
    • Data previously described, without static data (all are String[]s, semicolon separated) : only entries for the EmployerCuId that are queried.
Your method needs to return an array of doubles where the i-th element is your best guess for the number of injuries of the i-th query.

Your score for a particular test case will be max{0, 10000-S} where S is the sum of absolute differences between your predictions and the real numbers of serious injuries for given query tuples. Your overall score will be a simple average of your test cases scores. If your solution fails to return the right number of estimations or crashes, your score for this case will be 0.

Example test case does not contain actual ‘predictions’ - these queries will concern year 2010 (so you can find them in the training data). You can download the data used for the example test case along with expected results here.

Provisional and final cases will all be about the year 2011.

Training data was randomly selected from the raw database.

Note that even for training data you won't know the real answers (number of serious injuries in 2011). It's provided so you have more data for analysis and find patterns in previous years.

Special conditions

  • You can submit C++, Java, C# and VB solutions. Python solutions are not accepted.
  • You can include open source code in your submission provided that the open source code is clearly identified in the code and is licensed under a license approved by the Open Source Initiative (
  • We reserve the right not to release the data used in provisional and final tests.
  • In order to receive the prize money, you will need to fully document the derivation of all parameters internal to your algorithm. If these parameters were obtained from the training data set, you will also need to provide the program used to generate these training parameters. There is no restriction on the programming language used to generate these training parameters. Note that all this data should not be submitted anywhere during the coding phase. Instead, if you win a prize, a TopCoder representative will contact you directly in order to collect this data. The description must be submitted within 7 days after the contest results are published.
  • You may not use any external (outside of this competition) source of data to train your solution.


Parameters:String[], String[], String[], String[], String[], String[], String[], String[], String[], String[], String[], String[], String[], String[], String[], String[], String[], String[], String[], String[], String[], String[]
Method signature:int init(String[] Locations, String[] ClassificationUnits, String[] Employers, String[] EmployersCu, String[] EmployersCuLo, String[] BodyPart, String[] AccidentType, String[] NatureOfInjury, String[] SourceOfInjury, String[] OccupationType, String[] Regulations, String[] EmployersCuListTraining, String[] InspectionOrders, String[] InspectionOrdersRegulations, String[] PreventionCorrectionActions, String[] ActionsRegulations, String[] ActionsRelations, String[] NoticeOfIncident, String[] VsrpActions, String[] HealthSafetyActions, String[] Claims, String[] EmployerAssesment)
Parameters:String[], String[], String[], String[], String[], String[], String[], String[], String[], String[], String[]
Method signature:double[] predictSeriousInjuries(String[] InspectionOrders, String[] InspectionOrdersRegulations, String[] PreventionCorrectionActions, String[] ActionsRegulations, String[] ActionsRelations, String[] NoticeOfIncident, String[] VsrpActions, String[] HealthSafetyActions, String[] Claims, String[] EmployerAssesment, String[] Queries)
(be sure your methods are public)


-You can train your solution online using the training data provided or you can train it offline based on the files included in the problem statement. However, you cannot use any other external data in this contest.
-The time limit is 10 minutes per test case (init + predictSeriousInjuries together) and the memory limit is 1536 MB.
-There are 1 example, 3 provisional and at least 7 system test cases.
-The number of queries in provisional and final tests are approximately the same as in the example test case (around 8000).
-There is no explicit code size limit. The implicit source code size limit is around 1 MB (it is not advisable to submit codes of size close to that or larger). Once your code is compiled, the binary size should not exceed 1 MB.
-The compilation time limit is 30 seconds. You can find information about compilers that we use and compilation options here.



This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2010, TopCoder, Inc. All rights reserved.