JOIN
Get Time
long_comps_topcoder  Problem Statement
Contest: Trip Safety
Problem: TripSafetyFactors

Problem Statement

    

Prizes

  1. $7,000
  2. $3,500
  3. $2,000
  4. $1,500
  5. $1,000

Background

Granfalloon Delivery Logistics (GDL) is responsible for scheduling on-going deliveries to customers. Continuously improving safety is always an important objective for GDL. Because safety has been the top priority at GDL, accidents during delivery are extremely rare. In the effort to continuously improve safety even where there are no accidents, GDL uses on board computers (OBC) to monitor properties like speed, acceleration, deceleration, and stability during deliveries and to identify events which might indicate increased risk. For example, a deceleration and/or stability event can indicate that pilots we required to take evasive action, perhaps to avoid a collision or other hazard.

GDL is searching for ways to use its information about past trips and events to help schedule ever safer trips by choosing routes that avoid these kinds of events.

GDL has compiled records of more than 50,000 trips including variables related to route, known route risks, date, time, driver, weather, and traffic. These trips have been correlated with information about events. Under 6% of the trips have safety events associated with them. Under 1% of trips have more than 1 safety event associated with it. Multi-event information for trips is included.

In addition to finding a way to detect trips likely to include safety events, GDL would like to collect relevant information on reasons for selecting various techniques as well as any insights the solution might reveal about the structure and correlations in the data. To complete eligible for the prizes, authors of winning solutions will be required to document their solution, evaluate strengths and weaknesses of their solution, and describe what the solution might reveal about the data.

Data Dictionary

Column name		Description
Id			A unique identifier given to each trip. An individual trip can include loading, traveling to one or more customer sites to deliver, and even returning to reload (see complexity).
			A trip has dates & times, at least a primary pilot, underlying route record, cargo information.
source			Identifier for the location at which the trip began.
dist			The total distance covered by this trip.
cycles			Cycle N means that this trip was the N’th trip scheduled in a batch of trips assigned to a pilot and optionally a copilot.
complexity		A proxy measure for the overall complexity of the trip. Higher is more complicated.
			Roughly, this is the number of “trip segments” where a segment might be related to actual route, or it might relate to loading, unloading, refueling or other procedures in the trip.
cargo			Identifier for the type of cargo being carried on this trip.
stops			Number of distinct delivery stops required for the trip.
start_day		An integer number representing the day on which this trip started.
start_month		An integer number representing the month on which this trip started.
start_day_of_month	The day of the month on which this trip started.
start_day_of_week	An integer 1 thru 7 representing the day of the week on which this trip started.
start_time		Time of day at which trip began. HH:MM, 24 hr clock.
days			Length of this trip in days to two decimal places.
pilot			Identifier for the primary pilot of the vehicle.
pilot2			Identifier for the co-pilot of the vehicle, if there is one.  0 indicates no co-pilot.
pilot_exp		Number indicating the level of experience for the primary pilot.
pilot_visits_prev	Number of time the primary pilot has recently made deliveries using this route; serves as a measure of pilot familiarity with the route.
pilot_hours_prev	The number of hours that the primary pilot has been working on trips in a time period immediately prior to this trip. Might be an indicator of a “cold” pilot low or zero hours or of pilot fatigue for high hours.
pilot_duty_hrs_prev	The number of hours the primary pilot has been actively piloting in a time period prior to this trip. Trips involve activities beyond piloting, such as loading, unloading, resting, etc.
pilot_dist_prev		The distance the primary pilot has accumulated in the past 3 days prior to the start of this trip.
route_risk_1		A proxy measurement for the relative danger of route.  Higher number is higher risk. The number is based on the vicinity of the route, not the exact route.
route_risk_2		A second proxy measurement for the relative danger of route.  Higher number is higher risk. The number is based on the vicinity of the route, not the exact route.
weather			An integer number indicating a worst case weather condition encountered on a trip.
			So, if weather conditions were icy during a small part of the trip and clear during the rest, then weather for this trip is classified as “icy” 
			The integer value represents only a type of weather and is not a ranking of severity. 
visibility		Scoring from 1 to 50 of the average overall visibility throughout the trip to one decimal place.
traf0			Number of travel segments from the trip with unknown congestion.
traf1			Number of very congested trip segments.
traf2			Number of congested trip segments.
traf3			Number of lightly congested trip segments.
traf4			Number of uncongested trip segments.

Each training record includes four event counters for each trip. Solutions are not required to rank by type, only whether

speed_events + decel_events + accel_events + stability_events > 0

Column name		Description
accel_cnt		The number of rapid acceleration events reported by the OBC for this trip.
decel_cnt		The number of rapid deceleration events reported by the OBC for this trip.
speed_cnt		The number of high speed events reported by the On-board Computer (OBC)  for this trip.
stability_cnt		The number of stability warning events reported by the OBC for this trip.
evt_cnt			Total number of events for this trip.  Sum of speed_cnt + decel_cnt + accel_cnt + stability_cnt

A Simplified Example Solution

When given (1) training data that includes information about trips and events and (2) a set of input data that does not include alarm frequencies, then the contestant algorithms should respond with a ranked list of trips from (2) that it determines are most likely to contain alarm events.

An over-simplified example with the silly assumption that “RED traffic” is the only factor correlating to events:

Input (1) -- Training Data

Training Data	Pilot	Traffic		Events
100		10	RED		1
101		10	GREEN		0
102		11	RED		1
103		12 	GREEN		0
104		12	RED		1

Input (2) -- To Be Ranked

Testing Data	Pilot	Traffic
500		10	RED
501		10	GREEN
502		11	GREEN
503		11 	GREEN
504		13	BLUE
505		13	RED

Output -- Sample Response from a Solution

Route	Return Data (Rank)
500	1
501	3
502	4
503	5 
504	6
505	2

Keying off the correlation with “RED” traffic, this solution to the simplified problem has ranked the two red trips before the other trips. The actual order of the RED trips 500 and 505 is arbitrary unless the solution has reason to suspect that one is more likely to have events than the other, in which case the solution should order them accordingly.

Since Pilots 10 and 13 both had “RED” events, a more complicated solution might place both trips 501 and 504 immediately after the RED trips, speculating that pilot might also be a factor in the likelihood of events.

Specification for Solution

Implementation

Your task is to implement a method to rank trips from most likely to contain events to least likely to contain events. The signature of this method is specified in the Definition section below.

Both string[] trainingData and string[] testingData will contain comma delimited strings. Each string describes the information from a particular trip. The order of values and the kind of data in each string is specified in the “Data Dictionary.”

Each string of trainingData will have 34 string values, including the event count for each trip. Each string in the element of testingData will have only 29 string values and will not not contain the number of events.

Your solution method should return a list of ranks assigned to each trip. the list should be the same length as testingData. You response should rank order all trips such that those with the highest safety concern are assigned with a lower rank in the list. All values in the returned array must be unique and the lowest rank should be 1. return[i] is the assigned rank for testingData[i]

Scoring by Relative Ranking:

  • Trips without events are assume to be safe -- detecting any is a false positive.
  • Trips with at least one event are a safety concern -- ranking these is a true positive and will earn points.
  • Trips with more than one event will be an elevated safety concern -- ranking these early enough in the results will earn a small bonus. Solutions that miss or are unable to identify elevated safety concerns but can still rank the trips as safety concerns will earn points, but will miss the bonus.

Let N be the number of trips with at least one event, and let M <= N be the number for trips with more than one event occurred during the trip. Each rank is assigned with a score and a bonus.

scores[i] := MAX( (2N - i) / 2N, 0 ) where scores[i] is the score of rank i - 1
bonuses[i] := MAX( (2M - i) / 2M * 0.3, 0 ) where bonuses[i] is the bonus for rank i - 1

score := 0
for i := 0, 1, ..., sizeOf(yourResult)-1
{
    truePositive := testingData[i] is true positive
    highSafetyConcern := testingData[i] is elevated safety concern

    score := score + 
             truePositive * scores[yourResult[i]-1] + 
             highSafetyConcern * bonuses[yourResult[i]-1] 
}
score := 1000000 * score / max_possible_score

The overall score on a set of test cases is the arithmetic average of scores on single test cases from the set. The match standings displays overall scores on provisional tests for all competitors who have made at least 1 full test submission. The winners are competitors with the highest overall scores on system tests.

Notes on Data Set Generation

  • The full data set contains approximately 57,000 lines
  • The full data set is divided into 32,967 for example tests, 4,856 for provisional tests, and 19,298 for system tests.
  • For each test, approximately 66% of the data (from that segment) is selected for training, and the remainder for testing.
  • For provisional tests, all example data is also added to the training set.
  • For system tests, all example and provisional data is also added to the training set.
 

Definition

    
Class:TripSafetyFactors
Method:predict
Parameters:String[], String[]
Returns:int[]
Method signature:int[] predict(String[] trainingData, String[] testingData)
(be sure your method is public)
    
 

Notes

-The time limit is 5 minutes. The memory limit is 2048 megabytes.
-The compilation time limit is 30 seconds. You can find information about compilers that we use and compilation options here.
-There are 5 example test cases and 100 full submission (provisional) test cases.
-Example data is here.
-Scoring snippet is here.
 

Examples

0)
    
Seed: 1
1)
    
Seed: 2
2)
    
Seed: 3
3)
    
Seed: 4
4)
    
Seed: 5

This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2010, TopCoder, Inc. All rights reserved.