Granfalloon Delivery Logistics (GDL) is responsible for scheduling on-going deliveries to customers. Continuously improving safety is always an important objective for GDL. Because safety has been the top priority at GDL, accidents during delivery are extremely rare. In the effort to continuously improve safety even where there are no accidents, GDL uses on board computers (OBC) to monitor properties like speed, acceleration, deceleration, and stability during deliveries and to identify events which might indicate increased risk. For example, a deceleration and/or stability event can indicate that pilots we required to take evasive action, perhaps to avoid a collision or other hazard.
GDL is searching for ways to use its information about past trips and events to help schedule ever safer trips by choosing routes that avoid these kinds of events.
GDL has compiled records of more than 50,000 trips including variables related to route, known route risks, date, time, driver, weather, and traffic. These trips have been correlated with information about events. Under 6% of the trips have safety events associated with them. Under 1% of trips have more than 1 safety event associated with it. Multi-event information for trips is included.
In addition to finding a way to detect trips likely to include safety events, GDL would like to collect relevant information on reasons for selecting various techniques as well as any insights the solution might reveal about the structure and correlations in the data. To complete eligible for the prizes, authors of winning solutions will be required to document their solution, evaluate strengths and weaknesses of their solution, and describe what the solution might reveal about the data.
Column name Description
Id A unique identifier given to each trip. An individual trip can include loading, traveling to one or more customer sites to deliver, and even returning to reload (see complexity).
A trip has dates & times, at least a primary pilot, underlying route record, cargo information.
source Identifier for the location at which the trip began.
dist The total distance covered by this trip.
cycles Cycle N means that this trip was the N’th trip scheduled in a batch of trips assigned to a pilot and optionally a copilot.
complexity A proxy measure for the overall complexity of the trip. Higher is more complicated.
Roughly, this is the number of “trip segments” where a segment might be related to actual route, or it might relate to loading, unloading, refueling or other procedures in the trip.
cargo Identifier for the type of cargo being carried on this trip.
stops Number of distinct delivery stops required for the trip.
start_day An integer number representing the day on which this trip started.
start_month An integer number representing the month on which this trip started.
start_day_of_month The day of the month on which this trip started.
start_day_of_week An integer 1 thru 7 representing the day of the week on which this trip started.
start_time Time of day at which trip began. HH:MM, 24 hr clock.
days Length of this trip in days to two decimal places.
pilot Identifier for the primary pilot of the vehicle.
pilot2 Identifier for the co-pilot of the vehicle, if there is one. 0 indicates no co-pilot.
pilot_exp Number indicating the level of experience for the primary pilot.
pilot_visits_prev Number of time the primary pilot has recently made deliveries using this route; serves as a measure of pilot familiarity with the route.
pilot_hours_prev The number of hours that the primary pilot has been working on trips in a time period immediately prior to this trip. Might be an indicator of a “cold” pilot low or zero hours or of pilot fatigue for high hours.
pilot_duty_hrs_prev The number of hours the primary pilot has been actively piloting in a time period prior to this trip. Trips involve activities beyond piloting, such as loading, unloading, resting, etc.
pilot_dist_prev The distance the primary pilot has accumulated in the past 3 days prior to the start of this trip.
route_risk_1 A proxy measurement for the relative danger of route. Higher number is higher risk. The number is based on the vicinity of the route, not the exact route.
route_risk_2 A second proxy measurement for the relative danger of route. Higher number is higher risk. The number is based on the vicinity of the route, not the exact route.
weather An integer number indicating a worst case weather condition encountered on a trip.
So, if weather conditions were icy during a small part of the trip and clear during the rest, then weather for this trip is classified as “icy”
The integer value represents only a type of weather and is not a ranking of severity.
visibility Scoring from 1 to 50 of the average overall visibility throughout the trip to one decimal place.
traf0 Number of travel segments from the trip with unknown congestion.
traf1 Number of very congested trip segments.
traf2 Number of congested trip segments.
traf3 Number of lightly congested trip segments.
traf4 Number of uncongested trip segments.
Each training record includes four event counters for each trip. Solutions are not required to rank by type, only whether
speed_events + decel_events + accel_events + stability_events > 0
Column name Description
accel_cnt The number of rapid acceleration events reported by the OBC for this trip.
decel_cnt The number of rapid deceleration events reported by the OBC for this trip.
speed_cnt The number of high speed events reported by the On-board Computer (OBC) for this trip.
stability_cnt The number of stability warning events reported by the OBC for this trip.
evt_cnt Total number of events for this trip. Sum of speed_cnt + decel_cnt + accel_cnt + stability_cnt
A Simplified Example Solution
When given (1) training data that includes information about trips and events and (2) a set of input data that does not include alarm frequencies, then the contestant algorithms should respond with a ranked list of trips from (2) that it determines are most likely to contain alarm events.
An over-simplified example with the silly assumption that “RED traffic” is the only factor correlating to events:
Input (1) -- Training Data
Training Data Pilot Traffic Events
100 10 RED 1
101 10 GREEN 0
102 11 RED 1
103 12 GREEN 0
104 12 RED 1
Input (2) -- To Be Ranked
Testing Data Pilot Traffic
500 10 RED
501 10 GREEN
502 11 GREEN
503 11 GREEN
504 13 BLUE
505 13 RED
Output -- Sample Response from a Solution
Route Return Data (Rank)
Keying off the correlation with “RED” traffic, this solution to the simplified problem has ranked the two red trips before the other trips. The actual order of the RED trips 500 and 505 is arbitrary unless the solution has reason to suspect that one is more likely to have events than the other, in which case the solution should order them accordingly.
Since Pilots 10 and 13 both had “RED” events, a more complicated solution might place both trips 501 and 504 immediately after the RED trips, speculating that pilot might also be a factor in the likelihood of events.
Specification for Solution
Your task is to implement a method to rank trips from most likely to contain events to least likely to contain events. The signature of this method is specified in the Definition section below.
Both string trainingData and string testingData will contain comma delimited strings. Each string describes the information from a particular trip. The order of values and the kind of data in each string is specified in the “Data Dictionary.”
Each string of trainingData will have 34 string values, including the event count for each trip. Each string in the element of testingData will have only 29 string values and will not not contain the number of events.
Your solution method should return a list of ranks assigned to each trip. the list should be the same length as testingData. You response should rank order all trips such that those with the highest safety concern are assigned with a lower rank in the list. All values in the returned array must be unique and the lowest rank should be 1. return[i] is the assigned rank for testingData[i]
Scoring by Relative Ranking:
- Trips without events are assume to be safe -- detecting any is a false positive.
- Trips with at least one event are a safety concern -- ranking these is a true positive and will earn points.
- Trips with more than one event will be an elevated safety concern -- ranking these early enough in the results will earn a small bonus. Solutions that miss or are unable to identify elevated safety concerns but can still rank the trips as safety concerns will earn points, but will miss the bonus.
Let N be the number of trips with at least one event, and let M <= N be the number for trips with more than one event occurred during the trip. Each rank is assigned with a score and a bonus.
scores[i] := MAX( (2N - i) / 2N, 0 ) where scores[i] is the score of rank i - 1
bonuses[i] := MAX( (2M - i) / 2M * 0.3, 0 ) where bonuses[i] is the bonus for rank i - 1
score := 0
for i := 0, 1, ..., sizeOf(yourResult)-1
truePositive := testingData[i] is true positive
highSafetyConcern := testingData[i] is elevated safety concern
score := score +
truePositive * scores[yourResult[i]-1] +
highSafetyConcern * bonuses[yourResult[i]-1]
score := 1000000 * score / max_possible_score
The overall score on a set of test cases is the arithmetic average of scores on single test cases from the set. The match standings displays overall scores on provisional tests for all competitors who have made at least 1 full test submission. The winners are competitors with the highest overall scores on system tests.
Notes on Data Set Generation
- The full data set contains approximately 57,000 lines
- The full data set is divided into 32,967 for example tests, 4,856 for provisional tests, and 19,298 for system tests.
- For each test, approximately 66% of the data (from that segment) is selected for training, and the remainder for testing.
- For provisional tests, all example data is also added to the training set.
- For system tests, all example and provisional data is also added to the training set.