JOIN
Get Time
long_comps_topcoder  Problem Statement
Contest: HMS Challenge #4
Problem: BostonRiskPredictor

Problem Statement

    

Problem: Boston Risk Prediction

Overview

In this problem your task is to identify addresses, streets and neighborhoods in Boston that are - at a given time - at risk for events of major public concern, like health emergencies or violent crime. Your predictions will be based on other information contained in the City of Boston's administrative data streams. These data streams include those that update regularly, like citations from Inspection Services (e.g. housing code violations) or public complaints, and those that are more consistent, like geography data or local demographic characteristics. The predictions are evaluated on different spatial and temporal resolutions. It is not realistic to expect that an event's exact time, location and type can be predicted but the more accurate your predictions on lower resolutions are the more useful they become for the City.

Background and motivation

A number of approaches have been taken previously to attack this sort of question, but no best mechanism has been identified. Work done by the sponsors of this contest has used land use and neighborhood characteristics to determine whether a parcel receives a substantially high number of public complaints. Work in New York and Chicago has used reverse-engineering methods to identify the characteristics that are common to places where a major event has occurred.

A good solution to this problem could be a landmark from the perspective of research, policy and practice. For research, it would help the City to better understand the major processes that underlie neighborhood variation in health-related events. For policy, it will give a clearer guide to how the government might make proactive steps to promote public health. If such an approach is implemented at a policy level, it will help health practitioners decide how to focus their efforts, and what warning signs to be vigilant for when working in neighborhoods.

Neighborhood characteristics have long been noted to have direct, causal impacts on individual and community health. Violence, blight, degraded infrastructure and sub-standard housing all affect a neighborhood's ability to thrive and are therefore major public health concerns. Though the presence of such conditions strongly predicts a variety of negative health outcomes in residents, the algorithm solicited here would specify the spatial and temporal patterns that underlie these relationships. This could then inform policy measures that intend to make public services more responsive to risk factors, and more effective in preventing major health crises.

Training data

You will be provided with a number of data files that can be used to train your model. You should keep in mind that this is real data: although lots of effort has been made to improve the data quality you should be prepared to find inconsistencies, errors, missing elements, out of range values. Also you will find redundancy and traces of suboptimal database design. As an example, some of the data tables have two different unique identifier fields, some tables may use one of them as foreign key while different ones may use the other. Handling these problems gracefully is part of the task.

There are two main types of data files: geography data and event data. The former describes the geographic organization of Boston on multiple levels, the latter describes certain types of events happening at a given time and location. Further, there are a couple of small files that list possible enumerated values for certain fields of larger data files.

Each data file contains records one per line, data fields are separated by a single TAB (0x09) character. The first line contains the field names, also TAB separated, other non-empty lines contain data. The following data files are available.

Geography data

Each file describes a level of Boston's geographic organization. Note that there are references to other levels of geography that are not explicitly used in this contest, in particular pointers to Neighborhoods and Neighborhood Statistical Areas appear in several other data files (keyed Nbhd and NSA_NAME, resp). You may choose to ignore these fields, but feel free to use them if you find they add value to your predictions.

Parcels.csv

Parcels are subunits of addresses. Many addresses have only one parcel (e.g. single-family households), but many others have multiple. Note that a Parcel record has two unique identifiers, AddressID and ParcelNum, both are used as foreign keys from other tables.

  • Ref_ID : foreign key to Street_segments
  • X : X coordinate
  • Y : Y coordinate
  • AddressID : primary key. Note that the database contains a couple of rows with identical AddressIDs, such parcels are treated as equal when referenced from other tables.
  • ZipCd : zip code
  • ParcelNum : alternative primary key
  • LandUse : usage of the parcel, see possible values in Land_Use_Code_Table.csv
  • LotSizeSqF : area in square feet
  • ValuationL : land value
  • ValuationB : building value
  • StreetNbrS : street number part of address
  • Address : full address text
  • LocationID : foreign key to Addresses.LocationID

Addresses.csv

  • TotalSqFt : area in square feet
  • TotalLandV : land value (sum of parcel values)
  • TotalBldgV : building value (sum of parcel values)
  • REF_ID : foreign key to street
  • X : X coordinate
  • Y : Y coordinate
  • LandUse : see possible values in Land_Use_Code_Table.csv
  • Address : address text
  • LocationID : primary key
  • BG_ID : foreign key to Census_block_groups
  • Nbhd, NSA_NAME : foreign keys to higher level geographic units.
  • unusedparcels : number of unused parcels
  • usedparcels : number of used parcels
  • parcels : total number of parcels

Street_segments.csv

All street segments within Boston. Beyond the self descriptive fields there are three network connectivity measures available that need description: these are based on betweenness (the likelihood of walking a street when taking the shortest path from one randomly selected location to another) and closeness (the inverse distance of the shortest path between a street from all other streets in a defined region). Betweenness and closeness are based on an 800 m radius, a typical walking distance. These measures are locally correlated (i.e., in a very connected street grid with many short streets, all streets have higher betweenness scores than a suburban style street grid with cul-de-sacs), so there is also a centered measure (local_b) that compares a street's betweenness to the other street segments in the same Census block group. See here for more information on connectivity measures.

  • Ref_ID : primary key
  • LENGTH : length of segment in meters
  • CFCC : Census designation of road types, see possible values here
  • BG_ID : foreign key to Census_block_groups. Note that the nesting of streets into Census block groups is not perfect, addresses that belong to a street may belong to more than one Census block group. In this case the BG_ID field references one of these groups.
  • b : betweenness, see above
  • c : closeness, see above
  • local_b : see above
  • Main : "1" or "0" for Main vs. non-Main street
  • RoadType : predominant zoning type: one of "Res", "Com", "Ind", "Exe", "Emp" (for Residential, Commercial, Industrial, Exempt and Empty, respectively)
  • count : not in use
  • deadend : "1" or "0"

Intersections.csv

Intersections are points at which two or more street segments meet where at least two of the street segments have different names. Intersections are attributed to the street whose zoning will have the greatest impact on its character: Main > non-Main; Residential > Commercial > Industrial > Exempt > Empty. These characteristics are included in the Intersections database, but are in fact attributes of the street.

  • Description : address text
  • Main : same as for Street_segments
  • RoadType : same as for Street_segments
  • Ref_ID : foreign key to Street_segments
  • BG_ID : foreign key to Census_block_groups
  • X : X coordinate
  • Y : Y coordinate
  • propID : primary key

Census_Block_Groups.csv

Census data are based on Census 2000, as the data used is from the ACS 2005-2009 estimates. Census block groups are aggregations of blocks intended to have ~1000 residents (though this varies considerably).

  • BG_ID : primary key
  • Homeownership : the percentage of homeowners
  • MedIncomeHouse : median household income
  • propwhite : proportion of white people
  • propblack : proportion of black people
  • propasian : proportion of asian people
  • prophisp : proportion of hispanic people
  • medage : median age
  • propcollege : proportion of college students. Note that this is measured on a higher geography level and can be taken only as an estimate on this level.
  • TotalPop : total population
  • popden : population density

Aggregated locations files

The parcels and intersections that belong to a given Census Block Group are grouped by street segments and collected in files titled loc-[BG_ID].csv. This is for your convenience only as you are likely to need these location lists during local testing, the content of these files could be recreated by yourself from the geography files listed previously.

Event data

Event data files are provided either as a single file or as a batch of files, where each file contains events occurring in a given Census Block Group in a given month of 2011. In the latter case the file name contains the ID of the Census Block Group and the ordinal of the month (1 = January, 2 = February, etc), for example the CRM events from March originating from the Census Block Group of ID = 250250001001 are found in crm-250250001001-3.csv. Additionally for certain types of events data earlier than 2011 is available in a single file (not distributed by Census Block Group and date) titled [event-type]-PAST.csv.

crm-[BG_ID]-[month].csv

The Constituent Relationship Management (CRM) system accepts, organizes, and communicates all requests for city services. The nature of the request (Type field) has been categorized into several higher level issue types, see CRM_Case_Type_Code_Table.csv for the mapping that describes such classifications with 1/0-valued variables like "Public" issues (e.g. pothole) and, more importantly, human-generated "Problem"s (e.g. abandoned vehicle). This type mapping is for your convenience, you may want to ignore this or want to do your own classification differently.

The fields of the CRM table are self descriptive with the exception of the propID field.

  • propID : foreign key to either Parcels.AddressID or Intersections.propID. Address-based requests start with an 'A' and are followed by what is AddressID in the Parcels file, intersection-based requests start with an 'I' and refer to a propID in the Intersections file.

violations-[BG_ID]-[month].csv

This is the database of all inspections violations in the city. The DESCRIPT field is standardized and can be used to categorize the event types. Also, note that those cases whose CaseNumber begins with 'CE' are public violations (rather than food code violations, etc.).

  • AddressID : foreign key to Parcels.AddressID

Foreclosures and Related

Foreclosure deeds, petitions, Real Estate Owned homes, and "properties in distress" are documented in these files.

  1. Foreclosure_Deeds.csv
  2. Foreclosure_Distressed_Props.csv
  3. Foreclosure_Petitions.csv
  4. Foreclosure_REOs.csv
  • ParcelID : foreign key to Parcels.ParcelNum. Note that unlike foreign keys to Parcels from other data sets this does not link to the AddressID but to the ParcelNum.

permits-[BG_ID]-[month].csv

These are all permits received for building modifications, including electric, gas, plumbing, and other improvements. The presence of permits indicates that someone is actively involved in the building's upkeep.

  • Block : foreign key to Parcels.AddressID

emergencies-[BG_ID]-[month].csv

These are data are from Emergency (911) dispatch reports. The nature of the emergency (contained in the final_type field) has been categorized into 3 higher level emergency types: social disorders, medical emergencies and violent crimes, see Emergency_Type_Code_Table.csv for the mapping that describes these classifications with the following 1/0-valued variables 'Disorder', 'MedEmerg', 'Violent'. This categorization is important as your task will be to predict events belonging to these event categores. Priority also gives some indication as to the severity of an event.

  • propID : foreign key to either Parcels.AddressID or Intersections.propID, similarly as in CRM files.

Event_counts.csv

Daily emergency event counts in 2011 per event type. These data are provided to account for the absence of several years worth of data that could be used to calculate global seasonal trends. The values are calculated using an undisclosed smoothing algorithm.

Implementation, Testing and Scoring

Your program must implement 3 methods: setup, receiveEvents and predict. You'll be given the exact contents of most of the available data files as arguments for these methods. This means that the arguments 'parcels', 'addresses', 'streetSegments', 'emergencies', etc will represent the appropriate files. Each input argument is an array of Strings, each element of the array contains one line of a file in the same TAB-separated format as described previously. In case the file contains a header in the first line, it will be removed for your convenience.

The setup method provides static geography data plus the data described under "Foreclosure and related" above. All details of parcels, addresses, street segments, intersections, census block groups and the 4 kinds of foreclosures related data are given, however, you may chose to store only a subset of the records or of the fields in your program. The method returns an integer to indicate success, the exact return value is insignificant. The setup method will be called only once on an instance of your class.

The receiveEvents method provides event data for a given Census Block Group, one month at a time. It takes a String representing a Census Block Group ID, an integer representing a month within the year 2011 (January = 1, February = 2, etc) and several arrays of Strings representing event data. All details of CRM events, inspection violations, permits and emergencies are given, however, you may chose to store only a subset of the records or of the fields in your program. The method returns an integer to indicate success, the exact return value is insignificant. The receiveEvents method will be called multiple times on an instance of your class, see below for details on the calling sequence.

The predict method takes a Census Block Group ID and a month, similarly as specified for the receiveEvents method. Additionally it takes a String array of locations that contain the identifiers of all parcels and intersections within the given Census Block Group in the same format as used in the propID field of the CRM and Emergencies data sets. The method should return your predictions in an array of doubles, according to these rules:

  • Let L be the number of locations given to the method and let D be the number of days in the current month (30 or 31). The return array must contain L * D * 4 elements.
  • Each element is interpreted as the expected number of emergency events of a given type occurring on a given day at a given location. Note that in theory it is possible to see more than one events on the same day at the same place, so numbers higher than 1 are allowed. In practice however, this happens rarely, so the term "expected number" can be thought of as meaning "probability".
  • For a given location index (i = 0, 1, ..., L-1) and day of month (d = 0, 1, ..., D-1) let p = i*D*4 + d*4. The predictions for the location given as the i-th element of the locations array should be placed in the return array at positions p, p+1, p+2 and p+3, representing the expected number of social disorders, medical emergencies, violent crimes, and all other types, respectively.

The predict method will be called multiple times on an instance of your class.

An instance of your class interacts with events occurring only within a single Census Block Group. The full sequence of calls is outlined below:

  1. setup is called with all static geography data plus foreclosures and related data up to August 2011
  2. receiveEvents is called with all event data from January 2011 for the given Census Block Group
  3. receiveEvents is called with all event data from February
  4. ...
  5. receiveEvents is called with all event data from August
  6. predict is called to make predictions for September
  7. receiveEvents is called to update you with events from September
  8. predict is called to make predictions for October
  9. receiveEvents is called to update you with events from October
  10. ...
  11. predict is called to make predictions for December

The scoring works as follows. If the system is not able to get return from your solution or the return is formatted improperly, the score for this test case is 0. Otherwise, each predicted value is set to 0 if it is less than 0 and set to 10 if it is larger than 10. Then for each predicted month (month = 9,...,12), several partial scores are calculated on different resolutions of time, space and event types. These partial scores represent the difference between the predicted and actual distribution of events in the given month. These differences are calculated as weighted averages of RMSE values, where the weights used depend on the given time/space/event resolution. For scoring the inverse values of these weighted differences will be used. The score for a test case is calculated as the average of these partial scores on the 4 predicted months. Finally, the overall score on a group of tests is the arithmetic average of scores on all test cases from the group.

A more formal definition is given below. For the authoritative specification of the scoring algorithm see the scorer implementation.

Let TimeSpanSet be a set of these time spans: {1 day, 3 days, 1 week, 2 weeks, 1 month}
Let SpaceSet be a set of these location units: {'parcel', 'street', 'CensusBlockGroup'}
Let EventTypeSet contain these event types: {'Disorder', 'MedEmerg', 'Violent', 'Other'}

Let L_time(ts) be the lenght of timespan ts in days.
Let N_time(d) be the number of different consecutive time spans of length d within a month. 
For example in a 30-day month N_time(1 day) = 30, N_time(3 days) = 28, N_time(1 week) = 24.
Let wEvent(et) be 8 if et = 'Disorder'; 6 if et = 'MedEmerg'; 20 if et = 'Violent';  1 if et = 'Other'

Let getMseInCubicle(month, locationSet, predictions, expectedValues) be a function that calculates the 
average squared difference between predicted and actual values for a given month and a set of 
locations as such:
let L be the number of elements in locationSet
let sumError = 0
let sumWeight = 0
for each event type et in EventTypeSet
    for each timeSpan ts in TimeSpanSet
        for each possible time window tw of length ts within the month
            let P be the sum of predicted events within the time-space-event cubicle defined by tw, 
                locationSet and et
            let A be the sum of actual events within the same time-space-event cubicle
            let P = P / L / L_time(ts)
            let A = A / L / L_time(ts)
            let error = (P - A) * (P - A)
            let weight = L * wEvent(et) / N_time(ts)
            let sumError = sumError + error * weight
            let sumWeight = sumWeight + weight
let MSE = sumError / sumWeight
return MSE

Let errorToScore(error, eventCount) be a function that converts an RMSE value measured on a given 
    number of events to a score as such:
if (eventCount == 0) return 0,
otherwise return 1 / (1e-5 + error / eventCount).

With these definitions the score for a test case is calculated as such:
let sumScore = 0
for each month from 9 to 12
    let sumMonthlyScore = 0
    let eventCount be the number of actual events in this month
    for each location unit lu in SpaceSet
        let sumError = 0
        for each locationSet of granularity lu within the Census Block Group (*)
            let MSE = getMseInCubicle(month, locationSet, predictions, expectedValues)
            sumError = sumError + MSE
        let L be the number of locationSets of granularity lu within the Census Block Group 
        let RMSE = sqrt(sumError / L)
        let score = errorToScore(RMSE, eventCount)
        let sumMonthlyScore = sumMonthlyScore + score
    let averageMonthlyScore = sumMonthlyScore / 3 //number of location units
    let sumScore = sumScore + averageMonthlyScore
let scoreForTestCase = sumScore / 4 //number of months
return scoreForTestCase

Notes:

  • Parcel level predictions actually mean a combination of parcels and intersections.
  • The iteration marked with (*) above visits
    • each parcel and intersection one by one if lu is 'parcel'. In this case the locationSet within the loop contains a single parcel or intersection.
    • each street segment one by one if lu is 'street'. In this case the locationSet within the loop contains the parcels and intersections found on the given street segment.
    • alls parcels and intersections in a single step if lu is 'CensusBlockGroup'. In this case the locationSet within the loop contains all the parcels and intersections of the Census Block Group.
  • When aggregating parcel level predictions to the street level each intersection is assigned to one streetSegment as defined by the Intersections.Ref_ID field.

There will be 5 example, 95 provisional and 444 final test cases. Each test case corresponds to one Census Block Group. Boston's 544 Census Block Groups were distibuted randomly without any bias to example, provisional testing and final testing data sets.

Special conditions

You can include open source code in your submission provided that it is clearly identified in your code. Any OSI-approved licenses are allowed as listed here.

The solutions can be submitted in the following programming languages: C++, Java, C#.NET, or VB.NET. Python submissions are not allowed.

In order to receive the prize money, you will need to fully document the derivation of all parameters internal to your algorithm. If these parameters were obtained from the training data set, you will also need to provide the program used to generate these training parameters. There is no restriction on the programming language used to generate these training parameters. Note that all these data should NOT be submitted anywhere during the coding phase. Instead, if you win a prize, a TopCoder representative will contact you directly in order to collect these data. The description must be submitted within 7 days after the contest results are published.

You MUST NOT use any external (outside of this competition) source of data to train your solution.

 

Definition

    
Class:BostonRiskPredictor
Method:setup
Parameters:String[], String[], String[], String[], String[], String[], String[], String[], String[]
Returns:int
Method signature:int setup(String[] parcels, String[] addresses, String[] streetSegments, String[] intersections, String[] censusBlockGroups, String[] foreclosureDeeds, String[] foreclosureDistressedProps, String[] foreclosurePetitions, String[] foreclosureREOs)
 
Method:receiveEvents
Parameters:String, int, String[], String[], String[], String[]
Returns:int
Method signature:int receiveEvents(String BG_ID, int month, String[] CRMs, String[] violations, String[] permits, String[] emergencies)
 
Method:predict
Parameters:String, int, String[]
Returns:double[]
Method signature:double[] predict(String BG_ID, int month, String[] locations)
(be sure your methods are public)
    
 

Notes

-To facilitate the development of your model and allow you to test offline, you may download two sets of training data.
  • Set 1 contains all the static geography data and all event data from Boston's 544 Census Block Groups up to the end of August, 2011. Each line represents one record, in the same format as it will be given to the setup and receiveEvents methods. Note: the Event_counts.csv within the data package is deprecated, this file should be used instead.
  • Set 2 contains all event data for the 5 example test cases, including the time period that you have to predict (September - December, 2011).
-An instance of your class interacts with events occurring within a single Census Block Group.
-You can train your solution offline based on the given files and you can hardcode data into your solution - just remember that you are not allowed to use data from other sources than this contest.
-The time limit is 30 seconds per test case, the memory limit is 1024MB.
-There is no explicit code size limit. The implicit source code size limit is around 1 MB (it is not advisable to submit codes of size close to that or larger). Once your code is compiled, the binary size should not exceed 1 MB.
-The compilation time limit is 30 seconds. You can find information about compilers that we use and compilation options here.
-You may find it useful to think about different ways of how emergencies are following each other or other types of events. We assume that a successful model will combine elements of different approaches. A couple of ideas to consider:
  • a specific profile of events at a place precede other events at this place in a predictable fashion (same-level approach)
  • characteristics of a region, interacting with a profile of events there, precede other events at locations within the region, and some locations are more at risk than others for said events (top-down approach)
  • a profile of events at a single location (say, a parcel or a street) predicting not only further events at that geographical unit, but in the surrounding area, captured by a larger level of geography (bottom-up approach)
-You may find it useful to think about different ways of how emergencies are following each other or other types of events. We assume that a successful model will combine elements of different approaches. A couple of ideas to consider:
  • a specific profile of events at a place precede other events at this place in a predictable fashion (same-level approach)
  • characteristics of a region, interacting with a profile of events there, precede other events at locations within the region, and some locations are more at risk than others for said events (top-down approach)
  • a profile of events at a single location (say, a parcel or a street) predicting not only further events at that geographical unit, but in the surrounding area, captured by a larger level of geography (bottom-up approach)
 

Examples

0)
    
Census block group ID = 250250201003
1)
    
Census block group ID = 250250906003
2)
    
Census block group ID = 250251205002
3)
    
Census block group ID = 250251008005
4)
    
Census block group ID = 250251001006

This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2010, TopCoder, Inc. All rights reserved.