This contest is brought to you by: NASA's Center of Excellence for Collaborative Innovation and Harvard Business School in association with the Institute of Quantitative Social Sciences.
Mass atrocities are large-scale attacks against innocent civilians, offending our morality and threatening global security. President Barack Obama has directed U.S. government agencies including the United States Agency for International Development (USAID) to develop methods to detect and prevent mass atrocities. As part of its response to the presidential directive, USAID is partnering with Humanity United and TopCoder to build a computational model for predicting atrocities. We are challenging you to participate in this important humanitarian initiative.
The task you will need to solve in this contest is the following: given data about various sociopolitical activities around the world and information about past atrocities, develop an algorithm to predict where and when atrocities will happen in the near future (within the next 30 days).
There is a total $25,000 USD cash prize purse for this contest. The contestants with 5 highest final scores will be awarded the following prizes:
1st place $12,000 USD
2nd place $6,000 USD
3rd place $3,000 USD
4th place $2,000 USD
5th place $1,000 USD
TopCoder may *offer* to purchase submissions that did not win any prize if the client is interested in using them.
Additionally, there is a $1,000 USD prize for an unusual and perspective idea. Only contestants with 6th and lower final scores are eligible for this prize. The idea must be described and submitted in a text document. You do not need to implement your idea (though if it is incorporated into your marathon match final submission, it can be a plus). The submission phase for ideas is separate from the marathon match and will open once the match is over. Additional details will be communicated at the match forum. Evaluation of ideas is subjective, but the main evaluation criteria for the winning idea is: can (and how much) the Top-5 marathon match algorithms benefit from this idea? TopCoder reserves the right not to award this prize if no ideas are considered to be useful enough.
Test case structure and implementation notes
This problem uses the concept of a day ID. This is the chronological number of a day counting from some fixed date in the past.
The data used in this contest starts at the day 11284. There is 1 example, 1 provisional, and 1 system test case. Each test case has a learning period and a testing period:
Test case start-end day IDs of the learning period start-end day IDs of the testing period
Example 11284-15301 (44,904,851 lines*) 15302-15636 ( 6,345,731 lines)
Provisional 11284-15666 (51,250,582 lines) 15667-16367 (25,574,119 lines)
System 11284-16397 (76,824,701 lines) 16398-17644 (92,738,537 lines)
* the total amount of lines of sociopolitical activities data
During the learning period, you are only receiving the data. During the testing period, you are receiving the data and predicting the atrocities.
In order to supply any data to your code, the testing program will call the method receiveData. Its parameters have the following meaning:
- int dataSourceId. This is the ID of the data source from which the data comes. There are three possibilities for this parameter. dataSourceId = 0 means information about atrocity events that happened in the current day. dataSourceId = 1 corresponds to information about sociopolitical activities. Finally, dataSourceId = 2 means geographical data (supplied only at the first day of the learning period). More details about all these kinds of the data can be found below.
- int dayID. This is the ID of the current day.
- String data. Contains the whole data coming from the given data source for the given day.
You can return any value from the receiveData method. It does not have any meaning and will not be analyzed.
In this problem we assume that the entire land of the Earth is divided into 254 countries. Each country is further subdivided into one or more regions (overall, there are 3671 regions). The data about countries and regions is taken from this website (countries are called "admin-0 units" and regions are called "admin-1 units"). For simplicity, we assign a unique integer ID number to each country and to each region (from 0 to 253 for countries and from 0 to 3670 for regions).
Each region is represented as one or more spherical polygons. Each polygon is either outer or inner. The region is defined as the set of the locations which lie within at least one of outer polygons and do not lie within any of inner polygons.
This file contains the entire geographical data used in this problem. It contains 3671 region descriptions where each description is formatted as follows:
[one or more polygon description]
The description of each polygon consists of two lines. The first line contains one of the words "outer" or "inner". The second line consists of space-separated coordinates of polygon's corners. Each corner is formatted as "LONGITUDE,LATITUDE" where LATITUDE and LONGITUDE are the real numbers being the geographical coordinates of a polygon's corner.
Exactly the same data will also be passed to your solution during the call to receiveData with dataSourceId = 2. Only one such a call will be made (at the first day of the learning period). Each element of String data will contain a single line from the file.
Past atrocities data
As mentioned above, the information about atrocities that happened on the current day will be passed into your program by calling the receiveData method with dataSourceId = 0. In this case, each element of data will correspond to a single atrocity occurrence. It will be formatted as "LATITUDE LONGITUDE COUNTRY_ID REGION_ID", where LATITUDE and LONGITUDE describe the latitude and longitude of the event's location, COUNTRY_ID and REGION_ID are the IDs of country and region of atrocity's location. The format of both latitude and longitude is the following: "[+/-]<Degrees>d<Minutes>m<Seconds>s", where Degrees, Minutes and Seconds are components of the coordinate. Minutes and Seconds are always printed with exactly 2 digits (a leading zero is used if necessary), while Degrees does not have leading zeros. In the rest of the problem statement, we will refer to this format as DMS format. Examples: "-123d05m46s", "+0d40m00s".
NOTE: everywhere in this problem, if a certain location does not belong to any of the regions/countries, it will be assigned to a region/country nearby (in vast majority of cases, it will be the region/country closest to the given location).
Predicting atrocity events
Each time the testing program wants you to predict atrocities that will happen in the near future, it will call your predictAtrocities method. Its parameter dayID describes the current day ID. Your task in this method is as follows: for each region, estimate how likely it is that at least one atrocity event will happen in this region in the period from day dayID+1 to day dayID+30, inclusive.
The return value must contain exactly 3671 elements, each being between 0.0 and 1.0, inclusive. The i-th element represents your confidence about having at least one event in i-th region in the period designated above. A confidence of 0.0 means that you're absolutely sure that no event will happen. A confidence of 1.0 means that you're absolutely sure that at least one event will happen. See the "Scoring" section for more information about how these confidence values are interpreted.
Overall, the interaction between the testing program and your algorithm can be described with the following pseudocode:
score := 0
For each day in learning period + testing period (in chronological order):
If the current day is the first day of the learning period:
Pass geographical data to the algorithm through the receiveData method with dataSourceId = 2.
Pass sociopolitical activities data to the algorithm through the receiveData method with dataSourceId = 1.
Pass the information about atrocities occured in the current day through the receiveData method with dataSourceId = 0.
If the current day is within the testing period:
Ask the algorithm to estimate confidence for each region via the call to predictAtrocities method.
Adjust score based on the returned confidences and real atrocity event occurrences (as described in the "Scoring" section).
Training data description
For training purposes you can use all sociopolitical activities data available for day IDs 11284 through 15666, inclusive. It can be downloaded here. The data for day D is located in the file named D.txt.
Each line of the file describes an action that took place somewhere in the world. For each action, up to two participants are identified, as well as the action location and several sociopolitical attributes. There are 18 fields in total, separated by single spaces. The fields are described below.
1. Participant 1 ID
A participant can be a person, corporation, government, or any kind of organization. Participants are identified by an arbitrary ID number. All instances of the given ID number refer to the same participant.
2. Participant 1 location precision
The location of a participant is identified at one of three precision levels, each denoted with an integer from 1 to 3. If the location is unknown, an underscore character ('_') appears here.
- 1 = action identified in a country
- 2 = action identified in a region
- 3 = action identified in a town
3. Participant 1 location description
If the location is unknown, an underscore character ('_') appears here. Otherwise, the meaning of this field depends on the location precision, as follows.
- 1: the country's centroid coordinates formatted as "LATITUDE,LONGITUDE", where both LATITUDE and LONGITUDE are in DMS format
- 2: the region's centroid coordinates (in the same format as above)
- 3: the town's centroid coordinates (in the same format as above)
4. Participant 1 country ID
The ID of a country where the place given by field 3 is located. If the location is unknown, an underscore character ('_') appears here.
5. Participant 1 region ID
The ID of a region where the place given by field 3 is located. If the location is unknown or if location precision is 1 (country), an underscore character ('_') appears here.
6. Participant 2 ID
In each action, the number of identified participants can be zero, one, or two. The underscore character ('_') appears when no participant is identified. If only one participant is identified, it can appear as either Participant 1 or Participant 2.
The next four fields describe the location of Participant 2 in the same manner as above.
7. Participant 2 location precision
8. Participant 2 location description
9. Participant 2 country ID
10. Participant 2 region ID
11. Action type
There are twenty action types, each identified by a lowercase letter from 'a' to 't'.
The next four fields describe the location of the action in the same manner as for the participants.
12. Action location precision
13. Action location description
14. Action country ID
15. Action region ID
The value of this field is 't' or 'f'. If it is 't', the action is considered to be important. 'f' means that it is not considered being important.
17. Media coverage
The amount of media coverage received by the action is rated on a scale of 1 (least covered) to 100 (most covered).
18. Media sentiment
The sentiment of the media coverage is assessed on a scale of 0 to 50. A value of 0 means that the coverage was neutral. A score of 50 means that the coverage was very positive.
Data about atrocities that happened in the same period is also available. Each line is formatted as "DAY LATITUDE LONGITUDE COUNTRY_ID REGION_ID" where DAY is the day ID corresponding to the date of the atrocity, LATITUDE and LONGITUDE are the coordinates (in DMS format) of the location of the atrocity, COUNTRY_ID and REGION_ID are the IDs of country and region of atrocity's location.
For both provisional and system testing, there is just 1 test case and your total score is equal to the score you receive on this test case. For each (dayID from the testing period, region) pair, let CONF be the confidence that you returned. Consider all atrocity events that occurred at the given region at days chronologically preceding dayID (including dayID). Let D be the latest of these days. We define WGH as tanh((dayID - D + 10) / 180) (if no atrocity events occurred at the given region so far, WGH is set to 1). If at least one atrocity event occurred at the given region in days dayID+1 to dayID+30 (inclusive), then your score is increased by WGH * (CONF - CONF * CONF / 2). Otherwise, if no atrocity events occurred, your score is decreased by WGH * CONF * CONF / 2.
Once all (dayID, region) pairs are considered, if your score is negative, it is reset to 0. You also get a score of 0 if there are any problems with your solution (time/memory limit exceeded, crash, invalid return value, etc.).
- It is strictly forbidden to use any data other than the data that we provide to you in this competition.
- You are required to use the same approach to estimate the confidence for each (dayID, region) pair. It is forbidden to have any kind of special judgment for any selected pairs.
- You are required to process all participants (in sociopolitical data) in the same fashion. It is forbidden to havy any kind of special processing for any selected participants.
- You can include open source code in your submission provided that the open source code must be licensed under an acceptable license, separated from and clearly identified in your code, and your submission in this competition must be in compliance with the license. An open source license is acceptable if:
The question of whether a license is acceptable, and particularly whether a license that is not OSI-approved meets the requirements of the Open Source definition, is complex and will be decided by TopCoder in TopCoder’s sole discretion, and so we strongly recommend that, if you use open source code, you use code that is clearly licensed under an OSI-approved license. Questions about specific open source licenses may be asked in the forums, but given the complexities of license evaluation, is it likely that TopCoder will not be able to respond during the contest.
- it is an OSI-approved open source license (listed at http://opensource.org/licenses), or
- it is a license that meets the requirements of the OSI Open Source definition (http://opensource.org/osd), or
- it is an explicit and clear dedication by the author(s) to the public domain.
- The solution can be implemented in C++, Java, C# or VB. Python submissions are not accepted.
- In order to receive the prize money, you will need to fully document how your algorithm works. Note that this description should NOT be submitted anywhere during the coding phase. Instead, if you win a prize, a TopCoder representative will contact you directly in order to collect these data. The description must be submitted within 7 days after the contest results are published.