JOIN
Get Time
long_comps_topcoder  Problem Statement
Contest: Atrocity MM
Problem: MassAtrocityPredictor2016

Problem Statement

    

Mass atrocities are large-scale attacks against innocent civilians, offending our morality and threatening global security. We are challenging you to participate in this important humanitarian initiative.

The task you will need to solve in this contest is the following: given data about various sociopolitical activities around the world and information about past atrocities, develop an algorithm to predict where and when atrocities will happen in the near future (within the next 30 days).

 

Prizes

There is a total $15,000 USD cash prize purse for this contest. The contestants with 5 highest final scores will be awarded the following prizes:

1st place	 $7,000 USD
2nd place	 $3,000 USD
3rd place	 $2,500 USD
4th place	 $1,500 USD
5th place	 $1,000 USD

TopCoder may *offer* to purchase submissions that did not win any prize if the client is interested in using them.

 

Test case structure and implementation notes

This problem uses the concept of a day ID. This is the chronological number of a day counting from some fixed date in the past.

The data used in this contest starts at the day 1. There is 1 example, 1 provisional, and 1 system test case. Each test case has a learning period and a testing period:

Test case	start-end day IDs of the learning period	start-end day IDs of the testing period
=======================================================================================================
Example		     1-4748   (7,779,532 rows)                    4749-5114    (2,313,735 rows)
Provisional	     1-5114  (10,093,267 rows)                    5115-5990   (12,713,272 rows)
System		     1-6025  (23,517,588 rows)                    6026-7052   (22,138,639 rows)

During the learning period, you are only receiving the data. During the testing period, you are receiving the data and predicting the atrocities.

In order to supply any data to your code, the testing program will call the method receiveData. Its parameters have the following meaning:

  • int dataSourceId. This is the ID of the data source from which the data comes. There are three possibilities for this parameter. dataSourceId = 0 means information about atrocity events that happened in the current day. dataSourceId = 1 corresponds to information about sociopolitical activities. dataSourceId = 2 means geographic data (supplied only at the first day of the learning period). Finally, dataSourceId = 3 means the list of regions that are part of each country. More details about all these kinds of the data can be found below.
  • int dayID. This is the ID of the current day.
  • String[] data. Contains the whole data coming from the given data source for the given day.

You can return any value from the receiveData method. It does not have any meaning and will not be analyzed.

 

Geographical data

In this problem we assume that the entire land of the Earth is divided into countries, with each country further subdivided into one or more regions (overall, there are 3466 regions). The data about countries and regions is taken from this website (countries are called "admin-0 units" and regions are called "admin-1 units"). For simplicity, we assign a unique integer ID number to each country and to each region (from 1 to 303 for countries and from 1 to 3466 for regions). A list of region IDs in each country is provided. (Note that some countries may have no regions, implying that for purposes of our contest, nothing of interest happened there.)

Exactly the same data will also be passed to your solution during the call to receiveData with dataSourceId = 3. Only one such a call will be made (at the first day of the learning period). Each element of String[] data will contain a single line from the file.

Each region is represented as a spherical polygon, that is, a set of latitude and longitude coordinates bounding the area, in the format "LAT,LONG,LAT,LONG,LAT,LONG..."

This file contains the entire geographical data used in this problem. It contains the full list of region descriptions in order by ID.

Exactly the same data will also be passed to your solution during the call to receiveData with dataSourceId = 2. Only one such a call will be made (at the first day of the learning period). Each element of String[] data will contain a single line from the file.

 

Past atrocities data

As mentioned above, the information about atrocities that happened on the current day will be passed into your program by calling the receiveData method with dataSourceId = 0. In this case, each element of data will correspond to a single atrocity occurrence. It will be formatted as "PRECISION,COUNTRY_ID,REGION_ID,LATITUDE,LONGITUDE,SCALE", where LATITUDE and LONGITUDE describe the latitude and longitude of the event's location, COUNTRY_ID and REGION_ID are the IDs of country and region of atrocity's location. PRECISION is 1, 2 or 3 indicating if the atrocity location indicates, country, region, or locality. SCALE is a value from 1-6 indicating the magnitude of the atrocity.

NOTE: everywhere in this problem, if a certain location does not belong to any of the regions/countries, it will be assigned to a region/country nearby (in vast majority of cases, it will be the region/country closest to the given location).

 

Predicting atrocity events

Each time the testing program wants you to predict atrocities that will happen in the near future, it will call your predictAtrocities method. Its parameter dayID describes the current day ID. Your task in this method is as follows: for each region, estimate how likely it is that at least one atrocity event will happen in this region in the period from day dayID+1 to day dayID+30, inclusive.

The return value must contain exactly 3466 elements, each being between 0.0 and 10.0, inclusive. The i-th element represents your estimate of Sum(scale) for all events in (i+1)-th region in the period designated above. A confidence of 0.0 means that you're absolutely sure that no event will happen. See the "Scoring" section for more information about how these confidence values are interpreted.

 

Summary

Overall, the interaction between the testing program and your algorithm can be described with the following pseudocode:

score := 0

For each day in learning period + testing period (in chronological order):
    If the current day is the first day of the learning period:
        Pass geographical data to the algorithm through the receiveData method with dataSourceId = 2.
        Pass region and country data to the algorithm through the receiveData method with dataSourceId = 3.

    Pass sociopolitical activities data to the algorithm through the receiveData method with dataSourceId = 1.
    Pass the information about atrocities occurred in the current day through the receiveData method with dataSourceId = 0.

    If the current day is within the testing period:
        Ask the algorithm to estimate confidence for each region via the call to predictAtrocities method.
        Adjust score based on the returned confidences and real atrocity event occurrences (as described in the "Scoring" section).

 

Training data description

For training purposes you can use all sociopolitical activities data available for day IDs 1 through 5114, inclusive. It can be downloaded here. The data for day D is located in the file named data[D+9999].txt. Each line of the file describes an action that took place somewhere in the world. For each action, up to two participants are identified, as well as the action location and several sociopolitical attributes. There are 18 fields in total, separated by single spaces. The fields are described below.

1. Participant 1 ID

A participant can be a person, corporation, government, or any kind of organization. Participants are identified by an arbitrary ID number. All instances of the given ID number refer to the same participant.

2. Participant 1 location precision

The location of a participant is identified at one of three precision levels, each denoted with an integer from 1 to 6. If the location is unknown, an underscore character ('_') appears here.

  • 1 = action identified in a country
  • 2 = action identified in a region
  • 3 = action identified in a town
  • 4-6 = action identified at a specific location

3. Participant 1 location description

If the location is unknown, an underscore character ('_') appears here. Otherwise, the meaning of this field depends on the location precision, as follows.

  • 1: the country's centroid coordinates formatted as "LATITUDE,LONGITUDE", where both LATITUDE and LONGITUDE are float values.
  • 2: the region's centroid coordinates (in the same format as above)
  • 3: the town's centroid coordinates (in the same format as above)

4. Participant 1 country ID

The ID of a country where the place given by field 3 is located. If the location is unknown, an underscore character ('_') appears here.

5. Participant 1 region ID

The ID of a region where the place given by field 3 is located. If the location is unknown or if location precision is 1 (country), an underscore character ('_') appears here.

6. Participant 2 ID

In each action, the number of identified participants can be zero, one, or two. The underscore character ('_') appears when no participant is identified. If only one participant is identified, it can appear as either Participant 1 or Participant 2.

The next four fields describe the location of Participant 2 in the same manner as above.

7. Participant 2 location precision

8. Participant 2 location description

9. Participant 2 country ID

10. Participant 2 region ID

11. Action type

There are twenty action types, each identified by a lowercase letter from 'a' to 't'.

The next four fields describe the location of the action in the same manner as for the participants.

12. Action location precision

13. Action location description

14. Action country ID

15. Action region ID

16. Importance

The value of this field is 't' or 'f'. If it is 't', the action is considered to be important. 'f' means that it is not considered being important.

17. Media coverage

The amount of media coverage received by the action is rated on a scale of 1 (least covered) to 100 (most covered).

18. Media sentiment

The sentiment of the media coverage is assessed on a scale of -50 to 50. A value of 0 means that the coverage was neutral. A score of 50 means that the coverage was very positive, and -50 indicates very negative.

 

Data about atrocities that happened in the same period is also available. Each line is formatted as "DAY,PRECISION,COUNTRY_ID,REGION_ID,LATITUDE,LONGITUDE,SCALE" where DAY is the day ID corresponding to the date of the atrocity, LATITUDE and LONGITUDE are the coordinates of the location of the atrocity, COUNTRY_ID and REGION_ID are the IDs of country and region of atrocity's location. REGION_ID may be empty, in which case it means the atrocity happened at a country level, and is treated as an atrocity of size SCALE/N in each of the N regions in that country.

 

Scoring

For both provisional and system testing, there is just 1 test case and your total score is equal to the score you receive on this test case. For each (dayID from the testing period, region) pair, let PRED be the prediction that you returned. Consider all atrocity events that occurred at the given region at days chronologically preceding dayID (including dayID). Let D be the latest of these days. We define WGH as tanh((dayID - D + 10) / 180) (if no atrocity events occurred at the given region so far, WGH is set to 1). If at least one atrocity event occurred at the given region in days dayID+1 to dayID+30 (inclusive), then your score is increased by WGH * (SUM[SCALE] * PRED - PRED * PRED / 2). Otherwise, if no atrocity events occurred, your score is decreased by WGH * PRED * PRED / 2.

Note that your possible score for a given prediction of atrocities is maximized when you predict exactly the Sum(Scale) of atrocities over the next 30 days, and that it is possible to lose points for predicting a value more than 2 * Sum(Scale).

Note that some atrocities happen at a country level, rather than a regional level. In those cases, it is treated as though an event of size SCALE / N happened in each of the N regions of that country.

Once all (dayID, region) pairs are considered, if your score is negative, it is reset to 0. You also get a score of 0 if there are any problems with your solution (time/memory limit exceeded, crash, invalid return value, etc.).

 

Special conditions

  • It is strictly forbidden to use any data other than the data that we provide to you in this competition.
  • You are required to use the same approach to estimate the confidence for each (dayID, region) pair. It is forbidden to have any kind of special judgment for any selected pairs.
  • You are required to process all participants (in sociopolitical data) in the same fashion. It is forbidden to have any kind of special processing for any selected participants.
  • You can include open source code in your submission provided that the open source code must be licensed under an acceptable license, separated from and clearly identified in your code, and your submission in this competition must be in compliance with the license. An open source license is acceptable if:
    1. it is an OSI-approved open source license (listed at http://opensource.org/licenses), or
    2. it is a license that meets the requirements of the OSI Open Source definition (http://opensource.org/osd), or
    3. it is an explicit and clear dedication by the author(s) to the public domain.
    The question of whether a license is acceptable, and particularly whether a license that is not OSI-approved meets the requirements of the Open Source definition, is complex and will be decided by TopCoder in TopCoder’s sole discretion, and so we strongly recommend that, if you use open source code, you use code that is clearly licensed under an OSI-approved license. Questions about specific open source licenses may be asked in the forums, but given the complexities of license evaluation, is it possible that TopCoder will not be able to respond during the contest.
  • The solution can be implemented in C++, Java, C# or VB. Python submissions are not accepted.
  • In order to receive the prize money, you will need to fully document how your algorithm works. Note that this description should NOT be submitted anywhere during the coding phase. Instead, if you win a prize, a TopCoder representative will contact you directly in order to collect these data. The description must be submitted within 7 days after the contest results are published.
 

Definition

    
Class:MassAtrocityPredictor2016
Method:receiveData
Parameters:int, int, String[]
Returns:int
Method signature:int receiveData(int dataSourceId, int dayID, String[] data)
 
Method:predictAtrocities
Parameters:int
Returns:double[]
Method signature:double[] predictAtrocities(int dayID)
(be sure your methods are public)
    
 

Notes

-Time limit is 2 hours per test case and memory limit is 6144MB.
-The solutions are executed at VMs. Therefore the amount of available CPU resources may vary to some degree.
-The time limit is set to a large value in order to help you deal with that. Given this variability, we recommend you to design your solution so that its estimated runtime does not exceed 90 minutes. When estimating the runtime, please take into account that the system test case has significantly larger amount of data to be processed than the provisional one.
-There is no explicit code size limit. The implicit source code size limit is around 1 MB (it is not advisable to submit codes of size close to that or larger). Once your code is compiled, the binary size should not exceed 1 MB.
-The compilation time limit is 60 seconds. You can find information about compilers that we use and compilation options here.
-This contest was run previously. The data formats have changed slightly, but the core of the problem remains largely the same. You can view the previous problem, as well as winning submissions here. You are encouraged to review and use elements of the top 5 winning solutions to help on this problem.
-Documentation from previous contest winners is likewise available: Top-5 Descriptions and Post-contest ideation
 

Examples

0)
    
1)
    

This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2010, TopCoder, Inc. All rights reserved.