EPA Cyano Modeling Challenge
|| Problem Statement
| ||The EPA is a U.S. federal government agency devoted to safeguarding the environment. One of the EPA's great concerns is the proliferation of cyanobacterial harmful blooms (cyanoHABs) in the nation's lakes. The following resources provide information on what cyanoHABs are and how they threaten the environment. |
EPA: Cyanobacterial Harmful Blooms (CyanoHABs)
University of North Carolina: Cyanobacteria [video]
The TopCoder project on cyanoHABs aims to develop an algorithm that will be deployed in an Android app with mapping and data visualization capabilities. The app will inform local and federal policy makers about locations where bloom events are likely to occur, allowing them to concentrate their efforts in those areas. This Marathon competition is focused on seeking areas which have certain trend at some fixed intervals into the future.
Your prediction algorithm will be based on past concentrations of cyanobacteria records, along with other data such as weather information, survey of land usage and what crops were cultivated in agricultural areas.
The best 5 performers of this contest (according to system test results) will receive the following prizes:
- place: 6000$
- place: 4000$
- place: 2750$
- place: 1500$
- place: 750$
(1) MERIS Data
MERIS data is a synthetic data set representing cyanobacterial concentration, although it is actually an estimate, we use this data set as the ground truth in this competition, i.e. your prediction to make will be checked with future MERIS data for scoring.
MERIS data covers three regions of the United States:
They are derived from satellite photographs by applying an experimental formula to estimate the concentration of cyanobacteria in each 300-by-300-meter area of the covered region.
Concrete cyanobacterial concentration is measured using cells per milliliter, and the EPA has defined four levels of cyanobacterial concentration:
It should be noted that according to the EPA's calculations, the Low and Very High readings in the MERIS data are accurate and the readings at the intermediate levels (Medium and High) may be volatile. It means that any number between 110,000 and 999,999 is not guaranteed to be fully accurate. To address this situation, each region is divided into multiple areas and for scoring values within each segment are averaged.
- Low: 10,000 to 109,999 cyanobacterial cells per milliliter
- Medium: 110,000 to 299,999 cells / mL
- High: 300,000 to 999,999 cells / mL
- Very High: 1,000,000 cells / mL or higher
The areas are squares in terms of geographical coordinates. Each has a height of 0.01 degrees of latitude and a width of 0.02 degrees of longitude. The MERIS data is spanning the time period from February 2009 to April 2012, but it is not available for every day and the gap between available days is not a fixed value, the average gap is about 3 days. Further note that due to internal formatting the lowest value that can be stored is 10,232.9 and the highest is 3,090,295.4. Therefore this values should be treated as equal and less respectively equal and higher values.
This competition asks you to predict the average amount of cyanobacteria in cells per milliliter one, two and four weeks ahead for selected days and areas within each state. Each test case contains exactly 10000 (day, area) pairs. Furthermore within each test case the region and the level of cyanobacteria the last measurement before the prediction are equal.
(2) Weather and Climate data:
It contains daily readings of meteorological measurements. This data is split into two sets. The first one gives radiation and precipitation information for the whole globe. It is quite coarse, describing cells covering an area of one degree of latitude by one degree of longitude. The second one contains local measurements at certain weather stations within each region and gives a more precise view on different kinds of precipitation at these stations.
(3) Water Quality Data
This dataset contains readings of certain water related informations, like temperature, ph and concentration of phosphorus for certain places within each region.
(4) National Land Cover 2006
This is a one-time survey of land usage (residential, industrial, and agricultural), it has a high resolution exceeding that of the MERIS data. Currently this data is not available as live data during the match. Instead you can download the data in order to train your solution offline. The data consists of colored satelite images in geotiff format that encode the actual land use at a pixel by a 8 bit value. To help you reading this data we extracted the coordinates and the corresponding 8 bit value into a .csv file provided with the images. The interpretation of this data is a part of this contest. Since the land cover data is very huge - especially unpacked - we decided to split the entire data set into four files. We recommand to download the file belonging to Ohio first, since it is smallest in size. You can find the data for each region here: Ohio Florida New England (First Part)
New England (Second Part)
(5) Lake Assesment 2007
EPA and its state and tribal partners conducted a survey of the nation's lakes, ponds and reservoirs in 2007. This National Lakes Assessment is designed to provide statistically valid regional and national estimates of the condition of lakes. Currently this data is not available as live data during the match. Instead you can download the data in order to train your solution offline. The data can be found here
(6) CropScape 2009-2012
Annual surveys of what crops were cultivated in agricultural areas, it also has a high resolution equaling that of the MERIS data. Agricultural data may be important because the runoff from fertilizer use is the principal contributor to cyanobacterial growth. A new crop scape report - about the passed year - will always be available at 01/31.
Note that most of the data provided that is MERIS data has been collected in two content creation contest and by the projects copilots. You may feel free to ask details about the collection of data in the contest forum.
Test case structure and implementation notes
There are multiple test cases in each of example, provisional, and system data set. The test cases in the same data set have the same learning period and testing period:
Test case | Learning period | Learning period | Testing period |Testing period
| – start day | – end day | – start day | – end day
Training / |
Examples | 2009/02/01 2010/01/31 2010/02/01 2010/07/31
Tests | 2009/02/01 2010/07/31 2010/08/01 2011/04/30
Tests | 2009/02/01 2011/04/30 2011/05/01 2012/04/30
Important note: Due to the differently long training and testing phases, the amount of data provided to your solution massively varies. For all example test cases there may be at most 59.593.318 lines, for all provisional test cases 83.539.911 lines, and for all system test cases 115.832.464 lines of data that may be passed to your submission. Please take this into account especially with regard to time and memory bounds.
The data used for the training phase can be downloaded here. You may also download the originals of the training data before it was converted into the csv-streams here .
During the learning period, you are only receiving the data. During the testing period, you are receiving the data and give predictions. Each test case asks to make predictions certain fixed number of days ahead from current day. Note that start and end days of testing period mark the first and the last day for which you can be requested to make a prediction. So some days when you get requests to make predictions can be earlier than the first day of testing period. For example, during provisional tests, if the time span is chosen to be 28 days, the first request to make a prediction can come on 2010/07/04 (and this prediction will need to be made for 2010/08/01).
For each test case the init method will first be called to pass the basic information about this test case. Its parameters have the following meaning:
- String region. Which region this case deals with. This will be one out of "Florida", "Ohio" or "New_England".
- int minVInterested and int maxVInterested. You are only to predict the areas whose current cyanobacterial concentration is in [minVInterested, maxVInterested]. On each day within testing period the selected areas satisfying such conditions will be passed with the request parameter in predict method. Note that not all areas satisfying this condition are selected. Each test case use a subset of these in order to guarantee an almost fixed number of requests per test case.
- int timeSpan. How many days ahead from current day the predictions are to be made. It's one of 7, 14 or 28.
- String areaInfo. It describes the areaIds contained in the region represented by region. Each line is formatted as "areaId,A,B,C,D,Active" where A, B, C and D define the bounding box of the area in order minimal latitude, maximal latitude, minimal longitude and maximal longitude. Active is either 1 (true) or 0 (false). If it is 1, this area can be used for predictions in some test case (not necessarily the current one). If it is 0, this area will not be used for predictions. This data remains constant for all regions throughout all phases of this match, e.g. all test cases using data from "Florida" will use exactly the same areaInfo. Note that this information comes with a header line, so the data is begining at the 2nd row.
For each day between the start of the training phase and the end of the testing phase the method receiveData will be called to send data about that day to your submission. The parameters of this method have the following order and meaning:
- dayOfYear is an integer holding the current day of the year for which data is to be passed to your submission. In this problem, days of each year are always numbered chronologically starting from 1.
- year is also an integer giving the current year.
- merisData, areaMerisData, globalClimate, localClimate, waterQuality and cropScape are arrays of strings giving all data records collected at the specified date. Note that for each type of data there is exactly one array line for each record, thus the array may be empty in case that there is no new data collected.
The return type of both init and receiveData is an integer, but its value will be ignored by the tester.
For each day from timeSpan days behind the start of the testing period till timeSpan days behind the end of the testing period, there may occur a call to your predict method. The parameters of this function have the following order and meaning:
- dayOfYear is an integer holding the current day of the year.
- year is an integer holding the current year.
- requests is an array of integers. Each array entry is equal to one areaId specified in areaInfo data. It is expected that a prediction is to be made for all areas within requests and returned as an array of doubles in the same order as the corresponding requests.
Format Details of the provided data
The header is removed from all data before it is passed to your submission. You can see the original headers in the downloadable training data.
All data except areaMerisData share their first columns and all data streams are comma (",") separated.
The common first four columns are:
The areaMerisData shares only the first two of these common columns.
- year: the year the data record was recorded.
- day of the year: the day the data record was recorded.
- latitude: the latitude (see geographical coordinates) the data record was recorded.
- longitude: the longitude the data record was recorded.
Following are the further columns for all data groups.
The only additional column contains the concentration of cyanobacteria at the given location. The unit for this value is cells/ml.
Note that this data is only passed to your submission when it was possible to detect a concentration for these coordinates. Thus coordinates belonging to land or covered by clouds are left out.
This record has two data columns. The first is an integer giving the areaId for which the average cyanobacteria concentration was calculated. The second is a double value giving the calculated average.
The same restrictions as for merisData also apply for this data.
This record has three data columns.
Note that the entries can be negative. This indicates a flux from earth to space, positive values from space to earth.
- precipitation: The mean daily precipitation rate at surface for the given location. Unit is Kg/m2/s.
- nswrs: The mean net short wave radiation (mostly visible light) flux at surface. Unit is W/m^2.
- nlwrs: The mean net long wave radiation (mostly invisible rays) flux at surface. Unit is W/m^2.
This record contains an elaborate set of informations about precipitation on the selected location. The exact column order is:
elevation, PRCP, SNWD, SNOW, WESD, WESF, WDMV, WT09, WT14, WT07, WT01, WT15, WT17, WT06, WT21, WT05, WT02, WT11, WT22, WT04, WT13, WT16, WT08, WT18, WT03, WT10, WT19, station, station_name
While the first entry and the last ones are self explanatory, the meaning of the other values can be looked up here .
Each line of water quality data shares the following columns in its order of occurence:
- station: A string giving a unique identifier of the station where the given data was collected.
- dataGroup: Gives the group the following data belongs to. This is one out of:
“Temperature”, “PH”, “Phosphorus”, “Dissolved oxygen”, “Dissolved oxygen saturation”, “Chlorophyll a", “Flow”, “Nitrate”, “Turbidity”
- dataSubGroup:Gives a more detailed information on the data since not every station uses exactly the same method of collecting data.
- data: Double value holding the measured data.
- unit: The unit of the given date.
- status: The status of the given data line, e.g. if it is “Final” or “Rejected”.
- collection: The collection of the data, e.g. if it was “actual” (measurement) or “calculated”.
Note that this data was hand-collected at different water quality measurement stations within the USA. Therefore the data is rather sparse and there are multiple different units occuring even for the same data group. The most common units occuring are:
- mg/m3: milligrams per quare meter
- ug/l: micrograms per liter
- mg/l: milligrams per liter
- mg / kg: milligrams per kilogram (of water)
- %: percentage (of full saturation)
- cfs or ft3/sec: cube feet per second
- mmol/L: micromol per liter
- NTU, FTU: see here for an example on how to measure turbidity.
Furthermore be aware that as it is with hand collected data, this dataset may contain errors, typos and so on. It is part of this competition to filter for usefull informations that are content of this data.
Crop scape data is only provided on 31/01 and gives a detailed report about which crop was grown at which location in the last year. The only value is an integer standing for one crop or land use at that location. A complete table for resolving the integers to crop names can be found here.
For each prediction its score will be calculated using the formula
Score = 1 - (log10 predicted Value - log10 actual Value)^2.
The score is positive if and only if the prediction is closer than one magnitude to reality. The scoring function tries to identify bad data, e.g. if the average concentration is below 10.000. In such cases the score for this request will always be 0.
Your score for each test case will be 1,000,000 times the average score of predictions within this test case. The lowest possible score for each test case is 0. The test case score will also be 0 in case of memory or time limit errors or if your return is missformated.
Your total score is the average of all test case scores.
Rules and Restrictions
- It is strictly forbidden to use any data other than the data that we provide to you in this competition.
- The solution can be implemented in C++, Java, C# and VB. Python solutions are not allowed.
- You can include open source code in your submission. All third party code/libraries must be open source and you must include the license information in your submission. The license must allow us to modify/re-package the code as necessary. If you have any questions regarding this, please ask us via the forums or firstname.lastname@example.org email. Submissions that include third party code without the proper license information will be disqualified if the third party code is found to be non-usable due to license restrictions.
- In order to receive the prize money, you will need to fully document how your algorithm works. Note that this description should NOT be submitted anywhere during the coding phase. Instead, if you win a prize, a TopCoder representative will contact you directly in order to collect these data. The description must be submitted within 7 days after the contest results are published.
- In case you use any script or program to gain hard coded values for your submission, the script must also be submitted and explained among with the describing document mentioned before.
- Match forum is located here. Please check it regularly as important clarifications and updates may be posted there. All clarifications, updates, restrictions and any other information posted at the forum by [topcoder] admins (orange handles) or copilots of this match (handles WTrei and rst9288) should be treated as part of the problem statement. You can click “Watch Forum” if you would like to receive automatic notifications about all posted messages to your email.
- If your solution fails to comply with any restriction contained in the problem statement or posted at match forum within duration of the match, it can be a reason for its disqualification from the competition.
|Parameters:||String, double, double, int, String|
|Method signature:||int init(String region, double minVInterested, double maxVInterested, int timeSpan, String areaInfo)|
|Parameters:||int, int, String, String, String, String, String, String|
|Method signature:||int receiveData(int dayOfYear, int year, String merisData, String areaMerisData, String globalClimate, String localClimate, String waterQuality, String cropScape)|
|Parameters:||int, int, int, int|
|Method signature:||double predict(int dayOfYear, int year, int timeSpan, int requests)|
|(be sure your methods are public)|
|-||Memory limit is 4 GB and time limit is 80 minutes (which include only time spent within your code).|
|-||The solutions are executed at Amazon m3.large VMs. Therefore the amount of available CPU resources may vary to some degree. The time limit is set to a larger value in order to help you deal with that. Given this variability, we recommend you to design your solution so that its estimated runtime does not exceed 60 minutes. When estimating the runtime, please take into account that the system test case has significantly larger amount of data to be processed than the provisional one.|
|-||There is no explicit code size limit. The implicit source code size limit is around 1 MB (it is not advisable to submit codes of size close to that or larger). Once your code is compiled, the binary size should not exceed 1 MB. |
|-||The compilation time limit is 60 seconds. You can find information about compilers that we use and compilation options here.|
|-||If you are using C++, we recommend to accept all input vectors by reference (e.g. vector<string>& merisData). It will reduce the memory usage of your program.|
|-||Each test case in provisional and system tests will contain beteen 5.000 and 10.000 requested predictions. For the example test cases there may be less request due to the shorter period of time tested. |
|-||There are 3 example test cases, 15 provisional and 45 system test cases. |
This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2010, TopCoder, Inc. All rights reserved.