Get Time
long_comps_topcoder  Problem Statement
Contest: Asteroid Data Hunter MM 1
Problem: AsteroidRejector

Problem Statement



The best 5 performers of this contest (according to system test results) will receive the following prizes:

  • 1st place: $5,000
  • 2nd place: $2,000
  • 3rd place: $1,500
  • 4th place: $1,000
  • 5th place: $500


Life as we know it may very well change at any moment for ever without us even realizing it... but don't despair, NASA and Planetary Resources have spun up their engines and they have a plan. When we think of asteroids we immediately think of extinction. It also doesn't help if every once in a while we are reminded of how great that danger really is. Recently the 20m Chelyabinsk asteroid deposited energy equivalent to half a megaton of TNT in the atmosphere over a sparsely populated area, yet still managing to hurt over 1500 people. But aside their destructive force, they also have a brighter side to the story, they may very well be an indispensable mining resource in helping mankind in it's future universe exploration missions! The Catalina Sky Survey has an algorithm for detecting asteroids and Near Earth Objects (NEOs), but it has an estimated accuracy of 90%, where most of the time it also produces false detections along with the good ones. Increasing the detection efficiency would also increase the number of false detections. Although a trained astronomer could screen out these false detections, this is becoming highly inefficient considering we live in a world with ever increasing data flow from modern instruments. As such, they are looking to employ the minds of ordinary citizens who, although may or may not be trained in astronomy, have the potential of bringing out-of-the-box ideas for developing a post-processing routine that would help reduce false detections, thus significantly improving the detection precision of NEOs.



As part of a joint effort for improving Near Earth Objects (NEOs) detection you are tasked with increasing the accuracy of an algorithm by means of a post-processing step in which your own implementation should properly and automatically screen false detections from the former step thus effectively removing the human interaction factor.


Currently used algorithm

The following steps are used in detecting an asteroid:

  • each image is individually scanned by a program (Source Extractor) which finds all the discrete sources above some threshold over the noise floor (the threshold is variable because of weather and other undocumented reasons)
  • another program (con4) looks for all the objects that move in linear paths amongst the catalog of objects from all four images and its output is a detections file (.dets)
  • the detections file is checked for known objects that do not need reporting
  • bulk removals are performed around known sources (such as bright stars) without recording the deletions
  • at the end the remaining objects are presented to the operator for validation

We would like to automate this last step. The operator validates the remaining objects by accepting or rejecting each. For a list of possible rejection reasons please see this document.



The post-processing algorithm shall receive the following input:

  • raw image data extracted from 4 (FITS) images of the sky, taken roughly 10 minutes apart. The resolution of these extracted patches is fixed (64 x 64 pixels) and the data contains 16 bit values. Each extracted image patch is centered on a detected object. The resolution of the original FITS images are 4110 pixels wide and 4096 pixels in height. If one of the pixels in the 64 x 64 image patch falls outside the bounds of the original FITS image, the pixel value was set to 65535.
  • a detection list associated with the set of images which contains a list of information for the detected objects.

The training data can be downloaded here. The image file is in raw 16 bit format and can be read in and displayed by the visualizer that can be downloaded here. The detection list is in space delimited format with 16 columns and each row representing a single detection in one of the 4 original FITS images. The columns are:

  1. Unique ID -- An identifier for what detected object a row belongs to
  2. Detection Number -- sequential numbering of detection output of the currently used detection software
  3. Frame Number -- which observation is this row relevant to (1, 2, 3 or 4)
  4. Sexnum -- Source extractor number of the object
  5. Time -- Julian date
  6. RA -- right ascension of object in decimal hours
  7. DEC -- declination in decimal degrees
  8. X -- location in pixels of the object in the original FITS image
  9. Y -- location in pixels of the object in the original FITS image
  10. Magnitude -- brightness of the object in magnitudes
  11. FWHM -- full width at half maximum of Gaussian fit in pixels
  12. Elong -- ratio of long axis to short axis
  13. Theta -- position angle of the long axis
  14. RMSE -- error in fit to straight line
  15. Deltamu -- from Source Extractor, peak value minus threshold over background
  16. Rejected -- this value will be 1 if the operator rejected the detection, 0 otherwise. This column will only be available during the training phase. You need to predict this column

The following image shows an example detection, the 4 image patches are shown from left to right with the detected object at the center.



Your task is to implement a trainingData, testingData and getAnswer methods, whose signatures are detailed in the Definition section below.

int[] imageData contains image data about multiple detections at 4 different observation times. The array contains image patches of size 64 x 64. Every int will contain a 16 bit value. Let i be the i-th image patch and (x,y) be the coordinate within an image patch. (0,0) will be the top left corner in the patch and (63,63) the bottom right corner. The pixel data can then be found at index [i*64*64 + x + y*64] of the imageData array.

String[] detections contains information about the detected objects. It will be in the same format as in the provided training data files, except that the Unique ID is added by the tester software and can not be seen when viewing the training data files directly. Each row of detections will correspond to a single line of the file. However, the Rejected column will be eliminated during the testingData method call, so then there will be only 15 columns instead of 16.

Each row of detections can be linked with an image patch in imageData. The image patch of the j-th row in detections can be found starting at imageData[j*64*64] and will be 64*64 integers long.

Firstly, your trainingData method will be called with detection data extracted from a FITS image set. Your method will be called 1000 times, one time for every case in the training data. Each image set contains multiple detections. You can use this method to train your algorithm on the provided data if you want to.

Secondly, your testingData method will be called 200 times with different image and detection data than provided during training. The trainingData and testingData methods can return any value, it does not matter what you return.

Finally, your getAnswer method will be called. This method should return a list of all the Unique IDs (and each of them exactly once) provided in the detections rows. The goal is to order those elements (detections) in such a way, that those that you believe are the least probable to be rejected objects at the front of the array, and those the most probable at the back.


Testing and scoring

There are 1 example, 10 provisional tests and at least 20 system tests. They are generated as follows:

  • For a single example test: Data from 500 cases will be used. 300 cases from this set will be randomly selected for training and your trainingData method will be called 1000 times with data from a randomly selected cases from the 300 subset. The remaining 200 cases will be used for testing. The exact contents of example test cases can be downloaded here
  • For a single provisional test: Training data will include the 500 cases used in the example tests and another 500 cases will be randomly selected from 700 provisional cases. In total your traininData method will be called a 1000 times. The remaining 200 cases from the 700 will be used for testing.
  • For a single system test: Training data will include the 500 cases used in the example tests and another 500 cases will be randomly selected from the 700 provisional cases and another 500 system cases. In total your trainingData method will be called a 1000 times. A random set of 200 cases will be selected for testing from cases not supplied to you for training.

Note the difference in case selection between the example test (random cases with replacement) and provisional/system tests (random cases without replacement).


In this Marathon Match we will use Average Precision scoring metric and it's going to be scaled in such way, that maximum possible score is set fixed at 1000000. This exact pseudo code will be used to calculate your score (assuming array A describes your returned array):

score := 0.0
total := 0
correct := 0
for i := 0 .. A.length - 1
    total := total + 1
    if (isNotRejected(Ai)) {
    	correct := correct + 1
    	score := score + (1000000.0 / totalNotRejected) * (correct / total)    

If your solution exceeds time or memory limit, crashes or returns invalid result, your score for this test case is 0. Your result is considered invalid if it doesn't contain every detection provided to your algorithm via the testingData method.

The overall score on a set of test cases is the arithmetic average of scores on single test cases from the set. The match standings displays overall scores on provisional tests for all competitors who have made at least 1 full test submission. The winners are competitors with the highest overall scores on system tests.


An offline tester/visualizer tool is available.


Special rules and conditions

  • The allowed programming languages are C++ and Java.
  • You can include open source code in your submission. Open Source libraries under the BSD, or GPL or GPL2 license will be accepted. Other open source licenses could be accepted too. Just be sure to ask us.
  • In order to receive the prize money, you will need to fully document your code and explain your algorithm. If any parameters were obtained from the training data set, you will also need to provide the program used to generate these parameters. There is no restriction on the programming language used to generate these training parameters. Note that all this documentation should not be submitted anywhere during the coding phase. Instead, if you win a prize, a TopCoder representative will contact you directly in order to collect this data.
  • You may not use any external (outside of this competition) source of data to train your solution.


Parameters:int[], String[]
Method signature:int trainingData(int[] imageData, String[] detections)
Parameters:int[], String[]
Method signature:int testingData(int[] imageData, String[] detections)
Method signature:int[] getAnswer()
(be sure your methods are public)


-The match forum is located here. Please check it regularly because some important clarifications and/or updates may be posted there. You can click "Watch Forum" if you would like to receive automatic notifications about all posted messages to your email.
-You can train your solution offline based on the given training data file and you can hardcode data into your solution. However remember that you can't use data from other sources than this contest.
-Time limit is 40 minutes which include only time spend within your code. Solutions are executed at VMs. Therefore the amount of available CPU resources may vary to some degree. The time limit is set to a large value in order to help you deal with that. Given this variability, we recommend you to design your solution so that its estimated runtime does not exceed 30 minutes.
-Your solutions are guaranteed to be able to use 5 GB of memory. It is possible that more memory will be available, but we do not provide any guarantees for that.
-There is no explicit code size limit. The implicit source code size limit is around 1 MB (it is not advisable to submit codes of size close to that or larger).
-The compilation time limit is 60 seconds. You can find information about compilers that we use, compilation options and processing server specifications here.
-There is 1 example test case and 10 full submission (provisional) test cases.
-You may submit an example submission every 30 minutes and a full submission every 2 hours. We may change these limits at any time during the contest.


Example Seed

This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2010, TopCoder, Inc. All rights reserved.