Topcoder Challenge | Topcoder Community

Challenge Overview

This project aims at building a Proof of Concept application that would provide a user with an ability to categorize his/her personal activities and important moments and aggregate them by those categories and also time. The categorization will be done automatically by analyzing the information extracted from his personal data from emails, documents provided by him, social profile (like facebook/twitter) etc. The current contest focuses on the categorization part of this overall concept.

We want to build a component that would provide us a prototype of the above capability. Specifically, following are the high-level requirements for the component:

1.) Allow the users to create categories.

2.) Read the data from source files.

3.) Parse the data and categorize it using natural language processing and classification algorithm.

4.) Allow the user to see the summaries of their stored data aggregated by categories or based on time.

Example: Consider that a user creates two categories A and B. The component then reads the source data. Consider that the source data consists of text files containing set of emails of a user. Each email is a data sample. The algorithm parses the text and then runs the classification algorithm to put into appropriate categories. Now, the component will also have the ability to allow user to select and see the information under each category. When he selects either a category A or category B, he should be able to see his mails corresponding to that category. He can also select to see the information aggregated by time.

Some points to be considered here are as follows:

1.) When the user selects the category we just want to show him the summary of the information under that category. For example, he had 50 mails belonging to that category, only a small summary of each mail should be shown with a link to full data. This may require to plan for two different storage - one for summaries and one for actual data.

2.) The time granularity should be month-wise.

3.) Currently, we do not want the component to actually connect to the social profiles or email accounts. The data source will be simple text files stored locally. Each data sample in the text files will have timestamp associated with it.

4.) Also the UI needs to be very simple. Allow the user to create/delete categories. Allow the user to select which category they want to see and the time frame for which they want to see. The time frame can have granularity level of year and month.

Data Source:

The train and test data source needs to be created by the members. The members can also use readily available data sets. The structure of the data set needs to be text file with data samples in it. Each data sample must be text representing either email text, blog post, twitter post, etc. and each of them should be tagged with time.

Here is an example of a data sample:

It was the celebration of Joe's wedding today. All our family members had gathered. It was my birthday too. We all got together and had a blast while celebrating these double occasions. This is one of the most joyous day of my life. I will never want to forget this day.

Posted: Sunday, August 25, 2013

This is just an example of single data sample. You can have any number of data samples (the more the better for classification accuracy). You can also have as long text for a sample as available.

Documentation Provided:

The component design contest was held to create implementation design using above ideas. The final design has been attached in the forums.

The idea for this application with potential algorithm choices was rolled out through an idea generation contest recently. We have provided the documentation from that contest. (This document is for reference only and can be found under design phase documents. But the members are required to follow design document whenever there is some choice to be made.)

Final Submission Guidelines

The component would work as a standalone POC application. Both back-end and front-end are covered in this component.

Please Note:

- The current design uses WEKA library. It is a well-known ML lib and has many good features. In case, you want to use any other library/method for classification, you are welcome to use it. Please just ask in forum as soon as you have some doubt and need clarification.

- The classification result is NOT the output of this application. Classification must be done in background. For end user, he first gives the text file as input source. Then in the background the data samples will be classified and stored under categories as mentioned in spec and in design document. Now when user selects a particular category in particular time frame, he will see short text summary of his personal moments of that time.

- Your classification accuracy will determine how good the results appear when the user selects the category. You would know that if user selects vacation as category and he sees all business meetings details coming out, one can see that background classification was not good.

Personal Activities Aggregation Component

Key Information

Challenge Overview

Final Submission Guidelines

LEARN:

ELIGIBLE EVENTS:

REVIEW STYLE:

Final Review:

Approval:

CHALLENGE LINKS:

TOOLBOX:

SHARE:

ID: 30035856