Topcoder Challenge | Topcoder Community

Challenge Overview

1. Project Overview

IBM wants to build a utility that would take the Watson answer metrics JSON files, extract terms/entities from questions and answers in the file, identify relationships between extracted entities (person related to place related to other person etc.) and populate a graph database. This process will be executed incrementally as questions and answers come in and added to the graph database.

2. Contest Overview

In this contest, we are looking for you to design the architecture for Watson Question-Answer Metrics Graph system described below. Following are the high level application details:

Input Details:

- We have provided you a sample JSON file containing question and its answers (attached in forums).

- Each file represents a question asked to Watson and the answers, associated confidence & evidence provided.

- The answer metrics will reside in a folder within the Watson system. There will sub-folders, each organized by data time stamp YYYYMMDD

- Within each sub-folder there will several of these answer metrics file.

- The final deliverable must be able to iterate through all the sub-folders and files

- Within the file "QUESTION>uima.tcas.DocumentAnnotation.targets.SOFA" is the question that was asked to Watson

- Within the file, the array - QUESTION>com.ibm.bluej.model.CandidateAnswerVariant.targets.text are the answers Watson came back with. For this effort, consider the top-5 answers or the array index 0-4.

- Please note: The question and its answers are the only items you need to extract from the answer metrics file. You can ignore the rest.

Expected Output Details:

- For the combined set of extracted questions and associated answers, construct a graph. This graph should be populated in a graph database.

- Nodes in the graph represent the various concepts / terms extracted.

- Concepts / terms can include for eg: PEOPLE, DATE, DURATION, PERSON, ORGANIZATION, MEDICINE, DRUG, SUBSTANCE, DISEASE, CITY, LANGUAGE etc

- The graph will provide a view of the instances of the concepts. Eg: Woman, Julie Smith, June'10th, 2 hours, ACME Hospital, Tylenol, Alcohol, Depression, Detroit, English etc

- Edges represent the links between terms / concepts (example visual representation of a graph is provided in forums - this is just an example and visual representation may depend on UI capabilities)

- You will need to consider the entirety of all the questions and top-5 answers for each of those questions.

- You can choose to render a graph by each question with its answers and also all questions and all answers

Please note: The Graph DB and parser needs to potentially handle a ton of data, millions of questions and answers. Hence, performance is the key issue that needs to be taken into consideration while making choice of graph database.

Tips for Success:

Asking questions early and getting feedback is very important for the success of this competition.

3. UI capabilities

We expect a very simplistic UI support that allows us to query the graph and retrieve the results. We see two options to do that:

1.) Use the already available out of box UI capabilities provided by graph database that is used to store the graph. In that case, provide details on how to use those capabilities by end-user.

2.) The end user can query the graph & retrieve the results using APIs in which case well defined API operations (with examples) can be exposed

4. Technology Overview

1. The programming language to be used is Java.

2. We want to you to use Apache Titan as Graph Database Store. This is also preferred by clients. Having said that, if you see that some required feature cannot be implemented using Apache Titan, please open a discussion in forum immediately and also suggest any other graph DB which you would recommend to use for covering the missing features (if any). Please make sure that such a tool is open source and freely usable. As mentioned earlier, performance is a key issue to consider due to large amount of data.

3. As mentioned earlier, if the graph database do not support out of box capabilities to query the constructed graph, you are allowed to use API's to query the graph and retrieve the results. In this case, you will need well defined API operations (with examples) to be exposed

5. Resources Provided

The following resources have been provided in the forums. You will be able to access it after registration:

- Input answer metrics file: 1089909798_03072014_03_12_01.json

- Example visual representation of graph.

Final Submission Guidelines

We want you to submit the following deliverables:

Application Design Specification
Details about Graph Database and corresponding UI usage
Sequence Diagrams
Interface Diagrams
ER Diagram or Object Model representation of the Graph Database
Assembly Specifications
A sample JSON format (such as GraphSON using by popular graph databases) to show how the sample question answer format can be converted into a graph ready import format.
Requirements Implementation Mapping (Please consider this as more like an informal requirement mapping - that is just take the requirement points from spec and show that they are satisfied.)
Implementation artifacts with testable components such as JUnit test cases (UI, JARs, etc) [This one is for downstream assembly challeneg and not a requirement for this architecture.]

You can refer here to know more description on the templates and other details related to above mentioned required documents.

IBM Watson Question-Answer Metrics Graph Architecture

Key Information

Challenge Overview

Final Submission Guidelines

LEARN:

ELIGIBLE EVENTS:

REVIEW STYLE:

Final Review:

Approval:

CHALLENGE LINKS:

TOOLBOX:

SHARE:

ID: 30044395