Topcoder Challenge | Topcoder Community

Challenge Overview

In this challenge we want to implement an utility Java Class to be called by other java codes to setup and load data in Google Big Query (https://developers.google.com/bigquery/)

The utility java class will have methods to do the following tasks

Create dataset on the specified project under Google Developer. Information about dataset can be found here : https://developers.google.com/bigquery/what-is-bigquery#datasets and https://developers.google.com/bigquery/docs/reference/v2/datasets
Create table in the specified dataset and project under Google Developer. Information about table can be found here: https://developers.google.com/bigquery/what-is-bigquery#tables and https://developers.google.com/bigquery/docs/reference/v2/tables
Load the specified json data file into the specified table under a dataset of a project. Loading of the data should be done by a Google Big Query Jobs . Refer to https://developers.google.com/bigquery/what-is-bigquery#jobs and https://developers.google.com/bigquery/loading-data-into-bigquery.

Detailed Requirements

The Big Query Google API should be used for all these tasks, the API documentation is here: https://developers.google.com/bigquery/docs/reference/v2/. The java client library shall be used : https://developers.google.com/bigquery/client-libraries and https://developers.google.com/api-client-library/java/apis/bigquery/v2

The class should act as a service wrapper which is easy-to-use by other java codes.
- The authorization and other configurable parameters should be put into a properties file.
- The class should provide error-handling and logging.
- Use the package com.topcoder.utilities.bigqueryload
Create Dataset: Provide a wrapper method in this class for handling the create dataset task, the method should call the corresponding API with Google Java API client libraries.
Create Table: Provide a wrapper method in this class for handling the create table task, the method should call the corresponding API with Google Java API client libraries.
- Create table should support providing a optional table schema when creation. The table schema should be read from a specified JSON file.
Load JSON data file to the destination table as upload job. The destination table is specified by the PROJECT_ID, DATASET_ID and TABLE_ID.
- The JSON data file can be very large (up to gigabytes), so please check the API to see if we can gzip it first before posting the data.
- Optional table schema can be specified as in create table in case the destination table does not have any schema defined (i.e. empty table).
- The job execution can take time, please provide a way to pull the status of the job periodly.
- The result of the loading should be returned and logged so the codes which are going to use this utility can handle the error and success cases properly.
Write a demo class which call the utility and test all the tasks above. Both success / error cases should be tested.
Ant build file should be used to compile, build and run the demo.

Google Big Query Access

Google Big Query requires setting up billing to be able to use the API, so we created a test project (PROJECT_ID: tc-data-accessibility-30044673) under our account. You need to request in challenge forum here

After that, you need to create OAuth 2.0 client ID to be able to access the project using API. Please follow https://developers.google.com/bigquery/authorization#service-accounts

Testing of the Utility

You need to define a comprehensive table schema in JSON format to test. Then write a small utility class to generate a large enough JSON data file (NOT exceed 20M) to test.

For table schema definition, refer to https://developers.google.com/bigquery/docs/tables and https://developers.google.com/bigquery/docs/reference/v2/tables#schema.fields

In your test data, we want you to research how to handle the following issues we found during the manual load

Issue#1: Date and Time field values if null then they are not accepted by BigQuery. So we need to see how those records can be loaded in BigQuery.
Currently, the data load is manually done via a Ruby script, A check has been made that if the date and time value is null then ignore loading of that record.

Issue#2: Delimiter issue - If the field value contains the delimiter say, the delimiter is quote and the actual field value contains quote then it needs to be handled properly.
Currently, those field values have been updated so that we can get rid of errors and have a safe load. Also we can provide number of errors that are acceptable via data load. So that count has been raised to a higher value so that data load can happen (100).

Issue#3: Encoding issue - There are certain fields like user firstname which have some characters like \u255. This is not allowed by BigQuery and results into error. like Gamal Mateo[\u 221]rdenas

Final Submission Guidelines

The whole directory archive with all the files including the library files, testing data etc. Please using 7zip if the archive is too large.
A detailed deployment guide with verification steps.

Implement loader to load large JSON data into Google Big Query

Key Information

Challenge Overview

Final Submission Guidelines

LEARN:

ELIGIBLE EVENTS:

REVIEW STYLE:

Final Review:

Approval:

CHALLENGE LINKS:

TOOLBOX:

SHARE:

ID: 30044673