JOIN
Get Time
long_comps_topcoder  Problem Statement
Contest: Spoken Languages 2
Problem: SpokenLanguages2

Problem Statement

    

Competitors with the best 5 submissions in this contest according to system test results will receive the following prizes:

Prize   USD
1st     $10,000
2nd      $5,000
3rd      $2,500
4th      $1,500
5th      $1,000

Bonus Prize - $100 per winning submission

We will ask the 5 winners to submit a Docker file for their solution and will provide guidance in that regard. We will pay the winners who submit a Docker file an additional $100 each for their efforts.

Introduction

This contest builds upon one we ran recently. It uses the same format that allows submissions to employ any language or open source libraries that meet the licensing requirements specified below in the "Final Code Submission" section.

Requirements to Win a Prize

In order to receive a prize, you must do all the following:

  1. Achieve a score in the top 5, according to system test results calculated using the Contestant Test Data. See the "Data Description" and "Scoring" sections below.
  2. Create an algorithm that both reads in a single 10 second language audio recording and outputs the results in < 60 seconds, running on an Amazon Web Services m4.xlarge virtual machine.
  3. Within 7 days from the announcement of the contest winners, submit:
    • A complete report at least 2 pages long, outlining your final algorithm, and explaining the logic behind and steps to its approach. More details appear in the "Report" section below.
    • All code used in your final algorithm in 1 appropriately named file (or tar or zip archive). The file should include your final model, saved so that can make predictions without being retrained. Prospective winners will be contacted for this.

If you place in the top 5 but fail to do any of the above, then you will not receive a prize, and it will be awarded to the contestant with the next best performance who did all the above.

Background

Faith Comes by Hearing ("FCBH") is dedicated to spreading the message of the Bible across the globe. Over the years, FCBH has observed that many people can"t read or live in oral communities. The organization wishes to allow as many people as possible to hear the Bible in their native languages.

Objective

To enable this, FCBH seeks algorithms that can correctly identify the languages spoken in audio recordings. Given speech data from multiple languages, your algorithm must identify which languages are spoken.

Data Description

You will receive both a training data set ("Contestant Training Data") and a test data set ("Contestant Test Data"). Both data sets contain recorded speech in 176 languages ("Possible Languages Spoken"). A separate .mp3 file stores each speech recording, and only 1 language is spoken in each file.

The Contestant Training Data contains:

  • 66,176 files, each of which contains approximately 10 second of speech recorded in 1 of the 176 Possible Languages Spoken ("Speech Files").
  • The identity of the actual language spoken in each Speech File ("Actual Language Spoken").

The Contestant Test Data contains:

  • 12,320 different Speech Files, each containing approximately 10 second of speech recorded in 1 of the 176 Possible Languages Spoken.

There is no indication of which Possible Language Spoken in each Contestant Test Data file is the Actual Language Spoken in that file (i.e., the data are unlabeled).

Algorithm Output Format Requirement

Your algorithm must output 3 records for each Speech File. Each record must contain 3 comma-separated fields in the following order:

  1. The Speech File name ("speechFile")
  2. The Possible Language Spoken in the Speech File ("possibleLang")
  3. A rank from 1 to 3 of the likelihood that each Possible Language Spoken was the Actual Language Spoken spoken in that Speech File, with 1 being most likely and 3 being least likely ("langRank"). You must use each rank only once for a given Speech File. Tied ranks are not allowed.

An example output appears below for 2 Speech Files:

   
    fileName1.mp3,LanguageA,1
    fileName1.mp3,LanguageB,3
    fileName1.mp3,LanguageC,2
    fileName2.mp3,LanguageG,3
    fileName2.mp3,LanguageT,1
    fileName2.mp3,LanguageB,2

Functions

During the contest, only your results will be submitted. You will submit code which implements only one function, getURL(). Your function will return a String corresponding to the URL of your answer .csv file. You may upload your .csv file to a cloud hosting service such as Dropbox which can provide a direct link to your .csv file.

To create a direct sharing link in Dropbox, right click on the uploaded file and select share. You should be able to copy a link to this specific file which ends with the tag "?dl=0". This URL will point directly to your file if you change this tag to "?dl=1". You can then use this link in your getURL() function.

Another common example is to use Google drive for sharing the link. If you choose that, please use the following format to create a direct sharing link: "https://drive.google.com/uc?export=download&id=" + id;

You can use any other way to share your result file but make sure the link you provide should open the filestream directly.

Your complete code that generates these results will be tested at the end of the contest.

Hints

The following links may provide helpful background information on previous, similar work. Sections "IV. Acoustic-Phonetic Approaches" and "V. Topics in System Developments" of Spoken Language Recognition: From Fundamentals to Practice by H. Li et al. (2013) could be particularly useful.

National Institute of Standards and Technology (NIST) Language Recognition Evaluation 2009

NIST Language Recognition Evaluation 2011

You may find the techniques, tools, and research papers in the following section useful.

Technique: Language Identification (Weka)

Type: Research Paper

Reference: https://www.sussex.ac.uk/webteam/gateway/file.php?name=hosford-proj.pdf&site=20

Notes: Uses mel-frequency cepstral coefficients (MFCCs), pitch contour, and normalized pairwise variability index to generate coefficients that can be used in a feature space and algorithms implemented in Weka

Technique: Spoken Language Classification

Type: Research Paper

Reference: http://cs229.stanford.edu/proj2012/DeMoriFaizullahKhanHoltPruisken-SpokenLanguageClassification.pdf

Notes: Uses MFCC , variance and Sd to generate features for analysis, and support vector machines , neural networks, and Gaussian mixture models for language identification

Technique: Voice Pattern Designs

Type: Research Paper

Reference: http://arxiv.org/pdf/1502.06811v1.pdf

Notes: Uses MFCC, Spectral Energy Peak (SEP), Spectral Band Energy (SBE), Spectral Flatness Measure (SFM), Spectral Centroid (SC),Spectral Flatness Measure (SFM)

Technique: Audio Processing

Type: Library

Reference: https://www.idiap.ch/software/bob/docs/releases/last/sphinx/html/TutorialsAP.html

Notes: Idiap is a Python based framework that uses Bob for audio analysis

Technique: Machine Learning through Audio Analysis

Type: Library

Reference: https://github.com/tyiannak/pyAudioAnalysis

Notes: pyAudioAnalysis is a Python library covering a wide range of audio analysis tasks, including feature extraction, classification, segmentation, and visualization

Technique: Various

Type: Links to National Institute of Standards and Technology (NIST) language recognition evaluation tools

Reference: http://www.itl.nist.gov/iad/mig/tools/

Notes: The NIST ran language recognition evaluations in 2009 and 2011 that are relevant

Scoring

We will run both provisional and system tests using the same submission, but you will not know which Speech Files are used for which test.

Your algorithm's performance will be quantified as follows.

For each Speech File,
If possiblelLang = Actual Language Spoken and langRank = 1, then 1.0 points
Else if possiblelLang = Actual Language Spoken and langRank = 2, then 0.40 points
Else if possiblelLang = Actual Language Spoken and langRank = 3, then 0.16 points
Else 0.00 points

Score = 1,000 * Total points for all Speech Files in the Contestant Test Data

The maximum possible total scores for example, provisional, and system testing are 0, 3,520,000, and 8,800,000.

If there are ties in the final system test results, then we will break them using algorithm run time on the Amazon Web Services m4.xlarge virtual machine described above. However, we will not measure algorithm run time or break any ties during the contest.

Report

Your report must be at least 2 pages long, contain at least the following sections, and use the section names below.

Contact Information

  • First Name
  • Last Name
  • Topcoder handle
  • Email address
  • Final code submission file name

Approach Used

Please describe your algorithm so that we know what you did even before seeing your code. Use line references to refer to specific portions of your code.

  • Approaches considered
  • Approach ultimately chosen
  • Steps to approach ultimately chosen, including references to specific lines of code
  • Advantages and disadvantages of the approach chosen
  • Comments on libraries
  • Special guidance given to algorithm based on training
  • Potential Algorithm Improvements

Final Code Submission

You must submit all code used in your algorithm. You may use any programming language you like, provided it is a free, open-source solution for you to use and would be for the client as well. If using something other than a standard Python/gcc/Java install, as usually used by our testers, please include any relevant installation instructions for said language, and likewise for any additional libraries and packages.

You must submit evidence that your code runs successfully to completion. The code will be run on CentOS 6.5 x86_64 HVM.

Data File Downloads

 

Definition

    
Class:SpokenLanguages2
Method:getURL
Parameters:
Returns:String
Method signature:String getURL()
(be sure your method is public)
    
 

Notes

-In the evaluation after the contest, the submitted code must actually return results by processing the data files, without having manually hard-coded results specifed. Specifically, we should be able to execute the submitted code using the original contest data, and obtain results similar to what were submitted during the contest.
-Usage of additional data is acceptable, provided it is freely available in the same manner as code/libraries that are used. Also, any additional data should likewise be submitted with the code after the contest (in the case of a top 5 finish). The submitted code with additional data will be subject to the same aforementioned requirement of being able to produce the results submitted during the contest.
-2/7 of the submitted languages are used for provisional testing, and 5/7 are used for system testing. This will be *approximately* 20 and 50 of each language (but not exactly since the split was chosen randomly).
 

Examples

0)
    
Seed: 1
1)
    
Seed: 1
2)
    
Seed: 1

This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2010, TopCoder, Inc. All rights reserved.