JOIN
Get Time
long_comps_topcoder  Problem Statement
Contest: Spoken Languages
Problem: SpokenLanguages

Problem Statement

    

Prize Distribution

Competitors with the best 3 submissions in this contest according to system test results will receive the following prizes:

Prize   USD
1st     $2,500
2nd     $1,500
3rd     $1,000

Introduction

This contest uses a new format that allows submissions to employ any language or open source libraries that meet the licensing requirements specified below in the “Final Code Submission” section. Furthermore, we have included some helpful hints on potentially viable approaches. We look forward to seeing your results in this new format.

Background

Faith Comes by Hearing (“FCBH”) is dedicated to spreading the message of the Bible across the globe. Over the years, FCBH has observed that many people can’t read or live in oral communities. The organization wishes to allow as many people as possible to hear the Bible in their native languages.

To enable this, FCBH seeks algorithms that can correctly and quickly identify the languages spoken in audio recordings. Given speech data from multiple languages, your algorithm must learn how to identify which languages are spoken in new speech recordings. This learning requirement is most important to FCBH, as reflected in the “Algorithm Learning Evaluation” section below.

Requirements to Win a Prize

In order to receive a prize, you must do all the following:

  1. Achieve a score in the top 3, according to system test results calculated using the Contestant Test Data. See the “Data Description” and “Scoring” sections below.
  2. Create an algorithm that outputs the results for a single 10 second language audio recording in < 10 seconds, running on an Amazon Web Services c3.large virtual machine. See the “Algorithm Learning Evaluation” section below.
  3. Within 7 days from the end of the challenge, submit a complete report at least 2 pages long, outlining your final algorithm, and explaining the logic behind and steps to its approach. More details appear in the “Report” section below.
  4. Submit all code used in your final algorithm in 1 appropriately named file (or tar or zip archive) within 7 days of the end of the challenge. Prospective winners will be contacted for this.

If you place in the top 3 but fail to do any of the above, then you will not receive a prize, and it will be awarded to the contestant with the next best performance who did all the above.

Data Description

You will receive both a training data set (“Contestant Training Data”) and a test data set (“Contestant Test Data”). Both data sets contain recorded speech in 5 languages: English, French, Italian, German, and Spanish (“Possible Languages Spoken”). A separate .mp3 file stores each speech recording, and only 1 language is spoken in each file.

The Contestant Training Data contains:

  • 600 files, each of which contains approximately 10 second of speech recorded in 1 of the 5 Possible Languages Spoken (“Speech Files”).
  • The identity of the actual language spoken in each Speech File (“Actual Language Spoken”).

The Contestant Test Data contains:

  • 600 different Speech Files, each of which contains approximately 10 second of speech recorded in 1 of the 5 Possible Languages Spoken.

There is no indication of which Possible Language Spoken in each Contestant Test Data file is the Actual Language Spoken in that file (i.e., the data are unlabeled).

Algorithm Output Format Requirement

Your algorithm must output 5 records for each Speech File. Each record must contain 3 comma-separated fields in the following order:

  1. The Speech File name (“speechFile”)
  2. The Possible Language Spoken in the Speech File (“possibleLang”)
  3. A rank from 1 to 5 of the likelihood that each Possible Language Spoken was the Actual Language Spoken spoken in that Speech File, with 1 being most likely and 5 being least likely (“langRank”). You must use each rank only once for a given Speech File. Tied ranks are not allowed.

Your algorithm must output the results of testing each Speech File against every Possible Language Spoken in the Contestant Test Data. Thus, the number of rows in a complete set of output will be Total Speech Files * Possible Languages Spoken = 600 * 5 = 3,000. The "code" that is submitted during the contest should merely be a hardcoded set of return values. An example Java submission appears below for 2 Speech Files:

   
public class SpokenLanguages {

public String[] identify() {
  return new String[]{
    "fileName1.mp3,English,1",
    "fileName1.mp3,French,3",
    "fileName1.mp3,Italian,4",
    "fileName1.mp3,German,2",
    "fileName1.mp3,Spanish,5",
    "fileName2.mp3,English,5",
    "fileName2.mp3,French,1",
    "fileName2.mp3,Italian,2",
    "fileName2.mp3,German,4",
    "fileName2.mp3,Spanish,3"
  };
}

}

Hints

Different voices are used in both among the Contestant Training Data and Contestant Test Data, and sometimes multiple speakers may be talking simultaneously. Focus on correctly identifying the language spoken, not on the number of speakers or distinguishing between them.

The following links may provide helpful background information on previous, similar work. Sections “IV. Acoustic-Phonetic Approaches” and “V. Topics in System Developments” of Spoken Language Recognition: From Fundamentals to Practice by H. Li et al. (2013) could be particularly useful.

National Institute of Standards and Technology (NIST) Language Recognition Evaluation 2009

NIST Language Recognition Evaluation 2011

You may find the techniques, tools, and research papers in the following section useful.

Technique: Language Identification (Weka)

Type: Research Paper

Reference: https://www.sussex.ac.uk/webteam/gateway/file.php?name=hosford-proj.pdf&site=20

Notes: Uses mel-frequency cepstral coefficients (MFCCs), pitch contour, and normalized pairwise variability index to generate coefficients that can be used in a feature space and algorithms implemented in Weka

Technique: Spoken Language Classification

Type: Research Paper

Reference: http://cs229.stanford.edu/proj2012/DeMoriFaizullahKhanHoltPruisken-SpokenLanguageClassification.pdf

Notes: Uses MFCC , variance and Sd to generate features for analysis, and support vector machines , neural networks, and Gaussian mixture models for language identification

Technique: Voice Pattern Designs

Type: Research Paper

Reference: http://arxiv.org/pdf/1502.06811v1.pdf

Notes: Uses MFCC, Spectral Energy Peak (SEP), Spectral Band Energy (SBE), Spectral Flatness Measure (SFM), Spectral Centroid (SC),Spectral Flatness Measure (SFM)

Technique: Audio Processing

Type: Library

Reference: https://www.idiap.ch/software/bob/docs/releases/last/sphinx/html/TutorialsAP.html

Notes: Idiap is a Python based framework that uses Bob for audio analysis

Technique: Machine Learning through Audio Analysis

Type: Library

Reference: https://github.com/tyiannak/pyAudioAnalysis

Notes: pyAudioAnalysis is a Python library covering a wide range of audio analysis tasks, including feature extraction, classification, segmentation, and visualization

Technique: Various

Type: Links to National Institute of Standards and Technology (NIST) language recognition evaluation tools

Reference: http://www.itl.nist.gov/iad/mig//tools/

Notes: The NIST ran language recognition evaluations in 2009 and 2011 that are relevant

Scoring

We will run both provisional and system tests using the same submission, but you will not know which Speech Files are used for which test.

Your algorithm’s performance will be quantified as follows.

For each Speech File,
If possiblelLang = Actual Language Spoken and langRank = 1, then 1.0 points
Else if possiblelLang = Actual Language Spoken and langRank = 2, then 0.40 points
Else if possiblelLang = Actual Language Spoken and langRank = 3, then 0.16 points
Else 0.00 points

Score = 10,000 * Total points for all Speech Files in the Contestant Test Data

The maximum possible total scores for example, provisional, and system testing are 100,000, 1,900,000, and 4,000,000.

If there are ties in the final system test results, then we will break them using algorithm run time on the Evaluation Test Data and as described below. However, we will not measure algorithm run time or break any ties during the contest.

Algorithm Learning Evaluation

As mentioned in the “Background” section above, FCBH needs algorithms that can learn to identify new languages from new audio data. We will take the top 3 algorithms according to system test results on the Contestant Test Data and evaluate their ability to learn as follows, using a new training data set (“Evaluation Training Data”) and a new test data set (“Evaluation Test Data”) described below.

The Evaluation Training Data contains:

  • 600 new Speech Files, each of which contains approximately 10 second of speech recorded in 1 of multiple new Possible Languages Spoken.
  • The Actual Language Spoken in each new Speech File.

The Evaluation Test Data contains:

  • 600 different Speech Files, each of which contains approximately 10 second of speech recorded in 1 of the multiple Possible Languages Spoken.

There is no indication of which Possible Language Spoken in each Evaluation Test Data file is the Actual Language Spoken in that file (i.e., the data are unlabeled).

For this evaluation, you must write a make/build file and execution file (possibly a .sh script or similar) for your algorithm that do all the following:

  • Receive the Evaluation Training Data as input.
  • Learn to classify the multiple Possible Languages Spoken in it.
  • Produce 1 .csv file which conforms exactly to the “Algorithm Output Format Requirement” section above when classifying the languages spoken in the Evaluation Test Data. Note that the number of rows in the .csv will be Total Speech Files * Possible Languages Spoken, which will depend on the number of Possible Languages Spoken and may be different from the output for the Contestant Test Data.

On a common Amazon Web Services c3.large virtual machine, we will do all the following:

  • Run the make/build file for your algorithm.
  • Allow you algorithm to use the Evaluation Training Data to learn how to identify the multiple languages in it.
  • As described in the “Scoring” section above, evaluate the performance of the algorithm on identifying the multiple Actual Languages Spoken in the Evaluation Test Data.
  • After training has completed (untimed), ensure that the algorithm returns speechFile, possiblelLang, and langRank for a single Speech File in the Evaluation Test Data in < 10 seconds.

Report

Your report must be at least 2 pages long and must contain at least the following sections.

Your Information

This section must contain the following:

  • First Name
  • Last Name
  • Topcoder handle
  • Email address
  • Final code submission file name

Approach Used

Please describe your algorithm so that we know what you did even before seeing your code. Use line references to refer to specific portions of your code.

This section must contain the following:

  • Approaches considered
  • Approach ultimately chosen
  • Justification for approach ultimately chosen
  • Steps to approach ultimately chosen, including references to specific lines of your code
  • Advantages and disadvantages of the approach chosen
  • Comments on libraries
  • Special guidance given to algorithm based on training

Final Code Submission

You must submit all code used in your algorithm. You may use any programming language you like, provided it is a free, open-source solution for you to use and would be for the client as well. If using something other than a standard Python/gcc/Java install, as usually used by our testers, please include any relevant installation instructions for said language, and likewise for any additional libraries and packages.

You must submit evidence that your code runs successfully to completion. The code will be run on CentOS 6.5 x86_64 HVM.

Data File Downloads

 

Definition

    
Class:SpokenLanguages
Method:identify
Parameters:
Returns:String[]
Method signature:String[] identify()
(be sure your method is public)
    
 

Notes

-In the evaluation after the contest, the submitted code must actually return results by processing the data files, without having manually hard-coded results specifed. Specifically, we should be able to execute the submitted code using the original contest data, and obtain results similar to what were submitted during the contest.
-Usage of additional data is acceptable, provided it is freely available in the same manner as code/libraries that are used. Also, any additional data should likewise be submitted with the code after the contest (in the case of a top 3 finish). The submitted code with additional data will be subject to the same aforementioned requirement of being able to produce the results submitted during the contest.
 

Examples

0)
    
Seed: 1

This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2010, TopCoder, Inc. All rights reserved.