Register Now
Member Count: 627,603 - April 23, 2014  [Get Time]
Login
long_comps_topcoder  Problem Statement
Contest: GWASSpeedup
Problem: GWASSpeedup

Problem Statement

    

Genome-wide association studies (GWAS) analyze large sets of genetic markers across large cohorts of individuals to locate genetic variants contributing to the heritability of phenotypes (i.e., traits) of interest. GWAS analysis is computationally challenging because of the scale of the data involved and the modeling algorithms required. Typical GWAS analyses have marker sets on the scale of millions of single-nucleotide polymorphisms (SNPs) and cohort sets of thousands of individuals. Modeling associations between markers and phenotypes can be complex because of the need to take into account confounding covariates and model non-quantitative traits (e.g., case vs. control status).

This contest will focus on speeding up the logistic regression modeling that is the most computationally demanding component of many GWAS analyses. The basic setup is as follows. Say we have a cohort of N individuals. For each individual, we have phenotype data for P phenotypes, genotype data at M markers, and covariate data for C covariates. GWAS analyses ask which markers help explain which phenotypes, with the slight subtlety that we are only interested in phenotypic variance not already explained by covariates.

Input:

  • Phenotype matrix: N x P {0,1}-matrix of traits. Rows correspond to individuals in this and the following matrices. The matrix is binary because the traits are non-quantitative, e.g., whether or not the individual has a particular disease. Each row is represented in the input as a string of length P with each character equal to either '0' or '1'.
  • Genotype matrix: N x M {0,1,2,9}-matrix of genetic data at M markers. Each individual has either 0, 1, or 2 copies of the variant allele at each marker. A value of 9 indicates missing data. The data is formatted as above with each row represented as a string of length M.
  • Covariate matrix: N x C real-valued matrix of other data, possibly derived from genetic data or from physical measurements. Intuitively, the covariates are confounding variables that may have some correlation to the phenotypes. Any part of the phenotypic variance that is explained by the covariates should not be attributed to genetic markers when testing for association. This matrix is represented as an array in row-major order, i.e., element (n, c) has index n*C + c (using 0-based indexing).

Logistic regression model:

For each marker and each phenotype, we fit a logistic regression model of the phenotype vector against the marker vector and all the covariates (for a total of C+1 independent variables plus a constant term). Briefly, logistic regression models binary outcomes with probabilities 1/(1+e-t)), where t is a linear function of the independent variables. The best-fit coefficients of the linear function are obtained by maximizing likelihood. The maximum likelihood estimate does not have a closed form, but it can be computed using iterative approaches such as Newton-Raphson ("Newton's method"). Finally, the statistical significance of the association between the marker and the phenotype is assessed by computing a Wald statistic, which equals the square of the ratio of the regression coefficient to its standard error. The squared Z-score, Z^2, is asymptotically distributed as a chi-square distribution.

As a minor additional note, when testing marker m for association to each phenotype p, individuals with a missing value at marker m are discarded. Thus, the logistic regression may use slightly fewer than N measurements.

In the local tester package, we provide a reference implementation of logistic regression extracted from the PLINK software suite (see "Additional background" for more information about PLINK). Submissions will be scored against this reference implementation, and if you wish to use any code from the reference, you may do so. For further information about logistic regression, many resources are available online. For example, see:

  1. http://czep.net/stat/mlelr.pdf
  2. http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf

Output:

For each of the M x P (marker, phenotype) pairs, we wish to compute the value of the chi-square statistic for association of the marker with the phenotype according to the above logistic regression model. Element (m, p) of the matrix corresponds to index m*P + p of the return vector.

Not all of the M x P values are of equal interest, however. Generally, the vast majority of markers in a data set are unrelated to any given phenotype. In some GWAS analyses, the precise values of association statistics may only be required for the most significant associations between markers and phenotypes. Therefore, if runtime is a major limitation, it may be desirable to precisely compute association statistics for only the most significant pairs.

Scoring:

Submissions will be scored based on accuracy and runtime, reflecting the considerations described above. The accuracy score component is calculated using the following procedure:

  • Compute the actual values of the M x P chi-square statistics using the reference implementation.
  • Rank the M x P pairs in decreasing order of significance (i.e., decreasing order of Z2).
  • Going down the ranked list, check whether the returned value for each pair matches the actual statistic to within 0.1% (1e-3) absolute or relative error.
  • Stop at the first mistake. The accuracy score for this test case is the number of correct values computed before the first mistake.

To obtain a raw score for the test case, the accuracy score is then divided by a time penalty between 1 and 2:

  • RAW_SCORE = ACCURACY_SCORE / (1 + max(TIME_SPENT, 100ms) / TIME_LIMIT)

Finally, a scaled score for the test case is defined relative to the scores of other competitors:

  • SCALED_SCORE = RAW_SCORE / max(P, BEST)
where BEST is the best raw score achieved for that test case by any competitor and P is the number of phenotypes as before. (The max(P, BEST) is intended to reduce score variance in the event of very difficult cases.)

The total score for a submission is the sum of the scaled scores over all test cases.

Test generation:

For each seed, values of N, M, P, and C are uniformly sampled from the parameter ranges given in the “Constraints” section below. Then, the input matrices are created as follows:

  • Genotype matrix: The genotype matrix is created by randomly subsampling real genotype data from the 1000 Genomes Project. We subsample N individuals and M markers from a provided data set (containing 1624 individuals and 100,000 markers) to use as the genotype matrix.
  • Phenotype matrix: Given a genotype matrix, we first create a normalized matrix X (also N x M) by setting missing values to their column means and normalizing each column to have mean 0 and variance 1. Explicitly, given an input genotype matrix G with entries in {0,1,2,9}, for each marker m (i.e., column of G):
    • Compute the mean of the non-'9' entries of column m.
    • Replace '9' entries with the mean and subtract the mean from all entries.
    • Compute the standard deviation of column m (with 9s replaced).
    • Divide each entry of column m by the standard deviation.
    We then randomly generate phenotypes as follows. First, sample the approximate proportion of variance hg2 to be collectively explained by the causal markers (uniformly between 0 and 1) and the proportion of variance hpop2 to be explained by population stratification (uniformly between 0 and (1-hg2)/10). Also sample the approximate fraction q of markers with nonzero effects (uniformly between 0 and 0.05). These parameters are used for all P phenotypes in the test case. Then, for each phenotype:
    • Assign each marker m an effect size βm that is 0 with probability 1-q and is drawn from a normal distribution N(0,(hg2)/(M*q)) otherwise.
    • Assign each population k a population effect μk drawn from a normal distribution N(0,hpop2).
    • Generate the liability of each individual i as yi=Xi*β+μ+ϵ, where:
      • Xi*β is the dot product of the i-th row of the normalized genotype matrix with the effect vector (giving the total genetic effect from causal markers)
      • μ is the population effect for the population to which individual i belongs
      • ϵ is a noise variable sampled from N(0,1-hg2-hpop2).
    • Finally, set the phenotype to 1 if the liability yi > 0 and 0 otherwise.
    Note: the population of an individual can be found in the last column of genotype annotation data file. There 8 population types overall.
  • Covariate matrix: Compute the top C principal components of the normalized genotype matrix X to use as covariates. (This approach is a common technique used to correcting for population stratification.)

There are 10 example cases and 100 full submission cases. Example cases use seeds from 1 to 10, inclusive. The seed of 1 is hardcoded to have the minimum possible values of N, M, P and C.

Tools:

An offline tester tool is provided. It can run your solution locally and calculate its raw score. We also provide the Java source code of the offline tester. You are allowed to introduce any modifications into the code in order to make it suit your needs better. You are also allowed to use code from the tester in your submission.

Special conditions:

  • Submissions can be implemented in C++, Java, C# or VB.NET. Python submissions are not accepted.
  • For C++ submissions, the -llapack compiler flag will be enabled so that you can make calls to the LAPACK linear algebra library via its C++ interface.
  • Use of open source code is permitted as long as the license is compatible with re-releasing winning submissions under GPLv2.
  • No use of assembly is allowed.
  • In order to receive the prize money, you will need to fully document how your algorithm works. Note that this description should NOT be submitted anywhere during the coding phase. Instead, if you win a prize, a TopCoder representative will contact you directly in order to collect these data. The description must be submitted within 7 days after the contest results are published. Questions sent by email from TopCoder requesting clarification to this document must be answered within 3 days. If the document or a response to a clarification request is not received, TopCoder reserves the right to disqualify the submission.

Additional background

Many software tools have been developed for GWAS analyses. Of these, one of the most widely used is PLINK, an open source tool developed by Shaun Purcell at the Center for Human Genetic Research of the Massachusetts General Hospital (MGH). PLINK is a well-documented, mature software suite that implements an extensive range of analyses and data processing tools, contributing to its popularity. PLINK has not been highly optimized, however, and we believe there is substantial room for speedup in some of its analysis algorithms.

This challenge highlights an initial area for improvement that we have identified based on profiling PLINK on typical GWAS runs. Specifically, we have found that in logistic regression analyses of case-control phenotypes taking into account covariates -- a common use case -- PLINK spends most of its time in its LogisticModel::fitLM() function. The goal of this contest is to determine how much speedup is possible and ultimately incorporate the optimizations into the PLINK codebase to be made available for scientists to use. Follow-up contests may investigate optimizing algorithms for other PLINK functionality or bottlenecks in the PLINK data-handling (e.g., Model::buildDesignMatrix() also appears to take a substantial amount of setup time).

 

Definition

    
Class:GWASSpeedup
Method:computeAssociations
Parameters:String[], String[], double[]
Returns:double[]
Method signature:double[] computeAssociations(String[] phenotypes, String[] genotypes, double[] covariates)
(be sure your method is public)
    
 

Notes

-The memory limit is 1 GB.
-The time limit is 10 seconds.
-There is no explicit code size limit. The implicit source code size limit is around 1 MB (it is not advisable to submit codes of size close to that or larger). Once your code is compiled, the binary size should not exceed 1 MB.
-The compilation time limit is 30 seconds. You can find information about compilers that we use and compilation options here.
-No use of assembly is allowed.
-The match forum is located here. Please visit it regularly during the match, since we may post updates and/or clarifications there.
 

Constraints

-N will be between 500 and 1500 (inclusive).
-M will be between 1000 and 5000 (inclusive).
-P will be between 3 and 50 (inclusive).
-C will be between 5 and 10 (inclusive).
 

Examples

0)
    
Seed = 1
N = 500
M = 1000
P = 3
C = 5
1)
    
Seed = 2
N = 513
M = 3885
P = 15
C = 10
2)
    
Seed = 3
N = 1408
M = 1115
P = 20
C = 8
3)
    
Seed = 4
N = 1179
M = 2065
P = 31
C = 6
4)
    
Seed = 5
N = 1483
M = 1855
P = 43
C = 10
5)
    
Seed = 6
N = 1070
M = 3655
P = 46
C = 9
6)
    
Seed = 7
N = 679
M = 3174
P = 35
C = 7
7)
    
Seed = 8
N = 903
M = 2332
P = 18
C = 6
8)
    
Seed = 9
N = 818
M = 1133
P = 9
C = 5
9)
    
Seed = 10
N = 665
M = 1838
P = 19
C = 8

This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2010, TopCoder, Inc. All rights reserved.