Tuesday, October 15, 2013

Application of compressed sensing to genome wide association studies and genomic selection

As we were talking about phase transition, the next paper has already been covered by at least two of its authors on their respective blogs:

Go check them out, I'll wait.

Roughly speaking, the authors put GWAS findings in a regression setting and use the Donoho-Tanner (DT) phase transition to get a rule of thumb in the number of people needed in a study to detect a connection between SNP results and a phenotype with a certain amount of loci with nonzero effects. This is outstanding! I note the use additive noise and how it affects the DT phase transition to get a realistic rule of thumb (much like what was done in [1])

The paper is: Application of compressed sensing to genome wide association studies and genomic selection by Shashaank Vattikuti, James J. Lee, Stephen D. H. Hsu, Carson C. Chow
We show that the signal-processing paradigm known as compressed sensing (CS) is applicable to genome-wide association studies (GWAS) and genomic selection (GS). The aim of GWAS is to isolate trait-associated loci, whereas GS attempts to predict the phenotypic values of new individuals on the basis of training data. CS addresses a problem common to both endeavors, namely that the number of genotyped markers often greatly exceeds the sample size. We show using CS methods and theory that all loci of nonzero effect can be identified (selected) using an efficient algorithm, provided that they are sufficiently few in number (sparse) relative to sample size. For heritability h2 = 1, there is a sharp phase transition to complete selection as the sample size is increased. For heritability values less than one, complete selection can still occur although the transition is smoothed. The transition boundary is only weakly dependent on the total number of genotyped markers. The crossing of a transition boundary provides an objective means to determine when true effects are being recovered. For h2 = 0.5, we find that a sample size that is thirty times the number of nonzero loci is sufficient for good recovery.
If you are already aware of CS but not much into GWAS studies, here are some interesting tidbits from the paper (my emphasis underlined):

More importantly, we provide an independent quantitative criterion for when the method will work. In addition, the choice of optimization parameters has led some researchers to adopt computationally intensive procedures for integrating over the continuum of possible values as well as effectively reducing the statistical power by reserving data for cross-validation (Park and Casella 2008; Makowsky et al. 2011; Zhou et al. 2013). We show how the lasso penalization parameter can be determined theoretically rather than empirically through cross-validation (Candes` and Wakin 2008; Candes` and Plan 2009; Candes` and Plan 2011; Candes` 2011), preserving all the data for training.
what the Donoho-Tanner phase transition is really used for in this context:

Using more than 12,000 subjects from the ARIC European-American and GENEVA cohorts and nearly 700,000 single-nucleotide polymorphisms (SNPs) we show that the matrix of genotypes acquired in GWAS obeys properties suitable for the application application of CS theory. In particular, a given sample size determines the maximum number of nonzero loci that will be fully selected using a technique such as lasso
One notes the use of an additive noise that is connected to a heritability trait:

...The transition between poor and complete selection is sharp in the noiseless case (heritability equal to one). It is smoothed in the presence of noise (heritability less than 6one) but fully detectable. Consistent with CS theory, we find in cases with realistic residual noise that the minimal sample size is primarily determined by the number of nonzero locis and depends very weakly on the number of genotyped markers p (Candes` et al. 2006; Donoho et al. 2011; Candes` and Plan 2011).
The detail of the measurement matrix:
The SNP genotype matrix (A) consisted of 12,464 subjects and 693,385 SNPs. SNPs were coded by their minor allele and alleles were combined resulting in values of 0, 1, or 2. SNP vectors were standardized across subjects. Missing genotypes were replaced with 0’s after standardization.
with another Rosetta stone moment: 
Very roughly, we can say that the goal of GWAS [genome-wide association studies] per se is to identify the s nonzero elements of x, whereas the goal of GS [genomic selection]  is to determine Ax.
How does additive noise influence the number of samples, quite typically it is the main reason one ought to use these Donoho-Tanner phase transition diagrams, here we have (the lower h, the higher the noise)

Given the assumptions of h2 = 0.5 for height (Yang et al. 2010; Vattikuti et al. 2012) and a critical ρ = 0.03 per the simulations above, this suggests that the number of height-associated SNPs is greater than four hundred. This lower bound agrees with GWAS findings to date that have identified hundreds of height-associated SNPs while accounting for only a fraction of the genetic variance (Yang et al. 2012; Turchin et al. 2012).
and from the additive noise study, we get a rule of thumb, much like we do when designing hardware sensors:
For example, if h2 = 0.5, which is roughly the narrow-sense heritability of height and a number of other quantitative traits (Yang et al. 2010; Davies et al. 2011; Vattikuti et al. 2012), we find that irrespective of δ, ρ should be less than 0.03 for recovery. There is no hope of recovering x above this threshold. For example, if we have prior knowledge that s = 1, 200, then this means that the sample size should be no less than 40,000 subjects. As a rough guide, for h2 ∼ 0.5 we expect that n ∼ 30s is sufficient for good recovery of the loci with nonzero effects.

What's next ?

Phil Schniter just let us know about An Empirical-Bayes Approach to Recovering Linearly Constrained Non-Negative Sparse Signals that uses an AMP approach to LASSO with better phase diagrams than the ones allowed by l_1 minimization. That ought to be looked into as we know that the DT phase transition is also dependent on the actual algorithm used.

A second take is if one could also include multiplicative noise in the study as SNP measurements might have some errors. In that case, one wonders what an MMV approach might mean or even be feasible.

A third take on this is if one would allow for some sort of block structure to be dsiscovered  in the regression, one would probably decrease the number of samples and begin to provide some explanation as to which loci are really the important ones. In particular, the mapping between the regression coefficients within either a tree like or block like structure  might give a clue as to which biochemical network is really at play and the ones affected as a side effect.

No comments: