Nuit Blanche: AMP: Assembly Matching Pursuit, Metagenomic units (MGUs) discovery through sequence-based dictionary learning

Saturday, January 05, 2013

AMP: Assembly Matching Pursuit, Metagenomic units (MGUs) discovery through sequence-based dictionary learning - implementation -

A while back, we saw that for each individual, the microbiome did not seem to change too much over time, but what about at time t, how is the microbiome different among the seven billion individuals currently on earth ? Thanks to the Twitter and Jason Moore, I came across a paper (and attendant code) that sets out to answer that question with a dictionary learning approach.

AMP: Assembly Matching Pursuit by Surojit Biswas, Vladimir Jojic. The abstract reads:

Metagenomics, the study of the total genetic material isolated from a biological host, promises to reveal host-microbe or microbe-microbe interactions that may help to personalize medicine or improve agronomic practice. We introduce a method that discovers metagenomic units (MGUs) relevant for phenotype prediction through sequence-based dictionary learning. The method aggregates patient-speciﬁc dictionaries and estimates MGU abundances in order to summarize a whole population and yield universally predictive biomarkers. We analyze the impact of Gaussian, Poisson, and Negative Binomial read count models in guiding dictionary construction by examining classiﬁcation eﬃciency on a number of synthetic datasets and a real dataset from Ref. 1. Each outperforms standard methods of dictionary composition, such as random projection and orthogonal matching pursuit. Additionally, the predictive MGUs they recover are biologically relevant.

The code can be found here.

I wonder how these greedy algorithms scale for very large databases and how different the output would be if one were to use other dictionary learning techniques (especially the ones tending to structured sparsity). Synthetic data were derived from this human Microbiome dataset.