Sunday, January 25, 2015

Sunday Morning Insight: What Happens When You Cross into P Territory ?

Today's insight is a follow up to other previous Sunday Morning Insights. The first of those insights cast genome sequencing as a formerly NP-Hard problem (Sunday Morning Insight: Escaping Feynman's NP-Hard "Map of a Cat": Genomic Sequencing Edition), the second one focused on how Advanced Matrix Factorization could now help speed up this technology ( Improving Pacific Biosciences' Single Molecule Real Time Sequencing Technology through Advanced Matrix Factorization ? ). Then, we conjectured, based on our previous experience with compressive sensing ( a formerly NP-hard problem) how the field of genome sequencing would grow (Sunday Morning Insight: The Stuff of Discovery and Sunday Morning Insight: Crossing into P territory ). Today's insight will follow through on all those previous  insights. In a recent Twitter exchange, one could read:
Lex Nederbragt responded with this preprint that I had wanted to mention much earlier. The use of LSH to perform alignement.

Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing by Konstantin Berlin, Sergey Koren, Chen-Shan Chin, James Drake, Jane M Landolin, Adam M Phillippy

We report reference-grade de novo assemblies of four model organisms and the human genome from single-molecule, real-time (SMRT) sequencing. Long-read SMRT sequencing is routinely used to finish microbial genomes, but the available assembly methods have not scaled well to larger genomes. Here we introduce the MinHash Alignment Process (MHAP) for efficient overlapping of noisy, long reads using probabilistic, locality-sensitive hashing. Together with Celera Assembler, MHAP was used to reconstruct the genomes of Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and human from high-coverage SMRT sequencing. The resulting assemblies include fully resolved chromosome arms and close persistent gaps in these important reference genomes, including heterochromatic and telomeric transition sequences. For D. melanogaster, MHAP achieved a 600-fold speedup relative to prior methods and a cloud computing cost of a few hundred dollars. These results demonstrate that single-molecule sequencing alone can produce near-complete eukaryotic genomes at modest cost.
Emphasis mine. From the paper, one can read:

For example, using Amazon Web Services (AWS), the estimated cost to generate the D. melanogaster PBcR-BLASR assembly is over $100,000 at current rates, an order of magnitude higher than the sequencing cost. With MHAP, this cost is drastically reduced to under $300
MHAP is available here:

In summary, if LSH can be used for reconstruction, similar tools ought to be good enough to perform Compressive Genomics for even larger scale problems ...In effect, when you cross into P territory, you look for harder problems.


Join the CompressiveSensing subreddit or the Google+ Community and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

1 comment:

Aaron said...

Just got my MinION and can't to try out some of these long read methods!