Sunday, September 07, 2014

Sunday Morning Insight: Escaping Feynman's NP-Hard "Map of a Cat": Genomic Sequencing Edition

If you have read Nuit Blanche long enough, you might have seen this figure several times. While compelling in comparison to Moore's law, something immensely more important has happened recently.

With long read technology, DNA Sequencing has crossed over from being an NP-complete enterprise to being in P. How ? What does it mean ? Why should you care ?

NP-complete fields yield Cognitive Overload

I read Feynman's "Surely You're Joking, Mr. Feynman!" a long time ago but one of the little story that stuck with me was his adventures in Biology. Here is what he had to say:

I began to read the paper. It kept talking about extensors and flexors, the gastrocnemius muscle, and so on. This and that muscle were named, but I hadn’t the foggiest idea of where they were located in relation to the nerves or to the cat. So I went to the librarian in the biology section and asked her if she could find me a map of the cat.
“A map of the cat, sir?” she asked, horrified. “You mean a zoological chart!” From then on there were rumors about some dumb biology graduate student who is looking for a “map of the cat.”
When it came time for me to give my talk on the subject, I started off by drawing an outline of the cat and began to name the various muscles.
The other students in the class interrupt me: “We know all that!”
“Oh,” I say, “you do? Then no wonder I can catch up with you so fast after you’ve had four years of biology.” They had wasted all their time memorizing stuff like that, when it could be looked up in fifteen minutes.- 

Clearly, even though "the map of a cat" had a very intuitive meaning, the complexity of the field meant that the term was totally inapropriate. Worse, the story shows that in order to map an extraordinary complexity, a cognitive price to pay is the rote memorization of a collection of empirical facts and tricks of the trades ( such as knowing every muscles by name ) to, sometimes, the exclusion of real insight.

It's the alignment, stupid!

Anybody who has ever read sequencing related articles know that a large part of the uncertainty in the field comes from our collective inability to do well in the alignment reconstruction process after DNA molecules have been broken up chemically in sequencers. In particular, we have known for a little while based on information theory. that unless one can read at least 2 to 4 000 base pairs (2KBp to 4KBp) continuously on DNA molecules, we are condemned to using many empirical techniques and combinatorial solvers to fit those pieces back together.

In second generation technology (also known as Next Generation Sequencing NGS ), long reads meant 200 Bp (0.2 KBp) while with the third generation sequencing, long reads now mean 20000 Bp (20KBp) or more.

In other words, by enabling a 100 times improvement, the sensor technology made us cross the information theoretic limit into territories where even greedy alignment solvers could do well [1].

In all, third generation sequencers have enabled us to collectively cross into the less complex P territory. If you have read this blog long enough, you'll notice that this is exactly what happened in compressive sensing back in 2004. Before 2004, sparse reconstruction solvers were known to exist in very specific fragmented fields [4] who would use different vocabularies (cognitive overload again) but we had no idea why they worked as the analysis plainly showed we had to be in NP territory. In 2004, two papers showed us why we really were in P territory all along [2].

In both cases, technology development or a new mathematical insight, transported us collectively into a land of low hanging fruits. Which leads us to the next question.

How do you like t'em apples ?

Acknowledgment: Thank you M.R. Yousefi for the discussion [3]


[1] As the technology gets better with fewer error rates, we might get into fewer dna sampling requirements even though currently some alignment issues are still being worked out (see this week's with the Oxford Nanopore Minion story).

[2] In 2011, these territories were pushed further.

[3] A discussion between M.R. Yousefi and me that led to this blog entry:

My initial e-mail:

Howdy M.R.,

In light of David Tse's work ( ), is it fair to say that when the average read length went above 2000 (really from 200 bp max with NGS to 20KBp with PacBio 3rd gen), we went from a combinatorial world to a greedy one ? (please correct me on the numbers)


MR's response:
Howdy Igor,

According to the slides and [1], it seems that reconstructing Human Chr 19 using the greedy method requires read-length L of more than 4.5 kb, which is much higher than the information theoretic lower bound assuming no repeats of length L. This gap is due to repeats of longer length, as well as the read error. The MULTIBRIDGING algorithm, on the other hand, takes into account long repeats is near optimal (in most cases). But MULTIBRIDGING is not greedy. It uses graph theory and has computational complexity O(G^2) where G is the length of original sequence to be reconstructed. 
The authors in [1] also state that:
"The MULTIBRIDGING algorithm is near optimal in the sense that, for a wide range of repeat statistics, it requires the minimum read length and minimum coverage depth to achieve complete reconstruction. However, since the repeat statistics of a genome to be sequenced are usually not known in advance, this minimum required read length and minimum required coverage depth may also not be known in advance. In this context, it would be useful for the Multi-Bridging algorithm to validate whether its assembly is correct. More generally, an interesting question is to seek algorithms which are not only optimal in their data requirements but also provide a measure of confidence in their assemblies.
How realistic is the goal of complete reconstruction given current-day sequencing technologies? The minimum read lengths L_crit required for complete reconstruction on the datasets we examined are typically on the order of 500-3000 base pairs (bp). This is substantially longer than the reads produced by Illumina, the current dominant sequencing technology, which produces reads of lengths 100-200bp; however, other technologies produce longer reads. PacBio reads can be as long as several thousand base pairs, and as demonstrated by [27], the noise can be cleaned by Illumina reads to enable near complete reconstruction. Thus our framework is already relevant to some of the current cutting edge technologies."
I believe you are right that PacBio's promise of long reads (more than 8.5 kb on average) makes it possible to use the greedy algorithm and get nearly-perfect reconstruction.  
M. R. Yousefi

Join the CompressiveSensing subreddit or the Google+ Community and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

No comments: