Nuit Blanche: Improving Pacific Biosciences' Single Molecule Real Time Sequencing Technology through Advanced Matrix Factorization ?

Wednesday, August 20, 2014

Improving Pacific Biosciences' Single Molecule Real Time Sequencing Technology through Advanced Matrix Factorization ?

If you watched Elaine Mardis' Videos and Slides: Next-Generation Sequencing Technologies you noted that in order to produce complete genomic data, sequencing technology needs long read technology (also called 3rd generation sequencing technology) like the Pacific BioSciences SMRT or the nanopore technology. In fact, any advances in medical techology will come from dropping the price of these instruments to even lower prices than what we currently seem to reach. One of the issue with long read technology is the matter of error rate. That error rate is generally compounded with the large number of sequenced strands so that in effect, the error rate is starting to be very small compared to previous technology where one has to put together (align together really) very small read sequences.

Let us focus on PacBio's sequencing technology for a moment, here is a nice introduction:

So a polymerase does its job at the bottom of a chamber called a Zero Mode Waveguide. The idea is that the chamber is filled with fluorescence elements and the polymerase take them one by one as it reproduces the DNA strand in the chamber. The ZMW chamber is designed so that the light, that comes from below, only shines the polymerase and its immediate surrounding. The idea is that once a fluorescent element is no more near the polymerase vicinity (after it has been used), there is no more exciting light so that the only light seen outside is the one of the fluorescent material that is currently used by the polymerase. Very clever.

If you look after 3 minutes of that video,you'll see that the number of chambers is large in order to provide some oversampling because:

not all polymerase start at the same point in the DNA
there seems to be some amount of work that the polymerase can do and then stop (because of fluorescent molecule shortage ?)

What we see are dots with four different colors bleeping at the polymerase include specific elements in the DNA.

Why are we having this description ? Because quite simply the alignment work to be done afterwards is an Advanced Matrix Factorization problem [1]. But if there are algorithms that are already doing the work, why should we care about this problem ? It is solved, right ? Let me make three arguments to answer this very valid point of view. With the new Matrix Factorization algorithms come:

many different implementations (more implementations is better)
bounds or phase transitions which can have a clear link to sampling strategies (how many chambers ?) as was done for the recent GWAS work (which included noise)
robustness to grossly corrupting noise that could probably change the nature of hardware (is the ZMW really necessary ?) and even could provide improvement to previous and current technology.

[1] The factorization has the following shape: Y = P X with Y a matrix representing the timed sequence coming out of all the cells, a rank-1 matrix X that is unknown with columns representing the full aligned DNA being sequenced and P represents an unknown (group) sparse matrix (with +1 elements) representing the sampling done by the polymerase and the subsequent detection by the camera of the sequencing instrument. It looks like a blind deconvolution with a low rank dataset.

(UPDATE: more on this factorization here in "DNA Sequencing, Information Theory, Matrix Factorization and all that"

Join the CompressiveSensing subreddit or the Google+ Community and post there !