Friday, August 22, 2014

DNA Sequencing, Information Theory, Advanced Matrix Factorization and all that...

[1] The factorization has the following shape: Y = P X with Y a matrix representing the timed sequence coming out of all the cells, a rank-1 matrix X that is unknown with columns representing the full aligned DNA being sequenced and P represents an unknown (group) sparse matrix (with +1 elements) representing the sampling done by the polymerase and the subsequent detection by the camera of the sequencing instrument. It looks like a blind deconvolution with a low rank dataset.

Following that entry, M.R. Yousefi and I had a discussion where I eventually had to provide a more in-depth and accurate explanation of the Matrix Factorization that ought to be looked into for this technology.

For starters, here are some numbers and variables: Each chamber takes as input a DNA molecule of size ~20 kilobases. The input molecules can be different from chamber to chamber since these molecules are actually segments of a much much longer DNA template (finding this template is the goal of sequence alignment and it is represented in X). Therefore, X is constructed from the inputs of each chamber.

We have:

d = size of the DNA template being sought.
n = number of ZMW chambers (there are as many as 
150,000 ZMW chambers on a single SMRT Cell)
m = size of the maximum length read  (using the PacBio technology, 
they are ~8.5 kilobases long, but it could applied to
shorter reads ~200b - aka 2nd second generation -)

we are solving for:

Y = P X
Y is a matrix with m rows and n columns
P is a matrix with m rows and dxn columns
X is a matrix with dxn rows and n columns

To make the idea firmer, let us suppose our imaginary DNA template is actually composed of only three bases:


Now let us use PacBio tech that has only two ZMW chambers and we get these reads:


So in our Y = P X setting we have (using Matlab's notation):

d = 3
n = 2
m = 2

Y = [ a b;
        b c ]

X = [ a 0;
         b 0;
         c 0;
         0 a;
         0 b;
         0 c ]

with P = [1 0 0 0 1 0;
               0 1 0 0 0 1]

The first three columns of P have grouped 1s near the beginning and the last three columns have their 1s at the end of the rows, hence the reference to "group" sparsity.

X is not low rank in this instance but it has a very repetitive structure that can surely can be expressed as a low rank constraint. For instance, we could write it as X = I x T where x is the kronecker product, I the identity and T the template we are looking for. We could also make the statement that for any random sampling Pc, we have the matrix P1= [Pc Pc] with P1 X a rank-1 matrix. In the previous example, for instance we had two Pcs that took the form of either

[1 0 0 ;
0 1 0 ]


[ 0 1 0;
0 0 1]

hence P1= [Pc Pc] with the first Pc example would look like:

P1 = [1 0 0 1 0 0 ;
         0 1 0 0 1 0]

Actually nothing stops us from using Pcs that have real numbers (instead of integers) when constraining the rank of P1 X. In fact by putting several random samplings together, we could have something like

PN = [Pc2 Pc2;
          Pc3 Pc3;
          Pc4 Pc4;
         Pc56 Pc56]

And seek X such that PN X is rank-1.

In summary, we are searching for a matrix factorization of the following kind:

Find P and X s.t.
Y = P X 
PN X is rank 1 (or low rank) 
P is sparse and made up of 0/1s.

Let us note two or three things:

The size of these matrices, P, X and PN is large. A small bacterial genome has millions of nucleotides, while mammalians have billions. Also, the number of chambers (outputting reads as the random-length block sample of the template DNA) as well as the read length will increase. So these two combined will increase the size of those matrices. A good question is whether our computational/storage power will increase at the same pace, or whether we can use random projections or exploit the structure of this problem even more. Compressive sensing and related fields show that the larger the dimension the better random projections work, so counterintuitively, the fact that the problem is intractable because of its size may become interesting especially since there is so much structure to use in the first place.

One might argue that for example with d being the length of the template DNA, and m being the average read length, we need at least n reads to have, with high probability, perfect reconstruction (see the work of David Tse and his post-doc below). However, we could probably do better than that as it has been shown in other problems with group sparsity or a good dictionary. It so happens that the sequencing already provides elements of the dictionary. Hence exploiting that structure could make the problem require less ZMW cells or provide more certainty on the final template. 

The second obstacle is related to the length of the template DNA. In many de novo sequence alignments, the original length is unknown and the size of P, PN and X in this case. This is indeed a good issue to look into. 

M.R. then pointed me to the work of David Tse, which include those papers/preprints:

From the last three presentations (the ISIT tutorial is pretty complete), one can read the following slides of interest: 

First there is this figure showing the current phase transition of the problem. Indeed if the read lengths are too small then any amount of them will make it difficult or impossible to piece a template together. I note that one of the early known way to tackle this problem was through a greedy algorithm:

and that some of these algorithms are not even optimal with regards to the lowest bounds (a situation we have seen in compressive sensing)

except for a deBruijn based algorithm (improved further by the MULTIBRIDGING algorithm described here)

Let us note from the ISIT slides, the issue of assembly from short reads (one of them uses a LASSO algorithm)

From all this, here is what we could say:

  • the long read technology enabled by PacBio or nanopore means that for most sequencing we are already above the threshold where even a greedy algorithm can do well. This is probably the most noteworthy aspect of this research. In effect, there is no need for "informed" combinatorial search to get to some results.
  • a matrix factorization may be optimal if it can do better than MULTIBRIDGE or the greedy algorithms out there. And for that, one would have to use additional information on the DNA sequence by using a group sparsity argument. It certainly could provide some sense of the real phase transition of some of these problems and provide a more direct way of figuring out the noise issue like it does in plain vanilla compressive sensing. Because the factorization is set up in a general fashion, it could also allow one to investigate new instrumentations (see the randomization mentioned above) as well as investigate whether an AMP style algorithm could push the phase transition bounds further.
  • I note the use of sparsity (see later slides of High Throughput Sequencing: Microscope in a Big Data Era )
  • I wonder if a connection could be made with the partial digest problem (with its connection to phase retrieval).

Thank you M.R. for this very interesting and enlightning discussion !

For further reading and implementations:

Information Theory of DNA Shotgun Sequencing by Abolfazl MotahariGuy BreslerDavid Tse

DNA sequencing is the basic workhorse of modern day biology and medicine. Shotgun sequencing is the dominant technique used: many randomly located short fragments called reads are extracted from the DNA sequence, and these reads are assembled to reconstruct the original sequence. A basic question is: given a sequencing technology and the statistics of the DNA sequence, what is the minimum number of reads required for reliable reconstruction? This number provides a fundamental limit to the performance of {\em any} assembly algorithm. For a simple statistical model of the DNA sequence and the read process, we show that the answer admits a critical phenomena in the asymptotic limit of long DNA sequences: if the read length is below a threshold, reconstruction is impossible no matter how many reads are observed, and if the read length is above the threshold, having enough reads to cover the DNA sequence is sufficient to reconstruct. The threshold is computed in terms of the Renyi entropy rate of the DNA sequence. We also study the impact of noise in the read process on the performance.

No comments: