Nuit Blanche: CS: Compressed Genotyping, DNA Sudoku - Harnessing high throughput sequencing for multiplexed specimen analysis

Tuesday, September 08, 2009

CS: Compressed Genotyping, DNA Sudoku - Harnessing high throughput sequencing for multiplexed specimen analysis

Yaniv Erlich just sent me the following:

We came up with a framework, named compressed genotyping, of using compressed sensing for detecting carriers of rare genetic diseases, which somehow resembles the interesting work of Noam Shomron et al. (http://nuit-blanche.blogspot.com/2009/09/cs-rare-allele-detection-using.html).

In addition to the theoretical framework (found here), we have already tested large parts of it in a real experiment in which we genotyped more than 40,000 bacterial colonies in one run of DNA sequencing platform.
For comparison, regular (non-compressed) genotyping techniques can only process few hundreds samples in one run, which emphasizes the rigor of compressed sensing! The experimental work was published in Genome Research (July 2009), and a pre-print is found here.

In addition, I am going to present some aspects of compressed genotpying in Allerton 2009.

We would appreciate comments on our work. It seems that some aspects of DNA Sequencing has intriguing connections to compressed sensing.

Thanks Yaniv !

The first paper is: Compressed Genotyping by Yaniv Erlich, Assaf Gordon, Michael Brand, Gregory J. Hannon and Partha P. Mitra. The abstract reads:

Significant volumes of knowledge have been accumulated in recent years linking subtle genetic variations to a wide variety of medical disorders from Cystic Fibrosis to mental retardation. Nevertheless, there are still great challenges in applying this knowledge routinely in the clinic, largely due to the relatively tedious and expensive process of DNA sequencing. Since the genetic polymorphisms that underlie these disorders are relatively rare in the human population, the presence or absence of a disease-linked polymorphism can be thought of as a sparse signal. Using methods and ideas from compressed sensing and group testing, we have developed a cost-effective genotyping protocol. In particular, we have adapted our scheme to a recently developed class of high throughput DNA sequencing technologies, and assembled a mathematical framework that has some important distinctions from ’traditional’ compressed sensing ideas in order to address different biological and technicalconstraints.

This paper uses a Belief Propagation Reconstruction algorithm and makes the point that fewer measurement is not the only cost function.

The second paper is: DNA Sudoku - Harnessing high throughput sequencing for multiplexed specimen analysis by Yaniv Erlich, Kenneth Chang, Assaf Gordon, Roy Ronen, Oron Navon, Michelle Rooks, and Gregory J. Hannon. The abstract reads:

Next-generation sequencers have sufficient power to analyze simultaneously DNAs from many different specimens, a practice known as multiplexing. Such schemes rely on the ability to associate each sequence read with the specimen from which it was derived. The current practice of appending molecular barcodes prior to pooling is practical for parallel analysis of up to many dozen samples. Here, we report a strategy that permits simultaneous analysis of tens of thousands of specimens. Our approach relies on the use of combinatorial pooling strategies in which pools rather than individual specimens are assigned barcodes. Thus, the identity of each specimen is encoded within the pooling pattern rather than by its association with a particular sequence tag. Decoding the pattern allows the sequence of an original specimen to be inferred with high confidence. We verified the ability of our encoding and decoding strategies to accurately report the sequence of individual samples within a large number of mixed specimens in two ways. First, we simulated data both from a clone library and from a human population in which a sequence variant associated with cystic fibrosis was present. Second, we actually pooled, sequenced, and decoded identities within two sets of 40,000 bacterial clones comprising ~20,000 different artificial MicroRNAs targeting Arabidopsis or human genes. We achieved greater than 97% accuracy in these trials. The strategies reported here can be applied to a wide variety of biological problems, including the determination of genotypic variation within large populations of individuals.

Here are some items that pop-up my mind but I am sure there are others:

The article has no mention of any CS papers or CS pooling papers, it looks like a huge opportunity to bring some knowledge that has built up in the CS domain recently.
In an offline conversation, Nicolas Thierry-Mieg pointed out to me that this Sudoku approach does not seem closely related to the sudocode approach of Dror Baron, Rich Baraniuk and Shriram Sarvotham in Sudocodes: Fast Measurement and Reconstruction of Sparse Signals. Dror tells me that they are improving on this Sudocode code.
Page 42 makes a tangent note to quantization. A similar issue was mentioned in the Rare-Allele Detection Using Compressed Se(que)nsing paper. At some point, unless reconstruction is exact and algebraic, we are going to have to face this issue of quantization. Some work has been performed in this arena, check the blog with that keyword.
Maybe, at some point we need a similar table to that of Piotr Indyk released his Tutorial on Compressed Sensing (or Compressive Sampling or Linear Sketching), specific for pooling arguments in the bio world.