Improved Constructions for Non-adaptive Threshold Group Testing
February 15, 2012 3:45 pm - 4:15 pm
February 15, 2012 3:45 pm - 4:15 pm
Keywords of the presentation: Group testing, Explicit constructions
The basic goal in combinatorial group testing is to identify a set of up to d defective items within a large population of size n >> d using a pooling strategy. Namely, the items can be grouped together in pools, and a single measurement would reveal whether there are one or more defectives in the pool. The threshold model is a generalization of this idea where a measurement returns positive if the number of defectives in the pool exceeds a fixed threshold u, negative if this number is below a fixed lower threshold L 0.
Nicolas Thierry-Mieg - Centre National de la Recherche Scientifique (CNRS) http://membres-timc.imag.fr/Nicolas.Thierry-Mieg/
Shifted Transversal Design Smart-pooling: increasing sensitivity, specificity and efficiency in high-throughput biology
February 15, 2012 9:00 am - 10:00 am
February 15, 2012 9:00 am - 10:00 am
Keywords of the presentation: combinatorial group testing, smart-pooling, interactome mapping
Group testing, also know as smart-pooling, is a promising strategy for achieving high efficiency, sensitivity, and specificity in systems-level projects. It consists in assaying well-chosen pools of probes, such that each probe is present in several pools, hence tested several times. The goal is to construct the pools so that the positive probes can usually be directly identified from the pattern of positive pools, despite the occurrence of false positives and false negatives. While striving for this goal, two interesting mathematical or computational problems emerge: the pooling problem (how should the pools be designed?), and the decoding problem (how to interpret the outcomes?). In this talk I will discuss these questions and the solutions we have proposed: a flexible and powerful combinatorial construction for designing smart-pools (the Shifted Transveral Design, STD), and an efficient exact algorithm for interpreting results (interpool). I will then present the results of validation experiments that we have performed in the context of yeast two-hybrid interactome mapping.
Read More...
The data stream model has emerged as a way of analyzing algorithmic efficiency in the presence of massive data sets. Typically the algorithm is allowed a few (usually one) passes over the input, and must use limited memory and have very fast per input item processing time. I will give a survey of algorithms and lower bounds in this area, with an emphasis on problems such as norm estimation, numerical linear algebra, empirical entropy, l_p-sampling, and heavy hitters. Time-permitting I'll also discuss the extension to functional monitoring, in which there are multiple sites each with a data stream and a central coordinator should continuously maintain a statistic over the union of the data streams.
Reconstruction of bacterial communities using sparse representation
February 17, 2012 11:30 am - 12:30 pm
February 17, 2012 11:30 am - 12:30 pm
Keywords of the presentation: metagenomics, sparse representation, compressed seqeuncing
Determining the identities and frequencies of species present in a sample is a central problem in metagenomics, with scientific, environmental and clinical implications.
A popular approach to the problem is sequencing the Ribosomal 16s RNA gene in the sample using universal primers, and using variation in the gene's sequence between different species to identify the species present in the sample. We present a novel framework for community reconstruction, based on sparse representation; while millions of microorganisms are present on earth, with known 16s sequences stored in a database, only a small minority (typically a few hundreds) are likely to be present in any given sample,
We discuss the statistical framework, algorithms used and results in terms of accuracy and species resolution.
From screening clone libraries to detecting biological agents
February 17, 2012 10:15 am - 11:15 am
February 17, 2012 10:15 am - 11:15 am
Keywords of the presentation: (generalized) group testing, screening clone libraries, DNA microarrays
Group testing has made many contributions to modern molecular biology. In the Human Genome Project large clone libraries where created to amplify DNA. Widely used group testing schemes vastly accelerated the detection of overlaps between the individual clones in these libraries through experiments, realizing savings both in effort and materials.
Modern molecular biology also contributed to group testing. The problem of generalized group testing (in the combinatorial sense) arises naturally, when one uses oligonucleotide probes to identify biological agents present in a sample. In this setting a group testing design cannot be chosen arbitrarily. The possible columns of a group testing design matrix are prescribed by the biology, namely by the hybridization reactions between target DNA and probes
Genomic Privacy and the Limits of Individual Detection in a Pool
February 15, 2012 10:15 am - 11:15 am
February 15, 2012 10:15 am - 11:15 am
Keywords of the presentation: Genomewide association studies, Privacy, Pooled designs, Hypothesis testing, Local Asymptotic normality
Statistical power to detect associations in genome-wide association studies can be enhanced by combining data across studies in meta-analysis or replication studies. Such methods require data to flow freely in the scientific community, however, and this raises privacy concerns.
Till recently, many studies pooled individuals together, making only the allele frequencies of each SNP in the pool publicly available. However a technique that could be used to detect the presence of individual genotypes from such data prompted organizations such as the NIH to restrict public access to summary data . To again allow public access to data from association studies, we need to determine which set of SNPs can be safely exposed while preserving an acceptable level of privacy.
To answer this question, we provide an upper bound on the power achievable by any detection method as a function of factors such as the number and the allele frequencies of exposed SNPs, the number of individuals in the pool, and the false positive rate of the method. Our approach is based on casting the problem in a statistical hypothesis testing framework for which the likelihood ratio test (LR-test) attains the maximal power achievable.
Our analysis provides quantitative guidelines for researchers to make SNPs public without compromising privacy. We recommend, based on our analysis, that only common independent SNPs be exposed. The final decision regarding the exposed SNPs should be based on the analytical bound in conjunction with empirical estimates of the power of the LR test. To this end, we have implemented a tool, SecureGenome, that determines the set of SNPs that can be safely exposed for a given dataset.
Weighted Pooling - Simple and Effective Techniques for Pooled High Throughput Sequencing Design
February 15, 2012 3:15 pm - 3:45 pm
February 15, 2012 3:15 pm - 3:45 pm
Despite the rapid decrease in sequencing costs, sequencing a large group of individuals is still prohibitively expensive. Recently, several sophisticated pooling designs were suggested that can identify carriers of rare alleles in large cohorts with a significantly smaller number of lanes, thus dramatically reducing the cost of such large scale genotyping projects. These approaches all use combinatorial pooling designs where each individual is either present in a pool or absent from it. One can then infer the number of carriers in a pool, and by combining information across pools, reconstruct the identity of the carriers.
We show that one can gain further efficiency and cost reduction by using “weighted” designs, in which different individuals donate different amounts of DNA to the pools. Intuitively, in this situation the number of mutant reads in a pool does not only indicate the number of carriers, but also of the identity of the carriers.
We describe and study a simple but powerful example of such weighted designs, with non-overlapping pools. We demonstrate that even this naive approach is not only easier to implement and analyze but is also competitive in terms of accuracy with combinatorial designs when identifying very rare variants, and is superior to the combinatorial designs when genotyping more common variants.
We then discuss how weighting can be further incorporated into existing designs to increase their accuracy and demonstrate the resulting improvement in reconstruction efficiency using simulations. Finally, we argue that these weighted designs have enough power to facilitate detection of common alleles, so they can be used as a cornerstone of whole-exome or even whole-genome sequencing projects.
Combinatorial Pooling Enables Selective Sequencing of the Barley Gene Space
February 14, 2012 3:15 pm - 3:45 pm
February 14, 2012 3:15 pm - 3:45 pm
Keywords of the presentation: combinatorial pooling, DNA sequencing and assembly, genomics, next-generation sequencing,
We propose a new sequencing protocol that combines recent advances in combinatorial pooling design and second-generation sequencing technology to efficiently approach de novo selective genome sequencing. We show that combinatorial pooling is a cost-effective and practical alternative to exhaustive DNA barcoding when dealing with hundreds or thousands of DNA samples, such as genome-tiling gene-rich BAC clones. The novelty of the protocol hinges on the computational ability to efficiently compare hundreds of million of short reads and assign them to the correct BAC clones so that the assembly can be carried out clone-by-clone. Experimental results on simulated data for the rice genome show that the deconvolution is extremely accurate (99.57% of the deconvoluted reads are assigned to the correct BAC), and the resulting BAC assemblies have very high quality (BACs are covered by contigs over about 77% of their length, on average). Experimental results on real data for a gene-rich subset of the barley genome confirm that the deconvolution is accurate (almost 70% of left/right pairs in paired-end reads are assigned to the same BAC, despite being processed independently) and the BAC assemblies have good quality (the average sum of all assembled contigs is about 88% of the estimated BAC length).
Joint work with D. Duma (UCR), M. Alpert (UCR), F. Cordero (U of Torino), M. Beccuti (U of Torino), P. R. Bhat (UCR and Monsanto), Y. Wu (UCR and Google), G. Ciardo (UCR), B. Alsaihati (UCR), Y. Ma (UCR), S. Wanamaker (UCR), J. Resnik (UCR), and T. J. Close (UCR).
Preprint available at http://arxiv.org/abs/1112.4438
Poster - Upgraded Separate Testing of Inputs in Compressive Sensing
February 14, 2012 4:15 pm - 5:30 pm
February 14, 2012 4:15 pm - 5:30 pm
Screening experiments (SE) deal with finding a small number s Active Inputs (AIs) out of a vast total amount t of inputs in a regressionmodel. Of special interest in the SE theory is finding the so-called maximal rate (capacity). For a set of t elements, denote the set of its s-subsets by (s, t). Introduce a noisy system with t binary inputs and one output y. Suppose that only s 0. The f-capacity Cf (s) = limt!1(log t/Nf (s, t, )) for any > 0 is the ‘limit for the per- formance of the f-analysis’. We obtained tight capacity bounds [4], [5] by formalizing CSS as a special case of Multi-Access Communication Channels (MAC) of information transmission capacity region construction developed by R. Ahlswede in [1] and comparing CSS’ maximal rate (capacity) with small error for two practical methods of outputs’ analysis under the optimal CSS design motivated by applications like [2] . Recovering Active Inputs with small error probability and accurate parameter estimation are both possible with rates less than capacity and impossible with larger rates. A staggering amount of attention was recently devoted to the study of compressive sensing and related areas using sparse priors in over parameterized linear models which may be viewed as a special case of our models with continuous input levels. The threshold phenomenon was empirically observed in early papers [3], [6] : as the dimension of a randominstance of a problem grows there is a sharp transition from successful recovery to failure as a function of the number of observations versus the dimension and sparsity of the unknown signal. Finding this threshold is closely related to our capacity evaluation. Some threshold bounds for the compressive sensing were made using standard information-theoretic tools, e.g. in [15]
Greedy Separate Testing Sparse Inputs
February 16, 2012 3:15 pm - 3:45 pm
February 16, 2012 3:15 pm - 3:45 pm
A staggering amount of attention was recently devoted to the study of compressed sensing and related areas using sparse priors in over parameterized linear models under versions of linear programming (LP) analysis. More recently popularity of the sparsity approach for the classical model of group testing also increased. The threshold phenomenon was empirically observed in early papers of Donoho et al : as the dimension of a random instance of a problem grows there is a sharp transition from successful recovery to failure as a function of the number of observations versus the dimension and sparsity of the unknown signal. There are strong parallels between this threshold behavior and earlier Shannon's justification for similar phenomenon in terms of the capacity of the information transmission. We outline sharp asymptotical bounds for general models obtained by 1979 that were inspired by the Multi-Access Capacity Region construction that can greatly enhance the current understanding of recovery of active inputs in sparse compressive sensing. These bounds were obtained for asymptotically optimal designs which include random designs and thus imply upper bounds for performance of recovery under arbitrary designs. These optimal designs are a natural choice in settings where the experiment can be predesigned, providing in addition a competitive performance of a simplified analysis method: separate testing of inputs (STI). STI originated in 1959 and can be further traced back to Fisher's idea of randomization in estimation. The STI capacity exceeds that of LP relaxation for randomized designs in a wide range of models and admits several straightforward generalizations including significantly more powerful and still computationally feasible greedy version. This will be demonstrated in our simulations.
Olgica Milenkovic - University of Illinois at Urbana-Champaign http://www.ece.illinois.edu/directory/profile.asp?milenkov
Probabilistic and combinatorial models for quantized group testing
February 15, 2012 11:30 am - 12:00 pm
February 15, 2012 11:30 am - 12:00 pm
Keywords of the presentation: group testing, MAC channel, quantization, graphical models
We consider a novel group testing framework where defectives obey a probabilistic model in which the number and set of defectives is governed by a graphical model. We furthermore assume that the defectives have importance factors that influence their strength on the test outcomes and the accuracy with which the outcomes are read. This kind of scenario arises, for example, in MAC channels with networked users using different communication powers, or in DNA pooling schemes which involve large families.
Sparser Johnson-Lindenstrauss Transforms
February 16, 2012 10:15 am - 11:15 am
February 16, 2012 10:15 am - 11:15 am
Keywords of the presentation: dimensionality reduction, johnson-lindenstrauss, numerical linear algebra, massive data
The Johnson-Lindenstrauss (JL) lemma (1984) states that any n points in d-dimensional Euclidean space can be embedded into k = O((log n)/eps^2) dimensions so that all pairwise distances are preserved up to 1+/-eps. Furthermore, this embedding can be achieved via a linear mapping. The JL lemma is a useful tool for speeding up solutions to several high-dimensional problems: closest pair, nearest neighbor, diameter, minimum spanning tree, etc. It also speeds up some clustering and string processing algorithms, reduces the amount of storage required to store a dataset, and can be used to reduce memory required for numerical linear algebra problems such as linear regression and low rank approximation.
The original proofs of the JL lemma let the linear mapping be specified by a random dense k x d matrix (e.g. i.i.d. Gaussian entries). Thus, performing an embedding requires dense matrix-vector multiplication. We give the first construction of linear mappings for JL in which only a subconstant fraction of the embedding matrix is non-zero, regardless of how eps and n are related, thus always speeding up the embedding time. Previous constructions only achieved sparse embedding matrices for 1/eps >> log n.
Network topology as a source of biological information
February 14, 2012 10:15 am - 11:15 am
February 14, 2012 10:15 am - 11:15 am
Keywords of the presentation: biological networks, graph algorithms, network alignment
Sequence-based computational approaches have revolutionized biological understanding. However, they can fail to explain some biological phenomena. Since proteins aggregate to perform a function instead of acting in isolation, the connectivity of a protein interaction network (PIN) will provide additional insight into the inner working on the cell, over and above sequences of individual proteins. We argue that sequence and network topology give insights into complementary slices of biological information, which sometimes corroborate each other, but sometimes do not. Hence, the advancement depends on the development of sophisticated graph-theoretic methods for extracting biological knowledge purely from network topology before being integrated with other types of biological data (e.g., sequence). However, dealing with large networks is non-trivial, since many graph-theoretic problems are computationally intractable, so heuristic algorithms are sought.
Analogous to sequence alignments, alignments of biological networks will likely impact biomedical understanding. We introduce a family of topology-based network alignment (NA) algorithms, (that we call GRAAL algorithms), that produces by far the most complete alignments of biological networks to date: our alignment of yeast and human PINs demonstrates that even distant species share a surprising amount of PIN topology. We show that both species phylogeny and protein function can be extracted from our topological NA. Furtermore, we demonstrate that the NA quality improves with integration of additional data sources (including sequence) into the alignment algorithm: surprisingly, 77.7% of proteins in the baker’s yeast PIN participate in a connected subnetwork that is fully contained in the human PIN suggesting broad similarities in internal cellular wiring across all life on Earth. Also, we demonstrate that topology around cancer and non-cancer genes is different and when integrated with functional genomics data, it successfully predicts new cancer genes in melanogenesis-related pathways.
Sylvie Ricard-Blum - Université Claude-Bernard (Lyon I)http://www.ibcp.fr/scripts/affiche_detail.php?n_id=174
A dynamic and quantitative protein interaction network regulating angiogenesis
February 16, 2012 3:45 pm - 4:15 pm
February 16, 2012 3:45 pm - 4:15 pm
Keywords of the presentation: protein interaction networks, kinetics, affinity, protein arrays, surface plasmon resonance
Angiogenesis, consisting in the formation of blood vessels from preexisting ones, is of crucial importance in pathological situations such as cancer and diabetes and is a therapeutic target for these two diseases. We have developed protein and sugar arrays probed by surface plasmon resonance (SPR) imaging to identify binding partners of proteins, polysaccharides and receptors regulating angiogenesis. Interactions collected from our own experiments and from literature curation have been used to build a network comprising protein-protein and protein-polysaccharide interactions. To switch from a static to a dynamic and quantitative interaction network, we have measured kinetics and affinity of interactions by SPR to discriminate transient from stable interactions and to prioritize interactions in the network. We have also identified protein interaction sites either experimentally or by molecular modeling to discriminate interactions occurring simultaneously from those taking place sequentially. The ultimate step is to integrate, using bioinformatics tools, all these parameters in the interaction network together with inhibitors of these interactions and with gene and protein expression data available from Array Express or Gene Expression Omnibus and from the Human Protein Atlas. This dynamic network will allow us to understand how angiogenesis is regulated in a concerted fashion via several receptors and signaling pathways, to identify crucial interactions for the integrity of the network that are potential therapeutic targets, and to predict the side effects of anti-angiogenic treatments.
Tutorial: Group Testing and Coding Theory
February 14, 2012 9:00 am - 10:00 am
February 14, 2012 9:00 am - 10:00 am
Keywords of the presentation: Code Concatenation, Coding theory, Group Testing, List Decoding, List Recovery
Group testing was formalized by Dorfman in his 1943 paper and was originally used in WW-II to identify soldiers with syphilis. The main insight in this application is that blood samples from different soldiers can be combined to check if at least one of soldiers in the pool has the disease. Since then group testing has found numerous applications in many areas such as (computational) biology, combinatorics and (theoretical) computer science.
Theory of error-correcting codes, or coding theory, was born in the works of Shannon in 1948 and Hamming in 1950. Codes are ubiquitous in our daily life and have also found numerous applications in theoretical computer science in general and computational complexity in particular.
Kautz and Singleton connected these two areas in their 1964 paper by using "code concatenation" to design good group testing schemes. All of the (asymptotically) best know explicit constructions of group testing schemes use the code concatenation paradigm. In this talk, we will focus on the "decoding" problem for group testing: i.e. given the outcomes of the tests on the pools, identify the infected soldiers. Recent applications of group testing in data stream algorithm require sub-linear time decoding, which is not guaranteed by the traditional constructions.
The talk will first survey the Kautz-Singleton construction and then will will show how recent developments in list decoding of codes lead in a modular way to sub-linear time decodable group testing schemes.
Vyacheslav V. Rykov - University of Nebraska
http://www.unomaha.edu/~wwwmath/faculty/rykov/index.html
http://www.unomaha.edu/~wwwmath/faculty/rykov/index.html
Superimposed Codes and Designs for Group Testing Models
February 14, 2012 11:30 am - 12:30 pm
February 14, 2012 11:30 am - 12:30 pm
Keywords of the presentation: superimposed codes, screening designs, rate of code, threshold designs and codes
We will discuss superimposed codes and non-adaptive group testing designs arising from the potentialities of compressed genotyping models in molecular biology. The given survey is also motivated by the 30th anniversary of our recurrent upper bound on the rate of superimposed codes published in 1982.
Testing Boolean functions
February 14, 2012 2:00 pm - 3:00 pm
February 14, 2012 2:00 pm - 3:00 pm
I will talk about property testing of Boolean functions, concentrating on two topics: testing monotonicity (an example of a combinatorial property) and testing the property of being a low-degree polynomial (an algebraic property).
Mining Rare Human Variations using Combinatorial Pooling
February 13, 2012 3:15 pm - 3:45 pm
February 13, 2012 3:15 pm - 3:45 pm
Keywords of the presentation: high-throughput sequencing, combinatorial pooling, liquid-handling, compressed sensing
Finding rare genetic variations in large cohorts requires tedious preparation of large numbers of specimens for sequencing. We are developing a solution, called DNA Sudoku, to reduce prep time and increase the throughput of samples. By using a combinatorial pooling approach, we multiplex specimens and then barcode the pools, rather than individuals, for sequencing in a single lane on the Illumina platform.
We have developed a protocol for quantifying, calibrating, and pooling DNA samples using a liquid-handling robot, which has required a significant amount of testing in order to reduce volume variation. I will discuss our protocol and the steps we have taken to reduce CV. For accurate decoding and to reduce the possibility of specimen dropout, it is important that the DNA samples are accurately quantified and calibrated so that equal amounts can be pooled and sequenced. We can determine the number of carriers in each pool from sequencing output and reconstruct the original identity of individual specimens based on the pooling design, allowing us to identify a small number of carriers in a large cohort.
There are many instances in Coding Theory when codewords must be restored from partial information, like defected data (error correcting codes), or some superposition of the strings. These lead to superimposed codes, a close relative of group testing problems.
There are lots of versions and related problems, like Sidon sets, sum-free sets, unionfree families, locally thin families, cover-free codes and families, etc. We discuss two cases cancellative and union-free codes.
A family of sets Ƒ (and the corresponding code of 0-1 vectors) is called union-free if A ∪ B≠ C ∪ D and A,B,C,D ∈ F imply {A,B} = {C,D}. Ƒ is called t-cancellative if for all distict t + 2 members A1, … ,At and B,C ∈ Ƒ
A1 ∪ ... ∪ At ∪ B ≠ A1 ∪ … At ∪C:
Let ct(n) be the size of the largest t-cancellative code on n elements. We significantly improve the previous upper bounds of Körner and Sinaimeri, e.g., we show c2(n) ≤ 20:322n (for n > n0).
Competitive Testing for Evaluating Priced Functions
February 15, 2012 2:00 pm - 3:00 pm
February 15, 2012 2:00 pm - 3:00 pm
Keywords of the presentation: function evaluation, competitive analysis, Boolean functions, decision trees, adaptive testing, non-uniform costs
The problem of evaluating a function by sequentially testing a subset of variables whose values uniquely identify the function’s value arises in several domains of computer science. A typical example is from the field of automatic diagnosis in Artificial Intelligence. Automatic diagnosis systems employ a sequence of tests and based on their outcomes, they provide a diagnosis (e.g., a patient has cancer or not, a component of a system is failing or not). In general, it is not necessary to perform all the available tests in order to determine the correct diagnosis. Since different tests might incur also very different costs, it is reasonable to look for testing procedures that minimize the cost incurred to produce the diagnosis. Another example is query optimization, a major issue in the area of databases. Query optimization refers to the problem of defining strategies to evaluate queries in order to minimize the user response time. In a typical database query, thousands of tuples are scanned to determine which of them match the query. By carefully defining a strategy to evaluate the attributes, one can save a considerable amount of computational time. In general, it is not necessary to evaluate all attributes of a tuple in order to determine whether it matches the query or not and a smart choice of the attributes to evaluate first can avoid very expensive and/or redundant attribute evaluations. This talk will focus on the function evaluation problem where tests have non-uniform costs and competitive analysis is employed for measuring the algorithm's performance. For monotone Boolean functions we will present an optimal algorithm, i.e., with the best possible competitiveness. We will also discuss the case of some classes of non-monotone functions.
Length Reduction via Polynomials
February 16, 2012 11:30 am - 12:30 pm
February 16, 2012 11:30 am - 12:30 pm
Keywords of the presentation: Sparse covlutions, polynomials, finite fields, length reduction.
Efficient handling of sparse data is a key challenge in Computer Science. Binary convolutions, such as the Fast Fourier Transform or theWalsh Transform are a useful tool in many applications and are efficiently solved. In the last decade, several problems required efficient solution of sparse binary convolutions.
Both randomized and deterministic algorithms were developed for efficiently computing the sparse FFT. The key operation in all these algorithms was length reduction. The sparse data is mapped into small vectors that preserve the convolution result. The reduction method used to-date was the modulo function since it preserves location (of the ”1” bits) up to cyclic shift.
In this paper we present a new method for length reduction - polynomials. We show that this method allows a faster deterministic computation of the sparse FFT than currently known in the literature. It also enables the development of an efficient algorithm for computing the binary sparse Walsh Transform. To our knowledge, this is the first such algorithm in the literature.
(Joint work with Oren Kappah, Ely Porat, and Amir Rothschild)
RNA Structure Characterization from High-Throughput Chemical Mapping Experiments
February 13, 2012 3:45 pm - 4:45 pm
February 13, 2012 3:45 pm - 4:45 pm
Keywords of the presentation: RNA structure characterization, high-throughput sequencing, maximum likelihood estimation
New regulatory roles continue to emerge for both natural and engineered noncoding RNAs, many of which have specific secondary and tertiary structures essential to their function. This highlights a growing need to develop technologies that enable rapid and accurate characterization of structural features within complex RNA populations. Yet, available structure characterization techniques that are reliable are also vastly limited by technological constraints, while the accuracy of popular computational methods is generally poor. These limitations thus pose a major barrier to the comprehensive determination of structure from sequence and thereby to the development of mechanistic understanding of transcriptome dynamics. To address this need, we have recently developed a high-throughput structure characterization technique, called SHAPE-Seq, which simultaneously measures quantitative, single nucleotide-resolution, secondary and tertiary structural information for hundreds of RNA molecules of arbitrary sequence. SHAPE-Seq combines selective 2’-hydroxyl acylation analyzed by primer extension (SHAPE) chemical mapping with multiplexed paired-end deep sequencing of primer extension products. This generates millions of sequencing reads, which are then analyzed using a fully automated data analysis pipeline. Previous bioinformatics methods, in contrast, are laborious, heuristic, and expert-based, and thus prohibit high-throughput chemical mapping.
In this talk, I will review recent developments in experimental RNA structure characterization as well as advances in sequencing technologies. I will then describe the SHAPE-Seq technique, focusing on its automated data analysis method, which relies on a novel probabilistic model of a SHAPE-Seq experiment, adjoined by a rigorous maximum likelihood estimation framework. I will demonstrate the accuracy and simplicity of our approach as well as its applicability to a general class of chemical mapping techniques and to more traditional SHAPE experiments that use capillary electrophoresis to identify and quantify primer extension products.
This is joint work with Lior Pachter, Julius Lucks, Stefanie Mortimer, Shujun Luo, Cole Trapnell, Gary Schroth, Jennifer Doudna and Adam Arkin.
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.
No comments:
Post a Comment