Thursday, June 07, 2012

What is Faster than Moore's Law and Why You Should Care

The rise of synthetic biology as a tool to experiment what and what is not feasible is linked mostly to two endeavors: The accelerated pace of both sequencing and synthesis machines. In the sequencing realm, researchers need to change their slides every fifteen days but the most astounding area, i.e. where opportunities for high dimensional data understanding reside, stems from the faster-than-Moore's-law cost decrease for these machines. 

From Wikipedia, here what it says about Moore's Law:

Moore's law is a rule of thumb in the history of computing hardwarewhereby the number of transistors that can be placed inexpensively on an integrated circuit doubles approximately every two years. The period often quoted as "18 months" is due to Intel executive David House, who predicted that period for a doubling in chip performance (being a combination of the effect of more transistors and their being faster).[1]
Most of the world you know today is based on the ability to ride on the economies of scale enabled by the CMOS bandwagon. What does this mean ? Let us take a look at a 1997 write up on CMOS and let us recall that at that time, CMOS was barely in its infancy for use in cameras..

The bottom line is that a new semiconductor technology has to be a really significant advance to displace the current technology. When technologies are competing that have relatively undifferentiated characteristics, like GaAs and Silicon, then the technology that attracts the greatest investment will be the winner. It will also have a strong tendency to take over 100% market share because of economies of scale, such as the spreading of the cost of development of CMOS-related CAD software and manufacturing equipment across a larger industry base.
Eric Fossum, the inventor of CMOS for imaging has a nice presentation on CCD vs CMOS here and the attendant demise of CCD in the general market. CCD just could not compete. Also of interest is a more recent presentation for Eric's QIS architecture with the following linear curve instantiating Moore's Law (see below): 

A short decade later, CCD belongs to niche markets while billions of smartphones are equipped with CMOS camera technology. That technology opened the gates for large datasets, yet algorithms we produce are still barely capable of handling and making sense of the gigantic data stack generated  by homemade videos. Given all this, there is a technology that promises to grow faster than CMOS for cameras: Sequencing (Please note in the following graph a linear interpolation similar to Moore's law and the attendant cost dropping below that "Moore's law" curve)

What does it mean for science and algorithm development ? From the Genomic cost page, one can immediately focus on untamed opportunities screaming for better algorithms::

The costs associated with the following 'non-production' activities are not reflected in the two graphs:
  • Quality assessment/control for sequencing projects
  • Technology development to improve sequencing pipelines
  • Development of bioinformatics/computational tools to improve sequencing pipelines or to improve downstream sequence analysis
  • Management of individual sequencing projects
  • Informatics equipment
  • Data analysis downstream of initial data processing (e.g., sequence assembly, sequence alignments, identifying variants, and interpretation of results)

Of particular interest is the issue of quality:

For the Sanger-based sequence data, the cost accounting reflects the generation of bases with a minimum quality score of Phred20 (or Q20), which represents an error probability of 1 % and is an accepted community standard for a high-quality base. For sequence data generated with second-generation sequencing platforms, there is not yet a single accepted measure of accuracy; each manufacturer provides quality scores that are, at this time, accepted by the NHGRI sequencing centers as equivalent to or greater than Q20.
In the "Cost per Megabase of DNA Sequence" graph, the data reflect the cost of generating raw, unassembled sequence data; no adjustment was made for data generated using different instruments despite significant differences in the sequence read lengths. In contrast, the "Cost per Genome" graph does take these differences into account since sequence read length influences the ability to generate an assembled genome sequence.

Why should you care about quality ? well in some cases, it could enable a more straightforward hunt for Matt's son killer. It could admittedly also help in curing cancer, designing ways to Terraform Mars (btw, Godspeed Ray), design natural CO2 sinks though synthetic biology and much more. Beyond the quality issue, there is the central issue of faster data processing to make sense of all these data though proper modeling. Here again, algorithms do help: Let us recall that a focused effort on algorithm development in compressive sensing has shown the possibility of a 10 fold improvement per year for a few years. For more information, all Synthetic Biology and Compressive Sensing related posts are here. Videos of interest include:

What are you waiting for ?

Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.


JD said...

This very badly needs an edit. It appears to be about something interesting, but as written is completely incoherent.

Igor said...


Thanks for the feedback. In short, advanced algorithm development is the only way we are going to be able to make sense of the flurry of sequencing data that is going to be produced. As we speak, current algorithms can barely make sense of these data, yet if we go through the process of engineering a new biology, we need those tools to be a llittle more advanced so that we can used them in a more iterative fashion.