Nuit Blanche: Predicting the Future: Randomness and Parsimony

Monday, August 27, 2012

Predicting the Future: Randomness and Parsimony

[ Thus is part 2. Part 1 is here: Predicting the Future: The Steamrollers]

For many people, predicting the future means predicting the breakthroughs, and while this might seem reasonable at first, one should probably focus on how the steamrollers can precipitate these findings as opposed to expecting chance to be on our side.

One of the realities of applied mathematics is the dearth of really new and important algorithms. It is not surprising, it is a tough and difficult process. In turn, the mathematical tools we will use in the next 20 years are, for the most part, probably in our hands already. How can this fact help in predicting the future you say? Well, let us combine this observation with another one in the health care area, but it could be any other field that can be transformed through synthetic biology thanks to genomic sequencing.

First there is this stunning example recounted by David Valle in The Human Genome and Individualized Medicine (it is at 57min32s)

...First of all, acute lymphoblastic leukemia. When I was a house officer in the later '60s and early '70s, acute lymphoblastic leukemia was the most common form of childhood leukemia and had a 95 percent mortality rate - 95 percent mortality. Nowadays, acute lymphoblastic leukemia remains the most common chilhood leukemia. It has 95 percent survival rate, 95 percent survival/ So it went from 95 percent mortality to 95 percent survival. So what account for that change ? So actually if you look at it, the medicines that are currently being used are very similar, if not identical, to the medicines that we used all those years ago. So it's not the kinds of medicines that are being used. What it is, I would argue, is that oncologists have learned that this diagnosis is actually a heterogeneous group of disorders. And they've learned how to use gene expression profiling, age of onset, DNA sequence variation and other tools to subdivide the patients. In other words, move from one collective diagnosis to subcategories of diagnosis moving towards individualizing the diagnosis to individual patients and the manipulating their treatment according to which subdivision the patient falls. And that approach. a more informed approach in terms of differences between individual patient with the same diagnosis, has had a dramatic effect on the consequences of having ALL....

In other words, starting with the same medicines, it took us 40 years (most of that time without sequencing capabilities) to match a hierarchy of diseases to a hierarchy of drugs and processes. Back in the 70s, this matching of hierarchies entailed:

the ability to get a rapid feedback from drug trials
the ability to have enough statistics from a sub-group for certain drug trials

Because of the statistics required, treating rare diseases have been at odds with this process. How is this different nowadays ? Hal Dietz discusses that in Rational therapeutics for genetic conditions (see "...The Window Doesn't Close..."). and he points out that if you have the right tool to examine deep inside the metabolic networks through genome sequencing, then the window doesn't close. From the Q&A:

Question: Are Adults with Marfan syndrome all treatable ?

Hal Dietz: Yeah, so that'sa great question. The question is, are adults with Marfan all treatable or is the window of opportunity to make a difference over in childhood ? At least in our mice, we can allow them to become mature adults. They're sexually mature at about two months, by six months of age they are sort of mid-adult life and by a year of age they are old mice. And whether we start treatment right after birth, in the middle of that sequence, or at the end, we see the same kind of benefits. So we think that the window doesn't close, that there is an opportunity even later in life.

In short, with genomic sequencing, the matching process occurring in health care -a data driven hypothesis process- now becomes

the ability to get a rapid feedback from drug trials
the ability to get an information rich feedback from these drug trials

The Steamrollers that are Moore's law and Rapid Genomic Sequencing point to an ability to generate higher quality data at a faster pace than ever before while profoundly changing survival rates or curing diseases.

All would be well if the quality of the information from genomic sequencing did not come at the expense off an attendant large quantity of data. Let's put this in perspective: The genome comprises a billion information, the microbiome about ten times that and there are about seven billion people on Earth. If one were to decode the genome of the entire population, we would generate about 10^19 data. This is huge, it's more information than there are stars in the universe. However huge, this data is not that information rich, simply speaking there is a larger variety in the human genome between folks from the same tribe in Africa than any other humans living on the four other continents.

In short, the useful data actually "lives" in a much much much smaller world than the one produced by the combination of the Steamrollers. In order to handle this parsimonious needles within these very large haystacks, mathematical concentration of measure type of results have recently yielded different tools. Some of these methods use randomness as an efficient way of compressing this useful but sparse information.

What is the time frame for these tools using parsimony and randomness to be part of the standard toolbox in personalized medicine ?

Certainly less than 18 years. It took about 27 years to build efficient tools ( EISPACK (1972) - LAPACK (1999)) in linear algebra that are just now considering randomization (see Slowly but surely they'll join our side of the Force...). Using the parsimony of the data will probably be handled at a faster pace by crowdsourcing efforts such as scikit-learn. In the next eighteen years, we should expect libraries featuring standardized :Advanced Matrix Factorization Techniques as well as factorization in the Streaming Data model to be readily available in ready-to-use toolboxes. Parsimony also effectively embeds graph related concepts as well and one already sees the development of distributed computing beyond the now seven year old Hadoop such as GraphLab.

But the concepts of parsimony and randomness will also play a tremendous role in how we take data in the first place by changing the way we design diagnostic instruments. Sensing with parsimony aka Compressive Sensing will help in making this a reality. Besides aiding in reverse engineer biochemical networks,ir providing an effective way to compare genomic data, it will also help engineers devise new sensors or perfect older ones such as MRI. Expect new diagnostic tools.

Which gets us back to the original question: What can I say with certainty about August 25th, 2030 ? We will manage the large datasets coming out of the steamrollers only through the use of near philosophical concepts such as parsimony and randomness. By doing so we are likely to reduce tremendously our current number 1 and number 2 causes of death.