Nuit Blanche: Sunday Morning Insight: All hard problems are just slow feedback loops in disguise.

Sunday, June 21, 2015

Sunday Morning Insight: All hard problems are just slow feedback loops in disguise.

Today's insights are made up of different observations from the past few weeks:

Company X
Third Generation Genomics and a Kaggle ?
IoT and Machine Learning
Planet Machine Learning in Paris
Of Electric Dreams

Company X

With some friends, we think there is room for some disruptive hardware in Machine Learning and other areas. More on that later but don't be surprised if I get to talk more and more about it over the next few months.

Third Generation Genomics and a Kaggle ?

Earlier this month, as I mentioned in Compressive phase transition, a million genomes and 3rd generation sequencing, I attended the Bioinformatics for Third Generation Sequencing workshop organized by Hélène Touzet and colleagues. Some of the photos taken during the workshop can be found on my twitter feed under the #b3gs hashtag.

One of the most interesting part of the current Oxford Nanopore sequencer technology is that it is very much portable thereby changing some of the dynamics behind identifying strains in outbreaks. Nick Loman provided a few examples of that. One of the most awesome project by Nick and his team was that of the current Oxford Nanopore sequencers used in West Africa to map Ebola. At http://ebola.nextflu.org/ you get to see in near real time, which strains of Ebola travel where and how some of the strains mutate. It was absolutely fascinating. I wonder how much it will take people to use additonal side information such as weather or other IoT data to gain deeper understanding of the evolution and mutation of an outbreak. A similar map exist for the MERS-CoV outbreak . For more information go read today's blog entry by Lauren on Sequencing Ebola in the field: a tale of nanopores, mosquitos and whatsapp

http://ebola.nextflu.org/

Nick provided other fascinating examples such as strain identification in a hospital setting: the idea here is to figure out if you are figthing the same strain or a different strain coming from other outside sources (hospital remain open during outbreaks). The overall identification process seemed to take about 92 minutes. I wonder how much time before this sampling chain implemented by Nick and his team become a requirement as opposed to a "nice to have" technology.

Going deeper in the sequencing realm, Clive Brown of Oxford Nanopore was given a full hour talk. Nanopore technology allows for long reads and hence reduces the complexity of DNA alignment thereby enabling genome sequencing with midly good greedy solvers. Some bioinformatics people are skeptical about the technology as the readings of the base pairs does not seem to go to the precision required for medical diagnosis. The current accuracy of the system hovers in the 90% range when most medical applications require something like 99.9999%.

If you've been reading this blog long enough, you know that DNA goes through the pores at different speeds. In fact, the reason nanopore technology has taken so much time to flourish has been because of a sampling issue. In the crudest form, the DNA piece goes through nanopores at a speed of 1 million base pair per second. This is too fast for the electronics used to sample the DNA, so they use different chemical engines (they call them "chemistry") to slow down the travel of the DNA strand through the pores. With these chemistries, 1 million base pair per second slows down to an electronically manageable 500 to 1000 base pair per second. In fact, a 1000 base pairs per second is the fastest speed that Oxford Nanopore is currently bringing to the market.

At that point, one simply wonders if there is a way to sample at the initial speed of 1 000 000 base pairs per second using a compressive sensing approach such as A2I or the Xampling approach . Why would such sampling at tis rate be possible ? for one, the signal is constrained not by sparsity but it is made up of four DC subsignals (there are roughly 4 voltages levels characteristic of 4 base pairs), hence a few approaches would do well on a signal made up of a restrained (and quite literal) alphabet. The estimation process I just mentioned is called "base calling", i.e. the electrical signal from the pore is used to estimate which four letters A, G,T or C went through it. If the base caller and the electronics were to go a thousand faster than the faster product out there, Nick told me that quite a few applications could arise (it currently takes about 4 days to sequence the human genome with this technology, what if it were to take 6 minutes instead ? it would really become commodity sequencing.)

Further down the pipeline, there is the alignement process, whereby one uses each sets of identified series of base pairs to form a larger sequence. What differentiates third generation sequencing from the first two generations is the long length of base pairs found. It makes the problem of aligning base pairs not NP-hard any more. In order to enable that process, the technology oversamples the genome (they call this coverage: a 30X coverage means that the DNA has on average been covered 30 times). In short, we have grouped readings, oversampling of said readings and a 4-letter alphabet: all these elements point to compressive sensing and related matrix factorization algorithms for resolution. At the very least one wonders if current greedy algorithms are far from the phase transition of the problem or if current base callers can provide additional information.

An action item I took out of this meeting was to explore the possibilities of exposing some of these data to a Kaggle-like contest whether at the base caller level or at the alignement level (to be defined).

Other presentations at the meetings mostly tried to refine the quality of the 3rd generation sequencing using 1st and 2nd generation high accuracy sequencing methods as side information. The last talk by Vincent Croquette introduced us to a different kind of hardware that used holographic measurements and some mechanical means to figure out DNA sequencing information.

These sequencers can be seen as IoT sensors which brings us to the next meeting.

IoT and Machine Learning

Last week, I went to the Samsung VIP meeting that focused in Samsung's investment in France on IoT and in a "Strategy and Innovation center" here in Paris. I briefly spoke with Young Sohn and Luc Julia about their on-going Machine Learning efforts.

My view: The IoT business is tough because, in part, we have not had much large datasets in that area. It is also a treacherous business, if your sensor is in direct competition with data that can be had with a camera, it will eventually lose out to imaging CMOS. Furthermore and as I mentioned in a previous Paris IoT meetup, the "making sense" of data from these streams can only be made after accumulating enough of it.

Genome sequencing for instance can become a science only when you hit a high sampling bounds. The current stage of IoT enterprises is to build the data gathering infrastructure as we speak: We can only expect the insights to build up over some time. With that in mind, techniques such as compressive sensing are key to IoT development because most of the data in that field cannot leverage the compression technology used for images and videos, it is a blessing because:

randomized algorithms are a cheap way of getting compression

random projections keep some information in a way that overoptimized compression technology like jpeg cannot.

Planet Machine Learning in Paris

Last but not least, we had our last regular Machine Learning meetup of the season ( videos and slides are at: Paris Machine Learning Meetup #10 Season 2 Finale: "And so it begins": Deep Learning, Recovering Robots, Vowpal and Hadoop, Predicsis, Matlab, Bayesian test, Experiments on #ComputationalComedy & A.I.)

We also have two upcoming "Hors Séries" coming up as planet ML is coming to France in the next few weeks with the advent of COLT in Paris and ICML in Lille. These meetups will be

If you think you'll be in Paris and want to talk Machine Learning to an adoring crowd, send me an email and will see if we can set up some other "Hors Séries" within the next few weeks. We can have access to rooms for presentations within days and have more than 2300 members, a subset of which will like your presentation. We can also simply meet for a coffee !

Of Electric Dreams

Last but not least, this past week, the interwebs went crazy about certain kind of images generated by deep neural networks. Aside from the obvious trippy images, one wonders how long it will take before we have a subfield of machine learning investigating specific types of psychiatric diseases from the DSM-V list and images generated by current Machine Learning techniques. In my view, some recommender systems also have behaviors that fit diseases listed in the DSM-V: simple re-targeting being one.

Coming back to images, this interest in producing images that are close to our experience or to artistic endeavors is not new and did not start with deep neural networks. Back in 2007, we noticed that regularization had the possibility of generating artistic images ( Can I tell you my secret now ?....I see dead reconstructions). More recently, reconstructing images from seemingly deteriorated information has produced similarly interesting patterns (Do Androids Recall Dreams of Electric Sheeps ?, From Bits to Images: Inversion of Local Binary Descriptors - implementation -)

For instance, back in season 1 of the Paris Machine Learning Meetup #3 ("We already have cats under control") Patrick Perez did a presentation From image to descriptors and back again and mentioned a series of photos reconstructed from descriptors on Herve Jegou's page. (Reconstruction of images from their local descriptors )