Saturday, January 31, 2015

Sunday Morning Insight: The Hardest Challenges We Should be Unwilling to Postpone

Last Wednesday, at the Paris Machine Learning meetup, we had some of the most interesting talks around Data Science, Machine Learning and Non Profits. For the purpose of providing a panorama of non profits, we included Universities, Foundations, NGOs and an Open Source project. And it is no wonder that they are non profits, they tackle the hardest challenges in Data Science and Machine Learning.

We first heard Paul Duan, the founder of, a YCombinator-backed non profit. His presentation is here.

Paul presented how they are producing codes and algorithms so that it can serve the ambulance dispatch system in San Francisco ( at some point he even mentioned the issue of fair algorithm, an issue we'll try to address in one of a future meetup). One of the challenging issue found by BayesImpact was the old equipment of some of these systems and the fact that some are disconnected from each other by law. Another issue was that it may sometimes be difficult to articulate spending some money on algorithms while there is an ongoing funding shortage.

Then we had Isabelle Guyon ( AutoML Challenge presentation (pdf), and site: ChaLearn Automatic Machine Learning Challenge (AutoML), Fully Automatic Machine Learning without ANY human intervention. )

Isabelle  talked about the recent ML challenge, she and ChaLearn, a non profit foundation, is putting in place. The goal of the challenge is somehow to see how a model can deal with increased complexity and remove humans from the process feature engineering and hyperparameter tuning which as we all know is black art on most accounts. Many people see this effort as potentially killing the business of being a data scientist but this is erronous in my view. First, Kaggle type of efforts give the impression that overfitting is not a crime. This type of challenge should squarely bring back some common sense to this view of the world. Second, most current challenges have no real strategy for dealing with non stationary datasets (i.e. datasets that increase in complexity with time). This type of challenge opens the door to developing strategies in that regards. Definitely another tough problem.
Here is the longer presentation: AutoML Challenge presentation (ppt), (pdf)

We then went on with  Frederic le Manach from the Bloom Association, an NGO, on the topic of Subsidizing overfishing (pdf), (ppt)

Frederic talked about a specific problem related to data-driven policy making. His NGO is focused on trying to bring some light on the tortuous paths taken by different govermental subsidies (negative or positive) to the fishing industry. As a backgrounder, he mentioned some interesting information I was not particularly aware of: namely that some fishing nets can go as deep as 1800m (5900 feet). His NGO's thesis is that one of the reason for overfishing may have to do with the various opaque mechanisms by which subsidies are handed out to various stakeholders. His organization intents on untangling this maze so that policymakers understand the real effect of the current system. Frederic is looking for data scientists who could find ways to gather and clean data from various sources. 

The other information of interest to me was that, in terms of jobs, having small fishing outfits seemed to be a win-win on many accounts (fisherman gets paid better, fish reserves are not depleted yielding potentially larger fish population, etc...).

In the beer session Franck and I had with him afterwards, we noted while the subsidies issue could be having an impact on policy making, it might not be the most effective way of bringing the attention of overfishing to the attention of the general public. One item that seemed obvious to us was that the current system did not have a good fish stock counting process. And sure enough, Frederic mentioned two studies that clearly showed an awful mismatch between certain fish population counts and predictions. 

by Carl Walters and Jean-Jacques Maguire

 The fascinating thing is that there is some more open datasets (as opposed to the subsidies sets):

The count does not seem to take into account the underlying structure of the signal (the fish population). Think of the problem as a little bit like a Matrix completion problem of sorts. What sort of side information do we have ? According to Frederic, there are several instances of fish which depleted in a matter of a few years (and put entire industry out of business in that same time span). The underlying reason for this is that there are certain species that are only going by flocks (flocks of 40000 individuals). Think of them as clusters. If somehow, a flock is being fished and only 20000 individuals remain then, these indviduals will look for another cluster to merge with, in order to go back to about 40000+ individuals.

If you do some sampling in the sea and not know about the social stucture of the population, then whatever underlying assumptions behind some linear model will probably over or undercount the actual population. At the very least, it will be easy for any stakeholder to discount the counting method as a tale based on a mere interpolation with no real value.

Yet, there are real time information that could be used as proxies for the count. There are currently GPS on boats and some that data flux is available. Through their radars, fishing boats are hunting out flocks of fish and could be seen as a good proxy of where the flocks are located.  

And this is where the matrix completion problem comes in. We are looking at a problem that is very similar to a Robust PCA, where ones wants to images all the flocks at once with very incomplete information yet, the spatio-temporal dataset has sone definite structure that comes from what we know of the social behaviors of these animals. The problem could also fit with a group testing/compressive sensing approach.

In the end, a more exact count would have an effect on all stakeholders. For instance, if there were only 30 flocks of a certain species left in the mediteraenean sea, even bankers would make different decisions when it comes to loaning money for a new ship. Other stakeholders would equally make different choices so that the fishing of that stock could last a much longer time period.  

Our next speaker was Emmanuel Dupoux, of EHESS/ENS/LPS who presented The Zero Resource Speech Challenge (presentation pdf, presentation ppt )  


Emmanuel described the Zero Speech Challenge ( by arguing that the current approach to language learning is mostly through peer related interactions. Babies, in particular, can do a lot of unsupervised learning that is currently not the path taken by most algorithm development in Machine Learning. He also made an argument that the current path could probably not scale for languages that did not have large corpuses from which one could train current ML algorithms.

Emmanuel also mentioned the MIT dataset (See Deb Roy's talk "the Birth of a Word" ) where issues of privacy has overwhelmed the project to the point that the data is essentially closed. Emmanuel mentioned a similar project in his lab where similar issues of privacy have to be sorted through.

Eventually, we listened to Jean-Philippe Encausse who talked to us about S.A.R.A.H ( here is his presentation pdf, (ppt)). S.A.R.A.H is a system you can install at home and that enables people to communicate with their in-house connected devices. There is a potential for this system to produce large amount of data that could eventually be used by the academic community. It was interesting to see how very rapidily there could be an obvious match between the datasets potentially generated by S.A.R.A.H and those of direct relevance to the previous talk. Jean-Philippe described how a new kind of plugin could help in this endeavor.

Jean-Philippe wrote two blog entry on this potential use of S.A.R.A.H
 Here is the video of what S.A.R.A.H can do.

To paraphrase President Kennedy, we saw some of the hardest challenges we should be unwilling to postpone.

Godspeed Ian !

Join the CompressiveSensing subreddit or the Google+ Community and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

1 comment:

Anonymous said...

"First, Kaggle type of efforts give the impression that overfitting is not a crime"

What is this supposed to mean? On Kaggle, overfitting is immediately revealed at the contest finish, and will cost you dearly!