Sunday, September 18, 2016

Sunday Morning Insight: "More F$*%(g Data"

On Friday, we had a, too short, panel discussion on the AI ecosystem here in France. At the end, Paul asked us what would be needed to have an even more dynamic ecosystem:

After thinking about it a little while, it became clear to me that the answer was "More Data". Later that day, I met a person who wanted to organize a Hackaton in the area of Data Science. Again, it was clear that to attract local data scientists, you needed to have unheard-of and fascinating datasets. As a result of these interactions, I have been thinking of the type of datasets or access to datasets that could have large impacts in the start-up area and on our local community:

Health: If the British NHS and Deepmind did a deal in Great Britain, it would seem a shrewd move by the French state to open the health datasets owned by the wide variety of the health stakeholders. Yes, privacy issues are central and yes they can be solved efficiently, Most importantly, if we do not use them, we are not making our health-AI ecosystem dynamic. It is important to remember that by default it is a better move by the state to not move in that direction. Until people realize that you will get better diagnostics and treatment in the UK than in France, there will not be a move to exploit these data. By that time, the game will really be over for French start-ups to really make an impact. Let me make the additional comment that current start-ups that are doing well in that realm are companies that have been able to get their data from outside french soil. It is not just a shame and it's tragic.

Movie industry: The FranceisAI meeting took place at BPI France, a short 200 meters away from where the first public showing of a movie, at which admission was charged, took place.  The French created the movie industry and started many of the business practices and narrative used in scripts that stand to this day. Nowadays, under the reasoning that some things need to be protected, entire collections are seldom reachable by start-ups and artists types.  The database of INA is one such example but I am sure it is not the only one. What I notice is that a company like Netflix, that did not exist 15 years ago, is becoming a giant in the TV industry because of its awesome data driven work. It looks as if it is even becoming better now because of its ability to create and analyse "More F$*%(g Data". I am sure start-ups here in Paris could fast track their analytical tools based on local databases such as that of INA.

Environment: At the last Paris Machine Learning meetup this past Wednesday, we had David Klein, a data scientist in California who helps entities like conservation metrics in quantifying conservation actions throughout the world (his presentation is here). His presentation is exciting and listed in the meetup's archives here. Go read it, I'll wait....During the Q&A, one of our audience participant asked if, like the turtles David could detect, we could do the same for bull sharks. Why bull sharks you say ?  Well these sharks have had a tremendous impact on the local economy of the Reunion island, a French "département" in the Indian Ocean, Currently the French fisheries ministry staff goes through the lengthy process of tagging the sharks which can then be detected by sparsely located offline sensors. After watching David's presentation, we were wondering about the possibility of using acoustic sensors or drones with cameras and Deep Learning to detect in time (not after the fact) if any bull sharks were getting too close to the beach/surf areas. Because of the non-responsiveness of the current detection system, the state conservation department has had to remove some sharks. While the solution might appease the contentious relations with the population, it is simply neither a guarantee for humans nor it is ideal for the shark population. Using AI/ML capabilities with an offline detection system would still detect the non tagged sharks: a capability that currently simply does not exist or would cost a lot of money if it required the tagging of every sharks in the region. I am sure that there are other wildlife issues that could be solved using some of these techniques but since the environment has mostly been the realm of the state, it is high time it opens up and provides "More F$*%(g Data".

I could go on and on about different subject areas where the state owns more data than it can make sense of. Let me point one dataset out that does not seem to have a direct economic impact because it is looks too sciency: Earthquake detection. The Institut de physique du globe de Paris has some of the very large datasets that would be ideal for a beautiful hackaton or for start-ups that want to try their algorithms. These dataset should not be seen as just for the Science community. If the story of Kaggle is any indication, even Science can change as a result of releasing "More F$*%(g Data".

Current models such as those used in Deep Learning /AI require large amounts of data for their training. Policymakers need to understand that if homegrown start-ups and the attendant ecosystems in general are to strive and have an edge, it is because and only because they can have access to large amounts of F$*%(g data. 

Image Credit: NASA/JPL-Caltech
This image was taken by Rear Hazcam: Left B (RHAZ_LEFT_B) onboard NASA's Mars rover Curiosity on Sol 1463 (2016-09-17 11:56:14 UTC).
Full Resolution

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

No comments: