Nuit Blanche: 50 years of Data Science by David Donoho

Friday, October 23, 2015

50 years of Data Science by David Donoho

The genesis of Nuit Blanche is intimately linked to Dave Donoho's work (and its webpage). Today is a preprint based on a keynote speech Dave gave at the John W. Tukey 100th Birthday Celebration at Princeton University that took place on September 18, 2015, on defining Data Science. Here is the attendant preprint: 50 years of Data Science by David Donoho

More than 50 years ago, John Tukey called for a reformation of academic statistics. In `The Future of Data Analysis', he pointed to the existence of an as-yet unrecognized science, whose subject of interest was learning from data, or `data analysis'. Ten to twenty years ago, John Chambers, Bill Cleveland and Leo Breiman independently once again urged academic statistics to expand its boundaries beyond the classical domain of theoretical statistics; Chambers called for more emphasis on data preparation and presentation rather than statistical modeling; and Breiman called for emphasis on prediction rather than inference. Cleveland even suggested the catchy name \Data Science" for his envisioned field. A recent and growing phenomenon is the emergence of \Data Science" programs at major universities, including UC Berkeley, NYU, MIT, and most recently the Univ. of Michigan, which on September 8, 2015 announced a $100M \Data Science Initiative" that will hire 35 new faculty. Teaching in these new programs has signi cant overlap in curricular subject matter with traditional statistics courses; in general, though, the new initiatives steer away from close involvement with academic statistics departments. This paper reviews some ingredients of the current \Data Science moment", including recent commentary about data science in the popular media, and about how/whether Data Science is really diff erent from Statistics. The now-contemplated field of Data Science amounts to a superset of the fi elds of statistics and machine learning which adds some technology for `scaling up' to `big data'. This chosen superset is motivated by commercial rather than intellectual developments. Choosing in this way is likely to miss out on the really important intellectual event of the next fty years. Because all of science itself will soon become data that can be mined, the imminent revolution in Data Science is not about mere `scaling up', but instead the emergence of scienti c studies of data analysis science-wide. In the future, we will be able to predict how a proposal to change data analysis work ows would impact the validity of data analysis across all of science, even predicting the impacts fi eld-by- field. Drawing on work by Tukey, Cleveland, Chambers and Breiman, I present a vision of data science based on the activities of people who are `learning from data', and I describe an academic eld dedicated to improving that activity in an evidence-based manner. This new field is a better academic enlargement of statistics and machine learning than today's Data Science Initiatives,while being able to accommodate the same short-term goals.

There are many passages I liked, including this one:

"..Machine Translation research fi nally re-emerged decades later from the Piercian limbo, but only because it found a way to avoid a susceptibility to Pierce's accusations of glamor and deceit. A research team led by Fred Jelinek at IBM, which included true geniuses like John Cocke, began to make de nite progress towards machine translation based on an early application of the common task framework. A key resource was data: they had obtained a digital copy of the so-called Canadian Hansards, a corpus of government documents which had been translated into both English and French. By the late 1980's DARPA was convinced to adopt the CTF as a new paradigm for machine translation research. NIST was contracted to produce the sequestered data and conduct the refereeing, and DARPA challenged teams of researchers to produce rules that correctly classifi ed under the CTF...."

h/t Diego and Victoria

David Donoho reflects on "50 Years of #DataScience" >https://t.co/ntUnWMaQpR #Statistics #BigData HT @victoriastodden pic.twitter.com/dwZ0yd0S2g
— Dr. Diego Kuonen (@DiegoKuonen) October 13, 2015

Credit: NASA/Johns Hopkins University Applied Physics Laboratory/Southwest Research Institute

Pluto's Blue Sky
Release Date: October 8, 2015
Keywords: MVIC, Pluto, RalphPluto's haze layer shows its blue color in this picture taken by the New Horizons Ralph/Multispectral Visible Imaging Camera (MVIC). The high-altitude haze is thought to be similar in nature to that seen at Saturn’s moon Titan. The source of both hazes likely involves sunlight-initiated chemical reactions of nitrogen and methane, leading to relatively small, soot-like particles (called tholins) that grow as they settle toward the surface. This image was generated by software that combines information from blue, red and near-infrared images to replicate the color a human eye would perceive as closely as possible.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !