Sunday, May 04, 2014

Sunday Morning Insights: The Power of Zero

Triggered by the discussion we had with Brenda McCowan on Unraveling Dolphin Communication Complexity, Past approaches and next steps (video is here) at the Paris Machine Learning Meetup #10, I was fascinated by the potentiality of machine learning algorithms focused on exploration when there is no obvious prior knowledge. Something along the lines of the zero-knowledge proofs where one wants to know more, while knowing nothing [1]. I went ahead and asked several interweb communities about learning without training samples. Here is the initial request I sent out:

Dumb Question: Unsupervised Learning in Science ? 
The vast majority of ML related work and investigation revolves on some amount of supervised learning where there are examples of known meanings. Does anybody know of an example where the learning was unsupervised. For instance someone (re-)learning the Egyptian hieroglyphs without the Rosetta stone or somebody (re-)discovering the Mayan alphabet using strictly machine learning techniques ? The example does not have to be about human languages.

I obviously received good answers on unsupervised learning such as dictionary learning in image processing but I wanted to see what else was out there. On Google+ (here), John Taylor mentioned the following:

In your Rosetta/Maya examples decoding was done by people who already knew several languages. That definitely helped, as opposite to the case if discoverers of Rosetta stone would had no clue about existence of languages. And yes there are papers about such things, so called "zero shot learning". Basically ML systems were trained and "knew" something already, but were given something they never saw with a task of figuring what it could be. And they did good, well at least one system guesses were orders of magnitude better ones then a random guess. If that is something that you are looking for - go Google Scholar and search for zero shot learning papers. 
From Zero-Shot Learning by Convex Combination of Semantic Embeddings by Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg S. Corrado, Jeffrey Dean, one can read that "zero-shot learning" is defined as "annotation of images with new labels corresponding to previously unseen object categories". That seems pretty close because in effect we are not starting from nothing and we all know from compressive sensing that with additional prior, one can go much farther. In the case of Dolphin communication, Brenda and colleagues are asking whether there are discussions and /or dialects. If one assume that there are discussions, the question becomes, how do we learn that language/dialect.  Here are the most recent papers that mention zero-shot learning, a subject I will probably come back to every once in a while. Here is an example that involve animals yet is centered on imagery.
On Reddit, I received three answers  
Biologists use a lot of unsupervised learning to look for subtypes in diseases etc:

Another application of clustering, in biology too, is the clustering of protein-protein interaction networks to figure out what protein complexes exist in the organism
I've used unsupervised learning to downselect candidate ligands from a much larger pool of materials that would have been economically prohibitive to procure. Each ligand had a set of descriptors associated with it, and once a good ligand had been identified from experiments additional ligands could be identified using a nearest neighbors can get a general idea of some of the methods I used if you read some of the cheminformatics publications by Alexander Tropsha. His publications page is here.
and on LinkedIn, Bill Winkler mentioned the following:
Various forms of unsupervised learning (clustering), semi-supervised learning (moderate or small amounts of 'truth' data are combined with unlabelled data where 'truth' is not known), and supervised learning have been around for 30+ years.
Supervised learning is the standard machine learning method where training data are available.
The semi-supervised learning is where a moderate amount of labelled ('truth') data is combined with unlabelled data. In active learning (a variant of semi-supervised), 'truth' data may be gradually obtained until sufficient 'truth' is available for training a classifier.
For a naive Bayes classifiers, Nigam et al. provide an approach for semi-supervised learning.
Nigam, K., A. K. McCallum, S. Thrun, and T. Mitchell (2000), “Text Classification from Labeled and Unlabelled Documents using EM, Machine Learning, 39, 103-134.
The methods can extended (via a variant of boosting) where both better parameters and better models are obtained simulltaneously. The models/software hold for unsupervised, semi-supervised, and supervised learning.
Winkler, W. E. (2000), “Machine Learning, Information Retrieval, and Record Linkage,” American Statistical Association, Proceedings of the Section on Survey Research Methods, 20-29. (also available at ).
The following provides a method of unsupervised learning that automatically finds optimal parameters in ~500 subareas of the U.S. The methods outperform an active learning approach (semi-supervised learning) that is widely used for computer matching (record linkage). I presented the following paper in the session "Best from Wiley Interdisciplinary Reviews" at the 2011 Conference on the Interface Between Computing Science and Statistics.
Herzog, T. N., Scheuren, F., and Winkler, W.E., (2010), “Record Linkage,” in (D. W. Scott, Y. Said, and E. Wegman, eds.) Wiley Interdisciplinary Reviews: Computational Statistics, New York, N. Y.: Wiley, 2 (5), September/October, 535-543 .
The unsupervised learning methods were first applied for production software for the U.S. Decennial Censuses.
Winkler, W. E. (1988), "Using the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage," Proceedings of the Section on Survey Research Methods, American Statistical Association, 667-671, also at .
The methods were rediscovered using a method where Generalized Additive Models are used for getting the best naive Bayes (conditional independence) approximation of a general Bayes network model but use far less efficient algorithms than the earlier algorithms of Winkler (1988).
Larsen, K. (2005), Generalized Naïve Bayes Classifiers, SIGKDD Explorations, 7 (1), 76-81, doi, 10.1145/1089815.1089826.
Thank you to all who contributed to this post.
[1] The Security of Knowing Nothing, B. Chazelle, Nature 446 (26 April 2007), 992-993.and also Zero proof knowledge explained to your kids or on MathOverFlow or one based on Sudoku.

Join the CompressiveSensing subreddit or the Google+ Community and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

1 comment:

Leslie N. Smith said...

I am a bit surprised no one mentioned reinforcement learning where there isn't any training data. Instead, the agent performs and action and obtains feedback from the environment as to the value of its past actions. Or is this far different from what you were asking?