Pages

Wednesday, August 27, 2014

Yoshua Bengio's view on Deep Learning


Following this entryYoshua Bengio just wrote the following on his Google+ stream (I have a added a few links and reshaped the text for clearer reading):

There was an interesting debate on deep learning at Technion a couple of days ago: http://nuit-blanche.blogspot.fr/2014/08/is-deep-learning-final-frontier-and-end.html 
I wasn't there but I wished I had been to clarify a number of points. So I wrote this message to the members of the debate panel (sorry, it's pretty long):
-------- 
I just watched your recent debate on deep learning that took place at Technion. It was very interesting to hear you but made me wish very much that I had been there to inform the debate. So I would like to throw in a few answers to some of the questions and comments about some statements that were made. 
(1) Lack of theory and understanding of why deep learning works
I would be much more nuanced then what was said: there is already a lot of theory, but a lot more remains mysterious and needs to be worked out.  
1.1 Regarding depth, there are several theoretical results showing that deep circuits (including deep feedforward nets with rectifiers) can represent some functions exponentially (wrt depth) more efficiently than shallower ones (or ones of depth 2, which is the minimum to get universal approximation). The latest in this series (with references to previous ones) is this one: http://arxiv.org/abs/1402.1869 
1.2 Regarding the effect of learning a distributed representation, the same paper (following on many papers where I discuss this non-mathematically, including my 2009 book, and also following Montufar's similar result on RBMs,http://arxiv.org/pdf/1206.0387v4) shows that even a single layer distributed representation can represent an exponentially richer function (or distribution, for RBMs) than a "non-distributed" learner, e.g., kernel SVM, clustering, k-nearest-neighbors, decision trees, and other such non-parametric machine learning methods. This follows on previous negative results on such non-distributed local learners in my NIPS'2005 paper with Delalleau "The curse of highly variable functions for local kernel machines" and a related paper on the limitations of decision trees ("Decision trees do not generalize to new variations"). 
1.3 Insight behind the theory. The basic reason we get these potentially exponential gains is that we have compositionality of the parameters, i.e., the same parameters can be re-used in many contexts, so O(N) parameters can allow to distinguish O(2^N) regions in input space, whereas with nearest-neighbor-like things, you need O(N) parameters (i.e. O(N) examples) to characterize a function that can distinguish betwen O(N) regions.
Of course, this "gain" is only for some target functions, or more generally, we can think of it like a prior. If the prior is applicable to our target distribution, we can gain a lot. As usual in ML, there is no real free lunch. The good news is that this prior is very broad and seems to cover most of the things that humans learn about. What it basically says is that the data we observe are explained by a bunch of underlying factors, and that you can learn about each of these factors WITHOUT requiring to see all of the configurations of the other factors. This is how you are able to generalize to new configurations and get this exponential statistical gain. 
(2) Generative & unsupervised models. 
At the end there was a question about unsupervised and generative deep learning, and I would have really liked to be there to say that there is a lot of it, and in fact it is one of the main differences between the neural net research of the previous wave and the current wave, i.e., that we have made a lot of progress in designing unsupervised learning algorithms, including the generative type, and including deep ones. I am sure you have already heard about RBMs? and maybe about denoising auto-encoders? They are shallow but you can use them to pre-train deep generative models such as DBMs and DBNs (although these days we are moving in a territory where you don't need pre-training anymore). To see how powerful these are I invite you to look at some of the images generated in the papers of the last few years, e.g., http://www.icml-2011.org/papers/591_icmlpaper.pdf, or or pretty much all the papers from Ruslan Salakhutdinov or http://arxiv.org/pdf/1312.6114v10. Some that literature is reviewed in our 2013 TPAMI review paper
(3) How deep is deep enough?  
The above theoretical results say that the answer is data-dependent. For some tasks you may need a deeper (more non-linear, more complex) function. Also the actual "number" of layers is somewhat immaterial because it depends on the richness of the non-linearities that we put in each layer (i.e. you may simulate a depth-d net with a depth 2d depending on what operations each level is allowed to perform). However, with the usual suspects (neuron-like things and RBF-like things and gate-like things and sums and products), two levels give you universal approximation, so what we usually call "shallow" is depth 1 or 2. Depth 1 corresponds to linear systems, which are not even universal approximators, but are still very often used because they are so convenient. Regarding FFTs and such, they are indeed deep but that is not deep learning. Deep learning is when you learn multiple levels of representation, and the number of levels is something you can learn as well, so you have a kind of generic recipe throughout. 
(4) Successes of deep learning, beyond object recognition in images. 
Of course, besides object recognition in images, the big success has been in speech recognition. What most people don't know is that in the language modeling part of speech recognition, neural networks have been the SOTA for a number of years (starting with the work of Holger Schwenk), and neural nets (especially learning neural word embeddings, which I started with my NIPS'2000 paper) are quickly becoming a standard and a secret sauce in modern NLP. In particular, we are seeing this year a number of groups reaching and passing the SOTA in machine translation using these ideas (Google, BBN, Oxford, my lab, others). What is interesting is that we have moved beyond "object classification" into "structured output" tasks where the "output" is a high-dimensional object (e.g. a sentence, or a parse tree) which we are representing typically by its joint distribution. What is also interesting is that a lot of that work relies on UNSUPERVISED learning from pure text. See below for more on the supervised vs unsupervised divide. 
(5) Unsupervised learning is not bullshit and we can generalize with very few labeled examples thanks to transfer learning. Unsupervised learning is crucial to approach AI for a number of fundamental reasons, including the abundance of unlabeled data and the observed fact that representation learning (whether supervised or unsupervised) allows transfer learning, allowing to learn from VERY FEW LABELED EXAMPLES some new categories. Of course this is only possible if the learner has previously learned good representations from related categories, but with the AlexNet, it has clearly been shown that from 1000 object categories you can generalize to new categories with just a few examples. This has been demonstrated in many papers, starting with the two transfer learning competitions we won in 2011 (ICML 2011 and NIPS 2011), using UNSUPERVISED transfer learning. More recently, Socher showed that you can even get some decent generalization from ZERO examples simply because you know things from multiple modalities (e.g., that 'dog' and 'cat' are semantically similar in sentences, so that you can guess that something in an image could be a dog even if you have only seen images of cats). We had shown a similar result earlier (AAAI-2008, Larochelle et al) in the case of learning a new task for which you have some kind of representation (and these representations are learned across tasks).
So you can't use deep learning on a new field for which there is very little data if there is no relationship with what the learner has learned previously, but that is also true of humans. 
(6) Deep learning is not magic. 
I am horrified at how people react to deep learning as if 
  • (a) it was something magical, or 
  • (b) blame it for not being magical and solving every problem. 
Come on, let's not feed the blind hype and stay in the realm of science... which concentrates on the question of UNDERSTANDING. However, the kind of understanding I and others are seeking is not about the physical world directly but about the LEARNING mechanisms. That is very different. That is the thing on which I seek insight, and that is what my papers seek to provide. 
(7) About performance bounds: 
Because we are doing machine learning, you won't be able to achieve performance bounds without making assumptions on your data generating distribution. No magic. The BIG difference between deep learning and classical non-parametric statistical machine learning is that we go beyond the SMOOTHNESS assumption and add other priors such as
  • the existence of these underlying generative factors (= distributed representations)
  • assuming that they are organized hierarchically by composition (= depth)
  • assuming that they are causes of the observed data (allows semi-supervised learning to work)
  • assuming that different tasks share different subsets of factors (allows multi-task learning and transfer learning to tasks with very few labeled examples)
  • assuming that the top-level factors are related in simple ways to each other (makes it possible to stick a simple classifier on top of unsupervised learning and get decent results, for example)
See more of a discussion of that in my 2013 PAMI review paper on representation learning.
I am also posting this answer on Nuit Blanche (where I saw the video), google+ and facebook. The new book (called 'Deep Learning') that I am writing should help make these answers even clearer. 
Thanks again for helping me clarify things that need to be made clearer.
-- Yoshua




Image Credit: NASA/JPL-Caltech 
This image was taken by Navcam: Right B (NAV_RIGHT_B) onboard NASA's Mars rover Curiosity on Sol 729 (2014-08-25 05:32:20 UTC). 
Full Resolution

Join the CompressiveSensing subreddit or the Google+ Community and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

No comments:

Post a Comment