Monday, December 30, 2013

Do Deep Nets Really Need to be Deep?

There are several ways of learning the Identity (through compressive sensing in the semilinear model and algorithms of Machine Learning for nonlinear models, see [5]) with deep or not so deep neural networks and there is currently much schizophrenia in trying to figure out what is a good model. On the one hand, you want to have shallow models such as k-Sparse Autoencoders and deep down one of the reasons for this is the ability to potentially have Provable Algorithms for Machine Learning Problems in our lifetimes. On the other, deep networks seem to provide more accuracy in benchmarks.

One can do this comparison between shallow and deep netowrks several ways: through the comparison of results from shallow and deep networks with good databases / benchmarks, through the acid test of the sharp phase transitions ( see [1,2,3,4,5,6]), or maybe by seeing how approximating a deep network with a shallow one reduces its precision: the idea being that approximating a deeper network with a shallower will provide an idea of the legitimacy for investng much time with deeper networks ... or not. This is precisely what the next paper is doing:

Do Deep Nets Really Need to be Deep? by Lei Jimmy Ba, Rich Caruana

Currently, deep neural networks are the state of the art on problems such as speech recognition and computer vision. In this extended abstract, we show that shallow feed-forward networks can learn the complex functions previously learned by deep nets and achieve accuracies previously only achievable with deep models. Moreover, the shallow neural nets can learn these deep functions using a total number of parameters similar to the original deep model. We evaluate our method on TIMIT phoneme recognition task and are able to train shallow fully-connected nets that perform similarly to complex, well-engineered, deep convolutional architectures. Our success in training shallow neural nets to mimic deeper models suggests that there probably exist better algorithms for training shallow feed-forward nets than those currently available.
from the beginning of the paper:

You are given a training set with 10M labeled points. When you train a shallow neural net with one fully-connected feedforward hidden layer on this data you obtain 85% accuracy on test data. When you train a deeper neural net as in [2] consisting of convolutional layers, pooling layers, and multiple fully-connected feedforward layers on the same data you obtain 90% accuracy on the test set. What is the source of this magic? Is the 5% increase in accuracy of the deep net over the shallow net because: a) the deep net has more parameters than the shallow net; b) the deep net is deeper than the shallow net; c) nets without convolution can’t learn what nets with convolution can learn; d) current learning algorithms and regularization procedures work better with deep architectures than with shallow architectures; e) all or some of the above; f) none of the above?
from Rich Caruana's page
We're doing new work on what we call Model Compression where we take a large, slow, but accurate model and compress it into a much smaller, faster, yet still accurate model. This allows us to separate the models used for learning from the models used to deliver the learned function so that we can train large, complex models such as ensembles, but later make them small enough to fit on a PDA, hearing aid, or satellite. With model compression we can make models 1000 times smaller and faster with little or no loss in accuracy. Here's our first paper on model compression.
Personally, I think sharp phase transitions will eventually be the great equalizers.

  1. Sunday Morning Insight: Randomization is not a dirty word
  2. Sunday Morning Insight: Sharp Phase Transitions in Machine Learning ?
  3. Sunday Morning Insight: Exploring Further the Limits of Admissibility
  4. Sunday Morning Insight: The Map Makers
  5. Quick Panorama of Sensing from Direct Imaging to Machine Learning 
  6. Faster Than a Blink of an Eye.

Join the CompressiveSensing subreddit or the Google+ Community and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.


Unknown said...

Really Interestin, but it seems the SNN with the same number of parameters performs worse than the DNN or CNN. To get comparable performance from the SNN they have to substantially increase the number of parameters.

Unknown said...

Sorry for the late comment to the post - your post yesterday reminded me of this question.
The paper "Do deep nets really need to be deep?" should have referenced the paper by Yoshua Bengio (one of the leaders in the field of deep learning) called "Learning Deep Architectures for AI" because the second section called "Theoretical limitations of shallow architectures" is very relevant to this question. It seems like there are strong reasons to use a deep architecture.

Igor said...


You are probably right, if the paper has not mentioned, it probably should have.

Btw, I mentioned Yoshua Bengio work here: