Nuit Blanche: Why does Deep Learning work? - A perspective from Group Theory

Wednesday, January 07, 2015

Why does Deep Learning work? - A perspective from Group Theory

@gappy3000 @geomblog @igorcarron "Why pre-training helps" would have been a fine title.
— snippyhollow (@syhw) January 7, 2015

h/T to Giuseppe ~~Guissepe~~, Gabriel and Suresh for the discussion.

Here is a new insight from a paper that is up for review at ICLR2015:

We give an informal description. Suppose G is a group that acts on a set X by moving its points around ( e.g., groups of 2 x 2 invertible matrices acting over the euclidean plane). Consider x2X, and let Ox be the set of all points reachable from x via the group action . Ox is called an orbit.2. Some of the group elements may leave x unchanged. This subset,Sx, which is also a subgroup, is the stabilizer of x....

To answer, imagine that the space of the autoencoders form a group. A batch of learning iterations, aka search for stabilizers, stops whenever a stabilizer is found. Roughly speaking, if the search is a Markov chain (or a guided chain such as MCMC), then the bigger a stabilizer, the earlier it will be hit. The group structure implies that this big stabilizer corresponds to a small orbit. Now, intuition suggests that the simpler a feature, the smaller is its orbit. For example, a line-segment generates many fewer possible shapes 3 under linear deformations, than a flower-like shape. An autoencoder then should learn these simpler features first, which falls in line with most experiments (see Leeet al. (2009))

This points directly to a new kind of regularizer ! The next Sunday Morning Insight will probably be on the new regularizers. Without futher ado, here is: Why does Deep Learning work? - A perspective from Group Theory by Arnab Paul, Suresh Venkatasubramanian

Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.
One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations.
Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.

Let us note that it mentions another paper featured recently in Sunday Morning Insight: An exact mapping between the Variational Renormalization Group and Deep Learning. the authors say the following about that paper:

Mehta & Schwab (2014) recently showed an intriguing connection between Renormalization group flow 5 and deep-learning. They constructed an explicit mapping from a renormalization group over a block-spin Ising model (as proposed by Kadanoff et al. (1976)), to a DL architecture. On the face of it, this result is complementary to ours, albeit in a slightly different settings. Renormalization is a process of coarse-graining a system by first throwing away small details from its model, and then examining the new system under the simplified model (see Cardy (1996)). In that sense the orbit-stabilizer principle is a re-normalizable theory - it allows for the exact same coarse-graining operation at every layer - namely, keeping only minimal orbit shapes and then passing them as new parameters for the next layer - and the theory remains unchanged at every scale.

As we mentioned then, other approaches such as ScatNet also tries to learn a nonlinear relationship (in their words a scattering transform) so that eventual nonlinear features can be used in a group theoretic framework.

Join the CompressiveSensing subreddit or the Google+ Community and post there !

Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

2 comments:

rara avis said...: Igor, I know I love you and your blog... but dang! My name is Giuseppe! You put n typos in 8 characters! Not sparse! Not sparse!!

-gappy3000/giuseppe; Wednesday, January 7, 2015 at 11:59:00 AM CST
Igor said...: I have no words. Apologies and sorry !

Igor.; Wednesday, January 7, 2015 at 12:51:00 PM CST