Deep Learning by
Ian Goodfellow and
Aaron Courville and
Yoshua Bengio
On the number of response regions of deep feed forward networks with piece-wise linear activations by
Razvan Pascanu,
Guido Montufar,
Yoshua Bengio
This paper explores the complexity of deep feedforward networks with linear
pre-synaptic couplings and rectified linear activations. This is a contribution
to the growing body of work contrasting the representational power of deep and
shallow network architectures. In particular, we offer a framework for
comparing deep and shallow models that belong to the family of piecewise linear
functions based on computational geometry. We look at a deep rectifier
multi-layer perceptron (MLP) with linear outputs units and compare it with a
single layer version of the model. In the asymptotic regime, when the number of
inputs stays constant, if the shallow model has kn hidden units and n0
inputs, then the number of linear regions is O(kn0nn0). For a k
layer model with n hidden units on each layer it is Ω(⌊n/n0⌋k−1nn0). The number
⌊n/n0⌋k−1 grows faster than kn0 when n
tends to infinity or when k tends to infinity and n≥2n0.
Additionally, even when k is small, if we restrict n to be 2n0, we can
show that a deep model has considerably more linear regions that a shallow one.
We consider this as a first step towards understanding the complexity of these
models and specifically towards providing suitable mathematical tools for
future analysis.
On the Number of Linear Regions of Deep Neural Networks by
Guido Montúfar,
Razvan Pascanu,
Kyunghyun Cho,
Yoshua Bengio
We study the complexity of functions computable by deep feedforward neural
networks with piecewise linear activations in terms of the symmetries and the
number of linear regions that they have. Deep networks are able to sequentially
map portions of each layer's input-space to the same output. In this way, deep
models compute functions that react equally to complicated patterns of
different inputs. The compositional structure of these functions enables them
to re-use pieces of computation exponentially often in terms of the network's
depth. This paper investigates the complexity of such compositional maps and
contributes new theoretical results regarding the advantage of depth for neural
networks with piecewise linear activation functions. In particular, our
analysis is not specific to a single family of models, and as an example, we
employ it for rectifier and maxout networks. We improve complexity bounds from
pre-existing work and investigate the behavior of units in higher layers.
When Does a Mixture of Products Contain a Product of Mixtures? by
Guido F. Montufar,
Jason Morton
We derive relations between theoretical properties of restricted Boltzmann
machines (RBMs), popular machine learning models which form the building blocks
of deep learning models, and several natural notions from discrete mathematics
and convex geometry. We give implications and equivalences relating
RBM-representable probability distributions, perfectly reconstructible inputs,
Hamming modes, zonotopes and zonosets, point configurations in hyperplane
arrangements, linear threshold codes, and multi-covering numbers of hypercubes.
As a motivating application, we prove results on the relative representational
power of mixtures of product distributions and products of mixtures of pairs of
product distributions (RBMs) that formally justify widely held intuitions about
distributed representations. In particular, we show that a mixture of products
requiring an exponentially larger number of parameters is needed to represent
the probability distributions which can be obtained as products of mixtures.
FitNets: Hints for Thin Deep Nets Adriana Romero,
Nicolas Ballas,
Samira Ebrahimi Kahou,
Antoine Chassang,
Carlo Gatta,
Yoshua Bengio
While depth tends to improve network performances, it also makes
gradient-based training more difficult since deeper networks tend to be more
non-linear. The recently proposed knowledge distillation approach is aimed at
obtaining small and fast-to-execute models, and it has shown that a student
network could imitate the soft output of a larger teacher network or ensemble
of networks. In this paper, we extend this idea to allow the training of a
student that is deeper and thinner than the teacher, using not only the outputs
but also the intermediate representations learned by the teacher as hints to
improve the training process and final performance of the student. Because the
student intermediate hidden layer will generally be smaller than the teacher's
intermediate hidden layer, additional parameters are introduced to map the
student hidden layer to the prediction of the teacher hidden layer. This allows
one to train deeper students that can generalize better or run faster, a
trade-off that is controlled by the chosen student capacity. For example, on
CIFAR-10, a deep student network with almost 10.4 times less parameters
outperforms a larger, state-of-the-art teacher network.
On the Expressive Power of Deep Learning: A Tensor Analysis by
Nadav Cohen,
Or Sharir,
Amnon Shashua
It has long been conjectured that hypothesis spaces suitable for data that is
compositional in nature, such as text or images, may be more efficiently
represented with deep hierarchical architectures than with shallow ones.
Despite the vast empirical evidence, formal arguments to date are limited and
do not capture the kind of networks used in practice. Using tensor
factorization, we derive a universal hypothesis space implemented by an
arithmetic circuit over functions applied to local data structures (e.g. image
patches). The resulting networks first pass the input through a representation
layer, and then proceed with a sequence of layers comprising sum followed by
product-pooling, where sum corresponds to the widely used convolution operator.
The hierarchical structure of networks is born from factorizations of tensors
based on the linear weights of the arithmetic circuits. We show that a shallow
network corresponds to a rank-1 decomposition, whereas a deep network
corresponds to a Hierarchical Tucker (HT) decomposition. Log-space computation
for numerical stability transforms the networks into SimNets.
In its basic form, our main theoretical result shows that the set of
polynomially sized rank-1 decomposable tensors has measure zero in the
parameter space of polynomially sized HT decomposable tensors. In deep learning
terminology, this amounts to saying that besides a negligible set, all
functions that can be implemented by a deep network of polynomial size, require
an exponential size if one wishes to implement (or approximate) them with a
shallow network. Our construction and theory shed new light on various
practices and ideas employed by the deep learning community, and in that sense
bear a paradigmatic contribution as well.
How Can Deep Rectifier Networks Achieve Linear Separability and Preserve Distances? by
Senjian An,
Farid Boussaid,
Mohammed Bennamoun (attendant
slides and
video from ICML 2015)
This paper investigates how hidden layers of deep rectifier networks
are capable of transforming two or more pattern sets to be linearly
separable while preserving the distances with a guaranteed degree, and
proves the universal classification power of such distance preserving
rectifier networks. Through the nearly isometric nonlinear
transformation in the hidden layers, the margin of the linear separating
plane in the output layer and the margin of the nonlinear separating
boundary in the original data space can be closely related so that the
maximum margin classification in the input data space can be achieved
approximately via the maximum margin linear classifiers in the output
layer. The generalization performance of such distance preserving deep
rectifier neural networks can be well justified by the
distance-preserving properties of their hidden layers and the maximum
margin property of the linear classifiers in the output layer.
Training Very Deep Networks by
Rupesh K. Srivastava, Klaus Greff, Juergen Schmidhuber
Theoretical and empirical evidence indicates that the depth of neural networks is crucial for their success. However, training becomes more difficult as depth increases, and training of very deep networks remains an open problem. Here we introduce a new architecture designed to overcome this. Our so-called highway networks allow unimpeded information flow across many layers on information highways. They are inspired by Long Short-Term Memory recurrent networks and use adaptive gating units to regulate the information flow. Even with hundreds of layers, highway networks can be trained directly through simple gradient descent. This enables the study of extremely deep and efficient architectures.
Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.