Yesterday, Yann was talking at the Challenge in Optimization for Data Science workshop. One of his slides featured the distribution of losses across different neural network sizes (each color representing a specific network size, for one network size several randomly chosen networks were trained). The figure above was extracted from [1] but it is was the one shown in the slide. The crux of the argument was that by becoming larger, neural networks, trained from initially random weights, could see a concentration of their final losses i.e. once the networks are becoming large enough, the non-convexity is not that bad because even the local minima are in fact not so far from the global minimum. Roughly speaking, enlarging or lifting the phase space of hyperparameters allows the loss to converge toward zero while at the same time these losses are becoming more and more concentrated. Understanding how we can train these networks is important because currently we have no idea on how to really streamline the process nor have a particular understanding of the sampling bounds. It turns out that I had a very similar discussion with the folks at Kalray who were presenting their architectures to the NeuroSTIC people who also happen to have their annual meetings just a few meters away.

Coming back to this lifting argument, here are two papers that mentions that paper ([1] The Loss Surfaces of Multilayer Networks by Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, Yann LeCun ) and both take the path of tensorization as a way to do the lifting: First there is [3] Generalization Bounds for Neural Networks through Tensor Factorization by Majid Janzamin, Hanie Sedghi, Anima Anandkumar and then there is [2] Global Optimality in Tensor Factorization, Deep Learning, and Beyond by Benjamin D. Haeffele, Rene Vidal.

From [3] one can read:

Recently, Choromanska et al. (2015) analyze the loss surface of a multi-layer ReLU network by relating it to a spin glass system. They make several assumptions such as variable independence for the input, redundancy in network parameterization and so on. They show that under these assumptions, the lowest critical values of the random loss function form a layered structure, and the number of local minima outside that band diminishes exponentially with the network size.

aside from the sampling bound on the training they give, their approach is embarrassingly parallel which has a big impact on the eventual hardware on which it is going to be run. From [2] one can read:

Additionally, we have also shown that if the size of the network is allowed to be large enough then for any initialization a global minimizer can be found from purely local descent, and thus local minima are all equivalent. This is a similar conclusion to the work of Choromanska et al. (2014), who analyzed the problem from a statistical standpoint and showed that with sufficiently large networks and appropriate assumptions about the distributions of the data and network weights, then with high probability any family of networks learned with different initializations will have similar objective values, but we note that our results allow for a well defined set of conditions which will be sufficient to guarantee the property. Finally, many modern large scale networks do not use traditional regularization on the network weigh parameters such as an l1 or l2 norms during training and instead rely on alternative forms of regularization such as dropout as it tends to achieve better performance in practice(Srivastava et al., 2014; Krizhevsky et al., 2012; Wan et al., 2013). Given our commentary above regarding the critical importance of balancing the degree of homogeneity between the mapping and the regularizer, an immediate prediction of our analysis is that simply ensuring that the degrees of homogeneity are balanced could be a significant factor in improving the performance of deep networks

I can't wait to see how this discussion on mapping and regularizer will turn out. Without further ado, here are the abstracts:

[1] The Loss Surfaces of Multilayer Networks

Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, Yann LeCun

Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, Yann LeCun

We study the connection between the highly non-convex loss function of a simple model of the fully-connected feed-forward neural network and the Hamiltonian of the spherical spin-glass model under the assumptions of: i) variable independence, ii) redundancy in network parametrization, and iii) uniformity. These assumptions enable us to explain the complexity of the fully decoupled neural network through the prism of the results from random matrix theory. We show that for large-size decoupled networks the lowest critical values of the random loss function form a layered structure and they are located in a well-defined band lower-bounded by the global minimum. The number of local minima outside that band diminishes exponentially with the size of the network. We empirically verify that the mathematical model exhibits similar behavior as the computer simulations, despite the presence of high dependencies in real networks. We conjecture that both simulated annealing and SGD converge to the band of low critical points, and that all critical points found there are local minima of high quality measured by the test error. This emphasizes a major difference between large- and small-size networks where for the latter poor quality local minima have non-zero probability of being recovered. Finally, we prove that recovering the global minimum becomes harder as the network size increases and that it is in practice irrelevant as global minimum often leads to overfitting.

[2] Global Optimality in Tensor Factorization, Deep Learning, and Beyond

Benjamin D. Haeffele, Rene Vidal

Benjamin D. Haeffele, Rene Vidal

Techniques involving factorization are found in a wide range of applications and have enjoyed significant empirical success in many fields. However, common to a vast majority of these problems is the significant disadvantage that the associated optimization problems are typically non-convex due to a multilinear form or other convexity destroying transformation. Here we build on ideas from convex relaxations of matrix factorizations and present a very general framework which allows for the analysis of a wide range of non-convex factorization problems - including matrix factorization, tensor factorization, and deep neural network training formulations. We derive sufficient conditions to guarantee that a local minimum of the non-convex optimization problem is a global minimum and show that if the size of the factorized variables is large enough then from any initialization it is possible to find a global minimizer using a purely local descent algorithm. Our framework also provides a partial theoretical justification for the increasingly common use of Rectified Linear Units (ReLUs) in deep neural networks and offers guidance on deep network architectures and regularization strategies to facilitate efficient optimization.

[3] Generalization Bounds for Neural Networks through Tensor Factorization

Majid Janzamin, Hanie Sedghi, Anima Anandkumar

Majid Janzamin, Hanie Sedghi, Anima Anandkumar

Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for training a two-layer neural network. We prove efficient generalization bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for generalizability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, for the first time, we have a computationally efficient method with guaranteed generalization bounds for training neural networks.

**Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !**

Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

## No comments:

Post a Comment