Pages

Thursday, August 31, 2017

Videos: Deep Learning (DLSS) and Reinforcement Learning (RLSS) Summer School, Montreal 2017

event header image
We mentioned the slides earlier this summer (they are featured here) and we now have the attendant videos ! Thank you to the organizers: Joelle Pineau and Doina Precup. Graham Taylor, Aaron Courvilleand Yoshua Bengio for making these available on the interwebs.

Deep neural networks that learn to represent data in multiple layers of increasing abstraction have dramatically improved the state-of-the-art for speech recognition, object recognition, object detection, predicting the activity of drug molecules, and many other tasks. Deep learning discovers intricate structure in large datasets by building distributed representations, either via supervised, unsupervised or reinforcement learning.
The Deep Learning Summer School (DLSS) is aimed at graduate students and industrial engineers and researchers who already have some basic knowledge of machine learning (and possibly but not necessarily of deep learning) and wish to learn more about this rapidly growing field of research.
In collaboration with DLSS we will hold the first edition of the Montreal Reinforcement Learning Summer School (RLSS). RLSS will cover the basics of reinforcement learning and show its most recent research trends and discoveries, as well as present an opportunity to interact with graduate students and senior researchers in the field.
The school is intended for graduate students in Machine Learning and related fields. Participants should have advanced prior training in computer science and mathematics, and preference will be given to students from research labs affiliated with the CIFAR program on Learning in Machines and Brains. 

Deep Learning Summer School

Reinforcement Learning Summer School




Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Wednesday, August 30, 2017

ProjectionNet: Learning Efficient On-Device Deep Networks Using Neural Projections

At our weekly meeting, Iacopo talked about this recent interesting preprint: 



Deep neural networks have become ubiquitous for applications related to visual recognition and language understanding tasks. However, it is often prohibitive to use typical neural networks on devices like mobile phones or smart watches since the model sizes are huge and cannot fit in the limited memory available on such devices. While these devices could make use of machine learning models running on high-performance data centers with CPUs or GPUs, this is not feasible for many applications because data can be privacy sensitive and inference needs to be performed directly "on" device.
We introduce a new architecture for training compact neural networks using a joint optimization framework. At its core lies a novel objective that jointly trains using two different types of networks--a full trainer neural network (using existing architectures like Feed-forward NNs or LSTM RNNs) combined with a simpler "projection" network that leverages random projections to transform inputs or intermediate representations into bits. The simpler network encodes lightweight and efficient-to-compute operations in bit space with a low memory footprint. The two networks are trained jointly using backpropagation, where the projection network learns from the full network similar to apprenticeship learning. Once trained, the smaller network can be used directly for inference at low memory and computation cost. We demonstrate the effectiveness of the new approach at significantly shrinking the memory requirements of different types of neural networks while preserving good accuracy on visual recognition and text classification tasks. We also study the question "how many neural bits are required to solve a given task?" using the new framework and show empirical results contrasting model predictive capacity (in bits) versus accuracy on several datasets.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !

Tuesday, August 29, 2017

A Brief Survey of Deep Reinforcement Learning


 This is the start of a new season and we all could get some help from reviews such as this one. If you know of other, please ping me. Kai mentioned this review he co-wrote on his twitter feed:


A Brief Survey of Deep Reinforcement Learning by Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, Anil Anthony Bharath

Deep reinforcement learning is poised to revolutionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policy-based methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep Q-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.  







Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Saturday, August 26, 2017

Saturday Morning Videos: Bridging Continuous and Discrete Optimization Boot Camp, Simons Institute at Berkeley, August 21-25, 2017

Nikhil Bansal, Pablo Parrilo and Ben Recht organized a boot camp this week on "Bridging Continuous and Discrete Optimization" at the Simons Institute at Berkeley. The videos are already on the site, awesome !

Here are the abstracts and attendant videos:


In this tutorial, we will start by discussing some of the basic ideas and techniques that allow a discrete optimization problem to be relaxed or immersed into a continuous optimization problem. Typically, the resulting continuous optimization problem is convex, implying strong properties of duality and efficiency. As part of the first lecture, we will review convexification ideas and strong duality in the context of cone optimization problems. The second and third lectures will focus on many approaches for going back from the continuous problem to the discrete problem, i.e. rounding a solution from the larger, continuous problem into a solution to the discrete problem. This is a very rich area, and we will survey a variety of techniques.


In this tutorial, we will start by discussing some of the basic ideas and techniques that allow a discrete optimization problem to be relaxed or immersed into a continuous optimization problem. Typically, the resulting continuous optimization problem is convex, implying strong properties of duality and efficiency. As part of the first lecture, we will review convexification ideas and strong duality in the context of cone optimization problems. The second and third lectures will focus on many approaches for going back from the continuous problem to the discrete problem, i.e. rounding a solution from the larger, continuous problem into a solution to the discrete problem. This is a very rich area, and we will survey a variety of techniques.


In these introductory lectures we will present a unified introduction to Semidefinite Programming hierarchies and Sum of Squares techniques for continuous/discrete optimization, highlighting their many different and complementary interpretations (Lagrangian/algebraic duality, geometric embeddings, proof systems, probabilistic, etc). We will concisely summarize the state of the art in both positive and negative results (e.g., planted clique, tensor completion, constraint satisfaction), as well as current algorithmic approaches.



In these introductory lectures we will present a unified introduction to Semidefinite Programming hierarchies and Sum of Squares techniques for continuous/discrete optimization, highlighting their many different and complementary interpretations (Lagrangian/algebraic duality, geometric embeddings, proof systems, probabilistic, etc). We will concisely summarize the state of the art in both positive and negative results (e.g., planted clique, tensor completion, constraint satisfaction), as well as current algorithmic approaches.

In this tutorial, we will start by discussing some of the basic ideas and techniques that allow a discrete optimization problem to be relaxed or immersed into a continuous optimization problem. Typically, the resulting continuous optimization problem is convex, implying strong properties of duality and efficiency. As part of the first lecture, we will review convexification ideas and strong duality in the context of cone optimization problems. The second and third lectures will focus on many approaches for going back from the continuous problem to the discrete problem, i.e. rounding a solution from the larger, continuous problem into a solution to the discrete problem. This is a very rich area, and we will survey a variety of techniques.


Techniques and insights from convex optimization play an important role in the design of algorithms for discrete optimization, as evidenced by the success of convex relaxations in providing a framework for constructing and analyzing approximation algorithms. In the last decade, iterative methods from convex optimization have served as a springboard to a number of algorithmic advances for many discrete optimization tasks, including fundamental ones such as submodular optimization and maximum flow problems.
In this bootcamp series, we will provide a walk-through of the continuous approach to algorithm design in discrete optimization. We will consider the algorithmic challenges that arise in the choice of formulation and iterative algorithm by casting first-order methods as discretizations of continuous-time dynamics. We will see that the non-smooth nature of discrete optimization problems leads us to smoothen our objectives via regularization and discuss how different choices of regularizers arise in different settings. We will then put these principles to work in the contexts of submodular optimization and competitive analysis of online algorithms.



Techniques and insights from convex optimization play an important role in the design of algorithms for discrete optimization, as evidenced by the success of convex relaxations in providing a framework for constructing and analyzing approximation algorithms. In the last decade, iterative methods from convex optimization have served as a springboard to a number of algorithmic advances for many discrete optimization tasks, including fundamental ones such as submodular optimization and maximum flow problems.
In this bootcamp series, we will provide a walk-through of the continuous approach to algorithm design in discrete optimization. We will consider the algorithmic challenges that arise in the choice of formulation and iterative algorithm by casting first-order methods as discretizations of continuous-time dynamics. We will see that the non-smooth nature of discrete optimization problems leads us to smoothen our objectives via regularization and discuss how different choices of regularizers arise in different settings. We will then put these principles to work in the contexts of submodular optimization and competitive analysis of online algorithms.



In this session, we will provide an overview of stochastic and robust optimization, with problems in statistics and machine learning providing a motivating focus. We will discuss connections between convex optimization and regret minimization, both with stochastic and deterministic feedback. We will view these problems through the minimax lens and show how algorithms emerge as approximate dynamic programming solutions. Finally, we will review robust optimization and explore the interactions between robustness and regularization.



In this session, we will provide an overview of stochastic and robust optimization, with problems in statistics and machine learning providing a motivating focus. We will discuss connections between convex optimization and regret minimization, both with stochastic and deterministic feedback. We will view these problems through the minimax lens and show how algorithms emerge as approximate dynamic programming solutions. Finally, we will review robust optimization and explore the interactions between robustness and regularization.


In these introductory lectures we will present a unified introduction to Semidefinite Programming hierarchies and Sum of Squares techniques for continuous/discrete optimization, highlighting their many different and complementary interpretations (Lagrangian/algebraic duality, geometric embeddings, proof systems, probabilistic, etc). We will concisely summarize the state of the art in both positive and negative results (e.g., planted clique, tensor completion, constraint satisfaction), as well as current algorithmic approaches.


In these introductory lectures we will present a unified introduction to Semidefinite Programming hierarchies and Sum of Squares techniques for continuous/discrete optimization, highlighting their many different and complementary interpretations (Lagrangian/algebraic duality, geometric embeddings, proof systems, probabilistic, etc). We will concisely summarize the state of the art in both positive and negative results (e.g., planted clique, tensor completion, constraint satisfaction), as well as current algorithmic approaches.


In this tutorial we will cover the recent progress on lower bounds on the size of linear programs and semidefinite programs for combinatorial optimization problems. We will present the definition and motivation of extension complexity and the relation to communication complexity. Then we will discuss the lower bound of Fiorini et al for the TSP and correlation polytope. We will also review SDP extension complexity with related concept such as the SDP rank of matrices. Finally, we will discuss lower bounds based on Fourier-methods by CLRS and LRS for both the LP model and the SDP model.

3:00 pm – 4:00 pm

In this tutorial we will cover the recent progress on lower bounds on the size of linear programs and semidefinite programs for combinatorial optimization problems. We will present the definition and motivation of extension complexity and the relation to communication complexity. Then we will discuss the lower bound of Fiorini et al for the TSP and correlation polytope. We will also review SDP extension complexity with related concept such as the SDP rank of matrices. Finally, we will discuss lower bounds based on Fourier-methods by CLRS and LRS for both the LP model and the SDP model.


In this session, we will provide an overview of stochastic and robust optimization, with problems in statistics and machine learning providing a motivating focus. We will discuss connections between convex optimization and regret minimization, both with stochastic and deterministic feedback. We will view these problems through the minimax lens and show how algorithms emerge as approximate dynamic programming solutions. Finally, we will review robust optimization and explore the interactions between robustness and regularization.


In this tutorial we will cover the recent progress on lower bounds on the size of linear programs and semidefinite programs for combinatorial optimization problems. We will present the definition and motivation of extension complexity and the relation to communication complexity. Then we will discuss the lower bound of Fiorini et al for the TSP and correlation polytope. We will also review SDP extension complexity with related concept such as the SDP rank of matrices. Finally, we will discuss lower bounds based on Fourier-methods by CLRS and LRS for both the LP model and the SDP model.


In this tutorial we will cover the recent progress on lower bounds on the size of linear programs and semidefinite programs for combinatorial optimization problems. We will present the definition and motivation of extension complexity and the relation to communication complexity. Then we will discuss the lower bound of Fiorini et al for the TSP and correlation polytope. We will also review SDP extension complexity with related concept such as the SDP rank of matrices. Finally, we will discuss lower bounds based on Fourier-methods by CLRS and LRS for both the LP model and the SDP model.


Interior point methods are one of the key approaches to solving linear programming formulations as well as other convex programs. They give rise to algorithms that not only are the fastest ones known from asymptotic analysis point of view but also are often superior in practice.
This series of four lectures will present fundamentals of interior point methods as well as touch on the key ideas and concepts behind some of the recent developments. The topics we plan to cover include: (1) background on linear programming optimality conditions, the central path and its neighborhoods, and Newton's method; (2) complete analysis of a primal-dual path-following algorithm for linear programming and its relationship to a basic primal barrier method; (3) how to perturb central path to speed up convergence of an interior point method; and (4) a variant of interior point method that achieves running times that scales with the intrinsic dimension of the problem (rather than its possibly larger standard representation).
No previous experience with interior point methods is required.


Interior point methods are one of the key approaches to solving linear programming formulations as well as other convex programs. They give rise to algorithms that not only are the fastest ones known from asymptotic analysis point of view but also are often superior in practice.
This series of four lectures will present fundamentals of interior point methods as well as touch on the key ideas and concepts behind some of the recent developments. The topics we plan to cover include: (1) background on linear programming optimality conditions, the central path and its neighborhoods, and Newton's method; (2) complete analysis of a primal-dual path-following algorithm for linear programming and its relationship to a basic primal barrier method; (3) how to perturb central path to speed up convergence of an interior point method; and (4) a variant of interior point method that achieves running times that scales with the intrinsic dimension of the problem (rather than its possibly larger standard representation).
No previous experience with interior point methods is required.

Interior point methods are one of the key approaches to solving linear programming formulations as well as other convex programs. They give rise to algorithms that not only are the fastest ones known from asymptotic analysis point of view but also are often superior in practice.
This series of four lectures will present fundamentals of interior point methods as well as touch on the key ideas and concepts behind some of the recent developments. The topics we plan to cover include: (1) background on linear programming optimality conditions, the central path and its neighborhoods, and Newton's method; (2) complete analysis of a primal-dual path-following algorithm for linear programming and its relationship to a basic primal barrier method; (3) how to perturb central path to speed up convergence of an interior point method; and (4) a variant of interior point method that achieves running times that scales with the intrinsic dimension of the problem (rather than its possibly larger standard representation).
No previous experience with interior point methods is required.

Techniques and insights from convex optimization play an important role in the design of algorithms for discrete optimization, as evidenced by the success of convex relaxations in providing a framework for constructing and analyzing approximation algorithms. In the last decade, iterative methods from convex optimization have served as a springboard to a number of algorithmic advances for many discrete optimization tasks, including fundamental ones such as submodular optimization and maximum flow problems.
In this bootcamp series, we will provide a walk-through of the continuous approach to algorithm design in discrete optimization. We will consider the algorithmic challenges that arise in the choice of formulation and iterative algorithm by casting first-order methods as discretizations of continuous-time dynamics. We will see that the non-smooth nature of discrete optimization problems leads us to smoothen our objectives via regularization and discuss how different choices of regularizers arise in different settings. We will then put these principles to work in the contexts of submodular optimization and competitive analysis of online algorithms.

Techniques and insights from convex optimization play an important role in the design of algorithms for discrete optimization, as evidenced by the success of convex relaxations in providing a framework for constructing and analyzing approximation algorithms. In the last decade, iterative methods from convex optimization have served as a springboard to a number of algorithmic advances for many discrete optimization tasks, including fundamental ones such as submodular optimization and maximum flow problems.
In this bootcamp series, we will provide a walk-through of the continuous approach to algorithm design in discrete optimization. We will consider the algorithmic challenges that arise in the choice of formulation and iterative algorithm by casting first-order methods as discretizations of continuous-time dynamics. We will see that the non-smooth nature of discrete optimization problems leads us to smoothen our objectives via regularization and discuss how different choices of regularizers arise in different settings. We will then put these principles to work in the contexts of submodular optimization and competitive analysis of online algorithms.

Interior point methods are one of the key approaches to solving linear programming formulations as well as other convex programs. They give rise to algorithms that not only are the fastest ones known from asymptotic analysis point of view but also are often superior in practice.
This series of four lectures will present fundamentals of interior point methods as well as touch on the key ideas and concepts behind some of the recent developments. The topics we plan to cover include: (1) background on linear programming optimality conditions, the central path and its neighborhoods, and Newton's method; (2) complete analysis of a primal-dual path-following algorithm for linear programming and its relationship to a basic primal barrier method; (3) how to perturb central path to speed up convergence of an interior point method; and (4) a variant of interior point method that achieves running times that scales with the intrinsic dimension of the problem (rather than its possibly larger standard representation).
No previous experience with interior point methods is required.






Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Thursday, August 24, 2017

Towards a Deeper Understanding of Training Quantized Neural Networks / BitNet: Bit-Regularized Deep Neural Networks / Deep Binary Reconstruction for Cross-modal Hashing

There is a strong interest in the low bit side of the Force !




Towards a Deeper Understanding of Training Quantized Neural Networks by Hao Li, Soham De,  Zheng Xu, Christoph Studer, Hanan Samet, Tom Goldstein
Training neural networks with coarsely quantized weights is a key step towards learning on embedded platforms that have limited computing resources, memory capacity, and power consumption. Numerous recent publications have studied methods for training quantized networks, but these studies have been purely experimental. In this work, we investigate the theory of training quantized neural networks by analyzing the convergence properties of some commonly used methods. Our main result shows that training algorithms that exploit high-precision representations have an important annealing property that purely quantized training methods lack, which explains many of the observed empirical differences between these types of algorithms. 

We present a novel regularization scheme for training deep neural networks. The parameters of neural networks are usually unconstrained and have a dynamic range dispersed over the real line. Our key idea is to control the expressive power of the network by dynamically quantizing the range and set of values that the parameters can take. We formulate this idea using a novel end-to-end approach that regularizes the traditional classification loss function. Our regularizer is inspired by the Minimum Description Length principle. For each layer of the network, our approach optimizes a translation and scaling factor along with integer-valued parameters. We empirically compare BitNet to an equivalent unregularized model on the MNIST and CIFAR-10 datasets. We show that BitNet converges faster to a superior quality solution. Additionally, the resulting model is significantly smaller in size due to the use of integer parameters instead of floats.


With the increasing demand of massive multimodal data storage and organization, cross-modal retrieval based on hashing technique has drawn much attention nowadays. It takes the binary codes of one modality as the query to retrieve the relevant hashing codes of another modality. However, the existing binary constraint makes it difficult to find the optimal cross-modal hashing function. Most approaches choose to relax the constraint and perform thresholding strategy on the real-value representation instead of directly solving the original objective. In this paper, we first provide a concrete analysis about the effectiveness of multimodal networks in preserving the inter- and intra-modal consistency. Based on the analysis, we provide a so-called Deep Binary Reconstruction (DBRC) network that can directly learn the binary hashing codes in an unsupervised fashion. The superiority comes from a proposed simple but efficient activation function, named as Adaptive Tanh (ATanh). The ATanh function can adaptively learn the binary codes and be trained via back-propagation. Extensive experiments on three benchmark datasets demonstrate that DBRC outperforms several state-of-the-art methods in both image2text and text2image retrieval task.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !

Monday, August 21, 2017

The Sun Eclipse of 2017

Credit: NASA/JPL/Space Science Institute
Released: December 18, 2009 (PIA 11648)


The webcast for this coming eclipse will start in 15 minutes here: https://eclipse2017.nasa.gov/
The eclipse itself will be viewable in an hour.
The first telescope to check seems to be the one from Madras, Oregon (with 2.02 minutes of total darkness. )

It's also going to be viewable from the International Space Station, woohoo !









Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

SMASH: One-Shot Model Architecture Search through HyperNetworks - implementation -

Today, we have a video, preprint and an implementation, what else can you ask for ?




Designing architectures for deep neural networks requires expert knowledge and substantial computation time. We propose a technique to accelerate architecture selection by learning an auxiliary HyperNet that generates the weights of a main model conditioned on that model's architecture. By comparing the relative validation performance of networks with HyperNet-generated weights, we can effectively search over a wide range of architectures at the cost of a single training run. To facilitate this search, we develop a flexible mechanism based on memory read-writes that allows us to define a wide range of network connectivity patterns, with ResNet, DenseNet, and FractalNet blocks as special cases. We validate our method (SMASH) on CIFAR-10 and CIFAR-100, STL-10, ModelNet10, and Imagenet32x32, achieving competitive performance with similarly-sized hand-designed networks. Our code is available at this https URL

An implementation of SMASH is here: https://github.com/ajbrock/SMASH

h/t Andrew, GitXiv



Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Friday, August 18, 2017

Learning Transferable Architectures for Scalable Image Recognition / The Cost of Everything and the Value of Nothing


Hardmaru mentioned the following preprint: 


to what David replied eventually (see the thread)

Here are the two papers:


Developing state-of-the-art image classification models often requires significant architecture engineering and tuning. In this paper, we attempt to reduce the amount of architecture engineering by using Neural Architecture Search to learn an architectural building block on a small dataset that can be transferred to a large dataset. This approach is similar to learning the structure of a recurrent cell within a recurrent network. In our experiments, we search for the best convolutional cell on the CIFAR-10 dataset and then apply this learned cell to the ImageNet dataset by stacking together more of this cell. Although the cell is not learned directly on ImageNet, an architecture constructed from the best learned cell achieves state-of-the-art accuracy of 82.3% top-1 and 96.0% top-5 on ImageNet, which is 0.8% better in top-1 accuracy than the best human-invented architectures while having 9 billion fewer FLOPS. This cell can also be scaled down two orders of magnitude: a smaller network constructed from the best cell also achieves 74% top-1 accuracy, which is 3.1% better than the equivalently-sized, state-of-the-art models for mobile platforms.

The Cost of Everything and the Value of NothingDavid Moloney, CTO 





Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Thursday, August 17, 2017

Job: Smart Grid Data Scientist, Neuchâtel, Switzerland

Rafael let me know of the following opportunity at this Swiss non-profit corporation:


Dear Igor, 
We have an opening at CSEM (Switzerland) that might be interesting for Nuit-Blanche readers. Can I ask you to post the following announcement in your blog?
http://www.csem.ch/Page.aspx?pid=31515&jobid=107129
Best regards,
Rafael
___________________________________________________
Rafael E. Carrillo, Ph.D.
R&D Engineer
Energy Systems
rafael.carrillo@csem.ch

Wednesday, August 16, 2017

On the Expressive Power of Deep Neural Networks

In a Twitter exchange with MarkSurya mentioned four background papers from his group supporting his ICML presentation last week. Here they are:





We propose a new approach to the problem of neural network expressivity, which seeks to characterize how structural properties of a neural network family affect the functions it is able to compute. Our approach is based on an interrelated set of measures of expressivity, unified by the novel notion of trajectory length, which measures how the output of a network changes as the input sweeps along a one-dimensional path. Our findings can be summarized as follows:
(1) The complexity of the computed function grows exponentially with depth.
(2) All weights are not equal: trained networks are more sensitive to their lower (initial) layer weights.
(3) Regularizing on trajectory length (trajectory regularization) is a simpler alternative to batch normalization, with the same performance.


We study the behavior of untrained neural networks whose weights and biases are randomly distributed using mean field theory. We show the existence of depth scales that naturally limit the maximum depth of signal propagation through these random networks. Our main practical result is to show that random networks may be trained precisely when information can travel through them. Thus, the depth scales that we identify provide bounds on how deep a network may be trained for a specific choice of hyperparameters. As a corollary to this, we argue that in networks at the edge of chaos, one of these depth scales diverges. Thus arbitrarily deep networks may be trained only sufficiently close to criticality. We show that the presence of dropout destroys the order-to-chaos critical point and therefore strongly limits the maximum trainable depth for random networks. Finally, we develop a mean field theory for backpropagation and we show that the ordered and chaotic phases correspond to regions of vanishing and exploding gradient respectively.


We combine Riemannian geometry with the mean field theory of high dimensional chaos to study the nature of signal propagation in generic, deep neural networks with random weights. Our results reveal an order-to-chaos expressivity phase transition, with networks in the chaotic phase computing nonlinear functions whose global curvature grows exponentially with depth but not width. We prove this generic class of deep random functions cannot be efficiently computed by any shallow network, going beyond prior work restricted to the analysis of single functions. Moreover, we formalize and quantitatively demonstrate the long conjectured idea that deep networks can disentangle highly curved manifolds in input space into flat manifolds in hidden space. Our theoretical analysis of the expressive power of deep networks broadly applies to arbitrary nonlinearities, and provides a quantitative underpinning for previously abstract notions about the geometry of deep functions.




Despite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. We attempt to bridge the gap between the theory and practice of deep learning by systematically analyzing learning dynamics for the restricted case of deep linear neural networks. Despite the linearity of their input-output map, such networks have nonlinear gradient descent dynamics on weights that change with the addition of each new hidden layer. We show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions. We provide an analytical description of these phenomena by finding new exact solutions to the nonlinear dynamics of deep learning. Our theoretical analysis also reveals the surprising finding that as the depth of a network approaches infinity, learning speed can nevertheless remain finite: for a special class of initial conditions on the weights, very deep networks incur only a finite, depth independent, delay in learning speed relative to shallow networks. We show that, under certain conditions on the training data, unsupervised pretraining can find this special class of initial conditions, while scaled random Gaussian initializations cannot. We further exhibit a new class of random orthogonal initial conditions on weights that, like unsupervised pre-training, enjoys depth independent learning times. We further show that these initial conditions also lead to faithful propagation of gradients even in deep nonlinear networks, as long as they operate in a special regime known as the edge of chaos.


Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !