Nuit Blanche: Maximal Sparsity with Deep Networks?

Thursday, May 19, 2016

Maximal Sparsity with Deep Networks?

Back on the longest night of 2015 (see The Great Convergence in Action: Learning optimal nonlinearities for iterative thresholding algorithms ), I asked if Winter was coming to applied mathematics as one can witness a convergence of people looking at changing parameters of iterative solvers with Deep Learning techniques. We also have mentioned the following parallel here on Nuit Blanche for a while. Anyway, it is put in better words here:

....Although seemingly unrelated at rst glance, the layers of a deep neural network (DNN) can be viewed as iterations of some algorithm that have been unfolded into a network structure (Gregor and LeCun, 2010; Hershey et al., 2014). In particular, iterative thresholding approaches such as those mentioned above typically involve an update rule comprised of a xed, linear filter followed by a non-linear activation function that promotes sparsity. Consequently, algorithm execution can be interpreted as passing an input through an extremely deep network with constant layer weights (dependent on) at every layer....

Here is the paper. My remarks are below:Maximal Sparsity with Deep Networks? by Bo Xin, Yizhou Wang, Wen Gao, David Wipf

The iterations of many sparse estimation algorithms are comprised of a fixed linear filter cascaded with a thresholding nonlinearity, which collectively resemble a typical neural network layer. Consequently, a lengthy sequence of algorithm iterations can be viewed as a deep network with shared, hand-crafted layer weights. It is therefore quite natural to examine the degree to which a learned network model might act as a viable surrogate for traditional sparse estimation in domains where ample training data is available. While the possibility of a reduced computational budget is readily apparent when a ceiling is imposed on the number of layers, our work primarily focuses on estimation accuracy. In particular, it is well-known that when a signal dictionary has coherent columns, as quantified by a large RIP constant, then most tractable iterative algorithms are unable to find maximally sparse representations. In contrast, we demonstrate both theoretically and empirically the potential for a trained deep network to recover minimal $\ell_0$-norm representations in regimes where existing methods fail. The resulting system is deployed on a practical photometric stereo estimation problem, where the goal is to remove sparse outliers that can disrupt the estimation of surface normals from a 3D scene.

It's fascinating. Three items:

The paper ought to mention similar efforts such as the one mentioned earlier here (Learning optimal nonlinearities for iterative thresholding algorithms).
While ISTA and related algorothms are interesting, AMP solvers that have similar structure ought to be investigated as it is expected that the number of iterations is much smaller than that of traditional FISTA et al.
The following two earlier papers were focused on sparse coding and NMF (they are listed below) It would seem to me that much more insight could be gotten out of the studies on FISTA et related algorithms for compressive sensing but I can imagine I have a bias on that matter.

Deep Unfolding: Model-Based Inspiration of Novel Deep Architectures by John R. Hershey, Jonathan Le Roux, Felix Weninger

Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.

Learning Fast Approximations of Sparse Coding by Karol Gregor , Yann Lecun

In Sparse Coding (SC), input vectors are reconstructed using a sparse linear combination of basis vectors. SC has become a popular method for extracting features from data. For a given input, SC minimizes a quadratic reconstruction error with an L1 penalty term on the code. The process is often too slow for applications such as real-time pattern recognition. We proposed two versions of a very fast algorithm that produces approximate estimates of the sparse code that can be used to compute good visual features, or to initialize exact iterative algorithms. The main idea is to train a non-linear, feed-forward predictor with a specific architecture and a fixed depth to produce the best possible approximation of the sparse code. A version of the method, which can be seen as a trainable version of Li and Osher’s coordinate descent method, is shown to produce approximate solutions with 10 times less computation than Li and Osher’s for the same approximation error. Unlike previous proposals for sparse code predictors, the system allows a kind of approximate “explaining away ” to take place during inference. The resulting predictor is differentiable and can be included into globallytrained recognition systems. 1.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !