Nuit Blanche: NIPS 2015 Accepted Papers: Some Preprints

Tuesday, September 15, 2015

NIPS 2015 Accepted Papers: Some Preprints

On top of yesterday's paper, here is the list of accepted papers at NIPS. A few are on the web or on arxiv, here is a subsample that caught our attention (most of which have been featured here before)

Preconditioned Spectral Descent for Deep Learning by Carlson, David; Collins, Edo; Hsieh, Ya-Ping; Carin, Lawrence; Cevher, Volkan
(supplemental material is here.)

Optimizing objective functions in deep learning is a notoriously difficult task. Classical algorithms, including variants of gradient descent and quasi-Newton methods, can be interpreted as approximations to the objective function in Euclidean norms. However, it has recently been shown that approximations via non-Euclidean norms can significantly improve optimization performance. In this paper, we provide evidence that neither of the above two methods entirely capture the ``geometry'' of the objective functions in deep learning, while a combination of the two does. We theoretically formalize our arguments and derive a novel second-order non-Euclidean algorithms. We implement our algorithms on Restricted Boltzmann Machines, Feedforward Neural Nets, and Convolutional Neural Nets, and demonstrate improvements in both computational efficiency and model fit.

Tensorizing Neural Networks (presentation slides) by Alexander Novikov, Dmitry Podoprihin, Anton Osokin, Dmitry Vetrov,

Sum-of-Squares Lower Bounds for Sparse PCA
Tengyu Ma, Avi Wigderson

This paper establishes a statistical versus computational trade-off for solving a basic high-dimensional machine learning problem via a basic convex relaxation method. Specifically, we consider the {\em Sparse Principal Component Analysis} (Sparse PCA) problem, and the family of {\em Sum-of-Squares} (SoS, aka Lasserre/Parillo) convex relaxations. It was well known that in large dimension p, a planted k-sparse unit vector can be {\em in principle} detected using only n≈klogp (Gaussian or Bernoulli) samples, but all {\em efficient} (polynomial time) algorithms known require n≈k2logp samples. It was also known that this quadratic gap cannot be improved by the the most basic {\em semi-definite} (SDP, aka spectral) relaxation, equivalent to a degree-2 SoS algorithms. Here we prove that also degree-4 SoS algorithms cannot improve this quadratic gap. This average-case lower bound adds to the small collection of hardness results in machine learning for this powerful family of convex relaxation algorithms. Moreover, our design of moments (or "pseudo-expectations") for this lower bound is quite different than previous lower bounds. Establishing lower bounds for higher degree SoS algorithms for remains a challenging problem.

Natural Neural Networks
Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, Koray Kavukcuoglu

We introduce Natural Neural Networks, a novel family of algorithms that speed up convergence by adapting their internal representation during training to improve conditioning of the Fisher matrix. In particular, we show a specific example that employs a simple and efficient reparametrization of the neural network weights by implicitly whitening the representation obtained at each layer, while preserving the feed-forward computation of the network. Such networks can be trained efficiently via the proposed Projected Natural Gradient Descent algorithm (PRONG), which amortizes the cost of these reparametrizations over many parameter updates and is closely related to the Mirror Descent online learning algorithm. We highlight the benefits of our method on both unsupervised and supervised learning tasks, and showcase its scalability by training on the large-scale ImageNet Challenge dataset.

Approximating Sparse PCA from Incomplete Data
Abhisek Kundu, Petros Drineas, Malik Magdon-Ismail

We study how well one can recover sparse principal components of a data matrix using a sketch formed from a few of its elements. We show that for a wide class of optimization problems, if the sketch is close (in the spectral norm) to the original data matrix, then one can recover a near optimal solution to the optimization problem by using the sketch. In particular, we use this approach to obtain sparse principal components and show that for \math{m} data points in \math{n} dimensions, \math{O(\epsilon^{-2}\tilde k\max\{m,n\})} elements gives an \math{\epsilon}-additive approximation to the sparse PCA problem (\math{\tilde k} is the stable rank of the data matrix). We demonstrate our algorithms extensively on image, text, biological and financial data. The results show that not only are we able to recover the sparse PCAs from the incomplete data, but by using our sparse sketch, the running time drops by a factor of five or more.

Training Restricted Boltzmann Machines via the Thouless-Anderson-Palmer Free Energy
Marylou Gabrié, Eric W. Tramel, Florent Krzakala

Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.

Robust Regression via Hard Thresholding
Kush Bhatia, Prateek Jain, Purushottam Kar

We study the problem of Robust Least Squares Regression (RLSR) where several response variables can be adversarially corrupted. More specifically, for a data matrix X \in R^{p x n} and an underlying model w*, the response vector is generated as y = X'w* + b where b \in R^n is the corruption vector supported over at most C.n coordinates. Existing exact recovery results for RLSR focus solely on L1-penalty based convex formulations and impose relatively strict model assumptions such as requiring the corruptions b to be selected independently of X.
In this work, we study a simple hard-thresholding algorithm called TORRENT which, under mild conditions on X, can recover w* exactly even if b corrupts the response variables in an adversarial manner, i.e. both the support and entries of b are selected adversarially after observing X and w*. Our results hold under deterministic assumptions which are satisfied if X is sampled from any sub-Gaussian distribution. Finally unlike existing results that apply only to a fixed w*, generated independently of X, our results are universal and hold for any w* \in R^p.
Next, we propose gradient descent-based extensions of TORRENT that can scale efficiently to large scale problems, such as high dimensional sparse recovery and prove similar recovery guarantees for these extensions. Empirically we find TORRENT, and more so its extensions, offering significantly faster recovery than the state-of-the-art L1 solvers. For instance, even on moderate-sized datasets (with p = 50K) with around 40% corrupted responses, a variant of our proposed method called TORRENT-HYB is more than 20x faster than the best L1 solver.

Sparse PCA via Bipartite Matchings
Megasthenis Asteris, Dimitris Papailiopoulos, Anastasios Kyrillidis, Alexandros G. Dimakis

We consider the following multi-component sparse PCA problem: given a set of data points, we seek to extract a small number of sparse components with disjoint supports that jointly capture the maximum possible variance. These components can be computed one by one, repeatedly solving the single-component problem and deflating the input data matrix, but as we show this greedy procedure is suboptimal. We present a novel algorithm for sparse PCA that jointly optimizes multiple disjoint components. The extracted features capture variance that lies within a multiplicative factor arbitrarily close to 1 from the optimal. Our algorithm is combinatorial and computes the desired components by solving multiple instances of the bipartite maximum weight matching problem. Its complexity grows as a low order polynomial in the ambient dimension of the input data matrix, but exponentially in its rank. However, it can be effectively applied on a low-dimensional sketch of the data; this allows us to obtain polynomial-time approximation guarantees via spectral bounds. We evaluate our algorithm on real data-sets and empirically demonstrate that in many cases it outperforms existing, deflation-based approaches.

Solving Random Quadratic Systems of Equations Is Nearly as Easy as Solving Linear Systems

Yuxin Chen, Emmanuel J. Candes

We consider the fundamental problem of solving quadratic systems of equations in n variables, where yi=|⟨ai,x⟩|2, i=1,…,m and x∈Rn is unknown. We propose a novel method, which starting with an initial guess computed by means of a spectral method, proceeds by minimizing a nonconvex functional as in the Wirtinger flow approach. There are several key distinguishing features, most notably, a distinct objective functional and novel update rules, which operate in an adaptive fashion and drop terms bearing too much influence on the search direction. These careful selection rules provide a tighter initial guess, better descent directions, and thus enhanced practical performance. On the theoretical side, we prove that for certain unstructured models of quadratic systems, our algorithms return the correct solution in linear time, i.e. in time proportional to reading the data {ai} and {yi} as soon as the ratio m/n between the number of equations and unknowns exceeds a fixed numerical constant. We extend the theory to deal with noisy systems in which we only have yi≈|⟨ai,x⟩|2 and prove that our algorithms achieve a statistical accuracy, which is nearly un-improvable. We complement our theoretical study with numerical examples showing that solving random quadratic systems is both computationally and statistically not much harder than solving linear systems of the same size---hence the title of this paper. For instance, we demonstrate empirically that the computational cost of our algorithm is about four times that of solving a least-squares problem of the same size.

Locally Non-linear Embeddings for Extreme Multi-label Learning
Kush Bhatia, Himanshu Jain, Purushottam Kar, Prateek Jain, Manik Varma

The objective in extreme multi-label learning is to train a classifier that can automatically tag a novel data point with the most relevant subset of labels from an extremely large label set. Embedding based approaches make training and prediction tractable by assuming that the training label matrix is low-rank and hence the effective number of labels can be reduced by projecting the high dimensional label vectors onto a low dimensional linear subspace. Still, leading embedding approaches have been unable to deliver high prediction accuracies or scale to large problems as the low rank assumption is violated in most real world applications.
This paper develops the X-One classifier to address both limitations. The main technical contribution in X-One is a formulation for learning a small ensemble of local distance preserving embeddings which can accurately predict infrequently occurring (tail) labels. This allows X-One to break free of the traditional low-rank assumption and boost classification accuracy by learning embeddings which preserve pairwise distances between only the nearest label vectors.
We conducted extensive experiments on several real-world as well as benchmark data sets and compared our method against state-of-the-art methods for extreme multi-label classification. Experiments reveal that X-One can make significantly more accurate predictions then the state-of-the-art methods including both embeddings (by as much as 35%) as well as trees (by as much as 6%). X-One can also scale efficiently to data sets with a million labels which are beyond the pale of leading embedding methods.

Efficient Compressive Phase Retrieval with Constrained Sensing Vectors
Sohail Bahmani, Justin Romberg

We propose a new approach to the problem of compressive phase retrieval in which the goal is to reconstruct a sparse vector from the magnitude of a number of its linear measurements. The proposed framework relies on constrained sensing vectors and a two-stage reconstruction method that consists of two standard convex optimization programs that are solved sequentially.
Various methods for compressive phase retrieval have been proposed in recent years, but none have come with strong efficiency and computational complexity guarantees. The main obstacle has been that there is no straightforward convex relaxations for the type of structure in the target. Given a set of underdetermined measurements, there is a standard framework for recovering a sparse matrix, and a standard framework for recovering a low-rank matrix. However, a general, efficient method for recovering a matrix which is jointly sparse and low-rank has remained elusive.
In this paper, we show that if the sensing vectors are chosen at random from an incoherent subspace, then the low-rank and sparse structures of the target signal can be effectively decoupled. We show that a recovery algorithm that consists of a low-rank recovery stage followed by a sparse recovery stage will produce an accurate estimate of the target when the number of measurements is O(klogdk), where k and d denote the sparsity level and the dimension of the input signal. We also evaluate the algorithm through numerical simulation.

A Convergent Gradient Descent Algorithm for Rank Minimization and Semidefinite Programming from Random Linear Measurements
Qinqing Zheng, John Lafferty

We propose a simple, scalable, and fast gradient descent algorithm to optimize a nonconvex objective for the rank minimization problem and a closely related family of semidefinite programs. With O(r2κ2nlogn) random measurements of a positive semidefinite n×n matrix of rank r and condition number κ, our method is guaranteed to converge linearly to the global optimum.

On the limitation of spectral methods: From the Gaussian hidden clique problem to rank one perturbations of Gaussian tensors
Andrea Montanari, Daniel Reichman, Ofer Zeitouni

We consider the following detection problem: given a realization of a symmetric matrix X of dimension n, distinguish between the hypothesis that all upper triangular variables are i.i.d. Gaussians variables with mean 0 and variance 1 and the hypothesis where X is the sum of such matrix and an independent rank-one perturbation.
This setup applies to the situation where under the alternative, there is a planted principal submatrix B of size L for which all upper triangular variables are i.i.d. Gaussians with mean 1 and variance 1, whereas all other upper triangular elements of X not in B are i.i.d. Gaussians variables with mean 0 and variance 1. We refer to this as the `Gaussian hidden clique problem.'
When L=(1+ϵ)n√ (ϵ>0), it is possible to solve this detection problem with probability 1−on(1) by computing the spectrum of X and considering the largest eigenvalue of X. We prove that this condition is tight in the following sense: when L<(1−ϵ)n√ no algorithm that examines only the eigenvalues of X can detect the existence of a hidden Gaussian clique, with error probability vanishing as n→∞.
We prove this result as an immediate consequence of a more general result on rank-one perturbations of k-dimensional Gaussian tensors. In this context we establish a lower bound on the critical signal-to-noise ratio below which a rank-one signal cannot be detected.

Optimized Projections for Compressed Sensing via Direct Mutual Coherence Minimization
Zhouchen Lin, Canyi Lu, Huan Li

Compressed Sensing (CS) is a novel technique for simultaneous signal sampling and compression based on the existence of a sparse representation of signal and a projected dictionary $\PP\D$, where $\PP\in\mathbb{R}^{m\times d}$ is the projection matrix and $\D\in\mathbb{R}^{d\times n}$ is the dictionary. To exactly recover the signal with a small number of measurements $m$, the projected dictionary $\PP\D$ is expected to be of low mutual coherence. Several previous methods attempt to find the projection $\PP$ such that the mutual coherence of $\PP\D$ can be as low as possible. However, they do not minimize the mutual coherence directly and thus their methods are far from optimal. Also the solvers they used lack of the convergence guarantee and thus there has no guarantee on the quality of their obtained solutions. This work aims to address these issues. We propose to find an optimal projection by minimizing the mutual coherence of $\PP\D$ directly. This leads to a nonconvex nonsmooth minimization problem. We then approximate it by smoothing and solve it by alternate minimization. We further prove the convergence of our algorithm. To the best of our knowledge, this is the first work which directly minimizes the mutual coherence of the projected dictionary with a convergence guarantee. Numerical experiments demonstrate that the proposed method can recover sparse signals better than existing methods.

Completing Low-Rank Matrices with Corrupted Samples from Few Coefficients in General Basis
Hongyang Zhang, Zhouchen Lin, Chao Zhang

Subspace recovery from corrupted and missing data is crucial for various applications in signal processing and information theory. To complete missing values and detect column corruptions, existing robust Matrix Completion (MC) methods mostly concentrate on recovering a low-rank matrix from few corrupted coefficients w.r.t. the standard basis, which, however, does not apply to more general basis, e.g., the Fourier basis. In this paper, we prove that the range space of an $m\times n$ matrix with rank $r$ can be exactly recovered from few coefficients w.r.t. general basis, though the rank $r$ and the number of corrupted samples are both as high as $O(\min\{m,n\}/\log^3 (m+n))$. Thus our results cover previous work as special cases, and robust MC can recover the intrinsic matrix with a higher rank. Moreover, we suggest a universal choice of the regularization parameter, which is $\lambda=1/\sqrt{\log n}$. By our $\ell_{2,1}$ filtering algorithm, which has theoretical guarantees, we can further reduce the computational cost of our model. As an application, we also find that the solutions to extended robust Low-Rank Representation and to our extended robust MC are mutually expressible, so both our theory and algorithm can be immediately applied to the subspace clustering problem with missing values. Experiments verify our theories.

Approximating Sparse PCA from Incomplete Data
Abhisek Kundu, Petros Drineas, Malik Magdon-Ismail

We study how well one can recover sparse principal components of a data matrix using a sketch formed from a few of its elements. We show that for a wide class of optimization problems, if the sketch is close (in the spectral norm) to the original data matrix, then one can recover a near optimal solution to the optimization problem by using the sketch. In particular, we use this approach to obtain sparse principal components and show that for \math{m} data points in \math{n} dimensions, \math{O(\epsilon^{-2}\tilde k\max\{m,n\})} elements gives an \math{\epsilon}-additive approximation to the sparse PCA problem (\math{\tilde k} is the stable rank of the data matrix). We demonstrate our algorithms extensively on image, text, biological and financial data. The results show that not only are we able to recover the sparse PCAs from the incomplete data, but by using our sparse sketch, the running time drops by a factor of five or more.

Optimal Rates for Random Fourier Features
Bharath K. Sriperumbudur, Zoltan Szabo

Kernel methods represent one of the most powerful tools in machine learning to tackle problems expressed in terms of function values and derivatives due to their capability to represent and model complex relations. While these methods show good versatility, they are computationally intensive and have poor scalability to large data as they require operations on Gram matrices. In order to mitigate this serious computational limitation, recently randomized constructions have been proposed in the literature, which allow the application of fast linear algorithms. Random Fourier features (RFF) are among the most popular and widely applied constructions: they provide an easily computable, low-dimensional feature representation for shift-invariant kernels. Despite the popularity of RFFs, very little is understood theoretically about their approximation quality. In this paper, we provide the first detailed theoretical analysis about the approximation quality of RFFs by establishing optimal (in terms of the RFF dimension) performance guarantees in uniform and Lr (1≤r<∞) norms. We also propose a RFF approximation to derivatives of a kernel with a theoretical study on its approximation quality.

Learning with Group Invariant Features: A Kernel Perspective
Youssef Mroueh, Stephen Voinea, Tomaso Poggio

We analyze in this paper a random feature map based on a theory of invariance I-theory introduced recently. More specifically, a group invariant signal signature is obtained through cumulative distributions of group transformed random projections. Our analysis bridges invariant feature learning with kernel methods, as we show that this feature map defines an expected Haar integration kernel that is invariant to the specified group action. We show how this non-linear random feature map approximates this group invariant kernel uniformly on a set of N points. Moreover, we show that it defines a function space that is dense in the equivalent Invariant Reproducing Kernel Hilbert Space. Finally, we quantify error rates of the convergence of the empirical risk minimization, as well as the reduction in

Optimal linear estimation under unknown nonlinear transform
Xinyang Yi, Zhaoran Wang, Constantine Caramanis, Han Liu

Linear regression studies the problem of estimating a model parameter β∗∈Rp, from n observations {(yi,xi)}ni=1 from linear model yi=⟨xi,β∗⟩+ϵi. We consider a significant generalization in which the relationship between ⟨xi,β∗⟩ and yi is noisy, quantized to a single bit, potentially nonlinear, noninvertible, as well as unknown. This model is known as the single-index model in statistics, and, among other things, it represents a significant generalization of one-bit compressed sensing. We propose a novel spectral-based estimation procedure and show that we can recover β∗ in settings (i.e., classes of link function f) where previous algorithms fail. In general, our algorithm requires only very mild restrictions on the (unknown) functional relationship between yi and ⟨xi,β∗⟩. We also consider the high dimensional setting where β∗ is sparse ,and introduce a two-stage nonconvex framework that addresses estimation challenges in high dimensional regimes where p≫n. For a broad class of link functions between ⟨xi,β∗⟩ and yi, we establish minimax lower bounds that demonstrate the optimality of our estimators in both the classical and high dimensional regimes.

Deeply Learning the Messages in Message Passing Inference
Guosheng Lin, Chunhua Shen, Ian Reid, Anton van den Hengel

Deep structured output learning shows great promise in tasks like semantic image segmentation. We proffer a new, efficient deep structured model learning scheme, in which we show how deep Convolutional Neural Networks (CNNs) can be used to estimate the messages in message passing inference for structured prediction with Conditional Random Fields (CRFs). With such CNN message estimators, we obviate the need to learn or evaluate potential functions for message calculation. This confers significant efficiency for learning, since otherwise when performing structured learning for a CRF with CNN potentials it is necessary to undertake expensive inference for every stochastic gradient iteration. The network output dimension for message estimation is the same as the number of classes, in contrast to the network output for general CNN potential functions in CRFs, which is exponential in the order of the potentials. Hence CNN message learning has fewer network parameters and is more scalable for cases that a large number of classes are involved. We apply our method to semantic image segmentation on the PASCAL VOC 2012 dataset. We achieve an intersection-over-union score of 73.4 on its test set, which is the best reported result for methods using the VOC training images alone. This impressive performance demonstrates the effectiveness and usefulness of our CNN message learning method.

Fast Label Embeddings via Randomized Linear Algebra
Paul Mineiro, Nikos Karampatziakis

Many modern multiclass and multilabel problems are characterized by increasingly large output spaces. For these problems, label embeddings have been shown to be a useful primitive that can improve computational and statistical efficiency. In this work we utilize a correspondence between rank constrained estimation and low dimensional label embeddings that uncovers a fast label embedding algorithm which works in both the multiclass and multilabel settings. The result is a randomized algorithm whose running time is exponentially faster than naive algorithms. We demonstrate our techniques on two large-scale public datasets, from the Large Scale Hierarchical Text Challenge and the Open Directory Project, where we obtain state of the art results.

Fast Randomized Kernel Methods With Statistical Guarantees
Ahmed El Alaoui, Michael W. Mahoney

One approach to improving the running time of kernel-based machine learning methods is to build a small sketch of the input and use it in lieu of the full kernel matrix in the machine learning task of interest. Here, we describe a version of this approach that comes with running time guarantees as well as improved guarantees on its statistical performance. By extending the notion of \emph{statistical leverage scores} to the setting of kernel ridge regression, our main statistical result is to identify an importance sampling distribution that reduces the size of the sketch (i.e., the required number of columns to be sampled) to the \emph{effective dimensionality} of the problem. This quantity is often much smaller than previous bounds that depend on the \emph{maximal degrees of freedom}. Our main algorithmic result is to present a fast algorithm to compute approximations to these scores. This algorithm runs in time that is linear in the number of samples---more precisely, the running time is O(np2), where the parameter p depends only on the trace of the kernel matrix and the regularization parameter---and it can be applied to the matrix of feature vectors, without having to form the full kernel matrix. This is obtained via a variant of length-squared sampling that we adapt to the kernel setting in a way that is of independent interest. Lastly, we provide empirical results illustrating our theory, and we discuss how this new notion of the statistical leverage of a data point captures in a fine way the difficulty of the original statistical learning problem.

Randomized Dual Coordinate Ascent with Arbitrary Sampling
Zheng Qu, Peter Richtárik, Tong Zhang

We study the problem of minimizing the average of a large number of smooth convex functions penalized with a strongly convex regularizer. We propose and analyze a novel primal-dual method (Quartz) which at every iteration samples and updates a random subset of the dual variables, chosen according to an arbitrary distribution. In contrast to typical analysis, we directly bound the decrease of the primal-dual error (in expectation), without the need to first analyze the dual error. Depending on the choice of the sampling, we obtain efficient serial, parallel and distributed variants of the method. In the serial case, our bounds match the best known bounds for SDCA (both with uniform and importance sampling). With standard mini-batching, our bounds predict initial data-independent speedup as well as additional data-driven speedup which depends on spectral and sparsity properties of the data. We calculate theoretical speedup factors and find that they are excellent predictors of actual speedup in practice. Moreover, we illustrate that it is possible to design an efficient mini-batch importance sampling. The distributed variant of Quartz is the first distributed SDCA-like method with an analysis for non-separable data.

Fast and Guaranteed Tensor Decomposition via Sketching
Yining Wang, Hsiao-Yu Tung, Alexander Smola, Animashree Anandkumar

Tensor CANDECOMP/PARAFAC (CP) decomposition has wide applications in statistical learning of latent variable models and in data mining. In this paper, we propose fast and randomized tensor CP decomposition algorithms based on sketching. We build on the idea of count sketches, but introduce many novel ideas which are unique to tensors. We develop novel methods for randomized computation of tensor contractions via FFTs, without explicitly forming the tensors. Such tensor contractions are encountered in decomposition methods such as tensor power iterations and alternating least squares. We also design novel colliding hashes for symmetric tensors to further save time in computing the sketches. We then combine these sketching ideas with existing whitening and tensor power iterative techniques to obtain the fastest algorithm on both sparse and dense tensors. The quality of approximation under our method does not depend on properties such as sparsity, uniformity of elements, etc. We apply the method for topic modeling and obtain competitive results.

Learning both Weights and Connections for Efficient Neural Networks
Song Han, Jeff Pool, John Tran, William J. Dally

Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems. Also, conventional networks fix the architecture before training starts; as a result, training cannot improve the architecture. To address these limitations, we describe a method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy, by learning only the important connections. Our method prunes redundant connections using a three-step method. First, we train the network to learn which connections are important. Next, we prune the unimportant connections. Finally, we retrain the network to fine tune the weights of the remaining connections. On the ImageNet dataset, our method reduced the number of parameters of AlexNet by a factor of 9x, from 61 million to 6.7 million, without incurring accuracy loss. Similar experiments with VGG16 found that the network as a whole can be reduced by 13x, again with no loss of accuracy.

Matrix Completion from Fewer Entries: Spectral Detectability and Rank Estimation
Alaa Saade, Florent Krzakala, Lenka Zdeborová

The completion of low rank matrices from few entries is a task with many practical applications. We consider here two aspects of this problem: detectability, i.e. the ability to estimate the rank r reliably from the fewest possible random entries, and performance in achieving small reconstruction error. We propose a spectral algorithm for these two tasks called MaCBetH (for Matrix Completion with the Bethe Hessian). The rank is estimated as the number of negative eigenvalues of the Bethe Hessian matrix, and the corresponding eigenvectors are used as initial condition for the minimization of the discrepancy between the estimated matrix and the revealed entries. We analyze the performance in a random matrix setting using results from the statistical mechanics of the Hopfield neural network, and show in particular that MaCBetH efficiently detects the rank r of a large n×m matrix from C(r)rnm−−−√ entries, where C(r) is a constant close to 1. We also evaluate the corresponding root-mean-square error empirically and show that MaCBetH compares favorably to other existing approaches.

Beyond Convexity: Stochastic Quasi-Convex Optimization
Elad Hazan, Kfir Y. Levy, Shai Shalev-Shwartz

Stochastic convex optimization is a basic and well studied primitive in machine learning. It is well known that convex and Lipschitz functions can be minimized efficiently using Stochastic Gradient Descent (SGD). The Normalized Gradient Descent (NGD) algorithm, is an adaptation of Gradient Descent, which updates according to the direction of the gradients, rather than the gradients themselves. In this paper we analyze a stochastic version of NGD and prove its convergence to a global minimum for a wider class of functions: we require the functions to be quasi-convex and locally-Lipschitz. Quasi-convexity broadens the con- cept of unimodality to multidimensions and allows for certain types of saddle points, which are a known hurdle for first-order optimization methods such as gradient decent. Locally-Lipschitz functions are only required to be Lipschitz in a small region around the optimum. This assumption circumvents gradient explosion, which is another known hurdle for gradient descent variants. Interestingly, unlike the vanilla SGD algorithm, the stochastic normalized gradient descent algorithm provably requires a minimal minibatch size.

Training Very Deep Networks
Rupesh Kumar Srivastava, Klaus Greff, Jürgen Schmidhuber

Theoretical and empirical evidence indicates that the depth of neural networks is crucial for their success. However, training becomes more difficult as depth increases, and training of very deep networks remains an open problem. Here we introduce a new architecture designed to overcome this. Our so-called highway networks allow unimpeded information flow across many layers on information highways. They are inspired by Long Short-Term Memory recurrent networks and use adaptive gating units to regulate the information flow. Even with hundreds of layers, highway networks can be trained directly through simple gradient descent. This enables the study of extremely deep and efficient architectures.

Path-SGD: Path-Normalized Optimization in Deep Neural Networks
Behnam Neyshabur, Ruslan Salakhutdinov, Nathan Srebro

We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.

Spectral Representations for Convolutional Neural Networks
Oren Rippel, Jasper Snoek, Ryan P. Adams

Discrete Fourier transforms provide a significant speedup in the computation of convolutions in deep learning. In this work, we demonstrate that, beyond its advantages for efficient computation, the spectral domain also provides a powerful representation in which to model and train convolutional neural networks (CNNs).
We employ spectral representations to introduce a number of innovations to CNN design. First, we propose spectral pooling, which performs dimensionality reduction by truncating the representation in the frequency domain. This approach preserves considerably more information per parameter than other pooling strategies and enables flexibility in the choice of pooling output dimensionality. This representation also enables a new form of stochastic regularization by randomized modification of resolution. We show that these methods achieve competitive results on classification and approximation tasks, without using any dropout or max-pooling.
Finally, we demonstrate the effectiveness of complex-coefficient spectral parameterization of convolutional filters. While this leaves the underlying model unchanged, it results in a representation that greatly facilitates optimization. We observe on a variety of popular CNN configurations that this leads to significantly faster convergence during training.

Taming the Wild: A Unified Analysis of Hogwild!-Style Algorithms
Christopher De Sa, Ce Zhang, Kunle Olukotun, Christopher Ré

Stochastic gradient descent (SGD) is a ubiquitous algorithm for a variety of machine learning problems. Researchers and industry have developed several techniques to optimize SGD's runtime performance, including asynchronous execution and reduced precision. Our main result is a martingale-based analysis that enables us to capture the rich noise models that may arise from such techniques. Specifically, we use our new analysis in three ways: (1) we derive convergence rates for the convex case (Hogwild!) with relaxed assumptions on the sparsity of the problem; (2) we analyze asynchronous SGD algorithms for non-convex matrix problems including matrix completion; and (3) we design and analyze an asynchronous SGD algorithm, called Buckwild!, that uses lower-precision arithmetic. We show experimentally that our algorithms run efficiently for a variety of problems on modern hardware.

Fast, Provable Algorithms for Isotonic Regression in all ℓp-norms
Rasmus Kyng, Anup Rao, Sushant Sachdeva

Given a directed acyclic graph G, and a set of values y on the vertices, the Isotonic Regression of y is a vector x that respects the partial order described by G, and minimizes ||x−y||, for a specified norm. This paper gives improved algorithms for computing the Isotonic Regression for all weighted ℓp-norms with rigorous performance guarantees. Our algorithms are quite practical, and their variants can be implemented to run fast in practice.

Efficient and Parsimonious Agnostic Active Learning
Tzu-Kuo Huang, Alekh Agarwal, Daniel J. Hsu, John Langford, Robert E. Schapire

We develop a new active learning algorithm for the streaming setting satisfying three important properties: 1) It provably works for any classifier representation and classification problem including those with severe noise. 2) It is efficiently implementable with an ERM oracle. 3) It is more aggressive than all previous approaches satisfying 1 and 2. To do this we create an algorithm based on a newly defined optimization problem and analyze it. We also conduct the first experimental analysis of all efficient agnostic active learning algorithms, discovering that this one is typically better across a wide variety of datasets and label complexities.

Grammar as a Foreign Language
Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, Geoffrey Hinton

Syntactic constituency parsing is a fundamental problem in natural language processing and has been the subject of intensive research and engineering for decades. As a result, the most accurate parsers are domain specific, complex, and inefficient. In this paper we show that the domain agnostic attention-enhanced sequence-to-sequence model achieves state-of-the-art results on the most widely used syntactic constituency parsing dataset, when trained on a large synthetic corpus that was annotated using existing parsers. It also matches the performance of standard parsers when trained only on a small human-annotated dataset, which shows that this model is highly data-efficient, in contrast to sequence-to-sequence models without the attention mechanism. Our parser is also fast, processing over a hundred sentences per second with an unoptimized CPU implementation.

Regularization-free estimation in trace regression with symmetric positive semidefinite matrices
Martin Slawski, Ping Li, Matthias Hein

Over the past few years, trace regression models have received considerable attention in the context of matrix completion, quantum state tomography, and compressed sensing. Estimation of the underlying matrix from regularization-based approaches promoting low-rankedness, notably nuclear norm regularization, have enjoyed great popularity. In the present paper, we argue that such regularization may no longer be necessary if the underlying matrix is symmetric positive semidefinite (\textsf{spd}) and the design satisfies certain conditions. In this situation, simple least squares estimation subject to an \textsf{spd} constraint may perform as well as regularization-based approaches with a proper choice of the regularization parameter, which entails knowledge of the noise level and/or tuning. By contrast, constrained least squares estimation comes without any tuning parameter and may hence be preferred due to its simplicity.

Winner-Take-All Autoencoders
Alireza Makhzani, Brendan Frey

In this paper, we propose a winner-take-all method for learning hierarchical sparse representations in an unsupervised fashion. We first introduce fully-connected winner-take-all autoencoders which use mini-batch statistics to directly enforce a lifetime sparsity in the activations of the hidden units. We then propose the convolutional winner-take-all autoencoder which combines the benefits of convolutional architectures and autoencoders for learning shift-invariant sparse representations. We describe a way to train convolutional autoencoders layer by layer, where in addition to lifetime sparsity, a spatial sparsity within each feature map is achieved using winner-take-all activation functions. We will show that winner-take-all autoencoders can be used to to learn deep sparse representations from the MNIST, CIFAR-10, ImageNet, Street View House Numbers and Toronto Face datasets, and achieve competitive classification performance.

The LASSO with Non-linear Measurements is Equivalent to One With Linear Measurements
Chrtistos Thrampoulidis, Ehsan Abbasi, Babak Hassibi

Consider estimating an unknown, but structured, signal x0∈Rn from m measurement yi=gi(aTix0), where the ai's are the rows of a known measurement matrix A, and, g is a (potentially unknown) nonlinear and random link-function. Such measurement functions could arise in applications where the measurement device has nonlinearities and uncertainties. It could also arise by design, e.g., gi(x)=sign(x+zi), corresponds to noisy 1-bit quantized measurements. Motivated by the classical work of Brillinger, and more recent work of Plan and Vershynin, we estimate x0 via solving the Generalized-LASSO for some regularization parameter λ>0 and some (typically non-smooth) convex structure-inducing regularizer function. While this approach seems to naively ignore the nonlinear function g, both Brillinger (in the non-constrained case) and Plan and Vershynin have shown that, when the entries of A are iid standard normal, this is a good estimator of x0 up to a constant of proportionality μ, which only depends on g. In this work, we considerably strengthen these results by obtaining explicit expressions for the squared error, for the \emph{regularized} LASSO, that are asymptotically \emph{precise} when m and n grow large. A main result is that the estimation performance of the Generalized LASSO with non-linear measurements is \emph{asymptotically the same} as one whose measurements are linear yi=μaTix0+σzi, with μ=Eγg(γ) and σ2=E(g(γ)−μγ)2, and, γ standard normal. To the best of our knowledge, the derived expressions on the estimation performance are the first-known precise results in this context. One interesting consequence of our result is that the optimal quantizer of the measurements that minimizes the estimation error of the LASSO is the celebrated Lloyd-Max quantizer.