Pages

Monday, July 31, 2017

Nuit Blanche in Review (July 2017)

Since the last Nuit Blanche in Review (June 2017), it was found that Titan had interesting chemistry. On Nuit Blanche, on the other hand, we had four implementations released by their authors, several interesting in-depth articles (some of them related to SGD and Hardware) . We had several slides and videos of meetings and schools and three job offering. Enjoy !


In-depth

SGD related

CS/ML Hardware


Slides

Videos

Job:

Other 


Credit: Northern Summer on Titan, NASA/JPL-Caltech/Space Science Institute


Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Randomized apertures: high resolution imaging in far field

Using Glitter as a way to replace large structure mirrors for space telescopes: This is what is suggested and measured here. The random PSF allows for sharper resolution (and Machine Learning is used).This is another instance of the Great Convergence, woohoo ! ( and by the way, are we going to ever acknowledge that the Random Lens Imaging paper is one of the greatest preprint that did not make it into publication, ever ?)



We explore opportunities afforded by an extremely large telescope design comprised of ill-figured randomly varying subapertures. The veracity of this approach is demonstrated with a laboratory scaled system whereby we reconstruct a white light binary point source separated by 2.5 times the diffraction limit. With an inherently unknown varying random point spread function, the measured speckle images require a restoration framework that combine support vector machine based lucky imaging and non-negative matrix factorization based multiframe blind deconvolution. To further validate the approach, we model the experimental system to explore sub-diffraction-limited performance, and an object comprised of multiple point sources.







Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Friday, July 28, 2017

Beyond Filters: Compact Feature Map for Portable Deep Model


Beyond Filters: Compact Feature Map for Portable Deep Model by Yunhe Wang, Chang Xu, Chao Xu, Dacheng Tao

Convolutional neural networks (CNNs) have shown extraordinary performance in a number of applications, but they are usually of heavy design for the accuracy reason. Beyond compressing the filters in CNNs, this paper focuses on the redundancy in the feature maps derived from the large number of filters in a layer. We propose to extract intrinsic representation of the feature maps and preserve the discriminability of the features. Circulant matrix is employed to formulate the feature map transformation, which only requires O(d log d) computation complexity to embed a d-dimensional feature map. The filter is then reconfigured to establish the mapping from original input to the new compact feature map, and the resulting network can preserve intrinsic information of the original network with significantly fewer parameters, which not only decreases the online memory for launching CNN but also accelerates the computation speed. Experiments on benchmark image datasets demonstrate the superiority of the proposed algorithm over state-ofthe-art methods.





Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Thursday, July 27, 2017

A Randomized Rounding Algorithm for Sparse PCA

I had not noticed this when it came out. This is repaired:

We present and analyze a simple, two-step algorithm to approximate the optimal solution of the sparse PCA problem. Our approach first solves a L1 penalized version of the NP-hard sparse PCA optimization problem and then uses a randomized rounding strategy to sparsify the resulting dense solution. Our main theoretical result guarantees an additive error approximation and provides a tradeoff between sparsity and accuracy. Our experimental evaluation indicates that our approach is competitive in practice, even compared to state-of-the-art toolboxes such as Spasm.



Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !

Wednesday, July 26, 2017

Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks

Here is a very interesting paper that uses only one sensor:

"... ECG data is sampled at a frequency of 200 Hz and is collected from a single-lead, non- invasive and continuous monitoring device called the Zio Patch which has a wear period up to 14 days (Turakhia et al., 2013)."
What is interesting is that it can do very well compared to the gold standard established by the researchers in the paper. The second aspect that is fascinating to me is the need for 34 convolutional layers: an architecture that would have been difficult to guess in the first place. 





Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks by Pranav Rajpurkar, Awni Y. Hannun, Masoumeh Haghpanahi, Codie Bourn, Andrew Y. Ng

We develop an algorithm which exceeds the performance of board certified cardiologists in detecting a wide range of heart arrhythmias from electrocardiograms recorded with a single-lead wearable monitor. We build a dataset with more than 500 times the number of unique patients than previously studied corpora. On this dataset, we train a 34-layer convolutional neural network which maps a sequence of ECG samples to a sequence of rhythm classes. Committees of board-certified cardiologists annotate a gold standard test set on which we compare the performance of our model to that of 6 other individual cardiologists. We exceed the average cardiologist performance in both recall (sensitivity) and precision (positive predictive value).




Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !

Tuesday, July 25, 2017

Unbiased estimates for linear regression via volume sampling

I wonder how the leverage score or the volume sampling technique described below, could do well in the sparse pseudo-inverse we mentioned earlier this month ?





Given a full rank matrix X with more columns than rows consider the task of estimating the pseudo inverse X+ based on the pseudo inverse of a sampled subset of columns (of size at least the number of rows). We show that this is possible if the subset of columns is chosen proportional to the squared volume spanned by the rows of the chosen submatrix (ie, volume sampling). The resulting estimator is unbiased and surprisingly the covariance of the estimator also has a closed form: It equals a specific factor times X+X+⊤. .Pseudo inverse plays an important part in solving the linear least squares problem, where we try to predict a label for each column of X. We assume labels are expensive and we are only given the labels for the small subset of columns we sample from X. Using our methods we show that the weight vector of the solution for the sub problem is an unbiased estimator of the optimal solution for the whole problem based on all column labels.
We believe that these new formulas establish a fundamental connection between linear least squares and volume sampling. We use our methods to obtain an algorithm for volume sampling that is faster than state-of-the-art and for obtaining bounds for the total loss of the estimated least-squares solution on all labeled columns.


Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Monday, July 24, 2017

Densely Connected Convolutional Networks - implementations -



Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections - one between each layer and its subsequent layer - our network has L(L+1)/2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less memory and computation to achieve high performance. Code and models are available at this https URL .


From the main implementation page at: https://github.com/liuzhuang13/DenseNet

"..Other Implementations
  1. Our Caffe Implementation
  2. Our (much more) space-efficient Caffe Implementation.
  3. PyTorch Implementation (with BC structure) by Andreas Veit.
  4. PyTorch Implementation (with BC structure) by Brandon Amos.
  5. MXNet Implementation by Nicatio.
  6. MXNet Implementation (supporting ImageNet) by Xiong Lin.
  7. Tensorflow Implementation by Yixuan Li.
  8. Tensorflow Implementation by Laurent Mazare.
  9. Tensorflow Implementation (with BC structure) by Illarion Khlestov.
  10. Lasagne Implementation by Jan Schlüter.
  11. Keras Implementation by tdeboissiere.
  12. Keras Implementation by Roberto de Moura Estevão Filho.
  13. Keras Implementation (with BC structure) by Somshubra Majumdar.
  14. Chainer Implementation by Toshinori Hanya.
  15. Chainer Implementation by Yasunori Kudo.
  16. Fully Convolutional DenseNets for segmentation by Simon Jegou...."


Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Localization of Sound Sources in a Room with One Microphone - implementation -

When used fully, sparsity can do wonders !



Estimation of the location of sound sources is usually done using microphone arrays. Such settings provide an environment where we know the difference between the received signals among different microphones in the terms of phase or attenuation, which enables localization of the sound sources. In our solution we exploit the properties of the room transfer function in order to localize a sound source inside a room with only one microphone. The shape of the room and the position of the microphone are assumed to be known. The design guidelines and limitations of the sensing matrix are given. Implementation is based on the sparsity in the terms of voxels in a room that are occupied by a source. What is especially interesting about our solution is that we provide localization of the sound sources not only in the horizontal plane, but in the terms of the 3D coordinates inside the room.
Implementation on the Github repo is here at https://github.com/epfl-lts2/room_transfer_function_toolkit

This repository contains a library for sparse representation of the room transfer function and code for localization of sound sources in a room with one microphone. Author contact information: helena.peictukuljac@epfl.ch.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Wednesday, July 19, 2017

On Unlimited Sampling



Ayush Bhandari just let me know about the interesting approach of Unlimited Sampling in an email exchange:

...In practice, ADCs clip or saturate whenever the amplitude of signal x exceeds ADC threshold L. Typical solution is to de-clip the signal for which purpose various methods have been proposed.

Based on a new ADC hardware which allows for sampling using the principle 
y = mod(x,L)

where x is bandlimited and L is the ADC threshold, we show that Nyquist rate about \pi e (~10) times faster guarantees recovery of x from y. For this purpose we outline a new, stable recovery procedure.

Paper and slides are here.

There is also the PhysOrg coverage. Thanks Ayush ! Here is the paper:


Shannon's sampling theorem provides a link between the continuous and the discrete realms stating that bandlimited signals are uniquely determined by its values on a discrete set. This theorem is realized in practice using so called analog--to--digital converters (ADCs). Unlike Shannon's sampling theorem, the ADCs are limited in dynamic range. Whenever a signal exceeds some preset threshold, the ADC saturates, resulting in aliasing due to clipping. The goal of this work is to analyze an alternative approach that does not suffer from these problems. Our work is based on recent developments in ADC design, which allow for ADCs that reset rather than to saturate, thus producing modulo samples. An open problem that remains is: Given such modulo samples of a bandlimited function as well as the dynamic range of the ADC, how can the original signal be recovered and what are the sufficient conditions that guarantee perfect recovery? In this work, we prove such sufficiency conditions and complement them with a stable recovery algorithm. Our results are not limited to certain amplitude ranges, in fact even the same circuit architecture allows for the recovery of arbitrary large amplitudes as long as some estimate of the signal norm is available when recovering. Numerical experiments that corroborate our theory indeed show that it is possible to perfectly recover function that takes values that are orders of magnitude higher than the ADC's threshold.

h/t Laurent.

CSjob: Three postdoc positions, Paris, France

Lenka let me know of three postdoc positions:

If you know talented young researchers looking for a postdoc position, would you please forward them the following openings. One is in my group starting in 2018, and two on more applied subjects in collaboration with my group. 
Thanks in advance for your help. 








Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Tuesday, July 18, 2017

Object classification through scattering media with deep learning on time resolved measurement


I love it ! Calibration Invariant Imaging.




We demonstrate an imaging technique that allows identification and classification of objects hidden behind scattering media and is invariant to changes in calibration parameters within a training range. Traditional techniques to image through scattering solve an inverse problem and are limited by the need to tune a forward model with multiple calibration parameters (like camera field of view, illumination position etc.). Instead of tuning a forward model and directly inverting the optical scattering, we use a data driven approach and leverage convolutional neural networks (CNN) to learn a model that is invariant to calibration parameters variations within the training range and nearly invariant beyond that. This effectively allows robust imaging through scattering conditions that is not sensitive to calibration. The CNN is trained with a large synthetic dataset generated with a Monte Carlo (MC) model that contains random realizations of major calibration parameters. The method is evaluated with a time-resolved camera and multiple experimental results are provided including pose estimation of a mannequin hidden behind a paper sheet with 23 correct classifications out of 30 tests in three poses (76.6% accuracy on real-world measurements). This approach paves the way towards real-time practical non line of sight (NLOS) imaging applications.



Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

RLAGPU: High-performance Out-of-Core Randomized Singular Value Decomposition on GPU - implementation -

When the GPU cannot handle your randomized SVD:. 


RLAGPU: High-performance Out-of-Core Randomized Singular Value Decomposition on GPU by Yuechao Lu, Fumihiko Ino, Yasuyuki Matsushita and Kenichi Hagihara

Randomized Singular Value Decomposition (SVD)[1] is gaining attention in finding structure in scientific data. However, processing large-scale data is not easy due to the limited capacity of GPU memory. To deal with this issue, we propose RLAGPU, an out-of-core process method accelerating large-scale randomized SVD on GPU. The contribution of our method is as follows: l Out-of-core implementation that overcomes the GPU memory capacity limit. l High-performance. In-core and out-of-core routines switched automatically according to data size and available GPU memory. We found that our proposed method outperforms the existing cuBLAS-XT by a margin up to 50%

An implementation is here: https://github.com/luyuechao/


Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Wednesday, July 12, 2017

Why is #Netneutrality Important to Data Science, Machine Learning and Artificial Intelligence ?

Image associée

So your ISP decides the speed or the kind of service you can get based on religion or what not. What happens to our field ? Because you have spent much time on the following services, you are throttled down or have to pay for "premium" services. As a result, you may or may not get to 

  • follow Andrew Ng's Coursera or Siraj Raval classes
  • submit your Kaggle results on time
  • read ArXiv preprints
  • read the latest GAN paper on time
  • watch NIPS/ICLR/CVPR/ACL videos
  • download datasets
  • pay more to use ML/DL on the cloud
  • share reviews
  • download the latest ML/DL frameworks
  • have access to your Slack channels 
  • read Nuit Blanche
  • follow awesome DL thread on Twitter
  • get scholar google alerts
  • .... 


 

SGD, What Is It Good For ?

Image Credit: NASA/JPL-Caltech/Space Science Institute, 
N00284488.jpg, Titan, Jul. 11, 2017 10:12 AM

As noted per the activity on the subject, there is growing interest in understanding better SGD and related methods, we mentioned two such study recently on Nuit Blanche

Sebastian Ruder updated his blog entry on the subject in An overview of gradient descent optimization algorithms (Added derivations of AdaMax and Nadam). In Reinforcement Learning or Evolutionary Strategies? Nature has a solution: BothArthur Juliani makes a mention of an insight on gradient based methods in RL (h/t Tarin for the pointer on Twitter)

It is clear that for many reactive policies, or situations with extremely sparse rewards, ES is a strong candidate, especially if you have access to the computational resources that allow for massively parallel training. On the other hand, gradient-based methods using RL or supervision are going to be useful when a rich feedback signal is available, and we need to learn quickly with less data.

But we also had people trying to speed SGD up while others put some grain of salt in the whole adaptive approach. We also have one where SGD helped by random features helps in solving the linear Bellman equation, a tool central in linear control theory. 
Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization. Specifically, we show no loss of accuracy when training with large minibatch sizes up to 8192 images. To achieve this result, we adopt a linear scaling rule for adjusting learning rates as a function of minibatch size and develop a new warmup scheme that overcomes optimization challenges early in training. With these simple techniques, our Caffe2-based system trains ResNet-50 with a minibatch size of 8192 on 256 GPUs in one hour, while matching small minibatch accuracy. Using commodity hardware, our implementation achieves ~90% scaling efficiency when moving from 8 to 256 GPUs. This system enables us to train visual recognition models on internet-scale data with high efficiency. 


Adaptive optimization methods, which perform local optimization with a metric constructed from the history of iterates, are becoming increasingly popular for training deep neural networks. Examples include AdaGrad, RMSProp, and Adam. We show that for simple overparameterized problems, adaptive methods often find drastically different solutions than gradient descent (GD) or stochastic gradient descent (SGD). We construct an illustrative binary classification problem where the data is linearly separable, GD and SGD achieve zero test error, and AdaGrad, Adam, and RMSProp attain test errors arbitrarily close to half. We additionally study the empirical generalization capability of adaptive methods on several state-of-the-art deep learning models. We observe that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance. These results suggest that practitioners should reconsider the use of adaptive methods to train neural networks.

We introduce a data-efficient approach for solving the linear Bellman equation, which corresponds to a class of Markov decision processes (MDPs) and stochastic optimal control (SOC) problems. We show that this class of control problem can be cast as a stochastic composition optimization problem, which can be further reformulated as a saddle point problem and solved via dual kernel embeddings [1]. Our method is model-free and using only one sample per state transition from stochastic dynamical systems. Different from related work such as Z-learning [2, 3] based on temporal-difference learning [4], our method is an online algorithm following the true stochastic gradient. Numerical results are provided, showing that our method outperforms the Z-learning algorithm


Gradient descent optimization algorithms, while increasingly popular, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by. This article aims to provide the reader with intuitions with regard to the behaviour of different algorithms that will allow her to put them to use. In the course of this overview, we look at different variants of gradient descent, summarize challenges, introduce the most common optimization algorithms, review architectures in a parallel and distributed setting, and investigate additional strategies for optimizing gradient descent.


Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !

Monday, July 10, 2017

Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent

Continuing the examination on SGD on the hardware side this time !


Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent by Christopher De Sa, Matthew Feldman, Christopher Ré, Kunle Olukotun

Stochastic gradient descent (SGD) is one of the most popular numerical algorithms used in machine learning and other domains. Since this is likely to continue for the foreseeable future, it is important to study techniques that can make it run fast on parallel hardware. In this paper, we provide the first analysis of a technique called Buckwild! that uses both asynchronous execution and low-precision computation. We introduce the DMGC model, the first conceptualization of the parameter space that exists when implementing low-precision SGD, and show that it provides a way to both classify these algorithms and model their performance. We leverage this insight to propose and analyze techniques to improve the speed of low-precision SGD. First, we propose software optimizations that can increase throughput on existing CPUs by up to 11×. Second, we propose architectural changes, including a new cache technique we call an obstinate cache, that increase throughput beyond the limits of current-generation hardware. We also implement and analyze low-precision SGD on the FPGA, which is a promising alternative to the CPU for future SGD systems. 



Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

YellowFin: An automatic tuner for momentum SGD - implementation -

Here is an implementation for the optimization the momentum in the Stochastic Gradient Descent.


Hyperparameter tuning is one of the big costs of deep learning. State-of-the-art optimizers, such as Adagrad, RMSProp and Adam, make things easier by adaptively tuning an individual learning rate for each variable. This level of fine adaptation is understood to yield a more powerful method. However, our experiments, as well as recent theory by Wilson et al., show that hand-tuned stochastic gradient descent (SGD) achieves better results, at the same rate or faster. The hypothesis put forth is that adaptive methods converge to different minima (Wilson et al.). Here we point out another factor: none of these methods tune their momentum parameter, known to be very important for deep learning applications (Sutskever et al.). Tuning the momentum parameter becomes even more important in asynchronous-parallel systems: recent theory (Mitliagkas et al.) shows that asynchrony introduces momentum-like dynamics, and that tuning down algorithmic momentum is important for efficient parallelization.
We revisit the simple momentum SGD algorithm and show that hand-tuning a single learning rate and momentum value makes it competitive with Adam. We then analyze its robustness in learning rate misspecification and objective curvature variation. Based on these insights, we design YellowFin, an automatic tuner for both momentum and learning rate in SGD. YellowFin optionally uses a novel momentum-sensing component along with a negative-feedback loop mechanism to compensate for the added dynamics of asynchrony on the fly. We empirically show YellowFin converges in fewer iterations than Adam on large ResNet and LSTM models, with a speedup up to 2.8x in synchronous and 2.7x in asynchronous settings.
YellowFin: An automatic tuner for momentum SGD 

by Jian ZhangIoannis Mitliagkas and Chris Ré.
TLDR; Hand-tuned momentum SGD is competitive with state-of-the-art adaptive methods, like Adam. We introduce YellowFin, an automatic tuner for the hyperparameters of momentum SGD. YellowFin trains models such as large ResNets and LSTMs in fewer iterations than the state of the art. It performs even better in asynchronous settings via an on-the-fly momentum adaptation scheme that uses a novel momentum measurement component along with a negative-feedback loop mechanism.

Summary of results
(Figure) Comparing YellowFin to Adam on training a ResNet on CIFAR100 (left) synchronously; (right) asynchronously, using 16 workers.
Intro

.......

To try our tuner, get your fins on here for Tensorflow and here for PyTorch.


And earlier: Asynchrony begets Momentum, with an Application to Deep Learning by Ioannis MitliagkasCe ZhangStefan HadjisChristopher Ré

Asynchronous methods are widely used in deep learning, but have limited theoretical justification when applied to non-convex problems. We show that running stochastic gradient descent (SGD) in an asynchronous manner can be viewed as adding a momentum-like term to the SGD iteration. Our result does not assume convexity of the objective function, so it is applicable to deep learning systems. We observe that a standard queuing model of asynchrony results in a form of momentum that is commonly used by deep learning practitioners. This forges a link between queuing theory and asynchrony in deep learning systems, which could be useful for systems builders. For convolutional neural networks, we experimentally validate that the degree of asynchrony directly correlates with the momentum, confirming our main result. An important implication is that tuning the momentum parameter is important when considering different levels of asynchrony. We assert that properly tuned momentum reduces the number of steps required for convergence. Finally, our theory suggests new ways of counteracting the adverse effects of asynchrony: a simple mechanism like using negative algorithmic momentum can improve performance under high asynchrony. Since asynchronous methods have better hardware efficiency, this result may shed light on when asynchronous execution is more efficient for deep learning systems.




Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Thursday, July 06, 2017

Beyond Moore-Penrose Part II: The Sparse Pseudoinverse

One of the finding, in recent years, us that when faced with underdetermined/undersampled problems, you can choose any regularization you want. The Advanced Matrix Factorization roster of techniques is a prime example of that. Today, we go beyond the traditional Moore-Penrose inverse !




This is the second part of a two-paper series on generalized inverses that minimize matrix norms. In Part II we focus on generalized inverses that are minimizers of entrywise p norms whose main representative is the sparse pseudoinverse for p = 1. We are motivated by the idea to replace the Moore-Penrose pseudoinverse by a sparser generalized inverse which is in some sense well-behaved. Sparsity implies that it is faster to apply the resulting matrix; well-behavedness would imply that we do not lose much in stability with respect to the least-squares performance of the MPP. We first address questions of uniqueness and non-zero count of (putative) sparse pseu-doinverses. We show that a sparse pseudoinverse is generically unique, and that it indeed reaches optimal sparsity for almost all matrices. We then turn to proving our main stability result: finite-size concentration bounds for the Frobenius norm of p-minimal inverses for 1 \le p \le 2. Our proof is based on tools from convex analysis and random matrix theory, in particular the recently developed convex Gaussian min-max theorem. Along the way we prove several results about sparse representations and convex programming that were known folklore, but of which we could find no proof.


Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.