Monday, April 24, 2017

#ICLR2017 Monday Afternoon Program

 
ICLR 2017 is taking place today in Toulon this week, there will be a blog post for each half day that features directly links to papers and attendant codes if there are any. The meeting will be featured live on Facebook here at: https://www.facebook.com/iclr.cc/ . If you want to say hi, I am around.
 
Afternoon Session – Session Chair: Joan Bruna (sponsored by Baidu) 14.30 - 15.10 Invited talk 2: Benjamin Recht
15.10 - 15.30 Contributed Talk 3: Understanding deep learning requires rethinking generalization - BEST PAPER AWARD
15.30 - 15.50 Contributed Talk 4: Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
15.50 - 16.10 Contributed Talk 5: Towards Principled Methods for Training Generative Adversarial Networks
16.10 - 16.30 Coffee Break
16.30 - 18.20 Poster Session 2 (Conference Papers, Workshop Papers)
18.20 - 18.30 Group photo at stadium attached to Neptune Congress Center.
 
C1: Neuro-Symbolic Program Synthesis
C2: Generative Models and Model Criticism via Optimized Maximum Mean Discrepancy (code)
C3: Trained Ternary Quantization (code)
C4: DSD: Dense-Sparse-Dense Training for Deep Neural Networks (code)
C5: A Compositional Object-Based Approach to Learning Physical Dynamics (code, project site)
C6: Multilayer Recurrent Network Models of Primate Retinal Ganglion Cells
C7: Improving Generative Adversarial Networks with Denoising Feature Matching (chainer implementation)
C8: Transfer of View-manifold Learning to Similarity Perception of Novel Objects
C9: What does it take to generate natural textures?
C10: Emergence of foveal image sampling from learning to attend in visual scenes
C11: PixelCNN++: A PixelCNN Implementation with Discretized Logistic Mixture Likelihood and Other Modifications
C12: Learning to Optimize
C13: Do Deep Convolutional Nets Really Need to be Deep and Convolutional?
C14: Optimal Binary Autoencoding with Pairwise Correlations
C15: On the Quantitative Analysis of Decoder-Based Generative Models (evaluation code)
C16: Adversarial machine learning at scale
C17: Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks
C18: Capacity and Learnability in Recurrent Neural Networks
C19: Deep Learning with Dynamic Computation Graphs  (TensorFlow code)
C20: Exploring Sparsity in Recurrent Neural Networks
C21: Structured Attention Networks (code)
C22: Learning to Repeat: Fine Grained Action Repetition for Deep Reinforcement Learning
C23: Variational Lossy Autoencoder
C24: Learning to Query, Reason, and Answer Questions On Ambiguous Texts
C25: Deep Biaffine Attention for Neural Dependency Parsing
C26: A Compare-Aggregate Model for Matching Text Sequences (code)
C27: Data Noising as Smoothing in Neural Network Language Models
C28: Neural Variational Inference For Topic Models
C29: Bidirectional Attention Flow for Machine Comprehension (code, page)
C30: Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic
C31: Stochastic Neural Networks for Hierarchical Reinforcement Learning
C32: Learning Invariant Feature Spaces to Transfer Skills with Reinforcement Learning (video)
C33: Third Person Imitation Learning
 
W1: Audio Super-Resolution using Neural Networks (code)
W2: Semantic embeddings for program behaviour patterns
W3: De novo drug design with deep generative models : an empirical study
W4: Memory Matching Networks for Genomic Sequence Classification
W5: Char2Wav: End-to-End Speech Synthesis
W6: Fast Chirplet Transform Injects Priors in Deep Learning of Animal Calls and Speech
W7: Weight-averaged consistency targets improve semi-supervised deep learning results
W8: Particle Value Functions
W9: Out-of-class novelty generation: an experimental foundation
W10: Performance guarantees for transferring representations (presentation, video)
W11: Generative Adversarial Learning of Markov Chains
W12: Short and Deep: Sketching and Neural Networks
W13: Understanding intermediate layers using linear classifier probes
W14: Symmetry-Breaking Convergence Analysis of Certain Two-layered Neural Networks with ReLU nonlinearity
W15: Neural Combinatorial Optimization with Reinforcement Learning (TensorFlow code)
W16: Tactics of Adversarial Attacks on Deep Reinforcement Learning Agents
W17: Adversarial Discriminative Domain Adaptation (workshop extended abstract)
W18: Efficient Sparse-Winograd Convolutional Neural Networks
W19: Neural Expectation Maximization 

 
 
 
 
 
Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

#ICLR2017 Monday Morning Program

 
So ICLR 2017 is taking place today in Toulon this week, there will be a blog post for each half day that features directly links to papers from the Open review section. The meeting will be featured live on Facebook here at: https://www.facebook.com/iclr.cc/ . If you want to say hi, I am around.

Monday April 24, 2017

Morning Session – Session Chair: Dhruv Batra

7.00 - 8.45 Registration
8.45 - 9.00 Opening Remarks
9.00 - 9.40 Invited talk 1: Eero Simoncelli
9.40 - 10.00 Contributed talk 1: End-to-end Optimized Image Compression
10.00 - 10.20 Contributed talk 2: Amortised MAP Inference for Image Super-resolution
10.20 - 10.30 Coffee Break
10.30 - 12.30 Poster Session 1

C1: Making Neural Programming Architectures Generalize via Recursion (slides, code, video)
C2: Learning Graphical State Transitions (code)
C3: Distributed Second-Order Optimization using Kronecker-Factored Approximations
C4: Normalizing the Normalizers: Comparing and Extending Network Normalization Schemes
C5: Neural Program Lattices
C6: Diet Networks: Thin Parameters for Fat Genomics
C7: Unsupervised Cross-Domain Image Generation  (TensorFlow implementation )
C8: Towards Principled Methods for Training Generative Adversarial Networks
C9: Recurrent Mixture Density Network for Spatiotemporal Visual Attention
C10: Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer (PyTorch code)
C11: Pruning Filters for Efficient ConvNets
C12: Stick-Breaking Variational Autoencoders
C13: Identity Matters in Deep Learning
C14: On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
C15: Recurrent Hidden Semi-Markov Model
C16: Nonparametric Neural Networks
C17: Learning to Generate Samples from Noise through Infusion Training
C18: An Information-Theoretic Framework for Fast and Robust Unsupervised Learning via Neural Population Infomax
C19: Highway and Residual Networks learn Unrolled Iterative Estimation
C20: Soft Weight-Sharing for Neural Network Compression (Tutorial)
C21: Snapshot Ensembles: Train 1, Get M for Free
C22: Towards a Neural Statistician
C23: Learning Curve Prediction with Bayesian Neural Networks
C24: Learning End-to-End Goal-Oriented Dialog
C25: Multi-Agent Cooperation and the Emergence of (Natural) Language
C26: Efficient Vector Representation for Documents through Corruption ( code)
C27: Improving Neural Language Models with a Continuous Cache
C28: Program Synthesis for Character Level Language Modeling
C29: Tracking the World State with Recurrent Entity Networks (TensorFlow implementation)
C30: Reinforcement Learning with Unsupervised Auxiliary Tasks (blog post, an implementation )
C31: Neural Architecture Search with Reinforcement Learning ( slides, some implementation of appendix A
C32: Sample Efficient Actor-Critic with Experience Replay
C33: Learning to Act by Predicting the Future
 
 
 
 
Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Saturday, April 22, 2017

Sunday Morning Insight: "No Need for the Map of a Cat, Mr Feynman" or The Long Game in Nanopore Sequencing.



About 5 weeks ago, we wondered how we could tell if the world was changing right before our eyes ?  well, this is happening, instance #2 just got more real:


Nanopore sequencing is a promising technique for genome sequencing due to its portability, ability to sequence long reads from single molecules, and to simultaneously assay DNA methylation. However until recently nanopore sequencing has been mainly applied to small genomes, due to the limited output attainable. We present nanopore sequencing and assembly of the GM12878 Utah/Ceph human reference genome generated using the Oxford Nanopore MinION and R9.4 version chemistry. We generated 91.2 Gb of sequence data (~30x theoretical coverage) from 39 flowcells. De novo assembly yielded a highly complete and contiguous assembly (NG50 ~3Mb). We observed considerable variability in homopolymeric tract resolution between different basecallers. The data permitted sensitive detection of both large structural variants and epigenetic modifications. Further we developed a new approach exploiting the long-read capability of this system and found that adding an additional 5x-coverage of "ultra-long" reads (read N50 of 99.7kb) more than doubled the assembly contiguity. Modelling the repeat structure of the human genome predicts extraordinarily contiguous assemblies may be possible using nanopore reads alone. Portable de novo sequencing of human genomes may be important for rapid point-of-care diagnosis of rare genetic diseases and cancer, and monitoring of cancer progression. The complete dataset including raw signal is available as an Amazon Web Services Open Dataset at: https://github.com/nanopore-wgs-consortium/NA12878.
Here is some context:

And previously on Nuit Blanche:
 
Credit: NASA, JPL




Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Friday, April 21, 2017

Random Feature Expansions for Deep Gaussian Processes / AutoGP: Exploring the Capabilities and Limitations of Gaussian Process Models - implementation -

[I will be at ICLR next week, let's grab some coffee if you are there]



Random Feature Expansions for Deep Gaussian Processes by Kurt Cutajar, Edwin V. Bonilla, Pietro Michiardi, Maurizio Filippone
The composition of multiple Gaussian Processes as a Deep Gaussian Process (DGP) enables a deep probabilistic nonparametric approach to flexibly tackle complex machine learning problems with sound quantification of uncertainty. Existing inference approaches for DGP models have limited scalability and are notoriously cumbersome to construct. In this work, we introduce a novel formulation of DGPs based on random feature expansions that we train using stochastic variational inference. This yields a practical learning framework which significantly advances the state-of-the-art in inference for DGPs, and enables accurate quantification of uncertainty. We extensively showcase the scalability and performance of our proposal on several datasets with up to 8 million observations, and various DGP architectures with up to 30 hidden layers.
A python / TensorFlow implementation can be found here: https://github.com/mauriziofilippone/deep_gp_random_features

We investigate the capabilities and limitations of Gaussian process models by jointly exploring three complementary directions: (i) scalable and statistically efficient inference; (ii) flexible kernels; and (iii) objective functions for hyperparameter learning alternative to the marginal likelihood. Our approach outperforms all previously reported GP methods on the standard MNIST dataset; performs comparatively to previous kernel-based methods using the RECTANGLES-IMAGE dataset; and breaks the 1% error-rate barrier in GP models using the MNIST8M dataset, showing along the way the scalability of our method at unprecedented scale for GP models (8 million observations) in classification problems. Overall, our approach represents a significant breakthrough in kernel methods and GP models, bridging the gap between deep learning approaches and kernel machines.
and here is a recent presentation by one of the author: "Practical and Scalable Inference for Deep Gaussian Processes"

Wednesday, April 19, 2017

Stochastic Gradient Descent as Approximate Bayesian Inference

[I will be at ICLR next week, let's grab some coffee if you are there]


The recent distill pub on Why Momentum Really Works by Gabriel Goh does provide a some insight on why Gradient Descent might work. Overviews such as the ones listed below also do help:
But today, we have an addtional insight in the mapping of SGD to Bayesian inference: Stochastic Gradient Descent as Approximate Bayesian Inference by Stephan MandtMatthew D. HoffmanDavid M. Blei
Stochastic Gradient Descent with a constant learning rate (constant SGD) simulates a Markov chain with a stationary distribution. With this perspective, we derive several new results. (1) We show that constant SGD can be used as an approximate Bayesian posterior inference algorithm. Specifically, we show how to adjust the tuning parameters of constant SGD to best match the stationary distribution to a posterior, minimizing the Kullback-Leibler divergence between these two distributions. (2) We demonstrate that constant SGD gives rise to a new variational EM algorithm that optimizes hyperparameters in complex probabilistic models. (3) We also propose SGD with momentum for sampling and show how to adjust the damping coefficient accordingly. (4) We analyze MCMC algorithms. For Langevin Dynamics and Stochastic Gradient Fisher Scoring, we quantify the approximation errors due to finite learning rates. Finally (5), we use the stochastic process perspective to give a short proof of why Polyak averaging is optimal. Based on this idea, we propose a scalable approximate MCMC algorithm, the Averaged Stochastic Gradient Sampler.




Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Phase Transitions of Spectral Initialization for High-Dimensional Nonconvex Estimation


  [Personal message: I will at ICLR next week, let's grab some coffee if you are there]
Yue just sent me the following:
Dear Igor,

I hope all is well.

We recently posted a paper on arXiv on analyzing the exact asymptotic performance of a popular spectral initialization method for various nonconvex signal estimation problems (such as phase retrieval). We think you and readers of your blog might be interested in this research.

The paper can be found here:

https://arxiv.org/abs/1702.06435

Best regards,
Yue
Thanks Yue, two phase transitions ! I like it: Phase Transitions of Spectral Initialization for High-Dimensional Nonconvex Estimation by Yue M. Lu, Gen Li
We study a spectral initialization method that serves a key role in recent work on estimating signals in nonconvex settings. Previous analysis of this method focuses on the phase retrieval problem and provides only performance bounds. In this paper, we consider arbitrary generalized linear sensing models and present a precise asymptotic characterization of the performance of the method in the high-dimensional limit. Our analysis also reveals a phase transition phenomenon that depends on the ratio between the number of samples and the signal dimension. When the ratio is below a minimum threshold, the estimates given by the spectral method are no better than random guesses drawn from a uniform distribution on the hypersphere, thus carrying no information; above a maximum threshold, the estimates become increasingly aligned with the target signal. The computational complexity of the method, as measured by the spectral gap, is also markedly different in the two phases. Worked examples and numerical results are provided to illustrate and verify the analytical predictions. In particular, simulations show that our asymptotic formulas provide accurate predictions for the actual performance of the spectral method even at moderate signal dimensions.



Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Tuesday, April 18, 2017

MLHardware: Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent / WRPN: Training and Inference using Wide Reduced-Precision Networks

( Personal message: I will at ICLR next week, let's grab some coffee if you are there. )


As ML is becoming more and more important, the hardware architecture on which it runs needs to change as well. These changes in turns are wholly dependent on a number of trade-offs. Today, we have two such studies, one on the quantization issues in neural networks and another one on the influence of low precision on Stochastic Gradient Descent (something we already seen for  gradient descent )

For computer vision applications, prior works have shown the efficacy of reducing the numeric precision of model parameters (network weights) in deep neural networks but also that reducing the precision of activations hurts model accuracy much more than reducing the precision of model parameters. We study schemes to train networks from scratch using reduced-precision activations without hurting the model accuracy. We reduce the precision of activation maps (along with model parameters) using a novel quantization scheme and increase the number of filter maps in a layer, and find that this scheme compensates or surpasses the accuracy of the baseline full-precision network. As a result, one can significantly reduce the dynamic memory footprint, memory bandwidth, computational energy and speed up the training and inference process with appropriate hardware support. We call our scheme WRPN - wide reduced-precision networks. We report results using our proposed schemes and show that our results are better than previously reported accuracies on ILSVRC-12 dataset while being computationally less expensive compared to previously reported reduced-precision networks.


Stochastic gradient descent (SGD) is one of the most popular numerical algorithms used in machine learning and other domains. Since this is likely to continue for the foreseeable future, it is important to study techniques that can make it run fast on parallel hardware. In this paper, we provide the first analysis of a technique called BUCKWILD! that uses both asynchronous execution and low-precision computation. We introduce the DMGC model, the first conceptualization of the parameter space that exists when implementing low-precision SGD, and show that it provides a way to both classify these algorithms and model their performance. We leverage this insight to propose and analyze techniques to improve the speed of low-precision SGD. First, we propose software optimizations that can increase throughput on existing CPUs by up to 11×. Second, we propose architectural changes, including a new cache technique we call an obstinate cache, that increase throughput beyond the limits of current-generation hardware. We also implement and analyze low-precision SGD on the FPGA, which is a promising alternative to the CPU for future SGD systems.



Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Monday, April 17, 2017

F2F: A Library For Fast Kernel Expansions

( Personal message: I will at ICLR next week, let's grab some coffee if you are there. )

F2F is a C++ library for large-scale machine learning. It contains a CPU optimized implementation of the Fastfood algorithm, that allows the computation of approximated kernel expansions in loglinear time. The algorithm requires to compute the product of Walsh-Hadamard Transform (WHT) matrices. A cache friendly SIMD Fast Walsh-Hadamard Transform (FWHT) that achieves compelling speed and outperforms current state-of-the-art methods has been developed. F2F allows to obtain non-linear classification combining Fastfood and a linear classifier.
I am told by one of the author that the library should be out at some point in time.



Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !

Understanding Shallower Networks

We may need to Understanding Deep Neural Networks but we also need to understand Shallower networks




Empirical risk minimization (ERM) is ubiquitous in machine learning and underlies most supervised learning methods. While there has been a large body of work on algorithms for various ERM problems, the exact computational complexity of ERM is still not understood. We address this issue for multiple popular ERM problems including kernel SVMs, kernel ridge regression, and training the final layer of a neural network. In particular, we give conditional hardness results for these problems based on complexity-theoretic assumptions such as the Strong Exponential Time Hypothesis. Under these assumptions, we show that there are no algorithms that solve the aforementioned ERM problems to high accuracy in sub-quadratic time. We also give similar hardness results for computing the gradient of the empirical loss, which is the main computational burden in many non-convex learning tasks.



Remarkable success of deep neural networks has not been easy to analyze theoretically. It has been particularly hard to disentangle relative significance of architecture and optimization in achieving accurate classification on large datasets. On the flip side, shallow methods have encountered obstacles in scaling to large data, despite excellent performance on smaller datasets, and extensive theoretical analysis. Practical methods, such as variants of gradient descent used so successfully in deep learning, seem to perform below par when applied to kernel methods. This difficulty has sometimes been attributed to the limitations of shallow architecture.
In this paper we identify a basic limitation in gradient descent-based optimization in conjunctions with smooth kernels. An analysis demonstrates that only a vanishingly small fraction of the function space is reachable after a fixed number of iterations drastically limiting its power and resulting in severe over-regularization. The issue is purely algorithmic, persisting even in the limit of infinite data.
To address this issue, we introduce EigenPro iteration, based on a simple preconditioning scheme using a small number of approximately computed eigenvectors. It turns out that even this small amount of approximate second-order information results in significant improvement of performance for large-scale kernel methods. Using EigenPro in conjunction with stochastic gradient descent we demonstrate scalable state-of-the-art results for kernel methods on a modest computational budget.
Finally, these results indicate a need for a broader computational perspective on modern large-scale learning to complement more traditional statistical and convergence analyses. In particular, systematic analysis concentrating on the approximation power of algorithms with a fixed computation budget will lead to progress both in theory and practice.


Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !

Friday, April 14, 2017

Understanding Deep Neural Networks; Opening the Black Box, Energy Propagation in DNNs, Exploring loss function topology

Today, we have three papers trying to explain why deep neural networks work.  From the first paper:
"In our experiments, most of the optimization epochs are spent on compressing the internal representations under the training error constraint. This compression occurs by the SGD without any other explicit regularization or sparsity, and - we believe - is largely responsible for the absence of overfitting in DL. This observation also suggests that there are many (exponential in the number of weights) different randomized networks with essentially optimal performance. "

 


Despite their great success, there is still no comprehensive theoretical understanding of learning with Deep Neural Networks (DNNs) or their inner organization. Previous work [Tishby & Zaslavsky (2015)] proposed to analyze DNNs in the Information Plane; i.e., the plane of the Mutual Information values that each layer preserves on the input and output variables. They suggested that the goal of the network is to optimize the Information Bottleneck (IB) tradeoff between compression and prediction, successively, for each layer.
In this work we follow up on this idea and demonstrate the effectiveness of the Information-Plane visualization of DNNs. We first show that the stochastic gradient descent (SGD) epochs have two distinct phases: fast empirical error minimization followed by slow representation compression, for each layer. We then argue that the DNN layers end up very close to the IB theoretical bound, and present a new theoretical argument for the computational benefit of the hidden layers.


Many practical machine learning tasks employ very deep convolutional neural networks. Such large depths pose formidable computational challenges in training and operating the network. It is therefore important to understand how many layers are actually needed to have most of the input signal's features be contained in the feature vector generated by the network. This question can be formalized by asking how quickly the energy contained in the feature maps decays across layers. In addition, it is desirable that none of the input signal's features be "lost" in the feature extraction network or, more formally, we want energy conservation in the sense of the energy contained in the feature vector being proportional to that of the corresponding input signal. This paper establishes conditions for energy conservation for a wide class of deep convolutional neural networks and characterizes corresponding feature map energy decay rates. Specifically, we consider general scattering networks, and find that under mild analyticity and high-pass conditions on the filters (which encompass, inter alia, various constructions of Weyl-Heisenberg filters, wavelets, ridgelets, (α)-curvelets, and shearlets) the feature map energy decays at least polynomially fast. For broad families of wavelets and Weyl-Heisenberg filters, the guaranteed decay rate is shown to be exponential. Our results yield handy estimates of the number of layers needed to have at least ((1−ε)⋅100)% of the input signal energy be contained in the feature vector.

Exploring loss function topology with cyclical learning rates by Leslie N. Smith, Nicholay Topin
We present observations and discussion of previously unreported phenomena discovered while training residual networks. The goal of this work is to better understand the nature of neural networks through the examination of these new empirical results. These behaviors were identified through the application of Cyclical Learning Rates (CLR) and linear network interpolation. Among these behaviors are counterintuitive increases and decreases in training loss and instances of rapid training. For example, we demonstrate how CLR can produce greater testing accuracy than traditional training despite using large learning rates. Files to replicate these results are available at this https URL


Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Printfriendly