Monday, May 22, 2017

Short and Deep: Sketching and Neural Networks

This is a revised version of an earlier preprint.
Short and Deep: Sketching and Neural Networks by Amit Daniely, Nevena Lazic, Yoram Singer, Kunal Talwar
Data-independent methods for dimensionality reduction such as random projections, sketches, and feature hashing have become increasingly popular in recent years. These methods often seek to reduce dimensionality while preserving the hypothesis class, resulting in inherent lower bounds on the size of projected data. For example, preserving linear separability requires Ω(1/γ2 ) dimensions, where γ is the margin, and in the case of polynomial functions, the number of required dimensions has an exponential dependence on the polynomial degree. Despite these limitations, we show that the dimensionality can be reduced further while maintaining performance guarantees, using improper learning with a slightly larger hypothesis class. In particular, we show that any sparse polynomial function of a sparse binary vector can be computed from a compact sketch by a single-layer neural network, where the sketch size has a logarithmic dependence on the polynomial degree. A practical consequence is that networks trained on sketched data are compact, and therefore suitable for settings with memory and power constraints. We empirically show that our approach leads to networks with fewer parameters than related methods such as feature hashing, at equal or better performance. 

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

1 comment:

SeanVN said...

I noticed this paper entitled "Exponential Capacity in an Autoencoder Neural Network with a Hidden Layer."
I would kinda guess the exponential capacity is because real value weights are used to encode the binary output. With finite precision arithmetic say, 16 bit half floats or 32 bit floats there probably is an optimal number of weights to sum together to get a result. After that you should likely use locality sensitive hashing to switch in different weight vectors. I also noticed recently that the new nVidia GPU chip offers a 120 Tflop half float matrix operation that might be suitable for random projections. If you were lucky you might get say 40 million 65536-point RPs per second out of it. That would be about 4000 times faster than I can get from my dual core CPU using SIMD.