Wednesday, May 24, 2017

The High-Dimensional Geometry of Binary Neural Networks

Ah ! This is interesting.

Recent research has shown that one can train a neural network with binary weights and activations at train time by augmenting the weights with a high-precision continuous latent variable that accumulates small changes from stochastic gradient descent. However, there is a dearth of theoretical analysis to explain why we can effectively capture the features in our data with binary weights and activations. Our main result is that the neural networks with binary weights and activations trained using the method of Courbariaux, Hubara et al. (2016) work because of the high-dimensional geometry of binary vectors. In particular, the ideal continuous vectors that extract out features in the intermediate representations of these BNNs are well-approximated by binary vectors in the sense that dot products are approximately preserved. Compared to previous research that demonstrated the viability of such BNNs, our work explains why these BNNs work in terms of the HD geometry. Our theory serves as a foundation for understanding not only BNNs but a variety of methods that seek to compress traditional neural networks. Furthermore, a better understanding of multilayer binary neural networks serves as a starting point for generalizing BNNs to other neural network architectures such as recurrent neural networks.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.


SeanVN said...

It should be possible to do "no multiply" neural nets using random sign flipping + WHT random projections and the signof function using the architecture in this paper:

In hardware all you would need are low transistor count, low power requirement integer add and subtract operations and a few other simple bit operators. Avoiding much more complex and space consuming multiply logic circuit. It should be quite easy to pipe-line the operations on a FPGA. One other thing is that you may not need full precision integer +- because 2's complement overflow would simply increase the amount of nonlinearity, but probably not too much.

SeanVN said...
This comment has been removed by the author.
SeanVN said...

1 billion by 100 million operations is 100 Peta operations per second dude.

SeanVN said...

It would seem you could fit about 100 million integer add/subtract logic units on a current semiconductor die. Clock them at 1 billion operations per second and you have 100 Peta operations per second available for "no multiply" nets.

SeanVN said...

Tested in code:
Conclusion: Very good

The idea is very applicable to locality sensitive hashing as well.