Tuesday, April 18, 2017

MLHardware: Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent / WRPN: Training and Inference using Wide Reduced-Precision Networks

( Personal message: I will at ICLR next week, let's grab some coffee if you are there. )

As ML is becoming more and more important, the hardware architecture on which it runs needs to change as well. These changes in turns are wholly dependent on a number of trade-offs. Today, we have two such studies, one on the quantization issues in neural networks and another one on the influence of low precision on Stochastic Gradient Descent (something we already seen for  gradient descent )

For computer vision applications, prior works have shown the efficacy of reducing the numeric precision of model parameters (network weights) in deep neural networks but also that reducing the precision of activations hurts model accuracy much more than reducing the precision of model parameters. We study schemes to train networks from scratch using reduced-precision activations without hurting the model accuracy. We reduce the precision of activation maps (along with model parameters) using a novel quantization scheme and increase the number of filter maps in a layer, and find that this scheme compensates or surpasses the accuracy of the baseline full-precision network. As a result, one can significantly reduce the dynamic memory footprint, memory bandwidth, computational energy and speed up the training and inference process with appropriate hardware support. We call our scheme WRPN - wide reduced-precision networks. We report results using our proposed schemes and show that our results are better than previously reported accuracies on ILSVRC-12 dataset while being computationally less expensive compared to previously reported reduced-precision networks.

Stochastic gradient descent (SGD) is one of the most popular numerical algorithms used in machine learning and other domains. Since this is likely to continue for the foreseeable future, it is important to study techniques that can make it run fast on parallel hardware. In this paper, we provide the first analysis of a technique called BUCKWILD! that uses both asynchronous execution and low-precision computation. We introduce the DMGC model, the first conceptualization of the parameter space that exists when implementing low-precision SGD, and show that it provides a way to both classify these algorithms and model their performance. We leverage this insight to propose and analyze techniques to improve the speed of low-precision SGD. First, we propose software optimizations that can increase throughput on existing CPUs by up to 11×. Second, we propose architectural changes, including a new cache technique we call an obstinate cache, that increase throughput beyond the limits of current-generation hardware. We also implement and analyze low-precision SGD on the FPGA, which is a promising alternative to the CPU for future SGD systems.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.


Moseba said...

Hi great work! Can you give me an example how AlexNet arch looks like after using your method?

Moseba said...

hi great work! Can you show me how the alexnet arch looks like after using your method?

SeanVN said...

I was kind of interested in how you would evolve low precision neural networks.
There is the idea of scale free optimization where you don't pick a characteristic scale for mutations, instead mutations are evenly spread across the decades of magnitude you are interested in, and obviously there are not so many of those.
An example of such a mutation would be randomly + or - exp(-c*rnd()).
It is also highly related to using simple bit flipping as a mutation since a mutation of one bit in an 8 bit unsigned integer results in a change of 1,2,4,8,16,32,64 or 128. Which also follows an exponential curve.