Nuit Blanche: Scalable and Sustainable Deep Learning via Randomized Hashing

Monday, March 07, 2016

Scalable and Sustainable Deep Learning via Randomized Hashing

I very much like the following paper, especially the introduction which poses in no uncertain term the scalability issue of current architectures:

"...Deep Learning is revolutionizing big-data applications, after being responsible for groundbreaking improvements in object classification ( Krizhevsky et al., 2012) and speech recognition ( Hinton et al.,2012). With the recent upsurge in data, at a much faster rate than our computing capabilities, neural networks are growing deeper in order to process the information more effectively. Microsoft’s deep residual network (He et al., 2015) that won the ILSVRC 2015 competition had 152 layers and 3.6 billion FLOPs. To handle such large neural networks, researchers usually train them on high performance graphics cards or large clusters.

Graphic processing units (GPUs) are well suited at processing the expensive matrix multiplication operations found in the forward and back propagation steps of neural network computation. However, there are some challenges that come with using GPUs to train deep networks. For one, the amount of memory available on GPUs is limited, and so transferring data back and forth between main memory and the graphics card is a bottleneck. In addition, the disparity between network bandwidth and GPU processing speed limits scaling a GPU cluster beyond a single machine. These challenges limit the scalability of deep networks with giant parameter spaces on GPUs with current algorithms.
In distributed computing environments, the parameter space of giant deep networks is split across multiple nodes (Dean et al., 2012). This setup requires costly communication and synchronization between the parameter server to transfer the gradient and parameter updates. There is no clear way to avoid the costly synchronization without resorting to some ad-hoc breaking of the network. This ad-hoc breaking of deep networks is not well understood and is likely to hurt performance and to increase the risk of diverging. While deep networks are growing larger and more complex, there is a push for greater energy efficiency in order to satisfy the growing popularity of machine learning applications on mobile phones and low-power devices. These devices are designed for long battery life, and costly matrix multiplications although parallelizable are not energy-efficient. Recent work by (Chen et al. ,2015) demonstrates a technique to compress a neural networks parameter space through hashing in order to minimize its memory footprint. However, reducing the computational costs of neural networks, which directly translates into longer battery life, re-mains a critical issue...."

you can read the rest here: Scalable and Sustainable Deep Learning via Randomized Hashing by Ryan Spring, Anshumali Shrivastava

Current deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.