Friday, September 19, 2014

Randomized Kernels, Randomized Solvers, Random features and more

Here are a slew of very interesting papers connecting to the earlier Hardware Based Stochastic Gradient Descent and much earlier random features:

High-performance Kernel Machines with Implicit Distributed Optimization and Randomization by Vikas Sindhwani, Haim Avron
Complex machine learning tasks arising in several domains increasingly require "big models" to be trained on "big data". Such models tend to grow with the complexity and size of the training data, and do not make strong parametric assumptions upfront on the nature of the underlying statistical dependencies. Kernel methods constitute a very popular, versatile and principled statistical methodology for solving a wide range of non-parametric modelling problems. However, their storage requirements and high computational complexity poses a significant barrier to their widespread adoption in big data applications. We propose an algorithmic framework for massive-scale training of kernel-based machine learning models. Our framework combines two key technical ingredients: (i) distributed general purpose convex optimization for a class of problems involving very large but implicit datasets, and (ii) the use of randomization to significantly accelerate the training process as well as prediction speed for kernel-based models. Our approach is based on a block-splitting variant of the Alternating Directions Method of Multipliers (ADMM) which is carefully reconfigured to handle very large random feature matrices only implicitly, while exploiting hybrid parallelism in compute environments composed of loosely or tightly coupled clusters of multicore machines. Our implementation supports a variety of machine learning tasks by enabling several loss functions, regularization schemes, kernels, and layers of randomized approximations for both dense and sparse datasets, in a highly extensible framework. We study the scalability of our framework on both commodity clusters as well as on BlueGene/Q, and provide a comparison against existing sequential and parallel libraries for such problems.

Scalable Kernel Methods via Doubly Stochastic Gradients by Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina Balcan, Le Song
The general perception is that kernel methods are not scalable, and neural nets are the methods of choice for nonlinear learning problems. Or have we simply not tried hard enough for kernel methods? Here we propose an approach that scales up kernel methods using a novel concept called "doubly stochastic functional gradients". Our approach relies on the fact that many kernel methods can be expressed as convex optimization problems, and we solve the problems by making two unbiased stochastic approximations to the functional gradient, one using random training points and another using random functions associated with the kernel, and then descending using this noisy functional gradient. We show that a function produced by this procedure after t iterations converges to the optimal function in the reproducing kernel Hilbert space in rate O(1/t), and achieves a generalization performance of O(1/t√). This doubly stochasticity also allows us to avoid keeping the support vectors and to implement the algorithm in a small memory footprint, which is linear in number of iterations and independent of data dimension. Our approach can readily scale kernel methods up to the regimes which are dominated by neural nets. We show that our method can achieve competitive performance to neural nets in datasets such as 8 million handwritten digits from MNIST, 2.3 million energy materials from MolecularSpace, and 1 million photos from ImageNet.

Large-scale randomized-coordinate descent methods with non-separable linear constraints by Sashank Reddi, Ahmed Hefny, Carlton Downey, Avinava Dubey, Suvrit Sra
We develop randomized (block) coordinate descent (CD) methods for linearly constrained convex optimization. Unlike most CD methods, we do not assume the constraints to be separable, but let them be coupled linearly. To our knowledge, ours is the first CD method that allows linear coupling constraints, without making the global iteration complexity have an exponential dependence on the number of constraints. We present algorithms and analysis for four key problem scenarios: (i) smooth; (ii) smooth + nonsmooth separable; (iii) asynchronous distributed; and (iv) stochastic. We illustrate empirical behavior of our algorithms by simulation experiments.

Compact Random Feature Maps by Raffay Hamid, Ying Xiao, Alex Gittens, Dennis DeCoste
Kernel approximation using randomized feature maps has recently gained a lot of interest. In this work, we identify that previous approaches for polynomial kernel approximation create maps that are rank deficient, and therefore do not utilize the capacity of the projected feature space effectively. To address this challenge, we propose compact random feature maps (CRAFTMaps) to approximate polynomial kernels more concisely and accurately. We prove the error bounds of CRAFTMaps demonstrating their superior kernel reconstruction performance compared to the previous approximation schemes. We show how structured random matrices can be used to efficiently generate CRAFTMaps, and present a single-pass algorithm using CRAFTMaps to learn non-linear multi-class classifiers. We present experiments on multiple standard data-sets with performance competitive with state-of-the-art results.

Random Laplace Feature Maps for Semigroup Kernels on Histograms Jiyan Yang, Vikas Sindhwani, Quanfu Fan, Haim Avron

With the goal of accelerating the training and test ing complexity of nonlinear kernel methods, several recent papers have proposed explicit embeddings of the input data into low-dimensional feature spaces, where fast linear methods can instead be used to generate approximate solutions. Analogous to random Fourier feature maps to approximate shift-invariant kernels, such as the Gaussian kernel, on Rd, we develop a new randomized technique called random Laplace features, to approximate a family of kernel functions adapted to the semigroup structure of Rd+. This is the natural algebraic structure on the set of histograms and other non-negative data representations. We provide theoretical results on the uniform convergence of random Laplace features. Empirical analyses on image classification and surveillance event detection tasks demonstrate the attractiveness of using random Laplace features relative to several other feature maps proposed in the literature.


Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels by Jiyan Yang, Vikas Sindhwani, Haim Avron, Michael Mahoney

We consider the problem of improving the efficiency of randomized Fourier feature maps to accelerate training and testing speed of kernel methods on large datasets. These approximate feature maps arise as Monte Carlo approximations to integral representations of shift-invariant kernel functions (e.g., Gaussian kernel). In this paper, we propose to use Quasi-Monte Carlo (QMC) approximations instead where the relevant integrands are evaluated on a low-discrepancy sequence of points as opposed to random point sets as in the Monte Carlo approach. We derive a new discrepancy measure called box discrepancy based on theoretical characterizations of the integration error with respect to a given sequence. We then propose to learn QMC sequences adapted to our setting based on explicit box discrepancy minimization. Our theoretical analyses are complemented with empirical results that demonstrate the effectiveness of classical and adaptive QMC techniques for this problem.

From Kernel Machines to Ensemble Learning by Chunhua Shen, Fayao Liu

Ensemble methods such as boosting combine multiple learners to obtain better prediction than could be obtained from any individual learner. Here we propose a principled framework for directly constructing ensemble learning methods from kernel methods. Unlike previous studies showing the equivalence between boosting and support vector machines (SVMs), which needs a translation procedure, we show that it is possible to design boosting-like procedure to solve the SVM optimization problems. In other words, it is possible to design ensemble methods directly from SVM without any middle procedure. This finding not only enables us to design new ensemble learning methods directly from kernel methods, but also makes it possible to take advantage of those highly-optimized fast linear SVM solvers for ensemble learning. We exemplify this framework for designing binary ensemble learning as well as a new multi-class ensemble learning methods.  Experimental results demonstrate the flexibility and usefulness of the proposed framework. 
Credits: ESA/Rosetta/MPS for OSIRIS Team MPS/UPD/LAM/IAA/SSO/INTA/UPM/DASP/IDA
Date: 15 September 2014
Satellite: Rosetta
Depicts: Philae's primary landing site

No comments:

Printfriendly