Thursday, July 03, 2014

Rows vs Columns for Linear Systems of Equations - Randomized Kaczmarz or Coordinate Descent ?

Further studies on the Kaczmarz algorithm:

This paper is about randomized iterative algorithms for solving a linear system of equations Xβ=y in different settings. Recent interest in the topic was reignited when Strohmer and Vershynin (2009) proved the linear convergence rate of a Randomized Kaczmarz (RK) algorithm that works on the rows of X (data points). Following that, Leventhal and Lewis (2010) proved the linear convergence of a Randomized Coordinate Descent (RCD) algorithm that works on the columns of X (features). The aim of this paper is to simplify our understanding of these two algorithms, establish the direct relationships between them (though RK is often compared to Stochastic Gradient Descent), and examine the algorithmic commonalities or tradeoffs involved with working on rows or columns. We also discuss Kernel Ridge Regression and present a Kaczmarz-style algorithm that works on data points and having the advantage of solving the problem without ever storing or forming the Gram matrix, one of the recognized problems encountered when scaling kernelized methods.

Let us note that version 3 of reference 6 has been out for the past month and a half:

We obtain an improved finite-sample guarantee on the linear convergence of stochastic gradient descent for smooth and strongly convex objectives, improving from a quadratic dependence on the conditioning (L/μ)2(where L is a bound on the smoothness and μ on the strong convexity) to a linear dependence on L/μ. Furthermore, we show how reweighting the sampling distribution (i.e. importance sampling) is necessary in order to further improve convergence, and obtain a linear dependence in the average smoothness, dominating previous results. We also discuss importance sampling for SGD more broadly and show how it can improve convergence also in other scenarios. Our results are based on a connection we make between SGD and the randomized Kaczmarz algorithm, which allows us to transfer ideas between the separate bodies of literature studying each of the two methods. In particular, we recast the randomized Kaczmarz algorithm as an instance of SGD, and apply our results to prove its exponential convergence, but to the solution of a weighted least squares problem rather than the original least squares problem. We then present a modified Kaczmarz algorithm with partially biased sampling which does converge to the original least squares solution with the same exponential convergence rate.

Randomized algorithms that base iteration-level decisions on samples from some pool are ubiquitous in machine learning and optimization. Examples include stochastic gradient descent and randomized coordinate descent. This paper makes progress at theoretically evaluating the difference in performance between sampling with- and without-replacement in such algorithms. Focusing on least means squares optimization, we formulate a noncommutative arithmetic-geometric mean inequality that would prove that the expected convergence rate of without-replacement sampling is faster than that of with-replacement sampling. We demonstrate that this inequality holds for many classes of random matrices and for some pathological examples as well. We provide a deterministic worst-case bound on the gap between the discrepancy between the two sampling models, and explore some of the impediments to proving this inequality in full generality. We detail the consequences of this inequality for stochastic gradient descent and the randomized Kaczmarz algorithm for solving linear systems.

Image Credit: NASA/JPL-Caltech
This image was taken by Rear Hazcam: Left B (RHAZ_LEFT_B) onboard NASA's Mars rover Curiosity on Sol 677 (2014-07-02 16:42:12 UTC).
Full Resolution

No comments: