Nuit Blanche: Thesis: Randomized Algorithms For Large-Scale Strongly Overdetermined Linear Regression Problems

Monday, September 15, 2014

Thesis: Randomized Algorithms For Large-Scale Strongly Overdetermined Linear Regression Problems - Xiangrui Meng

Here is another thesis of interest: Randomized Algorithms For Large-Scale Strongly Overdetermined Linear Regression Problems by Xiangrui Meng

In the era of big data, distributed systems built on top of clusters of commodity hardware provide cheap and reliable storage and scalable data processing. With cheap storage, instead of storing only currently relevant data, most people choose to store data as much as possible, expecting that its value can be extracted later. In this way, exabytes (1018) of data are being created on a daily basis. However, extracting value from big data requires scalable implementation of advanced analytical algorithms beyond simple data processing, e.g., regression analysis and optimization. Many traditional methods are designed to minimize floating-point operations, which is the dominant cost of in-memory computation on a single machine. In a distributed environment, load balancing and communication including disk and network I/O can easily dominate computation. These factors greatly increase the complexity and challenge the way of thinking in the design of distributed algorithms. Randomized methods for big data analysis have received a great deal of attention in recent years because they are generally faster and simpler to implement than traditional methods, it is easier to distribute the work, and they sometimes have theoretically provable performance. In this work, we are most interested in random projection and random sampling algorithms for `2 regression and its robust alternative, `1 regression, with strongly rectangular data. Random projection and random sampling are used to create preconditioned systems that are easier to solve or sub-sampled problems that provide relative-error approximations. Our main result shows that in near input-sparsity time and only a few passes through the data we can obtain a good approximate solution, with high probability. Our theory holds for general p ∈ [1, 2], and thus we formulate our results in `p. In the first chapter, we introduce `p regression problems and `p-norm conditioning, as well as traditional solvers for `p regression problems and how they are affected by the condition number. The second chapter describes the solution framework, where we discuss how ellipsoidal rounding and subspace embedding are connected to `p regression and develop faster rounding and embedding algorithms via random projection and random sampling. Chapter 3 describes a parallel solver named LSRN for strongly over- or under-determined linear least squares (`2 regression), and Chapter 4 establishes the theory for `p subspace embedding and its application to `p regression.