Nuit Blanche: Randomized Nonlinear Component Analysis

Wednesday, September 10, 2014

Randomized Nonlinear Component Analysis - implementation -

Following up on those Saturday Morning Videos: Some ICML 2014 presentations, here is: Randomized Nonlinear Component Analysis by David Lopez-Paz, Suvrit Sra, Alex Smola, Zoubin Ghahramani, Bernhard Schölkopf

Classical methods such as Principal Component Analysis (PCA) and Canonical Correlation Analysis (CCA) are ubiquitous in statistics. However, these techniques are only able to reveal linear relationships in data. Although nonlinear variants of PCA and CCA have been proposed, these are computationally prohibitive in the large scale.

In a separate strand of recent research, randomized methods have been proposed to construct features that help reveal nonlinear patterns in data. For basic tasks such as regression or classification, random features exhibit little or no loss in performance, while achieving drastic savings in computational requirements.

In this paper we leverage randomness to design scalable new variants of nonlinear PCA and CCA; our ideas extend to key multivariate analysis tools such as spectral clustering or LDA. We demonstrate our algorithms through experiments on real-world data, on which we compare against the state-of-the-art. A simple R implementation of the presented algorithms is provided.

The implementation is here.

Let me note, something we pointed out earlier on Nuit Blanche:

It is of special interest that randomized algorithms are in many cases more robust than their deterministic analogues (Mahoney, 2011) because of the implicit regularization induced by randomness.

Indeed the seminal paper by Mike Mahoney was very clear on the advantages of randomization. Re-reading the introduction makes it plainly clear and is the basis for RandNLA (Randomized Numerical Linear Algebra)

Randomized algorithms for matrices and data

Michael W. Mahoney

(Submitted on 29 Apr 2011 (v1), last revised 15 Nov 2011 (this version, v3))

Randomized algorithms for very large matrix problems have received a great deal of attention in recent years. Much of this work was motivated by problems in large-scale data analysis, and this work was performed by individuals from many different research communities. This monograph will provide a detailed overview of recent work on the theory of randomized matrix algorithms as well as the application of those ideas to the solution of practical problems in large-scale data analysis. An emphasis will be placed on a few simple core ideas that underlie not only recent theoretical advances but also the usefulness of these tools in large-scale data applications. Crucial in this context is the connection with the concept of statistical leverage. This concept has long been used in statistical regression diagnostics to identify outliers; and it has recently proved crucial in the development of improved worst-case matrix algorithms that are also amenable to high-quality numerical implementation and that are useful to domain scientists. Randomized methods solve problems such as the linear least-squares problem and the low-rank matrix approximation problem by constructing and operating on a randomized sketch of the input matrix. Depending on the specifics of the situation, when compared with the best previously-existing deterministic algorithms, the resulting randomized algorithms have worst-case running time that is asymptotically faster; their numerical implementations are faster in terms of clock-time; or they can be implemented in parallel computing environments where existing numerical algorithms fail to run at all. Numerous examples illustrating these observations will be described in detail.

Join the CompressiveSensing subreddit or the Google+ Community and post there !