Friday, March 28, 2014

Dimensionality reduction: Spectral Sparse Representation for Clustering and a survey

From The Princess Bride. 
A note on why the name of an algorithm is not the same as the implementation of that algorithm. 

Last week, on top of doing a presentation at Paris IoT #5 on Making Sense of IoT data where I articulated, through three examples, that data sharing among sensors (and algorithms mentioned on Nuit Blanche) was really the only way to "produce business opportunities and save the world" (TM); I was also presenting a poster at JIONC (abstract is at From Direct Imaging to Machine Learning … and more, in 15 minutes) centered on the convergence between several tools in Machine Learning and sensing. The common element that enables this discussion across very different fields is none other that the slew of Advanced Matrix Factorization techniques listed in the Advanced Matrix Factorization Jungle page. Some of these arguments had already been mentioned in the presentation at the Paris Machine Learning Meetup #9 (Advanced Matrix Factorizations, Machine Learning and all that)

To make a long story short, I have since prettied up the Jungle page and found the following two relevant preprints. Enjoy!

Dimensionality reduction methods, e.g. PCA and Laplacian eigenmap (LE), and cluster analysis methods, e.g. K-means and ratio cut (Rcut), are two kinds of widely-used unsupervised data analysis tools. The former concerns high representation accuracy, while the latter semantics. Some preliminary relations between these methods have been established in the literature, but they are not yet integrated into a unified framework. In this paper, we show that under an ideal condition, the four methods: PCA, K-means, LE, and Rcut, are unified together; and when the ideal condition is relaxed, the unification evolves to a new sparse representation method, called spectral sparse representation (SSR). It achieves the same representation accuracy as PCA/LE, and is able to reveal the cluster structure of data as K-means/Rcut. SSR is inherently related to cluster analysis, and the sparse codes can be directly used to do clustering. An efficient algorithm NSCrt is developed to solve the sparse codes of SSR. It is observed that NSCrt is able to effectively recover the underlying solutions. As a direct application of SSR, a new clustering algorithm Scut is devised. It reaches the start-of-the-art performance. Compared with K-means based clustering methods, Scut does not depend on initialization and avoids getting trapped in local minima; and the solutions are comparable to the optimal ones of K-means run many times. Extensive experiments using data sets of different nature demonstrate the properties and strengths of SSR, NSCrt, and Scut.

A survey of dimensionality reduction techniques by C.O.S. Sorzano, J. Vargas, A. Pascual Montano

Experimental life sciences like biology or chemistry have seen in the recent decades an explosion of the data available from experiments. Laboratory instruments become more and more complex and report hundreds or thousands measurements for a single experiment and therefore the statistical methods face challenging tasks when dealing with such high dimensional data. However, much of the data is highly redundant and can be efficiently brought down to a much smaller number of variables without a significant loss of information. The mathematical procedures making possible this reduction are called dimensionality reduction techniques; they have widely been developed by fields like Statistics or Machine Learning, and are currently a hot research topic. In this review we categorize the plethora of dimension reduction techniques available and give the mathematical insight behind them.
Join the CompressiveSensing subreddit or the Google+ Community and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

No comments: