Sunday, April 20, 2014

Sunday Morning Insight: Why You Should Care About Phase Transitions in Clustering

Figure from [1]

In the Advanced Matrix Factorization page, I have no listing for the phase transition section of the spectral clustering or its nonlinear cousin subspace clustering. Little did I know that Suresh Venkatasubramanian (from the GeomBlog fame) has written about it in his essay on clustering. In particular, he wrote this interesting entry entitled: Choosing the number of clusters III: Phase Transitions where he draws a connection with statistical physics
...How can you tell when this happens ? Here's a very elegant technique, first proposed by Kenneth Rose in his work on deterministic annealing. In the high-T regime, where every point is in the same cluster, the cluster center location can be computed by solving a simple convex optimization. As the process evolves, the positive definite matrix defining the optimization starts losing its positive definiteness, till some point when one of its eigenvalues goes to zero. This point can be computed analytically, and yields a specific temperature value at which the first clusters start to emerge.
Kenneth Rose has some nice examples illustrating this behaviour. As time goes one, more and more phase transitions start to appear, as more and more clusters start to emerge. What you end up with is a hierarchy of clusterings that end with the trivial clustering where all data lie in separate clusters.
This idea has been developed further with the information bottleneck method, which replaces both the distortion and entropy terms by terms involving the mutual information of the assignments. The free energy paradigm works the same way, although now the phase transition points can't be computed analytically (I'll have more to say about the information bottleneck method)
What's entirely weird is that the phase transitions still happen (and I want to stress this point), data sets will split up into different numbers of clusters at the same transition point ! We wrote a paper a few years ago that proposes a particular "temperature" to look for phase transitions in, and lo and behold, we were able to recover the "correct" number of clusters from planted data sets by watching the data split at this point. I'm still not sure why this happens, but it was quite interesting, and yielded one way of identifying "natural clusters" in the data....

from Albert Parker's Ph.D thesis [3]

which leads me to Lenka Zdeborova's presentation at Paris Machine Learning Meetup #8 ( How hard is it to find a needle in a haystack? ) who tells a similar story of a connection between statistical physics and clustering [4,5,6] in finding communities within large sparse networks. In particular, she and co-authors also describe, the non-backtracking matrix a better choice for the transformation of the data/graph representation than the Laplacian or the adjacency matrix [5].

Why should we care ? because as Adam Coates and Andrew Y. Ng stated recently in [7], these clustering algorithms, much like dictionary learning in the early deep neural network papers are the first steps for feeding data into some deep learning algorithms. From [7]:

...Recently, it has been found that K-means clustering can be used as a fast alternative training method. The main advantage of this approach is that it is very fast and easily implemented at large scale. On the other hand, employing this method in practice is not completely trivial: K-means has several limitations, and care must be taken to combine the right ingredients to get the system to work well...
This must be balanced against a more recent shift on data that Sander Dieleman, a redditor and the winner of the recent Kaggle competition on the Galaxy Challenge, identified in a recent Reddit thread :

...I still see plenty of questions on Metaoptimize, on the Deep Learning G+ community, on the Kaggle forums and on this subreddit, from people who seem to be unaware of this "paradigm shift". They ask about training autoencoders and RBMs for unsupervised feature learning, when it is often clear that a purely supervised approach would probably work at least as well for their problem (and is conceptually much simpler and easier to understand).
I think this is because they read papers from 2010-2012 advertising unsupervised pre-training as the holy grail of deep learning. That was only 2-4 years ago, so they can't really be blamed for assuming that this approach still represents the state of the art.
Of course unsupervised pre-training still has its applications, but for many problems it has been obsoleted. So I don't think it's a bad thing to draw some attention to this fact

so quite clearly supervised learning applications in human speech or images won't need much clustering like K-means/subspace clustering but it will remain key for real exploration like understanding Dolphin communication or learning heterogenous data among other. And as the map makers did in compressive sensing, those phase transitions, when they are identified, will be key to understand these data. Given a dataset, if these algorithms fail (i.e are not robust) in these unsupervised settings, then you are either not catching the right data (and need to expand them by making the network less sparse) or you are not using the right modeling tools (metric, ...) .

PS: All these phase transitions results appear for spectral clustering and none seem to have been found in the subspace clustering arena .... yet.

PPS: I'll add this blog entry to the phase transition of the Spectral Clustering section in the Advanced Matrix Factorization Jungle page. I just split the spectral clustering section from the subspace clustering one and came up with this division:
  • Spectral Clustering, A = DX with unknown D and X, solve for sparse X and X_i = 0 or 1 
  • Subspace Clustering, A = AX with unknown X, solve for sparse/other conditions on X 

Join the CompressiveSensing subreddit or the Google+ Community and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.


Sander Dieleman said...

To avoid any misunderstandings, I should perhaps clarify that I was talking specifically about unsupervised pre-training as a method of initialising the parameters of supervised models. K-means has never been particularly popular for that, to my knowledge (mainly autoencoders and RBMs).

When it comes to unsupervised feature learning I'm a fan of the (spherical) k-means approach as Adam Coates describes it. Its simplicity and speed of learning is unmatched. I've used it myself for the whale detection challenge on Kaggle, among other things.

- Sander

Igor said...

Thank you Sander for the feedback.