Pages

Wednesday, April 30, 2014

Nuit Blanche in Review (April 2014)

What a month it has been since last March! First, we have lost a unique individual (Seth Roberts). We also witnessed the breakout of convnets (here and here), while being made aware of the intricacies of decyphering Dolphin communications. The Sunday Morning Insight entry entitled "Why You Should Care About Phase Transitions in Clustering" received more than +37 Google points, a record, while the list You are not paying attention if... generated about +6 Google points. We also saw several interesting sets of convergences
we also saw how random projections are slowly permeating into several avenues
I also mentioned a thing or two about the sociology of blogging (What Could Possibly Go Wrong ? Peer-Review, Trolls, Data Breaches) while reaching 3000th published  posts. We also had about 15 implementations released by their respective authors, another record I think. Finally, we also had Another Sighting of Citing Nuit Blanche. Enjoy !

Implementations



Focused Interests

Machine Learning Meetups (slides and videos)

Tuesday, April 29, 2014

Seth Roberts

Seth Roberts is no longer with us.
Seth was a truly gifted person and we collectively won't be the same without him. Tonight, many people will have a heavier heart, I know I will.

Godspeed Seth !



Seth was mentioned in the following entries/presentations

TGA: Grassmann Averages for Scalable Robust PCA - implementation -

So it looks like we have a fast implementation of the Robust PCA


Grassmann Averages for Scalable Robust PCA by Søren Hauberg, Aasa Feragen, Michael J. Black

As the collection of large datasets becomes increasingly automated, the occurrence of outliers will increase“big data” implies “big outliers”. While principal component analysis (PCA) is often used to reduce the size of data, and scalable solutions exist, it is well-known that outliers can arbitrarily corrupt the results. Unfortunately, state- of-the-art approaches for robust PCA do not scale beyond small-to-medium medium sized datasets. To address this, we introduce the Grassmann Average (GA), which expresses dimensionality reduction as an average of the subspaces spanned by the data. Because averages can be efficiently computed, we immediately gain scalability. GA is inherently more robust than PCA, but we show that they coincide for Gaussian data. We exploit that averages can be made robust to formulate the Robust Grassmann Average (RGA) as a form of robust PCA. Robustness can be with respect to vectors (subspaces) or elements of vectors; we focus on the latter and use a trimmed average. The resulting Trimmed Grassmann Average (TGA) is particularly appropriate for computer vision because it is robust to pixel outliers. The algorithm has low computational complexity and minimal memory requirements, making it scalable to “big noisy data.” We demonstrate TGA for background modeling, video restoration, and shadow removal. We show scalability by performing robust PCA on the entire Star Wars IV movie.


from the text:

To further emphasize the scalability of TGA, we compute the 20 leading components of the entire Star Wars IV movie. This consist of 179,415 frames with a resolution of 352153. Computing these 20 components (see [15]) took 8.5 hours on an Intel Xeon E5-2650 with 128 GB memory.
my emphasis on the underlined word.

The project page is here and the attendant code is here. It will be added to the Advanced Matrix Factorization Jungle page shortly.



Join the CompressiveSensing subreddit or the Google+ Community and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Monday, April 28, 2014

A Comparison of Clustering and Missing Data Methods for Health Sciences

This is interesting.




In this paper, we compare and analyze clustering methods with missing data in health behavior research. In particular, we propose and analyze the use of compressive sensing's matrix completion along with spectral clustering to cluster health related data. The empirical tests and real data results show that these methods can outperform standard methods like LPA and FIML, in terms of lower misclassification rates in clustering and better matrix completion performance in missing data problems. According to our examination, a possible explanation of these improvements is that spectral clustering takes advantage of high data dimension and compressive sensing methods utilize the near-to-low-rank property of health data.


Join the CompressiveSensing subreddit or the Google+ Community and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Saturday, April 26, 2014

What Could Possibly Go Wrong ? Peer-Review, Trolls, Data Breaches

Andrew just wrote about me being trolled on his blog a while ago. Go read it, it is here: Sleazy sock puppet can’t stop spamming our discussion of compressed sensing and promoting the work of Xiteng Liu.

While re-reading that fateful blog entry and the attendant comments, I could not escape thinking about another recent discussion that is currently going on at Stack Overflow: Why is Stack Overflow so negative of late? In short, good constructive discussions can be polluted very fast by very few and can essentially destroy the good will of the rest of the community. 

This led me pondering about the current pre-publication peer-review system.  We all agree there should be some peer review, but why constrain it to a one shot review that might include uneducated or, worse, dishonest people ? Sure, we all think of ourselves as fair and knowledgeable when we agree to do a review. However since most pre-publication peer review is anonymous, anybody can become your 'peer'. It's not just a figment of my imagination. For some editors, it might be sufficient to give a talk in compressive sensing at a meeting to become a de-facto specialist of the field. What could possibly go wrong with that ? Here is what could go wrong, check the S10 section of this conference. Sooner or later, the system is becoming less robust because of a few. 

The other problem I have with anonymity of the current pre-publication peer review system is that many people participating in it are deliberately giving away information about themselves (and the review they gave) to for-profit entities that have no compelling interest to protect it. What could possibly go wrong ? It's not like data breaches are that common anyway. Remember folks, it's not a question of 'if' but a question of 'when'.

Friday, April 25, 2014

Slides: Kernel Methods for Big Data

I just noticed this workshop on Kernel Methods for Big Data that just took place in Lille. Thanks to the organizers : Alain Celisse , Julien JacquesGuillemette Marot, here are the slides and attendant abstracts:

Arthur Gretton (University College London)
A short introduction to reproducing kernel Hilbert spaces The first lecture is an introduction to RKHS. The second covers embeddings of probabilities to RKHS, and characteristic RKHS (for which these embeddings are unique to each probability measure). The third lecture covers advanced topics: relation of RKHS embeddings of probabilities and energy distances, optimal kernel choice for two-sample testing, testing three-way interactions, and Bayesian inference without models.

Jean-Philippe Vert (Mines de Paris and Institut Curie)
slides: lecture
Learning with kernels: an introduction
In this short course I will give a general introduction to kernel methods, including support vector machines (SVM), for supervised classification and regression. I will explain how positive definite kernels and reproducing kernel Hilbert spaces (RKHS) allow to extend many simple linear methods to complex, structured and high-dimensional data such as images, texts or graphs, and give some hints on how kernels can be constructed for specific applications. I will also discuss the possibility to integrate heterogeneous data with multiple kernel learning (MKL).
Short talks

Sylvain Arlot (CNRS - Ecole Normale Supérieure)
slides: talk
Kernel change-point detection We tackle the change-point problem with data belonging to a general set. We propose a penalty for choosing the number of change-points in the kernel-based method of Harchaoui and Cappe (2007). This penalty generalizes the one proposed for one dimensional signals by Lebarbier (2005). We prove it satisfies a non-asymptotic oracle inequality by showing a new concentration result in Hilbert spaces. Experiments on synthetic and real data illustrate the accuracy of our method, showing it can detect changes in the whole distribution of data, even when the mean and variance are constant. Our algorithm can also deal with data of complex nature, such as the GIST descriptors which are commonly used for video temporal segmentation.
Collaboration with Alain Celisse and Zaïd Harchaoui.


Francis Bach (Inria - Ecole Normale Supérieure)
slides: talk
Sharp analysis of low-rank kernel matrix approximation We consider supervised learning problems within the positive-definite kernel framework, such as kernel ridge regression, kernel logistic regression or the support vector machine. With kernels leading to infinite-dimensional feature spaces, a common practical limiting difficulty is the necessity of computing the kernel matrix, which most frequently leads to algorithms with running time at least quadratic in the number of observations n, i.e., O(n^2). Low-rank approximations of the kernel matrix are often considered as they allow the reduction of running time complexities to O(p^2 n), where p is the rank of the approximation. The practicality of such methods thus depends on the required rank p. In this talk, I will show that for approximations based on a random subset of columns of the original kernel matrix, the rank p may be chosen to be linear in the degrees of freedom associated with the problem, a quantity which is classically used in the statistical analysis of such methods, and is often seen as the implicit number of parameters of non-parametric estimators. This result enables simple algorithms that have subquadratic running time complexity, but provably exhibit the same predictive performance than existing algorithms.

Charles Bouveyron (Université Paris Descartes)
slides: talk
Kernel discriminant analysis and clustering with parsimonious Gaussian process models We present a family of parsimonious Gaussian process models which allow to build, from a finite sample, a model-based classifier in an infinite dimensional space. The proposed parsimonious models are obtained by constraining the eigen-decomposition of the Gaussian processes modeling each class. This allows in particular to use non-linear mapping functions which project the observations into infinite dimensional spaces. It is also demonstrated that the building of the classifier can be directly done from the observation space through a kernel function. The proposed classification method is thus able to classify data of various types such as categorical data, functional data or networks. Furthermore, it is possible to classify mixed data by combining different kernels. The methodology is as well extended to the unsupervised classification case and an EM algorithm is derived for the inference. Experimental results on various data sets demonstrate the effectiveness of the proposed method.
This is a joint work with S. Girard (INRIA & LJK) and M. Fauvel (INRA & Univ. Toulouse).

Jérémie Kellner (Univ. Lille 1, Inria)
slides: talk
Normality test in high (or infinite) dimension with kernels Several kernel methods assume data in the kernel space have a Gaussian distribution. We design a goodness-of-fit test for normality in the Reproducing Kernel Hilbert Space (RKHS) to check this crucial assumption. The related statistic relies on Laplace transform, which allows to overcome some drawbacks of previous strategies such as the Maximum Mean Discrepancy (MMD). A theoretical analysis of the new test procedure is provided in terms of Type-I and-II errors. These good performances are also confirmed by simulation experiments carried out on both synthetic and real data. In particular our test strategy improves upon ongoing methods in high-dimensional settings as dimension increases, still exhibiting strong power where other competitors usually fail.

Rémi Lajugie (Ecole Normale Supérieure)
slides: talk
Large margin metric learning for constraint partitioning problems We consider unsupervised partitioning problems based explicitly or implicitly on the minimization of Euclidean distortions, such as clustering, image or video segmentation, and other change-point detection problems. We emphasize on cases with specific structure, which include many practical situations ranging from mean-based change-point detection to image segmentation problems. We aim at learning a Mahalanobis metric for these unsupervised problems, leading to feature weighting and/or selection. This is done in a supervised way by assuming the availability of several (partially) labeled datasets that share the same metric. We cast the metric learning problem as a large-margin structured prediction problem, with proper definition of regularizers and losses, leading to a convex optimization problem which can be solved efficiently. Our experiments show how learning the metric can significantly improve performance on bioinformatics, video or image segmentation problems..

Cristian Preda (Université Lille 1, Inria)
slides: talk
RKHS methods for regression with functional data We consider the regression problem with both predictor and response of functional type. The regression function is approximated over the space of functionals defined in some RKHS. Consistency of the estimators and numerical results are presented.

Guillem Rigaill (Univ. Evry Val-d'Essonne)
slides: talk
Kernel-based multiple change-point detection We consider the problem of kernel-based multiple change-point detection. From an algorithmic point of view it is possible to use dynamic programming to recover the best segmentations in 1 to K segments. However a naive implementation of such algorithm is quadratic in time and it is also quadratic in space as one typically needs to store and access the elements of the Gram Matrix. This is a problem to process large profiles. We describe an algorithm which has the same time complexity and yet is linear in space. We also describe a simple heuristic which is linear in time. We illustrate this approach on 2 dimensional DNA copy number profiles.
This is a joint work with Alain Celisse, Guillemette Marot and Morgane Pierre-Jean

Dino Sejdinovic (University College London)
slides: talk
MCMC Kameleon: Kernel Adaptive Metropolis-Hastings A Kernel Adaptive Metropolis-Hastings algorithm is introduced, for the purpose of sampling from a target distribution with strongly nonlinear support. The algorithm embeds the trajectory of the Markov chain into a reproducing kernel Hilbert space (RKHS), such that the feature space covariance of the samples informs the choice of proposal. The procedure is computationally efficient and straightforward to implement, since the RKHS moves can be integrated out analytically: our proposal distribution in the original space is a normal distribution whose mean and covariance depend on where the current sample lies in the support of the target distribution, and adapts to its local covariance structure. Furthermore, the procedure requires neither gradients nor any other higher order information about the target, making it particularly attractive for contexts such as Pseudo-Marginal MCMC. Kernel Adaptive Metropolis-Hastings outperforms competing fixed and adaptive samplers on multivariate, highly nonlinear target distributions, arising in both real-world and synthetic examples.
Joint work with Heiko Strathmann, Maria Lomeli Garcia, Christophe Andrieu and Arthur Gretton

Zoltan Szabo (University College London)
slides: talk
Learning on Distributions
Problems formulated in terms of distributions have recently gained widespread attention. An important task that belongs to this family is distribution regression: regressing to a real-valued response from a probability distribution. One particularly challenging difficulty of the task is its two-stage sampled nature: in practise we only have samples from sampled distributions. In my presentation I am going to talk about two (intimately related) directions to tackle this difficulty. Firstly, I am going to present a recently released information theoretical estimators open source toolkit capable of estimating numerous dependency, similarity measures on distributions in a nonparametric way. Next, I will propose an algorithmically very simple approach to tackle the distribution regression: embed the distributions to a reproducing kernel Hilbert space, and learn a ridge regressor from the embeddings to the outputs. I will show that (i) this technique is consistent in the two-stage sampled setting under fairly mild conditions, and (ii) it gives state-of-the-art results on supervised entropy learning and the prediction problem of aerosol optical depth based on satellite images.



Join the CompressiveSensing subreddit or the Google+ Community and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

It's Friday afternoon, it's Hamming's time: Antipode Seismic Maps ?

When Carolyn Porco came into town at the Texas A&M Physics seminar in 1995, she presented us with the recent computations of how the atmosphere of Jupiter had been affected by the Shoemaker-Levy 9 comet. They looked llike this simulation on the video below, except her's were more amplified as you could see the wave coming back to the origine, anyhoo...



The fascinating part of that simulation was to show us that if the sphere (planet) was round, then the impact of the comet was likely having an impact some later time on the exact opposite side of the planet. The inital ripple would meet on the other side with a similar force. This was really fascinating and made a big impression on me. Ever sincethat talk I have wondered if geophysicists take into account some metric where a seismic event has a potential effect on the exact opposite side of the Earth. 

Here is a highly speculative and probably wrong example. One of the most impressive volcano explosion in recent times is that of the Krakatoa island in 1883. While the event occured in August 26–27, 1883, the whole thing started with a series of lesser eruptions that began on May 20, 1883 .

What is on the Antipode of Krakatoa ? according to this tool:





a region west of Medellin, Colombia. 

From Wikipedia




Here is what I found on the Interweb, a paragraph from a book entitled Earthquakes by ARNOLD BOSCOWITZ published in 1890.


"...but on March 27, 1883, at 9-25 P.M., a long, deep underground muttering was heard at Iquique, the southernmost part of Peru. Soon after an earthquake shock startled the whole town. Two other terrestrial disturbances had occurred on the 7th of March at 1 1 '25 P.M. and at Andes (Chili), as well as in the town of Copiapo at 3 P.M. on the 8th of March. Besides, the volcano of Ometepa, in the lake of Nicaragua, which had been dormant for centuries, burst forth in smoke and flame. The shock felt at Copiapo on the 8th of March was also felt throughout nearly the whole of Colombia. At Cartagena and Turlio, at the mouth of the Atrato, the shock was violent, but did little damage. At Huda, upon the river Magdalena, the oscillation lasted more than a minute, and in the State of Antioquia much damage was done ; while Medellin, the capital of the state, came off almost scathless, though the cathedral suffered a good deal. In the town of Antioquia, the facade of the cathedral was suddenly thrown forwards in a slanting direction, several pillars were overthrown, and all the houses suffered more or less. At Garumal the prison and 35 houses were destroyed ; at Aquedas the town-hall was destroyed ; while at Abejirral the church and several of the houses were severely shaken and partly demolished. At Rinagana, the principal village in the territory of Darien, the huts, made of palm-branches, were overturned, and the rivers rose and fell with alarming rapidity. At the same time, the Taya Indians, living in the same district, had remarked with alarm the frequency of earthquakes and of the topographical changes which, they said, had completely modified the aspect of the country. They say that dull mutterings were constantly heard in the south- eastern region, which very few whites have explored ; and from this we may conclude that there was some volcanic agency at work in the district of Atrato. A large island at the mouth of the river Atrato, which had been hydrographically surveyed by a United States steamer in 1862, disappeared altogether..."
March 27, 1883 or about two months before the beginning of the Krakatoa's first series of explosion, there was a seismic event that was worthy of a mention in some History books. I realize there are always seismic activity on Earth but this one was worth a mention in that book, in the same way Krakatoa event was.

I wonder if we'd discover more obvious connections between different spots if one were to simply draw past seismic events on antipode maps ? In particular, I wonder about Turkey and its Pacific Ocean antipode.


I am sure it would be easy to do given the location of the earthquakes and draw them directly on some antipode map.

Join the CompressiveSensing subreddit or the Google+ Community and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Spectral redemption: clustering sparse networks - implementation -





Spectral redemption: clustering sparse networks by Florent Krzakala, Cristopher Moore, Elchanan Mossel, Joe Neeman, Allan Sly, Lenka Zdeborová, Pan Zhang

Spectral algorithms are classic approaches to clustering and community detection in networks. However, for sparse networks the standard versions of these algorithms are suboptimal, in some cases completely failing to detect communities even when other algorithms such as belief propagation can do so. Here we introduce a new class of spectral algorithms based on a non-backtracking walk on the directed edges of the graph. The spectrum of this operator is much better-behaved than that of the adjacency matrix or other commonly used matrices, maintaining a strong separation between the bulk eigenvalues and the eigenvalues relevant to community structure even in the sparse case. We show that our algorithm is optimal for graphs generated by the stochastic block model, detecting communities all the way down to the theoretical limit. We also show the spectrum of the non-backtracking operator for some real-world networks, illustrating its advantages over traditional spectral clustering.

The implementation is on Pan Zhang's website

STSBL-EM : Spatiotemporal Sparse Bayesian Learning with Applications to Compressed Sensing of Multichannel Physiological Signals -implementation -

Energy consumption is an important issue in continuous wireless telemonitoring of physiological signals. Compressed sensing (CS) is a promising framework to address it, due to its energy-efficient data compression procedure. However, most CS algorithms have difficulty in data recovery due to non-sparsity characteristic of many physiological signals. Block sparse Bayesian learning (BSBL) is an effective approach to recover such signals with satisfactory recovery quality. However, it is time-consuming in recovering multichannel signals, since its computational load almost linearly increases with the number of channels.
This work proposes a spatiotemporal sparse Bayesian learning algorithm to recover multichannel signals simultaneously. It not only exploits temporal correlation within each channel signal, but also exploits inter-channel correlation among different channel signals. Furthermore, its computational load is not significantly affected by the number of channels. The proposed algorithm was applied to brain computer interface (BCI) and EEG-based driver's drowsiness estimation. Results showed that the algorithm had both better recovery performance and much higher speed than BSBL. Particularly, the proposed algorithm ensured that the BCI classification and the drowsiness estimation had little degradation even when data were compressed by 80%, making it very suitable for continuous wireless telemonitoring of multichannel signals.
The implementation for ST-SBL is here. This is very interesting as we know Zhilin has been devloping one of the solvers that goes beyond sparsity for reconstruction. 

As it turns out, I also came across this competition on Kaggle that asks competitors to Predict visual stimuli from MEG recordings of human brain activity. From the competition page:


...The goal of this competition is to predict the category of a visual stimulus presented to a subject from the concurrent brain activity. The brain activity is captured with an MEG device which records 306 timeseries at 1KHz of the magnetic field associated with the brain currents. The categories of the visual stimulus for this competition are two: face and scrambled face. A stimulus and the concurrent MEG recording is calledtrial and thousands of randomized trials were recorded from multiple subjects. The trials of some of the subjects, i.e. the train set, are provided to create prediction models. The remaining trials, i.e. the test set, belong to different subjects and they will be used to score the prediction models. Because of the variability across subjects in brain anatomy and in the patterns of brain activity, a certain degree of difference is expected between the data of different subjects and thus between the train set and the test set...

Why am I mentioning this ?





Because if one uses the right dictionary, ST-SBL might provide a good framework for choosing the right feature for consumption in the translational competition set up by Kaggle. How so ? Since you are paying attention, you know that random projections can be equivalent to PCA and that this randomly generated phi matrix is key to getting to 80 percent compression in Zhilin et al's paper. In other words, if one usesa random phi and ST-SBL to get similar results as that of the paper with the Kaggle dataset, that will mean the random projections (phi matrix) can provide good features that can then be classified. 

 Just a $5000 thought.

Join the CompressiveSensing subreddit or the Google+ Community and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Thursday, April 24, 2014

CSjob: Internship Livermore, CA, Student Intern - Signal Processing Grad Summer, Sandia National Lab.

Jovana Helms just sent me the following:

Hi Igor, 
I am looking for a graduate summer intern with CS background to help on a research project at Sandia National Labs in Livermore, CA. Could you please post the position on Nuit Blanche? Please note that US citizenship is required for this position. 
Thanks,
Jovana
Jovana Helms, PhD
Senior Member of Technical Staff
Sandia National Laboratories
Sure Jovana, I can. As stated in the announcement, a green card is simply not enough.

PS: If you want to contact Jovana on the specifics of the internship, please contact her directly at: jilic@sandia.gov . 



Join the CompressiveSensing subreddit or the Google+ Community and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Reconstructing Nonlinear Biochemical Networks




Inferring Cell-Scale Signalling Networks via Compressive Sensing by Lei Nie, Xian Yang, Ian Adcock, Zhiwei Xu, Yike Guo
Signalling network inference is a central problem in system biology. Previous studies investigate this problem by independently inferring local signalling networks and then linking them together via crosstalk. Since a cellular signalling system is in fact indivisible, this reductionistic approach may have an impact on the accuracy of the inference results. Preferably, a cell-scale signalling network should be inferred as a whole. However, the holistic approach suffers from three practical issues: scalability, measurement and overfitting. Here we make this approach feasible based on two key observations: 1) variations of concentrations are sparse due to separations of timescales; 2) several species can be measured together using cross-reactivity. We propose a method, CCELL, for cell-scale signalling network inference from time series generated by immunoprecipitation using Bayesian compressive sensing. A set of benchmark networks with varying numbers of time-variant species is used to demonstrate the effectiveness of our method. Instead of exhaustively measuring all individual species, high accuracy is achieved from relatively few measurements.

In this paper, we present a distributed algorithm for the reconstruction of large-scale nonlinear networks. In particular, we focus on the identification from time-series data of the nonlinear functional forms and associated parameters of large-scale nonlinear networks. Recently, a nonlinear network reconstruction problem was formulated as a nonconvex optimisation problem based on the combination of a marginal likelihood maximisation procedure with sparsity inducing priors. Using a convex-concave procedure (CCCP), an iterative reweighted lasso algorithm was derived to solve the initial nonconvex optimisation problem. By exploiting the structure of the objective function of this reweighted lasso algorithm, a distributed algorithm can be designed. To this end, we apply the alternating direction method of multipliers (ADMM) to decompose the original problem into several subproblems. To illustrate the effectiveness of the proposed methods, we use our approach to identify a network of interconnected Kuramoto oscillators with different network sizes (500~100,000 nodes).

Blind Image Deblurring by Spectral Properties of Convolution Operators - implementation -

Here is a somewhat 'older' blind deconvolution algorithm (compared to this one)


In this paper, we study the problem of recovering a sharp version of a given blurry image when the blur kernel is unknown. Previous methods often introduce an image-independent regularizer (such as Gaussian or sparse priors) on the desired blur kernel. For the first time, this paper shows that the blurry image itself encodes rich information about the blur kernel. Such information can be found through analyzing and comparing how the spectrum of an image as a convolution operator changes before and after blurring. Our analysis leads to an effective convex regularizer on the blur kernel which depends only on the given blurry image. We show that the minimizer of this regularizer guarantees to give good approximation to the blur kernel if the original image is sharp enough. By combining this powerful regularizer with conventional image deblurring techniques, we show how we could significantly improve the deblurring results through simulations and experiments on real images, especially when the blur is large. In addition, our analysis and experiments help explaining why edges are good features for image deblurring.
The implementation is on Guangcan 's page (search for "blind deconvolution (image deblurring)" ).


Join the CompressiveSensing subreddit or the Google+ Community and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Wednesday, April 23, 2014

Active Subspace: Towards Scalable Low-Rank Learning - implementation -



Active Subspace: Towards Scalable Low-Rank Learning, Guangcan Liu and Shuicheng Yan

We address the scalability issues in low-rank matrix learning problems. Usually, these problems resort to solving nuclear norm regularized optimization problems (NNROPs), which often suffer from high computational complexities if based on existing solvers, especially under large-scale settings. Based on the fact that the optimal solution matrix to an NNROP is often low-rank, we revisit the classic mechanism of low-rank matrix factorization, based on which we present an active subspace algorithm for efficiently solving NNROPs by transforming large-scale NNROPs into small-scale problems. The transformation is achieved by factorizing the large-size solution matrix into the product of a small-size orthonormal matrix (active subspace) and another small-size matrix. Although such a transformation generally leads to non-convex problems, we show that suboptimal solution can be found by the augmented Lagrange alternating direction method. For the robust PCA (RPCA) (Candes et al., 2009) problem, which is a typical example of NNROPs, theoretical results verify sub-optimality of the solution produced by our algorithm. For the general NNROPs, we empirically show that our algorithm significantly reduces the computational complexity without loss of optimality

The implementation is on Guangcan Liu's page entitled 'solving the rank constrained RPCA problem (released on Nov 2012). It will be added shortly to the Advanced Matrix Factorization Jungle page.



Join the CompressiveSensing subreddit or the Google+ Community and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

A Convergence of Interests: IEEE SP Mag, COLT2014 and more...

I don't know if this is a sign of times, but  this Sunday Morning Insight's entry entitled Why You Should Care About Phase Transitions in Clustering got 35+ Google recommendations. It's a record as far as this blog is concerned. Why a sign of times ? because I think in the same way we see a convergence between sensing and machine learning, there seems to be a similar convergence in themes related to the general topic of advanced matrix factorizations in theoretical computer science, mathematics, statistics and signal processing. Here are four examples that just showed up on my radar screen that all relate in one way or another to advanced matrix factorization.

Nicolas Gillis just mentioned to me that the IEEE Signal Processing magazine is freely available online at: http://online.qmags.com/SIPR0514. Here are the article that he mentioned might be of interest to readers of Nuit Blanche:

Dustin Mixon just wrote a blog entry on  Compressive classification and the rare eclipse problem that we mentioned a while back

Afonso Bandeira discusses The rank recovery problem specifically about a problem he and co-authors mentioned in [1]. I very much like the clear introduction which goes like this:
Recovery problems in many fields are commonly solved under the paradigm of maximum likelihood estimation. Despite the rich theory it enjoys, in many instances the parameter space is exponentially large and non-convex, often rendering the computation of the maximum likelihood estimator (MLE) intractable. It is then common to settle for heuristics, such as expectation-maximization. Unfortunately, popular iterative heuristics often get trapped in local minima and one usually does not know if the global optimum was achieved
A common alternative to these heuristics is the use of convex relaxations - to attempt optimizing (usually minimizing the minus log-likelihood) in a larger convex set that contains the parameter space of interest, as that allows one to leverage the power of convex optimization (e.g. Candes and Tao (2010)). The downside is that the solution obtained might not be in the original feasible set, forcing one to take an extra, potentially suboptimal, rounding step. The upside is that if it does lie in the original set then no rounding step is needed and one is guaranteed to have found the optimal solution to the original problem. Fortunately, this seems to often be the case in several problems.
I had this back and forth discussion at a meeting focused on sensor design recently. It was about the fact that Maximum Likelihood solvers were potentially solving something different than what was intended. It triggered this entry: You are not paying attention if...you believe an algorithm implementation is the same as the algorithm itself

And finally, Sebastien Bubeck listed the COLT2014 accepted papers. Here are a few [2-13] that are connected to the theme of Nuit Blanche on Matrix Factorization, namely:
Let me note that [8] potentially avoids the traditional iterative computation of A and then X and then A etc in dictionary learning (and Blind deconvolution/calibration). I, for one, much liked the "previous work" section of version 1 through 3 of the manuscript (now at version 4). But hey, I am sure peer review has enhanced the clarity of the paper....I also liked the possibilities encapsulated in the end of the abstract of [2]:

. On the technical side, we contribute several new ideas on how to encode hard combinatorial problems in low-rank optimization problems. We hope that these techniques will be helpful in further understanding the computational limits of Matrix Completion and related problems.
and here is a connection to Machine Learning with Matrix Completion (from [2])
We also consider a natural variant of Matrix Completion where the unknown matrix is positive semidefinite and so must be the output matrix. The positive semidefinite completion problem arises naturally in the context of Support Vector Machines (SVM). The kernel matrix used in SVM learning must be positive semidefinite as it is the Gram matrix of feature vectors. But oftentimes the kernel matrix is derived from partial similarity information resulting in incomplete kernel matrices. In fact, this is a typical situation in medical and biological application domains . In such cases the data analyst would like to complete the partial kernel matrix to a full kernel matrix while ensuring positive semidefiniteness. Moreover, since it is often infeasible to store a dense n  n matrix, it is desirable to also have a low-rank representation of the kernel matrix .This is precisely the low-rank positive semidefinite completion problem. Our results establish strong hardness results for this problem under natural relexations. In this case we show that for any k 2 it is NP-hard to complete a partially given rank k matrix by a rank (2k -1) matrix

Refrences and abstracts:

We have observed an interesting, yet unexplained, phenomenon: Semidefinite programming (SDP) based relaxations of maximum likelihood estimators (MLE) tend to be tight in recovery problems with noisy data, even when MLE cannot exactly recover the ground truth. Several results establish tightness of SDP based relaxations in the regime where exact recovery from MLE is possible. However, to the best of our knowledge, their tightness is not understood beyond this regime. As an illustrative example, we focus on the generalized Procrustes problem.

Matrix Completion is the problem of recovering an unknown real-valued low-rank matrix from a subsample of its entries. Important recent results show that the problem can be solved efficiently under the assumption that the unknown matrix is incoherent and the subsample is drawn uniformly at random. Are these assumptions necessary?
It is well known that Matrix Completion in its full generality is NP-hard. However, little is known if make additional assumptions such as incoherence and permit the algorithm to output a matrix of slightly higher rank. In this paper we prove that Matrix Completion remains computationally intractable even if the unknown matrix has rank  4 but we are allowed to output any constant rank matrix, and even if additionally we assume that the unknown matrix is incoherent and are shown 90% of the entries. This result relies on the conjectured hardness of the 4-Coloring problem. We also consider the positive semidefinite Matrix Completion problem. Here we show a similar hardness result under the standard assumption that  P≠NP. Our results greatly narrow the gap between existing feasibility results and computational lower bounds. In particular, we believe that our results give the first complexity-theoretic justification for why distributional assumptions are needed beyond the incoherence assumption in order to obtain positive results. On the technical side, we contribute several new ideas on how to encode hard combinatorial problems in low-rank optimization problems. We hope that these techniques will be helpful in further understanding the computational limits of Matrix Completion and related problems.

Under a standard assumption in complexity theory (NP not in P/poly), we demonstrate a gap between the minimax prediction risk for sparse linear regression that can be achieved by polynomial-time algorithms, and that achieved by optimal algorithms. In particular, when the design matrix is ill-conditioned, the minimax prediction loss achievable by polynomial-time algorithms can be substantially greater than that of an optimal algorithm. This result is the first known gap between polynomial and optimal algorithms for sparse linear regression, and does not depend on conjectures in average-case complexity.


We obtain sharp bounds on the performance of Empirical Risk Minimization performed in a convex class and with respect to the squared loss, without any boundedness assumptions on class members or on the target. Rather than resorting to a concentration-based argument, the method relies on a `small-ball' assumption and thus holds for heavy-tailed sampling and heavy-tailed targets. Moreover, the resulting estimates scale correctly with the `noise'. When applied to the classical, bounded scenario, the method always improves the known estimates.

From concentration inequalities for the suprema of Gaussian or Rademacher processes an inequality is derived. It is applied to sharpen existing and to derive novel bounds on the empirical Rademacher complexities of unit balls in various norms appearing in the context of structured sparsity and multitask dictionary learning or matrix factorization. A key role is played by the largest eigenvalue of the data covariance matrix.

We solve the COLT 2013 open problem of \citet{SCB} on minimizing regret in the setting of advice-efficient multiarmed bandits with expert advice. We give an algorithm for the setting of K arms and N experts out of which we are allowed to query and use only M experts' advices in each round, which has a regret bound of \tilde{O}\bigP{\sqrt{\frac{\min\{K, M\} N}{M} T}} after T rounds. We also prove that any algorithm for this problem must have expected regret at least \tilde{\Omega}\bigP{\sqrt{\frac{\min\{K, M\} N}{M}T}}, thus showing that our upper bound is nearly tight.

We give a robust version of the celebrated result of Kruskal on the uniqueness of tensor decompositions: we prove that given a tensor whose decomposition satisfies a robust form of Kruskal's rank condition, it is possible to approximately recover the decomposition if the tensor is known up to a sufficiently small (inverse polynomial) error.
Kruskal's theorem has found many applications in proving the identifiability of parameters for various latent variable models and mixture models such as Hidden Markov models, topic models etc. Our robust version immediately implies identifiability using only polynomially many samples in many of these settings. This polynomial identifiability is an essential first step towards efficient learning algorithms for these models.
Recently, algorithms based on tensor decompositions have been used to estimate the parameters of various hidden variable models efficiently in special cases as long as they satisfy certain "non-degeneracy" properties. Our methods give a way to go beyond this non-degeneracy barrier, and establish polynomial identifiability of the parameters under much milder conditions. Given the importance of Kruskal's theorem in the tensor literature, we expect that this robust version will have several applications beyond the settings we explore in this work.

In sparse recovery we are given a matrix A (the dictionary) and a vector of the form AX where X is sparse, and the goal is to recover X. This is a central notion in signal processing, statistics and machine learning. But in applications such as sparse coding, edge detection, compression and super resolution, the dictionary A is unknown and has to be learned from random examples of the form Y=AX where X is drawn from an appropriate distribution --- this is the dictionary learning problem. In most settings, A is overcomplete: it has more columns than rows. This paper presents a polynomial-time algorithm for learning overcomplete dictionaries; the only previously known algorithm with provable guarantees is the recent work of Spielman, Wang and Wright who gave an algorithm for the full-rank case, which is rarely the case in applications. Our algorithm applies to incoherent dictionaries which have been a central object of study since they were introduced in seminal work of Donoho and Huo. In particular, a dictionary is μ-incoherent if each pair of columns has inner product at most μ/n
The algorithm makes natural stochastic assumptions about the unknown sparse vector X, which can contain kcmin(n/μlogn,m1/2η) non-zero entries (for any η>0). This is close to the best kallowable by the best sparse recovery algorithms even if one knows the dictionary A exactly. Moreover, both the running time and sample complexity depend on log1/ϵ, where ϵ is the target accuracy, and so our algorithms converge very quickly to the true dictionary. Our algorithm can also tolerate substantial amounts of noise provided it is incoherent with respect to the dictionary (e.g., Gaussian). In the noisy setting, our running time and sample complexity depend polynomially on 1/ϵ, and this is necessary.

We consider the problem of sparse coding, where each sample consists of a sparse combination of a set of dictionary atoms, and the task is to learn both the dictionary elements and the mixing coefficients. Alternating minimization is a popular heuristic for sparse coding, where the dictionary and the coefficients are estimated in alternate steps, keeping the other fixed. Typically, the coefficients are estimated via ℓ1 minimization, keeping the dictionary fixed, and the dictionary is estimated through least squares, keeping the coefficients fixed. In this paper, we establish local linear convergence for this variant of alternating minimization and establish that the basin of attraction for the global optimum (corresponding to the true dictionary and the coefficients) is O(1/s2) where s is the sparsity level in each sample and the dictionary elements are incoherent. Combined with the recent results of approximate dictionary estimation, these yield the provable guarantees for exact recovery of both the dictionary elements and the coefficients.


In this paper we show that very large mixtures of Gaussians are efficiently learnable in high dimension. More precisely, we prove that a mixture with known identical covariance matrices whose number of components is a polynomial of any fixed degree in the dimension n is polynomially learnable as long as a certain non-degeneracy condition on the means is satisfied. It turns out that this condition is generic in the sense of smoothed complexity, as soon as the dimensionality of the space is high enough. Moreover, we prove that no such condition can possibly exist in low dimension and the problem of learning the parameters is generically hard. In contrast, much of the existing work on Gaussian Mixtures relies on low-dimensional projections and thus hits an artificial barrier. Our main result on mixture recovery relies on a new "Poissonization"-based technique, which transforms a mixture of Gaussians to a linear map of a product distribution. The problem of learning this map can be efficiently solved using some recent results on tensor decompositions and Independent Component Analysis (ICA), thus giving an algorithm for recovering the mixture. In addition, we combine our low-dimensional hardness results for Gaussian mixtures with Poissonization to show how to embed difficult instances of low-dimensional Gaussian mixtures into the ICA setting, thus establishing exponential information-theoretic lower bounds for underdetermined ICA in low dimension. To the best of our knowledge, this is the first such result in the literature. In addition to contributing to the problem of Gaussian mixture learning, we believe that this work is among the first steps toward better understanding the rare phenomenon of the "blessing of dimensionality" in the computational aspects of statistical inference.


Shie Mannor (EE-Technion), Vianney Perchet (LPMA), Gilles Stoltz (GREGH)
In the standard setting of approachability there are two players and a target set. The players play a repeated vector-valued game where one of them wants to have the average vector-valued payoff converge to the target set which the other player tries to exclude. We revisit the classical setting and consider the setting where the player has a preference relation between target sets: she wishes to approach the smallest ("best") set possible given the observed average payoffs in hindsight. Moreover, as opposed to previous works on approachability, and in the spirit of online learning, we do not assume that there is a known game structure with actions for two players. Rather, the player receives an arbitrary vector-valued reward vector at every round. We show that it is impossible, in general, to approach the best target set in hindsight. We further propose a concrete strategy that approaches a non-trivial relaxation of the best-in-hindsight given the actual rewards. Our approach does not require projection onto a target set and amounts to switching between scalar regret minimization algorithms that are performed in episodes.

In this paper, we consider networks consisting of a finite number of non-overlapping communities. To extract these communities, the interaction between pairs of nodes may be sampled from a large available data set, which allows a given node pair to be sampled several times. When a node pair is sampled, the observed outcome is a binary random variable, equal to 1 if nodes interact and to 0 otherwise. The outcome is more likely to be positive if nodes belong to the same communities. For a given budget of node pair samples or observations, we wish to jointly design a sampling strategy (the sequence of sampled node pairs) and a clustering algorithm that recover the hidden communities with the highest possible accuracy. We consider both non-adaptive and adaptive sampling strategies, and for both classes of strategies, we derive fundamental performance limits satisfied by any sampling and clustering algorithm. In particular, we provide necessary conditions for the existence of algorithms recovering the communities accurately as the network size grows large. We also devise simple algorithms that accurately reconstruct the communities when this is at all possible, hence proving that the proposed necessary conditions for accurate community detection are also sufficient. The classical problem of community detection in the stochastic block model can be seen as a particular instance of the problems consider here. But our framework covers more general scenarios where the sequence of sampled node pairs can be designed in an adaptive manner. The paper provides new results for the stochastic block model, and extends the analysis to the case of adaptive sampling.
The paper proposes a novel upper confidence bound (UCB) procedure for identifying the arm with the largest mean in a multi-armed bandit game in the fixed confidence setting using a small number of total samples. The procedure cannot be improved in the sense that the number of samples required to identify the best arm is within a constant factor of a lower bound based on the law of the iterated logarithm (LIL). Inspired by the LIL, we construct our confidence bounds to explicitly account for the infinite time horizon of the algorithm. In addition, by using a novel stopping time for the algorithm we avoid a union bound over the arms that has been observed in other UCB-type algorithms. We prove that the algorithm is optimal up to constants and also show through simulations that it provides superior performance with respect to the state-of-the-art.
credit photo: NASA/ESA


Join the CompressiveSensing subreddit or the Google+ Community and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.