Nuit Blanche

Sunday, December 02, 2007

Monday Morning Algorithm part 5: The 1-D Experimental Probabilistic Hypersurface

The Experimental Probabilistic Hypersurface is a tool initially designed by Bernard Beauzamy. Let me first try to say why it is useful first. Imagine you have a function of several variables, say 100, and it takes 2 hours to evaluate that function. At the end of his/her 8 hours day, the engineer will have processed at best 4 function evaluations. It will take a long time of doing any type of sensitivity studies.

In order to have a faster turn around, one could foresee a scheme where one would do a linear interpolation between points (of dimension 100) that we already know. This would provide an idea of what the function looks like in a larger region. Unfortunately, this interpolation scheme in dimension 100 would by itself take too much time. Irrespective to this time constraint, there is nothing that says this function should be linear or even continuous.

In 2004, Bernard Beauzamy made this original construction whereby the function to be evaluated is a multidimensional probabilistic distribution. For every set of (100) parameters that have been already evaluated, the result of the EPH for this function is a dirac. For all other sets of parameters that have not been evaluated, the result of the EPH for this function is a distribution whose shape will be as peaky (resp. flat) as one is close (resp. far) from the sets of parameters for which the function has already been evaluated. Olga Zeydina, a Ph.D candidate, is writing her thesis on the subject. Most information on this construction can be found in the document section of the Robust Mathematical Modeling program at SCM. Of particular interest are: the initial practical implementation [ Report 1 ] and a newer implementation [Report 2].

Today's algorithm will be featured in two files. One that implements the EPH, the other using the EPH function and produce graphs from this document. Listed in this entry, I will list the EPH function file. The second file can be found in the MMA repository.

function [t,pxx]=EPH(xi,theta,x,t1,tau,xe)

% Algorithm developed by Bernard Beauzamy

% and Olga Zeydina

% This implementation was written by

% Igor Carron

% Inputs

% xi: a parameter set where the function

% has been evaluated

% theta: the result of the

% function evaluation for parameter set xi

% x: bounding value for parameter set xi

% t1: bounding value for result of

% function evaluation

% tau: is the increment in the interval

% defined by the bounding value of t1

% xe: Query parameter set. We want to

% know the value of the function at

% this point.

% Outputs

% t: interval where the distribution is

% defined

% pxx: probability distribution sought

% sampled over t.

N =size(xi,2)

x1_min= x(1,1)

x1_max= x(1,2)

t_min= t1(1,1);

t_max= t1(1,2);

nu=size([t_min:tau:t_max],2)-1

t=[t_min:tau:t_max];

dmax(1)=max(abs(x(1,1)-xi(1)),abs(x(1,2)-xi(1)));

lambd(1)=log(nu)/dmax(1);

for i=2:N

dmax(i)=max(abs(x(1,1)-xi(i)),abs(x(1,2)-xi(i)));

if dmax(i)>dmax(i-1)

lambd(i)=log(nu)/dmax(i);

else

lambd(i)=lambd(i-1);

end

nxe=size(xe,2);

options = optimset('TolFun',1e-3);

for jxx=1:nxe

for i=1:N

d(i)=abs(xe(jxx)-xi(i));

end

for i=1:N

a =pi/tau.^2*exp(1-2*lambd(i)*d(i));

lamn=lambd(i)*d(i);

at2=sum(exp(-a.*t.^2));

at3=sum(t.^2.*exp(-a.*t.^2));

diffe=abs(at2-1/tau*(pi/a).^(1/2))/(1/tau*(pi/a).^(1/2));

if abs(diffe)<1e-10

areal(i) = a;

else

areal(i) = fzero(@(x) fina(x,t,lamn),a,options);

end

b = 1/2-lambd(i)*d(i);

breal(i) = log(1/(sum(exp(-areal(i).*t.^2))));

aa(i)=areal(i);

bb(i)=breal(i);

end

for i=1:N

pro(i,:)=d;

end

prd=cumprod(d,2);

for i=1:N

pro(i,i)=1.0;

end

xpro=cumprod(pro,2);

pioverdi=xpro(:,N);

d11s=sum(pioverdi);

for j=1:nu+1

for i=1:N

pij1(i,j)=exp(-aa(i)*(t(j)-theta(i)).^2 + bb(i));

end

pxx(jxx,j)=1/d11s*(pioverdi'*pij1(:,j));

end

To run this function, one needs to run the MMA5.m file. This is a simple 1-D implementation. One can obviously foresee its application to 2-D or n-D functions. But more important, one could use the resident expert knowledge of the engineer and use the algorithm of last week [1] to produce a new metric in the parameter space and then use the EPH. If you are using it for some serious work, you may want to join us in the Robust Mathematical Modeling program. This probabilistic approach to function evaluation can also be used as a storage mechanism. Another issue that most engineers face is that after so many function evaluations, it generally is difficult to summarize all the findings. This approach solves this problem. One can even implement a way for the engineer to add parameters as he/she goes.

[1] Improving Embeddings by Flexible Exploitation of Side Information

Compressed Sensing: Restricted Isometry Properties and Nonconvex Compressive Sensing

Rick Chartrand and Valentina Staneva just submitted a paper on the subject of "Restricted isometry properties and nonconvex compressive sensing"

In previous work, numerical experiments showed that lp minimization with p above 0 and p less than 1 recovers sparse signals from fewer linear measurements than does l1 minimization. It was also shown that a weaker restricted isometry property is sufficient to guarantee perfect recovery in the lp case. In this work, we generalize this result to an lp variant of the restricted isometry property, and then determine how many random, Gaussian measurements are sufficient for the condition to hold with high probability. The resulting sufficient condition is met by fewer measurements for smaller p.

In order to prove that a weaker restricted isometry property is sufficient to guarantee perfect recovery in the lp case, they used an improvement of a proof of Dvoretzky's theorem presented in Gilles Pisier 's book. The algorithm exploiting this weaker RIP was featured in Iteratively Reweighted Algorithms for Compressive Sensing and was implemented in the Monday Morning Algorithm Part deux: Reweighted Lp for Non-convex Compressed Sensing Reconstruction. The paper is available from the authors and will be on the LANL website in about a week. [ Update: it is now available here ]

Besides the impressive result on the weaker RIP, one of the experimentally noticeable item featured in the article is that below p=0.8, using the reweighted Lp algorithm, perfect recovery below a certain number of measurements is not possible. That number 0.8 is also the coefficient used for efficient recovery of natural images, in Deconvolution using natural image priors by Anat Levin, Rob Fergus, Fredo Durand, and Bill Freeman on the reconstruction from coded aperture measurements. Naaaaah... it's just a coincidence....

Reference:
[1] Pisier Gilles, The volume of convex bodies and Banach space geometry.

Friday, November 30, 2007

Non-Negative Matrix Factorization with Sparsity Constraint for Spectral Unmixing applied to Space Situational Awareness

Ever since the Nature Article of Daniel Lee and Sebastian Seung (Learning the parts of objects by non-negative matrix factorization), there has a been steady progress in implementing ever increasingly sophisticated algorithms performing the Non-Negative Matrix Factorization. Patrik Hoyer came up with the Non-negative Matrix Factorization with sparseness constraints and we now have Shift Invariant Sparse Coding of Image and Music Data by Morten Mørup and Mikkel Schmidt.

Closer to some of the interest expressed in this blog is Space Situational Awareness. Bob Plemmons has used this method to perform the spectral unmixing from data sensed for the Maui Hyperspectral sensors. Much of his work is mentioned in [1] .

By looking up from the ground, and knowing the particular reflectance as a function of wavelength of particular elements such as aluminum, black paints,... one should be able to identify specific elements orbiting the Earth. It is important because the population of earth orbiting elements is constantly changing and needs to be appraised often. Space debris can have a significant impact on other orbiting spacecrafts and satellites.

Since Bob mentions the Maui shots of the Columbia spacecraft, I wonder if those shots were taken with the same hyperspectral sensor ?

Because in case it is, we can see that most of the Shuttle is white.....

whereas our camera that was on top of the Spacehab module was very shiny (it was behind the white box on the roof).

It would be of interest to see if the method can do a good job at spatial resolution. I can provide more information if needed. From there we could infer if there were missing pieces from the RCC panels as shown from the Southwest Research Institute foam impact experiment.

Another Spectral Unmixing method pushing the sparsity constraint and a new heuristic is introduced by Argyrios Zymnis, Seung-Jean Kim, Joelle Skaf, Mario Parente, and Stephen Boyd in Hyperspectral Image Unmixing via Alternating Projected Subgradients . The abstract reads:

We formulate the problem of hyperspectral image unmixing as a nonconvex optimization problem, similar to nonnegative matrix factorization. We present a heuristic for approximately solving this problem using an alternating projected subgradient approach. Finally, we present the result of applying this method on the 1990 AVIRIS image of Cuprite, Nevada and show that our results are in agreement with similar studies on the same data.

On a related note, ESA makes available imagery from ENVISAT/MERIS to whoever applies for it for research purposes.

Reference:
[1] Low-Rank Nonnegative Factorizations for Spectral Imaging Applications, Bob Plemmons

Thursday, November 29, 2007

Morphological Diversity using Sparse Constraints

Jerome Bobin just announced the opening of a site featuring most of his work and attendant Matlab code relevant to Morphological Diversity which is introduced as:

Recent advances in harmonic analysis and signal processing have advocated the use of overcomplete signal representations. The ability of such redundant signal dictionary to lead to very sparse representations has already been used successfully in various fields. In this context, morphological diversity has emerged as an effective signal analysis tool. The gist of morphological diversity consists in modeling signals as the linear combination of several so-called morphological components...
Recovering the morphological components from their combination then relies on the incoherence (morphological diversity) of their respective sparse representation (the DCT and Curvelet tight frame). Morphological Component Analysis (MCA) has been devised to solve this harsh recovery problem.

In the general case, a wide range of signal representations can be accounted for such as wavelets, contourlets, bandlets, wave atoms or curvelets.

Morphological diversity and Morphological Component Analysis (MCA) then turns to be a privileged tool for sparse signal representation.

This website has be designed so as to give a wide review of morphological diversity ‘s ability. This particular and effective framework has already been extended and applied to a very large range of applications in signal and image processing from texture/contour separation to source separation or compressed sensing. We really think that morphological diversity has just begun a fruitful and long life in signal processing.

The hyperspectral part of the website mentions the use of Mars Data and the upcoming availability of a code that does Blind Source Separation using Spatial and Spectral Sparsity Constraints. On a related note, ESA now provides access to Earth hyperspectral data from MERIS and Proba.

Credit Photo: ESA/DLR/FU Berlin (G. Neukum), Hyperspectral photo of Phobos, one of Mars' Moon taken by the Mars Express probe three years ago.

Outliers in the Long Tail: The case of Autism.

When reading subjects like this one, I am still amazed that some people can live off blogging and that other people think they can do that too based purely on traffic based ads. It is important to realize that sometimes what is important is quality traffic. This is often quantified in click-through traffic but oftentimes, the important feature of a blog is to bring awareness on a subject that is difficult to communicate on. So the whole issue of power laws on the net as featured in the long tail argument of Chris Anderson does not seem to take into account the quality aspect of it. In other words, while the linking behavior and traffic do follow a power law while applications on Facebook do not, the most important question should be:

Is there a business case for using the web where the amount of traffic is not directly proportional to income or recognition ?

Let me take an example featured in this blog. In some cases, people don't know about a subject or the wording associated with that subject but they are very much willing to know more because it affects them. Case in point Autism (it could be any other subject). I have made different postings on the business case to be made for an eye tracking device to detect or quantify autism ( Part I, Part II, Part III). When I look back at the log of queries leading to this blog (after these entries were posted), there is a fair amount of low level queries with the following keywords:

" infant eye tracking",
"baby not eye tracking well",
"eye tracking autism",
"baby's eyes not tracking",
"newborn eye tracking",
"detecting autism early",
"eye tracking kids",
"babies with eye tracking problems",
"autism and follow eyes",
"autism and eye tracking importance",
"autism and eye movements"........

I would bet that a fair amount of these queries comes from parents who think that something is amiss with their kids. Linking the eye tracking behavior to Autism has probably given them some headstart with regards to getting a diagnosis. Even though, it may not be a diagnosis of autism, we know from studies that eye tracking is important in the learning process in infants. In a more recent entry, I mentioned the difficulty of getting a diagnosis of Autism using current testing techniques. That post has brought a steady stream of queries along the lines of the following keywords:

"adi-r and ados diagnosis",
"autism difficulty with standardized testing",
"ados failed to detect asd",
"age at which autism diagnosed",
"pdd-nos stability of diagnosis"

In both cases, the traffic is seldom large but I am sure the entries have had some amount of influence on the readers. [ In a related note, Aleks Jakulin and Masanao Yajima are in the process of writing another installment of the Venn Diagram blog entry that was started on this blog (current update is here). Much interesting information should come out of their work. This is important work to many people: medical doctors, families and researchers. To medical doctors and families, it points to the weakness of the current testing system. To researchers, making sense of this data should allow them to pinpoint areas of the tests that need refinement and those areas that are too coarse. ]

With one out of 166 kids being in the Autistic Spectrum Disorder, it would seem to me there is ample work for people just trying to make sense of new findings (and not delivering "solutions" which is the current market). Doctors or families while being directly affected, sometimes do not have the means to interpret the new findings and need somebody who is specialized in this area. I have had several hits from other countries than the U.S. and I would guess the "market" is much larger than what the traffic shows.

In all, only the web can provide access to less than 1/100 of the population and I would venture that some percentage of that population (my guess is 10 percent) is willing to pay for a service that they cannot get otherwise. Eventually instead of a long tail or power law, we should really think of it in a different manner not unlike that featured by Dan Bricklin

which reminds me of the Plouffe's inverter graphs that shows how digits in numbers follow power laws (Benford's law):

One can clearly see outliers in the tail.

P.S. I am currently not running ads in this blog is very much linked to the fact that the keyword "Autism" features many ads for products for which nobody can guarantee anything.

Saturday, November 24, 2007

Monday Morning Algorithm Part 4: Improving Embeddings by Flexible Exploitation of Side Information

[ All Monday Morning Algorithms are listed here]
In Machine Learning, one is often faced with using the Euclidian distance as the de-facto way of comparing objects or features. This is like comparing Apples and Oranges, little contextual side information is provided since it really depends on the application. Comparing Apples and Oranges might be Ok if you are caring about eating one fruit a day, but it might be extremely non-productive if you care about how comparing Vitamin C content between fruits without making a difference between the two. One way to do this is to construct a specific metric associated with the end goal you have in mind and create a Mahalanobis distance function. The main issue is how do you impart any type of similarity and dissimilarity between features with any type of prior knowledge of the problem at hand. Ali Ghodsi, Dana Wilkinson and Finnegan Southey have begun answering this question by building a metric based on side information in Improving Embeddings by Flexible Exploitation of Side Information. We are going to feature this algorithm today with Cable Kurwitz (as usual all the good stuff is his and the mistakes are mine). The conclusion of the article reads

Many machine learning techniques handle complex data by mapping it into new spaces and then applying standard techniques, often utilizing distances in the new space. Learning a distance metric directly is an elegant means to this end. This research shows how side information of two forms, class equivalence information and partial distance information, can be used to learn a new distance as a preprocessing step for embedding. The results demonstrate that side information allows us to capture attributes of the data within the embedding in a controllable manner. Even a small amount of side information can improve the embedding’s structure. Furthermore, the method can be kernelized and also used to capture multiple attributes in the same embedding. Its advantages over existing metric learning methods are a simpler optimization formulation that can be readily solved with off-the-shelf software and greater flexibility in the nature of the side information one can use. In all, this technique represents a useful addition to our toolbox for informed embeddings.

In order to implement this algorithm you have to have Michael Grant, Stephen Boyd, and Yinyu Ye's CVX program installed.

clear
% simple example where x1 and x2 are very
% close and x3 and x4 are very far apart
% the dimension space is 5
n =5;
x1=[1 2 3 4 5]';
x2=[0.1 2.1 12 56 100; 7 1.9 98 200 1]';
x3=[1 4 7 8 19]';
x4=[123 43 7.1 29 133]';
% computing elements in the loss function
for i=1:2
xs1(:,i)=x1-x2(:,i);
end
xt1 = zeros(n*n,1);
for i=1:2
xt1= xt1 + vec(xs1(:,i)*xs1(:,i)') ;
end
for i=1:1
xs2=x3-x4
end
for i=1:1
xt2= vec(xs2*xs2') ;
end

cvx_begin
variable A(n,n) symmetric;
minimize(vec( A )' *(xt1-xt2))
subject to
A == semidefinite(n);
trace(A) == 1;
cvx_end
% checking results
% difference between vector supposedly close to each other
for i=1:2
xr1(i)=(x1-x2(:,i))'*A*(x1-x2(:,i))
end
% results is close to zero for both distance(x1 and the
% two vectors x2)
%
% now difference between vectors that are supposedly far
% from each other
for i=1:1
(x3-x4)'*A*(x3-x4)
end
% results are very far for the two vectors x3 and x4 as
% expected.
% Actual results:
% xr1 =
%
% 2.7221e+003
%
%
%xr1 =
%
% 1.0e+003 *
%
% 2.7221 0.0067
%
%
%ans =
%
% 2.8875e+004

It's a nifty little tool. From there you can use you favorite dimensionality reduction technique.

Some other algorithms I will be featuring in the future Monday Morning Algorithm Series include: The Fast Johnson Lindenstrauss Transform, the Experimental Probabilistic Hypersurface, Diffusion methods for dimensionality reduction and others....

Credit Photo: Chinese Space Agency, Xinhua, First photo of the Moon from the Chinese made Chang'e lunar orbiter yesterday.

Friday, November 23, 2007

StrangeMaps, Nuclear Power, HASP, ML in Space, GPU with Matlab

Taking a break from Compressed Sensing for a moment, here is a strangely addictive site reviewing different types of maps. It's called Strangemaps. Quite timely my first blog entry was about maps and it was about four years ago. Google Maps did not exist then. Things changed indeed, now environmentalists are cheering for nuclear power this is quite a turn of event for those of us who have been in the nuclear engineering field.

Talking about Mississippi, LSU is accepting application for the next HASP flight and the deadline is December 18th. The big difference between a HASP flight and that of a sounding balloon can be found here:

A sounding balloon project I am following with much interest is project WARPED. They have an excellent technical description of what they are doing.

In other news from the KDnuggets weekly e-mail, some items caught my attention: Maybe the birth of a new kind of Journalism in What Will Journalist-Programmers Do?and there is a CFP on the subject of Machine Learning in Space, (Mar 31)

Via GPGPU, AMD is talking about introducing double precision in frameworks using GPUs (Graphics Cards). On the other hand, the NVIDIA CUDA compiler now has a Matlab plug-in where whenever you use a function in Matlab like FFT, this plug-in takes over the job and send it to the Graphics card (GPU). Maybe we can do faster reconstruction in Compressed Sensing using a Partial Fourier Transform and a GPU. Oops I did it again, I spoke about CS.

Tuesday, November 20, 2007

Compressed Sensing: Sensor Networks, Learning Compressed Sensing by Uncertain Component Analysis, Sparsity or Positivity ?, New CVX build

Two new items in the Rice Repository on Sensor Networks:

Shuchin Aeron, Manqi Zhao, and Venkatesh Saligrama, Sensing capacity of sensor networks: Fundamental tradeoffs of SNR, sparsity, and sensing diversity.

The abstract reads:

A fundamental question in sensor networks is to determine the sensing capacity – the minimum number of sensors necessary to monitor a given region to a desired degree of fidelity based on noisy sensor measurements. In the context of the so called compressed sensing problem sensing capacity provides bounds on the maximum number of signal components that can be identified per sensor under noisy measurements. In this paper we show that sensing capacity is a function of SNR, signal sparsity—the inherent complexity/dimensionality of the underlying signal/information space and its frequency of occurrence and sensing diversity – the number of modalities of operation of each sensor. We derive fundamental tradeoffs between SNR, sparsity, diversity and capacity. We show that the capacity is a monotonic function of SNR and diversity. A surprising result is that as sparsity approaches zero so does the sensing capacity irrespective of diversity. This implies for instance that to reliably monitor a small number of targets in a given region requires disproportionately large number of sensors.

Shuchin Aeron, Manqi Zhao, and Venkatesh Saligrama, On sensing capacity of sensor networks for the class of linear observation, fixed SNR models.
The abstract reads:

In this paper we address the problem of finding the sensing capacity of sensor networks for a class of linear observation models and a fixed SNR regime. Sensing capacity is defined as the maximum number of signal dimensions reliably identified per sensor observation. In this context sparsity of the phenomena is a key feature that determines sensing capacity. Precluding the SNR of the environment the effect of sparsity on the number of measurements required for accurate reconstruction of a sparse phenomena has been widely dealt with under compressed sensing. Nevertheless the development there was motivated from an algorithmic perspective. In this paper our aim is to derive these bounds in an information theoretic set-up and thus provide algorithm independent conditions for reliable reconstruction of sparse signals. In this direction we first generalize the Fano’s inequality and provide lower bounds to the probability of error in reconstruction subject to an arbitrary distortion criteria. Using these lower bounds to the probability of error, we derive upper bounds to sensing capacity and show that for fixed SNR regime sensing capacity goes down to zero as sparsity goes down to zero. This means that disproportionately more sensors are required to monitor very sparse events. We derive lower bounds to sensing capacity (achievable) via deriving upper bounds to the probability of error via adaptation to a max-likelihood detection set-up under a given distortion criteria. These lower bounds to sensing capacity exhibit similar behavior though there is an SNR gap in the upper and lower bounds. Subsequently, we show the effect of correlation in sensing across sensors and across sensing modalities on sensing capacity for various degrees and models of correlation. Our next main contribution is that we show the effect of sensing diversity on sensing capacity, an effect that has not been considered before. Sensing diversity is related to the effective coverage of a sensor with respect to the field. In this direction we show the following results (a) Sensing capacity goes down as sensing diversity per sensor goes down; (b) Random sampling (coverage) of the field by sensors is better than contiguous location sampling (coverage). In essence the bounds and the results presented in this paper serve as guidelines for designing efficient sensor network architectures.

I also found Learning Compressed Sensing by Yair Weiss, Hyun Sung Chang and William Freeman
The abstract reads:

Compressed sensing [7], [6] is a recent set of mathematical results showing that sparse signals can be exactly reconstructed from a small number of linear measurements. Interestingly, for ideal sparse signals with no measurement noise, random measurements allow perfect reconstruction while measurements based on principal component analysis (PCA) or independent component analysis (ICA) do not. At the same time, for other signal and noise distributions, PCA and ICA can significantly outperform random projections in terms of enabling reconstruction from a small number of measurements. In this paper we ask: given a training set typical of the signals we wish to measure, what are the optimal set of linear projections for compressed sensing ? We show that the optimal projections are in general not the principal components nor the independent components of the data, but rather a seemingly novel set of projections that capture what is still uncertain about the signal, given the training set. We also show that the projections onto the learned uncertain components may far outperform random projections. This is particularly true in the case of natural images, where random projections have vanishingly small signal to noise ratio as the number of pixels becomes large.

This is an important paper as it expand ICA/PCA to underdetermined systems through the use of they call Uncertain Component Analysis.

Unrelated to the previous subject, Morten Morup shows in this poster the extension and state of the art technique in Non-Negative Matrix Factorization. The poster (entitled Extensions of non negative matrix factorization NMF to Higher Order Data) shows where the use of the regularizing ell1 norm provides the ability to do Sparse Coding NMF i.e. find the sparsest positive factorization. Ever since the Nature Article of Daniel Lee and Sebastian Seung (Learning the parts of objects by non-negative matrix factorization), much has been done to improve the subject by Non Negative Matrices. And so while the work of Morup is extremely important, it would seem that the positivity can be assured only by an argument of sparsity. This is the intriguing especially in light of this article by Alfred Bruckstein, Michael Elad, and Michael Zibulevsky entitled "A Non-Negative and Sparse Enough Solution of an Underdetermined Linear System of Equations is Unique". The abstract reads:

In this paper we consider an underdetermined linear system of equations Ax = b with non-negative entries of A and b, and the solution x being also required to be non- negative. We show that if there exists a sufficiently sparse solution to this problem, it is necessarily unique. Furthermore, we present a greedy algorithm - a variant of the matching pursuit - that is guaranteed to find this sparse solution. We also extend the existing theoretical analysis of the basis pursuit problem, i.e. min kxk1 s.t. Ax = b, by studying conditions for perfect recovery of sparse enough solutions. Considering a matrix A with arbitrary column norms, and an arbitrary monotone element-wise concave penalty replacing the ell1-norm objective function, we generalize known equivalence results. Beyond the desirable generalization that this result introduces, we show how it is exploited to lead to the above-mentioned uniqueness claim.

In other words, if it is sparse enough, then it is unique and one can drop the L1 penalty constraint as a non negativity constraint is sufficient. Wow. (Thanks Jort for the link.)

In a different area, Michael Grant, Stephen Boyd, and Yinyu Ye released a new build (Version 1.1, Build 565) for CVX ( Matlab Software for Disciplined Convex Programming).

Monday, November 19, 2007

Monday Morning Algorithm Part 3: Compressed Sensing meets Machine Learning / Recognition via Sparse Representation Classification Algorithm

[Wired readers, you may want to read this summary]
The whole list of Monday Morning Algorithms is listed here with attendant .m files. If you want contribute, like Jort Gemmeke just did, please let me know. For those reading this on Sunday, well, it's Monday somewhere. Now on today's algorithm:

In a previous entry, we discovered that the Machine Learning Community was beginning to use Random projections (the ones that follow the Restricted Isometric Property!) not just to reduce tremendously the dimension of images for pattern recognition. The thinking has gone farther by making the clever argument that when searching for a match between a query and a dictionary composed of training examples, only a few elements will match thus yielding a sparse representation in that dictionary. And so using one of the Compressed Sensing reconstruction technique, one could in effect say that the result of the database query has become a linear programming question (for those of us using L1 reconstruction techniques). John Wright, Arvind Ganesh Balasubramanian, Allen Yang and Yi Ma proposed this in their study on face recognition [1]. The generic approach in Machine Learning has mostly been based on Nearest Neighbor computation.

Intrigued by this approach, Jort Gemmeke decided to contribute today's Monday Morning Algorithm. Jort decided to write Algorithm 1 (Recognition via Sparse Representation) of reference [1]. I mostly contributed in prettying it up and then some (i.e. all the good stuff is his and all the mistakes are mine). Thank you Jort . It uses GPSR ( Mário Figueiredo, Robert D. Nowak and Stephen Wright have made a new version available) but in the comment section you can also uncomment to access L-1 Magic, Sparsify or CVX.

In this example we have a dictionary made up of cosine functions of different frequencies making up ten different classes (i.e. ten different frequencies), the query is close to the third class. In every class there are two cosine functions of the same frequency but with different noise added to them. The query is free of noise. You need to have GPSR in your path to have it run.

clear all;
% Algorithm implementing Recognition via Sparse Representation
% or Algorithm 1 suggested in
% "Feature selection in face recognition: A sparse representation
% perspective" by Allen Y. Yang, John Wright, Yi Ma, and Shankar Sastry
% written by Jort Florent Gemmeke and Igor Carron.

% n total number of images in the training database
n = 20;
% m is dimension of the image/ambient
% space (number of samples in signal )
% (e.g. total "training" set) ..e.g. 101
x=[0:0.01:1];
m = size(x,2);
% d is the number of Compressed Sensing
% measurements, it is also the dimension of the feature
% space.
d = 7;
% k is the number of classification groups and for k subjects
k = 10;
% e.g. every group consists of kn = 20 samples.
kn = n/k;

% Building the training database/dictionary A of n elements
% Within that dictionary there are k classes
% Each of these classes correspond to functions:cos(i x)
% with added noise. i is the differentiator between classes.
% Within each class, any element is different from the other
% by some amount of noise.
% Each element is listed in a row. Each function-element is
% sampled m times.
noise=0.1;
for i=1:k
for j=1:kn
yy = noise*rand(1,m);
A(:,(i-1)*kn+j)=cos(i.*x)+yy;
end
end

% In the following, every index i contains a vector
% of length m with nonzeros corresponding to
% columns in A belonging to class i (i in [1,k])
% Every row holds one element of a class,
% i.e. there are as many rows as elements in this
% dictionary/database.
for i=1:k
onesvecs(i,:)=[zeros(1,(i-1)*kn) ones(1,kn) zeros(1,n-((i-1)* kn + kn))];
end
% This is the element for which we want to
% know to what class it belongs to. Here we give
% the answer for the plot in x_true.
x_true = [zeros(1,4) 1 1 zeros(1,n-6)];
x_true = x_true/norm(x_true,1);
% Now the expression stating that y = cos(3.*x);
y = cos(3.*x)';
% Please note it is different from both entries
% in the dictionary as there is no noise.
% we could have put this as well: y = A * x_true'
% but it would have meant we were using the training
% function as a query test.
%
%
%
% Step 2
% Dimensionality reduction from 101 to 7 features
% using random projection.
R = randn(d,m);
Atilde = R * A;
ytilde = R * y;

% Normalize columns of Atilde and ytilde
for i=1:size(A,2)
Atilde(:,i)=Atilde(:,i)/norm(Atilde(:,i),2);
end
ytilde = ytilde/norm(ytilde);

% Step 3
% call any L1 solver you want. Here we call GPSR
%
% ---- L1 Magic Call
%xp = l1qc_logbarrier(Atilde\ytilde, Atilde, [], ytilde, 0.01, 1e-3);
%
% ---- Sparsify Solver Call
%xp = greed_omp(ytilde,Atilde,n);
%
% ---- CVX Solver Call
%sigma = 0.05;
%cvx_begin
% variable xp(k);
% minimize(norm(xp,1));
% subject to
% norm(Atilde*xp - ytilde) lessthan sigma;
%cvx_end
%
% ---- GPSR Solver call

xp = GPSR_BB(ytilde,Atilde, 0.01*max(abs(Atilde'*ytilde)),'Continuation',1);

% plot results of L1 solution
figure; plot(xp,'v');
title('L1 solution (v), initial signal (o)');
hold; plot(x_true,'o');

%
% Step 4
% Calculate residuals (formula 13)
residual = [];
for i=1:k
% only keeps nonzeros in xp where onesvec is nonzero
deltavec(:,i) = onesvecs(i,:)'.* xp;
% calculate residual on only a subset of Atilde
residual(i) = norm(ytilde-Atilde*deltavec(:,i));
end
figure; bar(residual); % plot residuals
title('The lowest bar indicates the class number to which the element belongs to')

Please note the substantial phase space reduction from 101 down to 7 and still a good classification capability with noise over signal ratio above 10 percent for signals in the library.

[1] Robust Face Recognition via Sparse Representation, John Wright, Arvind Ganesh Balasubramanian, Allen Yang and Yi Ma. Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2] Updated version (November 8, 2007) of the MATLAB code is available here: GPSR_4.0

Photo Credit: Credits: ESA ©2007 MPS for OSIRIS Team MPS/UPD/LAM/IAA/RSSD/INTA/UPM/DASP/IDA

, View of Earth by night from the Rosetta probe in the fly by that took place on November 13th. Rosetta is on 10-year journey to 67/P Churyumov-Gerasimenko. And No, Rosetta is not an asteroid.