Nuit Blanche: Sunday Morning Insight: Randomization is not a dirty word

Sunday, December 15, 2013

Sunday Morning Insight: Randomization is not a dirty word

From [8]

The recent announcement of Yann LeCun's appointment as a director of the new Artificial Intelligence Lab at Facebook and Geoff Hinton's move with Google sends a pretty clear signal that neural networks are OK and that the time has come for 'AI for images' i.e. image understanding. Beyond the cleverness of the algorithms developed by those two researchers and their collaborators, lies several technical prowesses. One of the most important one is hardware based: Moore's Law, one of the steamrollers (Predicting the Future: The Steamrollers) that now allows the computation of parameters from larger neural networks with desktop computers.

Parallel to these developments (highlighted in Predicting the Future: Randomness and Parsimony), is the mathematics behind the study of high dimensional objects, a field that has really taken off since compressive sensing back in 2004. In fact, mathematically speaking, the ball got really rolling with Johnson and Lindenstrauss back in 1984. With these results, randomization suddenly became something of a wonder tool. In linear algebra and attendant fields, its use is only starting to get noticed (see the recent Slowly but surely they'll join our side of the Force... ). In fact, several folks decided to give a word to this field: Randomized Numerical Linear Algebra (RandNLA) -Most blog entries on the subject are under the RandNLA tag)-

All is well and peachy keen but how is randomization connected to the nonlinear world of deep learning and neural networks ?

If you were to read the literature, you'd think you'd be watching some episode of Game of Thrones. Not the treacherous and gratitious violence, but rather the many tribes living in their own islands of knowledge. Here is a small case in point: as Joseph Ghafari was making a presentation at the recent Paris Machine Learning Meetup #6 on designing neural networks for botnet detection, I was struck by the similarity between the ELM approach to neural networks spearheaded by Guang-Bin Huang since 2000 [22] and the work on Random Kitchen Sinks [1,2,13] or the work on 1bit and quantization CS [23,24,25] and even more recent work on high dimensional function estimation with ridge functionals [3-9]

Similarly, I do not see papers proposing SVM with Extreme Learning Machine (ELM) techniques mentioning the random kitchen sinks approach,but I am sure I am not reading the right papers either.

None of this really matter. It is a question of time before they actually do talk to each other. But much like what happened in compressive sensing where we had more than 50 sparsity seeking solvers, we really had no real way of judging which one was really interesting. The sharp phase transitions ( see Sunday Morning Insight: The Map Makers ) eventually were the only way to figure this one out. One wonders if they could provide rules of thumbs for understanding these nonlinear approximations of the identity [14]. At the very least, much like in Game of Thrones, the walls, or the sharp phase transition will act as an acid test for all involved. Here are some of the questions that could be answered. From the ELM theory Open Problems page:

As observed in experimental studies, the performance of basic ELM is stable in a wide range of number of hidden nodes. Compared to the BP learning algorithm, the performance of basic ELM is not very sensitive to the number of hidden nodes. However, how to prove it in theory remains open.
One of the typical implementations of ELM is to use random nodes in the hidden layer and the hidden layer of SLFNs need not be tuned. It is interesting to see that the generalization performance of ELM turns out to be very stable. How to estimate the oscillation bound of the generalization performance of ELM remains open too.
It seems that ELM performs better than other conventional learning algorithms in applications with higher noise. How to prove it in theory is not clear.
ELM always has faster learning speed than LS-SVM if the same kernel is used?
ELM provides a batch learning kernel solution which is much simpler than other kernel learning algorithms such as LS-SVM. It is known that it may not be straightforward to have an efficient online sequential implementation of SVM and LS-SVM. However, due to the simplicity of ELM, is it possible to implement the online sequential variant of the kernel based ELM?
ELM always provides similar or better generalization performance than SVM and LS-SVM if the same kernel is used (if not affected by computing devices' precision)?
ELM tends to achieve better performance than SVM and LS-SVM in multiclasses applications, the higher the number of classes is, the larger the difference of their generalization performance will be?
Scalability of ELM with kernels in super large applications.
Parallel and distributed computing of ELM.
ELM will make real-time reasoning feasible?
The hidden node / neuron parameters are not only independent of the training data but also of each other.
Unlike conventional learning methods which MUST see the training data before generating the hidden node / neuron parameters, ELM could randomly generate the hidden node / neuron parameters before seeing the training data.

So far only few works seem to find those transitions. To a certain extent, it looks to be the case in the quantization compressive sensing case [23]

From [23]

or more recently in Learning Functions of Few Arbitrary Linear Parameters in High Dimensions [8]

From [8]

or in the examples featured in last Sunday Morning Insight's that talked about Sharp Phase Transitions in Machine Learning.. . in all, while randomization helps those nonlinear models in getting better results, the acid test that will allow us figure out which one is the most robust, will come from their behaviors with respect the sharp phase transitions...If you are presently using a specific algorithm: Are you going to provide the rest of the community with new charts ? Are you going to be the next Map Makers ?

References:

[1] Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning by Ali Rahimi, Benjamin Recht

[2] Random Features for Large-Scale Kernel Machines by Ali Rahimi and Ben Recht

[3] Learning Non-Parametric Basis Independent Models from Point Queries via Low-Rank Methods by Hemant Tyagi, Volkan Cevher

We consider the problem of learning multi-ridge functions of the form f(x) = g(Ax) from point evaluations of f. We assume that the function f is defined on an l_2-ball in R^d, g is twice continuously differentiable almost everywhere, and A \in R^{k \times d} is a rank k matrix, where k << d. We propose a randomized, polynomial-complexity sampling scheme for estimating such functions. Our theoretical developments leverage recent techniques from low rank matrix recovery, which enables us to derive a polynomial time estimator of the function f along with uniform approximation guarantees. We prove that our scheme can also be applied for learning functions of the form: f(x) = \sum_{i=1}^{k} g_i(a_i^T x), provided f satisfies certain smoothness conditions in a neighborhood around the origin. We also characterize the noise robustness of the scheme. Finally, we present numerical examples to illustrate the theoretical bounds in action.

[4] Active Learning of Multi-Index Function Models by Hemant Tyagi and Volkan Cevher

[5] LEARNING NON-PARAMETRIC BASIS INDEPENDENT MODELS FROM POINT QUERIES VIA LOW-RANK METHODS by HEMANT TYAGI AND VOLKAN CEVHER

[6] LEARNING RIDGE FUNCTIONS WITH RANDOMIZED SAMPLING IN HIGH DIMENSIONS by Hemant Tyagi and Volkan Cevher

[7] COMPRESSED LEARNING OF HIGH-DIMENSIONAL SPARSE FUNCTIONS by Karin Schnass, Jan Vybıral

[8] Learning Functions of Few Arbitrary Linear Parameters in High Dimensions by Massimo Fornasier, Karin Schnass, Jan Vybiral

Let us assume that f is a continuous function defined on the unit ball of Rd, of the form f(x)=g(Ax), where A is a k×d matrix and g is a function of k variables for k≪d. We are given a budget m∈N of possible point evaluations f(xi), i=1,...,m, of f, which we are allowed to query in order to construct a uniform approximating function. Under certain smoothness and variation assumptions on the function g, and an {\it arbitrary} choice of the matrix A, we present in this paper

1. a sampling choice of the points {xi} drawn at random for each function approximation;

2. algorithms (Algorithm 1 and Algorithm 2) for computing the approximating function, whose complexity is at most polynomial in the dimension d and in the number m of points.

Due to the arbitrariness of A, the choice of the sampling points will be according to suitable random distributions and our results hold with overwhelming probability. Our approach uses tools taken from the {\it compressed sensing} framework, recent Chernoff bounds for sums of positive-semidefinite matrices, and classical stability bounds for invariant subspaces of singular value decompositions.

[9] Entropy and sampling numbers of classes of ridge functions by Sebastian Mayer, Tino Ullrich, Jan Vybiral

We study properties of ridge functions $f(x)=g(a\cdot x)$ in high dimensions $d$ from the viewpoint of approximation theory. The considered function classes consist of ridge functions such that the profile $g$ is a member of a univariate Lipschitz class with smoothness $\alpha > 0$ (including infinite smoothness), and the ridge direction $a$ has $p$-norm $\|a\|_p \leq 1$. First, we investigate entropy numbers in order to quantify the compactness of these ridge function classes in $L_{\infty}$. We show that they are essentially as compact as the class of univariate Lipschitz functions. Second, we examine sampling numbers and face two extreme cases. In case $p=2$, sampling ridge functions on the Euclidean unit ball faces the curse of dimensionality. It is thus as difficult as sampling general multivariate Lipschitz functions, a result in sharp contrast to the result on entropy numbers. When we additionally assume that all feasible profiles have a first derivative uniformly bounded away from zero in the origin, then the complexity of sampling ridge functions reduces drastically to the complexity of sampling univariate Lipschitz functions. In between, the sampling problem's degree of difficulty varies, depending on the values of $\alpha$ and $p$. Surprisingly, we see almost the entire hierarchy of tractability levels as introduced in the recent monographs by Novak and Wo\'zniakowski.

[10] Conic Geometric Programming by Venkat Chandrasekaran and Parikshit Shah

[11] Randomized Algorithms for Scalable Machine Learning by Kleiner, Ariel

[12] Random Conic Pursuit for Semideﬁnite Programming by Ariel Kleiner, Ali Rahimi, Michael I. Jordan

We present a novel algorithm, Random Conic Pursuit, that solves semideﬁnite programs (SDPs) repeatedly solving randomly generated optimization problems over two-dimensional subcones of the PSD cone. This scheme is simple, easy to implement, applicable to general nonlinear SDPs, and scalable. These advantages come at the expense of inexact solutions, though we show these to be practically useful and provide theoretical guarantees for some of them. This property renders Random Conic Pursuit of particular interest for machine learning applications, in which the relevant SDPs are based on random data and so exact minima are often not a priority.

[13] Random Features for Large-Scale Kernel Machines by Ali Rahimi and Ben Recht

[14] Sunday Morning Insight on a Quick Panorama of Sensing from Direct Imaging to Machine Learning

[15] Sunday Morning Insight: Sharp Phase Transitions in Machine Learning ?

[16]Using SVMs with randomised feature spaces: an extreme learning approach by Benoît Frénay and Michel Verleysen

[17] Benoît Frénay's blog , http://bfrenay.wordpress.com/extreme-learning/

[18]Random Projections for Support Vector Machines by Saurabh Paul, Christos Boutsidis, Malik Magdon-Ismail, Petros Drineas. in Random Projections for Support Vector Machines

[19] Fast Functions via Randomized Algorithms: Linear Regression with Random Projections

[20] Fast Functions via Randomized Algorithms: Fastfood versus Random Kitchen Sinks

[21] Nystrom Method vs Random Fourier Features:: A Theoretical and Empirical Comparison by Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou

[22] Extreme Learning Machine (ELM), http://www.ntu.edu.sg/home/egbhuang/

[23] J. N. Laska Regime Change. (includes: BSE, several algorithms, numerical work, and regime change)
[24] http://dsp.rice.edu/1bitCS/

[25] Resources on Quantization, Petros Boufounos

Join the CompressiveSensing subreddit or the Google+ Community and post there !

Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

3 comments:

Simone Scardapane said...: Very nice article, the explosion of randomized techniques in ML is one of the things that interest me most. I would add to the discussion the work on Reservoir Computing approaches, e.g. Echo State Networks, that can be seen as the recurrent counterpart to ELM methods. I would love to see the connections between all these subfields analyzed in detail.; Monday, December 16, 2013 at 7:15:00 AM CST
Igor said...: Simone,

Thanks for the heads-up.

The dearth of papers and the fact that these communities barely know each other is a sure sign we are at the edge of our collective knowledge. If you do make the connection between these and the ESN, then please let me know, that new community needs to know about this.

Cheers,

Igor.; Tuesday, December 17, 2013 at 5:29:00 AM CST
Simone Scardapane said...: Yes, diversity of approaches is at the same time stimulating and bewildering.
As an example I found truly fascinating that the same idea of "randomizing" a recurrent neural network came up three times in the span of a few years (Echo State Networks, Liquid State Machines, Backpropagation-Decorrelation). Researchers are starting to appreciate the correlation with ELMs, as an example the following paper uses an architecture composed of a combination of a ESN and a ELM:
http://www.sciencedirect.com/science/article/pii/S0893608012003085
There is also another connection between ELMs and a neuroscientific framework called Neuroengineering, that I first learned about here:
http://www.sciencedirect.com/science/article/pii/S089360801300049X
By going back in time one could also connect these architectures to the earliest works on NNs (eg the Perceptron) or to the works on Random Projections in the Eighties...
Sorry for the long post and congratulations for the blog.
Simone; Tuesday, December 17, 2013 at 3:03:00 PM CST