Sunday, December 15, 2013

Sunday Morning Insight: Randomization is not a dirty word

From [8]

The recent announcement of Yann LeCun's appointment as a director of the new Artificial Intelligence Lab at Facebook and Geoff Hinton's move with Google sends a pretty clear signal that neural networks are OK and that the time has come for 'AI for images' i.e. image understanding. Beyond the cleverness of the algorithms developed by those two researchers and their collaborators, lies several technical prowesses. One of the most important one is hardware based:  Moore's Law, one of the steamrollers (Predicting the Future: The Steamrollers) that now allows the computation of parameters from larger neural networks with desktop computers.

Parallel to these developments (highlighted in Predicting the Future: Randomness and Parsimony), is the mathematics behind the study of high dimensional objects, a field that has really taken off since compressive sensing back in 2004. In fact, mathematically speaking, the ball got really rolling with Johnson and Lindenstrauss back in 1984. With these results, randomization suddenly became something of a wonder tool. In linear algebra and attendant fields, its use is only starting to get noticed (see the recent Slowly but surely they'll join our side of the Force... ). In fact, several folks decided to give a word to this field: Randomized Numerical Linear Algebra (RandNLA) -Most blog entries on the subject are under the RandNLA tag)- 

All is well and peachy keen but how is randomization connected to the nonlinear world of deep learning and neural networks ?

If you were to read the literature, you'd think you'd be watching some episode of Game of Thrones. Not the treacherous and gratitious violence, but rather the many tribes living in their own islands of knowledge. Here is a small case in point: as Joseph Ghafari was making a presentation at the recent Paris Machine Learning Meetup #6 on designing neural networks for botnet detection, I was struck by the similarity between the ELM approach to neural networks spearheaded by Guang-Bin Huang since 2000 [22] and the work on Random Kitchen Sinks [1,2,13] or the work on 1bit and quantization CS [23,24,25] and even more recent work on high dimensional function estimation with ridge functionals [3-9]

Similarly, I do not see papers proposing SVM with Extreme Learning Machine (ELM) techniques mentioning the random kitchen sinks approach,but I am sure I am not reading the right papers either.

None of this really matter. It is a question of time before they actually do talk to each other. But much like what happened in compressive sensing where we had more than 50 sparsity seeking solvers, we really had no real way of judging which one was really interesting. The sharp phase transitions ( see Sunday Morning Insight: The Map Makers ) eventually were the only way to figure this one out. One wonders if they could provide rules of thumbs for understanding these nonlinear approximations of the identity [14]. At the very least, much like in Game of Thrones, the walls, or the sharp phase transition will act as an acid test for all involved. Here are some of the questions that could be answered. From the ELM theory Open Problems page:

  1. As observed in experimental studies, the performance of basic ELM is stable in a wide range of number of hidden nodes. Compared to the BP learning algorithm, the performance of basic ELM is not very sensitive to the number of hidden nodes. However, how to prove it in theory remains open.
  2. One of the typical implementations of ELM is to use random nodes in the hidden layer and the hidden layer of SLFNs need not be tuned. It is interesting to see that the generalization performance of ELM turns out to be very stable. How to estimate the oscillation bound of the generalization performance of ELM remains open too.
  3. It seems that ELM performs better than other conventional learning algorithms in applications with higher noise. How to prove it in theory is not clear.
  4. ELM always has faster learning speed than LS-SVM if the same kernel is used?
  5. ELM provides a batch learning kernel solution which is much simpler than other kernel learning algorithms such as LS-SVM. It is known that it may not be straightforward to have an efficient online sequential implementation of SVM and LS-SVM. However, due to the simplicity of ELM, is it possible to implement the online sequential variant of the kernel based ELM?
  6. ELM always provides similar or better generalization performance than SVM and LS-SVM if the same kernel is used (if not affected by computing devices' precision)?
  7. ELM tends to achieve better performance than SVM and LS-SVM in multiclasses applications, the higher the number of classes is, the larger the difference of their generalization performance will be?
  8. Scalability of ELM with kernels in super large applications.
  9. Parallel and distributed computing of ELM.
  10. ELM will make real-time reasoning feasible?
  11. The hidden node / neuron parameters are not only independent of the training data but also of each other.
  12. Unlike conventional learning methods which MUST see the training data before generating the hidden node / neuron parameters, ELM could randomly generate the hidden node / neuron parameters before seeing the training data.
So far only few works seem to find those transitions. To a certain extent, it looks to be the case in the quantization compressive sensing case [23]  

From [23]

From [8]

or in the examples featured in last  Sunday Morning Insight's that talked about Sharp Phase Transitions in Machine Learning.. . in all, while randomization helps those nonlinear models in getting better results, the acid test that will allow us figure out which one is the most robust, will come from their behaviors with respect the sharp phase transitions...If you are presently using a specific algorithm: Are you going to provide the rest of the community with new charts ? Are you going to be the next Map Makers ?

[2] Random Features for Large-Scale Kernel Machines by Ali Rahimi and Ben Recht
We consider the problem of learning multi-ridge functions of the form f(x) = g(Ax) from point evaluations of f. We assume that the function f is defined on an l_2-ball in R^d, g is twice continuously differentiable almost everywhere, and A \in R^{k \times d} is a rank k matrix, where k << d. We propose a randomized, polynomial-complexity sampling scheme for estimating such functions. Our theoretical developments leverage recent techniques from low rank matrix recovery, which enables us to derive a polynomial time estimator of the function f along with uniform approximation guarantees. We prove that our scheme can also be applied for learning functions of the form: f(x) = \sum_{i=1}^{k} g_i(a_i^T x), provided f satisfies certain smoothness conditions in a neighborhood around the origin. We also characterize the noise robustness of the scheme. Finally, we present numerical examples to illustrate the theoretical bounds in action.

Let us assume that f is a continuous function defined on the unit ball of Rd, of the form f(x)=g(Ax), where A is a k×d matrix and g is a function of k variables for k≪d. We are given a budget m∈N of possible point evaluations f(xi), i=1,...,m, of f, which we are allowed to query in order to construct a uniform approximating function. Under certain smoothness and variation assumptions on the function g, and an {\it arbitrary} choice of the matrix A, we present in this paper
1. a sampling choice of the points {xi} drawn at random for each function approximation;
2. algorithms (Algorithm 1 and Algorithm 2) for computing the approximating function, whose complexity is at most polynomial in the dimension d and in the number m of points.
Due to the arbitrariness of A, the choice of the sampling points will be according to suitable random distributions and our results hold with overwhelming probability. Our approach uses tools taken from the {\it compressed sensing} framework, recent Chernoff bounds for sums of positive-semidefinite matrices, and classical stability bounds for invariant subspaces of singular value decompositions.

We study properties of ridge functions $f(x)=g(a\cdot x)$ in high dimensions $d$ from the viewpoint of approximation theory. The considered function classes consist of ridge functions such that the profile $g$ is a member of a univariate Lipschitz class with smoothness $\alpha > 0$ (including infinite smoothness), and the ridge direction $a$ has $p$-norm $\|a\|_p \leq 1$. First, we investigate entropy numbers in order to quantify the compactness of these ridge function classes in $L_{\infty}$. We show that they are essentially as compact as the class of univariate Lipschitz functions. Second, we examine sampling numbers and face two extreme cases. In case $p=2$, sampling ridge functions on the Euclidean unit ball faces the curse of dimensionality. It is thus as difficult as sampling general multivariate Lipschitz functions, a result in sharp contrast to the result on entropy numbers. When we additionally assume that all feasible profiles have a first derivative uniformly bounded away from zero in the origin, then the complexity of sampling ridge functions reduces drastically to the complexity of sampling univariate Lipschitz functions. In between, the sampling problem's degree of difficulty varies, depending on the values of $\alpha$ and $p$. Surprisingly, we see almost the entire hierarchy of tractability levels as introduced in the recent monographs by Novak and Wo\'zniakowski.
[10] Conic Geometric Programming by Venkat Chandrasekaran and Parikshit Shah
[12] Random Conic Pursuit for Semidefinite Programming by Ariel Kleiner, Ali Rahimi, Michael I. Jordan
We present a novel algorithm, Random Conic Pursuit, that solves semidefinite programs (SDPs) repeatedly solving randomly generated optimization problems over two-dimensional subcones of the PSD cone. This scheme is simple, easy to implement, applicable to general nonlinear SDPs, and scalable. These advantages come at the expense of inexact solutions, though we show these to be practically useful and provide theoretical guarantees for some of them. This property renders Random Conic Pursuit of particular interest for machine learning applications, in which the relevant SDPs are based on random data and so exact minima are often not a priority.
[13] Random Features for Large-Scale Kernel Machines by Ali Rahimi and Ben Recht
[17] Benoît Frénay's blog ,
[21] Nystrom Method vs Random Fourier Features:: A Theoretical and Empirical Comparison  by Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou
[22] Extreme Learning Machine (ELM),
[23] J. N. Laska Regime Change. (includes: BSE, several algorithms, numerical work, and regime change)


Simone Scardapane said...

Very nice article, the explosion of randomized techniques in ML is one of the things that interest me most. I would add to the discussion the work on Reservoir Computing approaches, e.g. Echo State Networks, that can be seen as the recurrent counterpart to ELM methods. I would love to see the connections between all these subfields analyzed in detail.

Igor said...


Thanks for the heads-up.

The dearth of papers and the fact that these communities barely know each other is a sure sign we are at the edge of our collective knowledge. If you do make the connection between these and the ESN, then please let me know, that new community needs to know about this.



Simone Scardapane said...

Yes, diversity of approaches is at the same time stimulating and bewildering.
As an example I found truly fascinating that the same idea of "randomizing" a recurrent neural network came up three times in the span of a few years (Echo State Networks, Liquid State Machines, Backpropagation-Decorrelation). Researchers are starting to appreciate the correlation with ELMs, as an example the following paper uses an architecture composed of a combination of a ESN and a ELM:
There is also another connection between ELMs and a neuroscientific framework called Neuroengineering, that I first learned about here:
By going back in time one could also connect these architectures to the earliest works on NNs (eg the Perceptron) or to the works on Random Projections in the Eighties...
Sorry for the long post and congratulations for the blog.