Spotted the following paper in the blog entry More Deep Learning Musings by Paul Mineiro, one can read the following in that blog entry:
...For me the best evidence comes from that old chestnut MNIST. For many years the Gaussian kernel yielded better results than deep learning on MNIST among solutions that did not exploit spatial structure. Since the discovery of dropout this is no longer true and one can see a gap between the Gaussian kernel (at circa 1.2% test error) and, e.g., maxout networks (at 0.9% test error). The Gaussian kernel essentially works by penalizing all function derivatives, i.e., enforcing smoothness. Now it seems something more powerful is happening with deep architectures and dropout. You might say, “hey 1.2% vs. 0.9%, aren't we splitting hairs?” but I don't think so. I suspect something extra is happening here, but that's just a guess, and I certainly don't understand it....
there is more in the blog entry go read it. Here is the paper pointed to in that blog entry
Despite their theoretical appeal and grounding in tractable convex optimization techniques, kernel methods are often not the first choice for large-scale speech applications due to their significant memory requirements and computational expense. In recent years, randomized approximate feature maps have emerged as an elegant mechanism to scale-up kernel methods. Still, in practice, a large number of random features is required to obtain acceptable accuracy in predictive tasks. In this paper, we develop two algorithmic schemes to address this computational bottleneck in the context of kernel ridge regression. The first scheme is a specialized distributed block coordinate descent procedure that avoids the explicit materialization of the feature space data matrix, while the second scheme gains efficiency by combining multiple weak random feature models in an ensemble learning framework. We demonstrate that these schemes enable kernel methods to match the performance of state of the art Deep Neural Networks on TIMIT for speech recognition and classification tasks. In particular, we obtain the best classification error rates reported on TIMIT using kernel methods.
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.
This comment has been removed by the author.
ReplyDeleteWhat comment in particular did you have in mind ?
ReplyDeleteIgor.
This comment has been removed by the author.
ReplyDeleteI just found this early paper about using the Walsh Hadamard transform to generate Gaussian random numbers.
ReplyDeletehttps://archive.org/details/bitsavers_mitreESDTe69266ANewMethodofGeneratingGaussianRando_2706065
That then is the earliest I know.