From the conclusion
Finally, we note that stability may also provide new ideas for designing learning rules. There are a variety of successful methods in machine learning and signal processing that do not compute an exact stochastic gradient, yet are known to find quality stationary points in theory and practice . Do the ideas developed in this paper provide new insights into how to design learning rules that accelerate the convergence and improve the generalization of SGM?
Train faster, generalize better: Stability of stochastic gradient descent by Moritz Hardt, Benjamin Recht, Yoram Singer
We show that any model trained by a stochastic gradient method with few iterations has vanishing generalization error. We prove this by showing the method is algorithmically stable in the sense of Bousquet and Elisseeff. Our analysis only employs elementary tools from convex and continuous optimization. Our results apply to both convex and non-convex optimization under standard Lipschitz and smoothness assumptions.
Applying our results to the convex case, we provide new explanations for why multiple epochs of stochastic gradient descent generalize well in practice. In the nonconvex case, we provide a new interpretation of common practices in neural networks, and provide a formal rationale for stability-promoting mechanisms in training large, deep models. Conceptually, our findings underscore the importance of reducing training time beyond its obvious benefit.
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.