tag:blogger.com,1999:blog-6141980.post7366428231216446359..comments2024-03-20T12:28:35.004-05:00Comments on Nuit Blanche: L1 -- What is it good for ?Igorhttp://www.blogger.com/profile/17474880327699002140noreply@blogger.comBlogger6125tag:blogger.com,1999:blog-6141980.post-50114289115968467272007-08-16T15:03:00.000-05:002007-08-16T15:03:00.000-05:00Yaroslav,This L1/L0 discovery is really about intr...Yaroslav,<BR/><BR/>This L1/L0 discovery is really about introducing the concept of prior knowledge on the solution (something that was just hoped for L2). When all the features are important, L2 might as well be used, I would venture.<BR/><BR/>Now for Lp with p less than 1, maybe you should ask Rick Chartrand directly ( http://math.lanl.gov/~rick/ ) on the difficulties he may have had ?<BR/><BR/><BR/>Igor.Igorhttps://www.blogger.com/profile/17474880327699002140noreply@blogger.comtag:blogger.com,1999:blog-6141980.post-28015631162016737722007-08-08T13:29:00.000-05:002007-08-08T13:29:00.000-05:00Well, I was thinking more from the classification ...Well, I was thinking more from the classification point of view - people say that L1 is good when only a few features are relevant, so I was wondering how it compares to L2 when all the features are relevant. Or in other words, whether there are practical situations when L2 regularization is better in the sense of producing lower prediction error.<BR/><BR/>I suspect that Alex/Gelman are using t-distribution for Bayesian Model Averaging, and not sparse decomposition. Since t-distribution is flatter than a Gaussian, doing BMA with it would result in smaller variance.<BR/><BR/>I'd be interested to see if anyone else got good results for doing Lp regularization for p<1 with gradient descent. The problem I see is that the norm of the gradient is infinite whenever one of the coefficients is 0. If you initialize your parameters at 0, the gradient descent will be stuck at 0. If you initialize one of the parameters close to 0, then that parameter should go to 0 fast. In other words it seems like gradient descent with such regularization would be overly sensitive to initial conditionsYaroslav Bulatovhttps://www.blogger.com/profile/06139256691290554110noreply@blogger.comtag:blogger.com,1999:blog-6141980.post-58165420597960353852007-08-07T17:28:00.000-05:002007-08-07T17:28:00.000-05:00I don't know. L1 is enjoying a nice ride because o...I don't know. L1 is enjoying a nice ride because of the sparse property. If the decomposition is not sparse (exactly or approximately), then solving an underdetermined system will produce infinitely many solutions using either L2 or L1.<BR/><BR/>Some people could say that if your decomposition is not sparse, then maybe you are solving the wrong problem.<BR/><BR/>Does that answer your question ?<BR/><BR/>Igor.Igorhttps://www.blogger.com/profile/17474880327699002140noreply@blogger.comtag:blogger.com,1999:blog-6141980.post-9587495640077738182007-08-07T14:55:00.000-05:002007-08-07T14:55:00.000-05:00The second question is how L1 compares to L2 when ...The second question is how L1 compares to L2 when the signal is not sparse in the basis we are considering. Will L1 work better than no regularization at all? Will L2 work better than L1?Yaroslav Bulatovhttps://www.blogger.com/profile/06139256691290554110noreply@blogger.comtag:blogger.com,1999:blog-6141980.post-68704674612844753592007-08-06T09:11:00.000-05:002007-08-06T09:11:00.000-05:00Yaroslav,you are absolutely right. There is a good...Yaroslav,<BR/><BR/>you are absolutely right. There is a good reason I needed a break. Let me make a huge correction in the post. It stills remains that empirically, the statistics folks seem to find that T-distribution yield sparser representation, and as you point out it does not make sense in light of what we currently know about L1 regularization doing better than L2.<BR/><BR/>Thanks for pointing out this gross error.<BR/><BR/><BR/>I am not sure I fully understand the second question .<BR/><BR/><BR/>Igor.Igorhttps://www.blogger.com/profile/17474880327699002140noreply@blogger.comtag:blogger.com,1999:blog-6141980.post-65527025158712671462007-07-26T14:31:00.000-05:002007-07-26T14:31:00.000-05:00I don't see how t-distribution can favor sparse re...I don't see how t-distribution can favor sparse representations. log(1+d^2) has the same contours as d^2, so MAP estimation with T-distribution or Gaussian should follow the same regularization path.<BR/><BR/>One thing that intrigues me is how important is the original encoding of features for L1 to work. Suppose you take your data X to AX where A is some dense matrix. If a good solution was sparse in the original representation, it's no longer sparse in the new representation. Will L1 regularization still work better than L2?Yaroslav Bulatovhttps://www.blogger.com/profile/06139256691290554110noreply@blogger.com