L1 and L2 Regularization for Multiclass Hinge Loss Models
Venue
Symposium on Machine Learning in Speech and Natural Language Processing (2011)
Publication Year
2011
Authors
Robert C. Moore, John DeNero
BibTeX
Abstract
This paper investigates the relationship between the loss function, the type of
regularization, and the resulting model sparsity of discriminatively-trained
multiclass linear models. The effects on sparsity of optimizing log loss are
straightforward: L2 regularization produces very dense models while L1
regularization produces much sparser models. However, optimizing hinge loss yields
more nuanced behavior. We give experimental evidence and theoretical arguments
that, for a class of problems that arises frequently in natural-language
processing, both L1- and L2-regularized hinge loss lead to sparser models than
L2-regularized log loss, but less sparse models than L1-regularized log loss.
Furthermore, we give evidence and arguments that for models with only indicator
features, there is a critical threshold on the weight of the regularizer below
which L1- and L2-regularized hinge loss tends to produce models of similar
sparsity.
