Publication Data
L1 and L2 Regularization for Multiclass Hinge Loss Models
Abstract: This paper investigates the relationship between the loss
function, the type of regularization, and the resulting model sparsity of
discriminatively-trained multiclass linear models. The effects on sparsity of
optimizing log loss are straightforward: L2 regularization produces very dense models
while L1 regularization produces much sparser models. However, optimizing hinge loss
yields more nuanced behavior. We give experimental evidence and theoretical arguments
that, for a class of problems that arises frequently in natural-language processing,
both L1- and L2-regularized hinge loss lead to sparser models than L2-regularized log
loss, but less sparse models than L1-regularized log loss. Furthermore, we give
evidence and arguments that for models with only indicator features, there is a
critical threshold on the weight of the regularizer below which L1- and L2-regularized
hinge loss tends to produce models of similar sparsity.
