Training Highly Multi-class Linear Classifiers
Venue
Journal Machine Learning Research (JMLR) (2014), 1461-−1492
Publication Year
2014
Authors
Maya R. Gupta, Samy Bengio, Jason Weston
BibTeX
Abstract
Classification problems with thousands or more classes often have a large variance
in the confusability between classes, and we show that the more-confusable classes
add more noise to the empirical loss that is minimized during training. We propose
an online solution that reduces the effect of highly confusable classes in training
the classifier parameters, and focuses the training on pairs of classes that are
easier to differentiate at any given time in the training. We also show that the
adagrad method, recently proposed for automatically decreasing step sizes for
convex stochastic gradient descent optimization, can also be profitably applied to
the nonconvex optimization stochastic gradient descent training of a joint
supervised dimensionality reduction and linear classifier. Experiments on ImageNet
benchmark datasets and proprietary image recognition problems with 15,000 to 97,000
classes show substantial gains in classification accuracy compared to one-vs-all
linear SVMs and Wsabie.
