No Classification without Representation: Assessing Geodiversity Issues in Open Data Sets for the Developing World
Venue
NIPS 2017 workshop: Machine Learning for the Developing World
Publication Year
2017
Authors
Shreya Shankar, Yoni Halpern, Eric Breck, James Atwood, Jimbo Wilson, D. Sculley
BibTeX
Abstract
Modern machine learning systems such as image classifers rely heavily on large
scale data sets for training. Such data sets are costly to create, thus in practice
a small number of freely available, open source data sets are widely used. Such
strategies may be particularly important for ML applications in the developing
world, where resources may be constrained and the cost of creating suitable large
scale data sets may be a blocking factor. However, we suggest that examining the
{\em geo-diversity} of open data sets is critical before adopting a data set for
such use cases. In particular, we analyze two large, publicly available image data
sets to assess geo-diversity and find that these data sets appear to exhibit a
observable amerocentric and eurocentric representation bias. Further, we perform
targeted analysis on classifiers that use these data sets as training data to
assess the impact of these training distributions, and find strong differences in
the relative performance on images from different locales. These results emphasize
the need to ensure geo-representation when constructing data sets for use in the
developing world.