Goods: Organizing Google's Datasets
Venue
SIGMOD (2016) (to appear)
Publication Year
2016
Authors
Alon Halevy, Flip Korn, Natalya F. Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang
BibTeX
Abstract
Enterprises increasingly rely on structured datasets to run their businesses. These
datasets take a variety of forms, such as structured files, databases,
spreadsheets, or even services that provide access to the data. The datasets often
reside in different storage systems, may vary in their formats, may change every
day. In this paper, we present Goods, a project to rethink how we organize
structured datasets at scale, in a setting where teams use diverse and often
idiosyncratic ways to produce the datasets and where there is no centralized system
for storing and querying them. Goods extracts metadata ranging from
salient information about each dataset (owners, timestamps, schema) to
relationships among datasets, such as similarity and provenance. It then exposes
this metadata through services that allow engineers to find datasets within the
company, to monitor datasets, to annotate them in order to enable others to use
their datasets, and to analyze relationships between them. We discuss the technical
challenges that we had to overcome in order to crawl and infer the metadata for
billions of datasets, to maintain the consistency of our metadata catalog at scale,
and to expose the metadata to users. We believe that many of the lessons that we
learned are applicable to building large-scale enterprise-level data management
systems in general.
