TFX: A TensorFlow-Based Production-Scale Machine Learning Platform
Venue
KDD 2017
Publication Year
2017
Authors
Akshay Naresh Modi, Chiu Yuen Koo, Chuan Yu Foo, Clemens Mewald, Denis M. Baylor, Eric Breck, Heng-Tze Cheng, Jarek Wilkiewicz, Levent Koc, Lukasz Lew, Martin A. Zinkevich, Martin Wicke, Mustafa Ispir, Neoklis Polyzotis, Noah Fiedel, Salem Elie Haykal, Steven Whang, Sudip Roy, Sukriti Ramesh, Vihan Jain, Xin Zhang, Zakaria Haque
BibTeX
Abstract
Creating and maintaining a platform for reliably producing and deploying machine
learning models requires careful orchestration of many components—a learner for
generating models based on training data, modules for analyzing and validating both
data as well as models, and finally infrastructure for serving models in
production. This becomes particularly challenging when data changes over time and
fresh models need to be produced continuously. Unfortunately, such orchestration is
often done ad hoc using glue code and custom scripts developed by individual teams
for specific use cases, leading to duplicated effort and fragile systems with high
technical debt. We present TensorFlow Extended (TFX), a TensorFlow-based
general-purpose machine learning platform implemented at Google. By integrating the
aforementioned components into one platform, we were able to standardize the
components, simplify the platform configuration, and reduce the time to production
from the order of months to weeks, while providing platform stability that
minimizes disruptions. We present the case study of one deployment of TFX in the
Google Play app store, where the machine learning models are refreshed continuously
as new data arrive. Deploying TFX led to reduced custom code, faster experiment
cycles, and a 2% increase in app installs resulting from improved data and model
analysis.