PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce
Venue
Proceedings of the 35th International Conference on Very Large Data Bases (VLDB-2009)
Publication Year
2009
Authors
Biswanath Panda, Joshua S. Herbach, Sugato Basu, Roberto J. Bayardo
BibTeX
Abstract
Classification and regression tree learning on massive datasets is a common data
mining task at Google, yet many state of the art tree learning algorithms require
training data to reside in memory on a single machine. While more scalable
implementations of tree learning have been proposed, they typically require
specialized parallel computing architectures. In contrast, the majority of Google’s
computing infrastructure is based on commodity hardware. In this paper, we describe
PLANET: a scalable distributed framework for learning tree models over large
datasets. PLANET defines tree learning as a series of distributed computations, and
implements each one using the MapReduce model of distributed computation. We show
how this framework supports scalable construction of classification and regression
trees, as well as ensembles of such models. We discuss the benefits and challenges
of using a MapReduce compute cluster for tree learning, and demonstrate the
scalability of this approach by applying it to a real world learning task from the
domain of computational advertising.
