Jump to Content

Algorithms & optimization

We perform fundamental research in algorithms, markets, optimization, and graph analysis, and use it to deliver solutions to challenges across Google’s business.

Google offices

Algorithms & optimization

About the team

Our team comprises multiple overlapping research groups working on graph mining, large-scale optimization, and market algorithms. We collaborate closely with teams across Google, benefiting Ads, Search, YouTube, Play, Infrastructure, Geo, Social, Image Search, Cloud and more. Along with these collaborations, we perform research related to algorithmic foundations of machine learning, distributed optimization, economics, data mining, and data-driven optimization. Our researchers are involved in both long-term research efforts as well as immediate applications of our technology.

Examples of recent research interests include online ad allocation problems, distributed algorithms for large-scale graph mining, mechanism design for advertising exchanges, and robust and dynamic pricing for ad auctions.

Team focus summaries

Large-scale optimization

Our mission is to develop large-scale, distributed, and data-driven optimization techniques and use them to improve the efficiency and robustness of infrastructure and machine learning systems at Google. We achieve such goals as increasing throughput and decreasing latency in distributed systems, or improving feature selection and parameter tuning in machine learning. To do this, we apply techniques from areas such as combinatorial optimization, online algorithms, and control theory. Our research is used in critical infrastructure that supports products such as Search and Cloud.

Learn more

Understanding places

Our mission is to discover all the world’s places and to understand people’s interactions with those places. We accomplish this by using ML to develop deep understanding of user trajectories and actions in the physical world, and we apply that understanding to solve the recurrent hard problems in geolocation data analysis. This research has enabled many of the novel features that appear in Google geo applications such as Maps.

Structured information extraction

Our mission is to extract salient information from templated documents and web pages and then use that information to assist users. We focus our efforts on extracting data such as flight information from email, event data form the web, and product information from the web. This enables many features in products such as Google Now, Search, and Shopping.

Search and information retrieval

Our mission is to conduct research to enable new or more effective search capabilities. This includes developing deeper understanding of correlations between documents and queries; modeling user attention and product satisfaction; developing Q&A models, particularly for the “next billion Internet users”; and, developing effective personal search models even when Google engineers cannot inspect private user input data.

Medical knowledge and learning

Our mission is offer a premier source of high-quality medical information along your entire online health journey. We provide relevant, targeted medical information to users by applying advanced ML on Google Search data. Examples of technologies created by this team include Symptom Search, Allergy Prediction, and other epidemiological applications.

Featured publications

Quick Access: Building a Smart Experience for Google Drive
Alexandrin Popescul
Julian Gibbons
Alan Green
Michael James Smith
Cayden Meyer
Reuben Kan
Proc. of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2017), pp. 1643-1651
Preview abstract Google Drive is a cloud storage and collaboration service used by hundreds of millions of users around the world. Quick Access is a new feature in Google Drive that surfaces the relevant documents to the user on the home page. We describe the development of a machine-learned service behind this feature. Our metrics show that this feature cuts the time it takes for users to locate their documents in half. The development of this product feature is an illustration of a number of more general challenges and constraints associated with machine learning product deployment such as dealing with private corpora and protecting user privacy, working with data services that are not designed with machine-learning in mind and may be owned and operated by different teams with different constraints, and evolving product definitions which inform the metric being optimized. We believe that the lessons learned from this experience will be useful to practitioners tackling a wide range of applied machine-learning problems. View details
Segmenting Two-Sided Markets
Siddhartha Banerjee
Kamesh Munagala
WWW (2017)
Preview abstract Recent years have witnessed the rise of many successful ecommerce marketplaces like the Amazon marketplace, Uber, AirBnB, and Upwork, where a central platform mediates economic transactions between buyers and sellers. A common feature of many of these two-sided marketplaces is that the platform has full control over search and discovery, but prices are determined by the buyers and sellers. Motivated by this, we study the algorithmic aspects of market segmentation via directed discovery in two-sided markets with endogenous prices. We consider a model where an online platform knows each buyer/seller’s characteristics, and associated demand/supply elasticities. Moreover, the platform can use discovery mechanisms (search/recommendation/etc.) to control which buyers/sellers are visible to each other. This leads to a segmentation of the market into pools, following which buyers and sellers endogenously determine market clearing transaction prices within each pool. The aim of the platform is to maximize the resulting volume of transactions/welfare in the market. We develop efficient algorithms with provable guarantees under a variety of assumptions on the demand and supply functions. We also test the validity of our assumptions on demand curves inferred from NYC taxicab log-data, as well as show the performance of our algorithms on synthetic experiments. View details
Preview abstract We study utility games (Vetta, FOCS 2002) where a set of players join teams to produce social utility, and receive individual utility in the form of payments in return. These games have many natural applications in competitive settings such as labor markets, crowdsourcing, etc. The efficiency of such a game depends on the profit sharing mechanism -- the rule that maps utility produced by the players to their individual payments. We study three natural and widely used profit sharing mechanisms -- egalitarian or equal sharing, marginal gain or value addition when a player joins, and marginal loss or value depletion when a player leaves. For these settings, we give tight bounds on the price of anarchy, thereby allowing comparison between these popular mechanisms from a (worst case) social welfare perspective. View details
Preview abstract We study the problem of automatically and efficiently generating itineraries for users who are on vacation. We focus on the common case, wherein the trip duration is more than a single day. Previous efficient algorithms based on greedy heuristics suffer from two problems. First, the itineraries are often unbalanced, with excellent days visiting top attractions followed by days of exclusively lower-quality alternatives. Second, the trips often re-visit neighborhoods repeatedly in order to cover increasingly low-tier points of interest. Our primary technical contribution is an algorithm that addresses both these problems by maximizing the quality of the worst day. We give theoretical results showing that this algorithm's competitive factor is within a factor two of the guarantee of the best available algorithm for a single day, across many variations of the problem. We also give detailed empirical evaluations using two distinct datasets: (a) anonymized Google historical visit data and (b) Foursquare public check-in data. We show first that the overall utility of our itineraries is almost identical to that of algorithms specifically designed to maximize total utility, while the utility of the worst day of our itineraries is roughly twice that obtained from other approaches. We then turn to evaluation based on human raters who score our itineraries only slightly below the itineraries created by human travel experts with deep knowledge of the area. View details
Distributed Balanced Partitioning via Linear Embedding
Ninth ACM International Conference on Web Search and Data Mining (WSDM), ACM (2016), pp. 387-396
Preview abstract Balanced partitioning is often a crucial first step in solving large-scale graph optimization problems: in some cases, a big graph is chopped into pieces that fit on one machine to be processed independently before stitching the results together, leading to certain suboptimality from the interaction among different pieces. In other cases, links between different parts may show up in the running time and/or network communications cost, hence the desire to have small cut size. We study a distributed balanced partitioning problem where the goal is to partition the vertices of a given graph into k pieces, minimizing the total cut size. Our algorithm is composed of a few steps that are easily implementable in distributed computation frameworks, e.g., MapReduce. The algorithm first embeds nodes of the graph onto a line, and then processes nodes in a distributed manner guided by the linear embedding order. We examine various ways to find the first embedding, e.g., via a hierarchical clustering or Hilbert curves. Then we apply four different techniques such as local swaps, minimum cuts on partition boundaries, as well as contraction and dynamic programming. Our empirical study compares the above techniques with each other, and to previous work in distributed algorithms, e.g., a label propagation method [34], FENNEL [32] and Spinner [23]. We report our results both on a private map graph and several public social networks, and show that our results beat previous distributed algorithms: we notice, e.g., 15-25% reduction in cut size over [34]. We also observe that our algorithms allow for scalable distributed implementation for any number of partitions. Finally, we apply our techniques for the Google Maps Driving Directions to minimize the number of multi-shard queries with the goal of saving in CPU usage. During live experiments, we observe an ≈ 40% drop in the number of multi-shard queries when comparing our method with a standard geography-based method. View details
Preview abstract We study the dynamic mechanism design problem of a seller who repeatedly sells independent items to a buyer with private values. In this setting, the seller could potentially extract the entire buyer surplus by running efficient auctions and charging an upfront participation fee at the beginning of the horizon. In some markets, such as internet advertising, participation fees are not practical since buyers expect to inspect items before purchasing them. This motivates us to study the design of dynamic mechanisms under successively more stringent requirements that capture the implicit business constraints of these markets. We first consider a periodic individual rationality constraint, which limits the mechanism to charge at most the buyer's value in each period. While this prevents large upfront participation fees, the seller can still design mechanisms that spread a participation fee across the first few auctions. These mechanisms have the unappealing feature that they provide close-to-zero buyer utility in the first auctions in exchange for higher utility in future auctions. To address this problem, we introduce a martingale utility constraint, which imposes the requirement that from the perspective of the buyer, the next item's expected utility is equal to the present one's. Our main result is providing a dynamic auction satisfying martingale utility and periodic individual rationality whose profit loss with respect to first-best (full extraction of buyer surplus) is optimal up to polylogarithmic factors. The proposed mechanism is a dynamic two-tier auction with a hard floor and a soft floor that allocates the item whenever the buyer's bid is above the hard floor and charges the minimum of the bid and the soft floor. View details
Preview abstract An effective technique for solving optimization problems over massive data sets is to partition the data into smaller pieces, solve the problem on each piece and compute a representative solution from it, and finally obtain a solution inside the union of the representative solutions for all pieces. This technique can be captured via the concept of composable core-sets, and has been recently applied to solve diversity maximization problems as well as several clustering problems [7,15,8]. However, for coverage and submodular maximization problems, impossibility bounds are known for this technique [15]. In this paper, we focus on efficient construction of a randomized variant of composable core-sets where the above idea is applied on a random clustering of the data. We employ this technique for the coverage, monotone and non-monotone submodular maximization problems. Our results significantly improve upon the hardness results for non-randomized core-sets, and imply improved results for submodular maximization in a distributed and streaming settings. The effectiveness of this technique has been confirmed empirically for several machine learning applications [22], and our proof provides a theoretical foundation to this idea. In summary, we show that a simple greedy algorithm results in a 1/3-approximate randomized composable core-set for submodular maximization under a cardinality constraint. Our result also extends to non-monotone submodular functions, and leads to the first 2-round MapReduce-based constant-factor approximation algorithm with O(n) total communication complexity for either monotone or non-monotone functions. Finally, using an improved analysis technique and a new algorithm PseudoGreedy, we present an improved 0.545-approximation algorithm for monotone submodular maximization, which is in turn the first MapReduce-based algorithm beating factor 1/2 in a constant number of rounds. View details
Online Stochastic Matching: Beating 1-1/e
S. Muthukrishnan
Symposium on the Foundations of Computer Science (FOCS) (2009)
Preview abstract We study the online stochastic bipartite matching problem, in a form motivated by display ad allocation on the Internet. In the online, but adversarial case, the celebrated result of Karp, Vazirani and Vazirani gives an approximation ratio of 1−1e≃0.632, a very familiar bound that holds for many online problems; further, the bound is tight in this case. In the online, stochastic case when nodes are drawn repeatedly from a known distribution, the greedy algorithm matches this approximation ratio, but still, no algorithm is known that beats the 1−1e bound.Our main result is a 0.67-approximation online algorithm for stochastic bipartite matching, breaking this 1−1e barrier. Furthermore, we show that no online algorithm can produce a 1−ϵ approximation for an arbitrarily small ϵ for this problem. Our algorithms are based on computing an optimal offline solution to the expected instance, and using this solution as a guideline in the process of online allocation. We employ a novel application of the idea of the power of two choices from load balancing: we compute two disjoint solutions to the expected instance, and use both of them in the online algorithm in a prescribed preference order. To identify these two disjoint solutions, we solve a max flow problem in a boosted flow graph, and then carefully decompose this maximum flow to two edge-disjoint (near-) matchings. In addition to guiding the online decision making, these two offline solutions are used to characterize an upper bound for the optimum in any scenario. This is done by identifying a cut whose value we can bound under the arrival distribution. At the end, we discuss extensions of our results to more general bipartite allocations that are important in a display ad application. View details
Preview abstract As a fundamental tool in modeling and analyzing social, and information networks, large-scale graph mining is an important component of any tool set for big data analysis. Processing graphs with hundreds of billions of edges is only possible via developing distributed algorithms under distributed graph mining frameworks such as MapReduce, Pregel, Gigraph, and alike. For these distributed algorithms to work well in practice, we need to take into account several metrics such as the number of rounds of computation and the communication complexity of each round. For example, given the popularity and ease-of-use of MapReduce framework, developing practical algorithms with good theoretical guarantees for basic graph algorithms is a problem of great importance. In this tutorial, we first discuss how to design and implement algorithms based on traditional MapReduce architecture. In this regard, we discuss various basic graph theoretic problems such as computing connected components, maximum matching, MST, counting triangle and overlapping or balanced clustering. We discuss a computation model for MapReduce and describe the sampling, filtering, local random walk, and core-set techniques to develop efficient algorithms in this framework. At the end, we explore the possibility of employing other distributed graph processing frameworks. In particular, we study the effect of augmenting MapReduce with a distributed hash table (DHT) service and also discuss the use of a new graph processing framework called ASYMP based on asynchronous message-passing method. In particular, we will show that using ASyMP, one can improve the CPU usage, and achieve significantly improved running time. View details

Highlighted work

Some of our people