Jump to Content
Ashish Gupta

Ashish Gupta

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    F1 Query: Declarative Querying at Scale
    Bart Samwel
    Ben Handy
    Jason Govig
    Chanjun Yang
    Daniel Tenedorio
    Felix Weigel
    David G Wilhite
    Jiacheng Yang
    Jun Xu
    Jiexing Li
    Zhan Yuan
    Qiang Zeng
    Ian Rae
    Anurag Biyani
    Andrew Harn
    Yang Xia
    Andrey Gubichev
    Amr El-Helw
    Orri Erling
    Allen Yan
    Mohan Yang
    Yiqun Wei
    Thanh Do
    Colin Zheng
    Somayeh Sardashti
    Ahmed Aly
    Divy Agrawal
    Shivakumar Venkataraman
    PVLDB (2018), pp. 1835-1848
    Preview abstract F1 Query is a stand-alone, federated query processing platform that executes SQL queries against data stored in different file-based formats as well as different storage systems (e.g., BigTable, Spanner, Google Spreadsheets, etc.). F1 Query eliminates the need to maintain the traditional distinction between different types of data processing workloads by simultaneously supporting: (i) OLTP-style point queries that affect only a few records; (ii) low-latency OLAP querying of large amounts of data; and (iii) large ETL pipelines transforming data from multiple data sources into formats more suitable for analysis and reporting. F1 Query has also significantly reduced the need for developing hard-coded data processing pipelines by enabling declarative queries integrated with custom business logic. F1 Query satisfies key requirements that are highly desirable within Google: (i) it provides a unified view over data that is fragmented and distributed over multiple data sources; (ii) it leverages datacenter resources for performant query processing with high throughput and low latency; (iii) it provides high scalability for large data sizes by increasing computational parallelism; and (iv) it is extensible and uses innovative approaches to integrate complex business logic in declarative query processing. This paper presents the end-to-end design of F1 Query. Evolved out of F1, the distributed database that Google uses to manage its advertising data, F1 Query has been in production for multiple years at Google and serves the querying needs of a large number of users and systems. View details
    Ubiq: A Scalable and Fault-tolerant Log Processing Infrastructure
    Alexander Smolyanov
    Divy Agrawal
    Haifeng Jiang
    Manish Bhatia
    Monica Chawathe Lenart
    Namit Sikka
    Navin Melville
    Scott Holzer
    Shan He
    Shivakumar Venkataraman
    Tianhao Qiu
    Venkatesh Basker
    Vinny Ganeshan
    Yuri Vasilevski
    Workshop on Business Intelligence for the Real Time Enterprise (BIRTE), Springer (2016)
    Preview abstract Most of today’s Internet applications are data-centric and generate vast amounts of data (typically, in the form of event logs) that needs to be processed and analyzed for detailed reporting, enhancing user experience and increasing monetization. In this paper, we describe the architecture of Ubiq, a geographically distributed framework for processing continuously growing log files in real time with high scalability, high availability and low latency. The Ubiq framework fully tolerates infrastructure degradation and datacenter-level outages without any manual intervention. It also guarantees exactly-once semantics for application pipelines to process logs in the form of event bundles. Ubiq has been in production for Google’s advertising system for many years and has served as a critical log processing framework for hundreds of pipelines. Our production deployment demonstrates linear scalability with machine resources, extremely high availability even with underlying infrastructure failures, and an end-to-end latency of under a minute. View details
    High-Availability at Massive Scale: Building Google’s Data Infrastructure for Ads
    Workshop on Business Intelligence for the Real Time Enterprise (BIRTE), Springer (2015) (to appear)
    Preview abstract Google’s Ads Data Infrastructure systems run the multi- billion dollar ads business at Google. High availability and strong consistency are critical for these systems. While most distributed systems handle machine-level failures well, handling datacenter-level failures is less common. In our experience, handling datacenter-level failures is critical for running true high availability systems. Most of our systems (e.g. Photon, F1, Mesa) now support multi-homing as a fundamental design property. Multi-homed systems run live in multiple datacenters all the time, adaptively moving load between datacenters, with the ability to handle outages of any scale completely transparently. This paper focuses primarily on stream processing systems, and describes our general approaches for building high availability multi-homed systems, discusses common challenges and solutions, and shares what we have learned in building and running these large-scale systems for over ten years. View details
    Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing
    Fan Yang
    Jason Govig
    Adam Kirsch
    Kelvin Chan
    Kevin Lai
    Shuo Wu
    Sandeep Dhoot
    Abhilash Kumar
    Mingsheng Hong
    Jamie Cameron
    Masood Siddiqi
    David Jones
    Andrey Gubarev
    Shivakumar Venkataraman
    Divyakant Agrawal
    VLDB (2014)
    Preview abstract Mesa is a highly scalable analytic data warehousing system that stores critical measurement data related to Google's Internet advertising business. Mesa is designed to satisfy a complex and challenging set of user and systems requirements, including near real-time data ingestion and queryability, as well as high availability, reliability, fault tolerance, and scalability for large data and query volumes. Specifically, Mesa handles petabytes of data, processes millions of row updates per second, and serves billions of queries that fetch trillions of rows per day. Mesa is geo-replicated across multiple datacenters and provides consistent and repeatable query answers at low latency, even when an entire datacenter fails. This paper presents the Mesa system and reports the performance and scale that it achieves. View details
    Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams
    Rajagopal Ananthanarayanan
    Venkatesh Basker
    Sumit Das
    Haifeng Jiang
    Tianhao Qiu
    Alexey Reznichenko
    Deomid Ryabkov
    Shivakumar Venkataraman
    SIGMOD '13: Proceedings of the 2013 international conference on Management of data, ACM, New York, NY, USA, pp. 577-588
    Preview abstract Web-based enterprises process events generated by millions of users interacting with their websites. Rich statistical data distilled from combining such interactions in near real-time generates enormous business value. In this paper, we describe the architecture of Photon, a geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency, where the streams may be unordered or delayed. The system fully tolerates infrastructure degradation and datacenter-level outages without any manual intervention. Photon guarantees that there will be no duplicates in the joined output (at-most-once semantics) at any point in time, that most joinable events will be present in the output in real-time (near-exact semantics), and exactly-once semantics eventually. Photon is deployed within Google Advertising System to join data streams such as web search queries and user clicks on advertisements. It produces joined logs that are used to derive key business metrics, including billing for advertisers. Our production deployment processes millions of events per minute at peak with an average end-to-end latency of less than 10 seconds. We also present challenges and solutions in maintaining large persistent state across geographically distant locations, and highlight the design principles that emerged from our experience. View details
    QoS Aware Path Protection Schemes for MPLS Networks
    B. N. Jain
    Satish Tripathi
    QoS Aware Path Protection Schemes for MPLS Networks, International Conference on Computer Communications (2002)