Jump to Content
Arif Merchant

Arif Merchant

Arif Merchant is a Research Scientist with the Storage Analytics group at Google, where he studies interactions between components of the storage stack. Prior to this, he was with HP Labs, where he worked on storage QoS, distributed storage systems, and stochastic models of storage. He holds the B.Tech. degree from IIT Bombay and the Ph.D. in Computer Science from Stanford University. He is an ACM Distinguished Scientist.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Practical Design Considerations for Wide Locally Recoverable Codes (LRCs)
    Shashwat Silas
    Dave Clausen
    File and Storage Technologies (FAST), USENIX (2023)
    Preview abstract Most of the data in large-scale storage clusters is erasure coded. At exascale, optimizing erasure codes for low storage overhead, efficient reconstruction, and easy deployment is of critical importance. Locally recoverable codes (LRCs) have deservedly gained central importance in this field, because they can balance many of these requirements. In our work we study wide LRCs; LRCs with large number of blocks per stripe and low storage overhead. These codes are a natural next step for practitioners to unlock higher storage savings, but they come with their own challenges. Of particular interest is their reliability, since wider stripes are prone to more simultaneous failures. We conduct a practically-minded analysis of several popular and novel LRCs. We find that wide LRC reliability is a subtle phenomenon that is sensitive to several design choices, some of which are overlooked by theoreticians, and others by practitioners. Based on these insights, we construct novel LRCs called Uniform Cauchy LRCs, which show excellent performance in simulations, and a 33% improvement in reliability on unavailability events observed by a wide LRC deployed in a Google storage cluster. We also show that these codes are easy to deploy in a manner that improves their robustness to common maintenance events. Along the way, we also give a remarkably simple and novel construction of distance optimal LRCs (other constructions are also known), which may be of interest to theory-minded readers. View details
    CacheSack: Admission Optimization for Datacenter Flash Caches
    Homer Wolfmeister
    Mustafa Uysal
    Seth Bradley Pollen
    Tzu-Wei Yang
    2022 USENIX Annual Technical Conference (2022) (to appear)
    Preview abstract This paper describes the algorithm, implementation, and deployment experience of CacheSack, the admission algorithm for Google datacenter flash caches. CacheSack minimizes the dominant costs of Google’s datacenter flash caches: disk IO and flash footprint. CacheSack partitions cache traffic into disjoint categories, analyzes the observed cache benefit of each subset, and formulates a knapsack problem to assign the optimal admission policy to each subset. Prior to this work, Google datacenter flash cache admission policies were optimized manually, with most caches using the Lazy Adaptive Replacement Cache (LARC) algorithm. Production experiments showed that CacheSack significantly outperforms the prior static admission policies for a 6.5% improvement of the total operational cost, as well as significant improvements in disk reads and flash wearout. View details
    Tiger: disk-adaptive redundancy without placement restrictions
    Francisco Maturana
    Sanjith Athlur
    Rashmi KV
    Gregory R. Ganger
    Tiger: disk-adaptive redundancy without placement restrictions (2022)
    Preview abstract Large-scale cluster storage systems use redundancy (via erasure coding) to ensure data durability. Disk-adaptive redundancy—dynamically tailoring the redundancy scheme to observed disk failure rates—promises significant space and cost savings. Existing disk-adaptive redundancy systems, however, pose undesirable constraints on data placement, partitioning disks into subclusters with homogeneous failure rates and forcing each erasure-coded stripe to be entirely placed on the disks within one subcluster. This design increases risk, by reducing intra-stripe diversity and being more susceptible to unanticipated changes in a make/model’s failure rate, and only works for very large storage clusters fully committed to disk-adaptive redundancy. Tiger is a new disk-adaptive redundancy system that efficiently avoids adoption-blocking placement constraints, while also providing higher space-savings and lower risk relative to prior designs. To do so, Tiger introduces the eclectic stripe, in which disks with different failure rates can be used to store a stripe that has redundancy tailored to the set of failure rates of those disks. With eclectic stripes, pre-existing placement policies can be used while still enjoying the space-savings and robustness benefits of disk-adaptive redundancy. This paper introduces eclectic striping and Tiger’s design, including a new mean time-to-data-loss (MTTDL) approximation technique and new approaches for ensuring safe per-stripe settings given that failure rates of different devices change over time. Evaluation with logs from real-world clusters show that Tiger provides better space-savings, less bursty IO for changing redundancy schemes, and better robustness (due to increased risk-diversity) than prior disk-adaptive redundancy designs. View details
    Reliability of nand-Based SSDs: What Field Studies Tell Us
    Bianca Schroeder
    Raghav Lagisetty
    Proceedings of the IEEE (2017)
    Preview abstract Solid-state drives (SSDs) based on NAND flash are making deep inroads into data centers as well as the consumer market. In 2016, manufacturers shipped more than 130 million units totaling around 50 Exabytes of storage capacity. As the amount of data stored on solid state drives keeps increasing, it is important to understand the reliability characteristics of these devices. For a long time, our knowledge about flash reliability was derived from controlled experiments in lab environments under synthetic workloads, often using methods for accelerated testing. However, within the last two years, three large-scale field studies have been published that report on the failure behavior of flash devices in production environments subjected to real workloads and operating conditions. The goal of this paper is to provide an overview of what we have learned about flash reliability in production, and where appropriate contrasting it with prior studies performing controlled experiments. View details
    Preview abstract The use of solid state drives based on NAND flash technology is continuously growing. As more data either lives on flash or is being cached on flash, data durability and availability critically depend on flash reliability. This paper provides a detailed field study of flash reliability based on data collected over 6 years in a large-scale data center production environment. The data spans many millions of drive days, ten different drive models, different flash technologies (MLC and SLC) and feature sizes (ranging from 24nm to 50nm). The paper analyses this data in order to derive a better understanding of flash reliability in the field, including the most prevalent types of errors and hardware failures and their frequency, and how different factors impact flash reliability. View details
    Slicer: Auto-Sharding for Datacenter Applications
    Atul Adya
    Jon Howell
    Jeremy Elson
    Colin Meek
    Vishesh Khemani
    Stefan Fulger
    Pan Gu
    Lakshminath Bhuvanagiri
    Jason Hunter
    Roberto Peon
    Alexander Shraer
    Kfir Lev-Ari
    OSDI 2016 (2016)
    Preview abstract Sharding is a fundamental building block of large-scale applications, but most have their own custom, ad-hoc implementations. Our goal is to make sharding as easily reusable as a filesystem or lock manager. Slicer is \Google's general purpose sharding service. It monitors signals such as load hotspots and server health and dynamically shards work over a set of servers. Its goals are to maintain high availability and reduce load imbalance while minimizing churn from moved work. In this paper, we describe Slicer's design and implementation. Slicer has the consistency and global optimization of a centralized sharder while approaching the high availability, scalability, and low latency of systems that make local decisions. It achieves this by separating concerns: a reliable data plane forwards requests, and a smart control plane makes load-balancing decisions off the critical path. Slicer's small but powerful API has proven useful and easy to adopt in dozens of \Google applications. It is used to allocate resources for web service front-ends, coalesce writes to increase storage bandwidth, and increase the efficiency of a web cache. It currently handles 2-6M~req/s of production traffic. Production workloads using Slicer exhibit a most-loaded task 30\%--180\% of the mean load, even for highly skewed and time-varying loads. View details
    Take me to your leader! Online Optimization of Distributed Storage Configurations
    Artyom Sharov
    Alexander Shraer
    Murray Stokely
    Proceedings of the 41st International Conference on Very Large Data Bases, VLDB Endowment (2015), pp. 1490-1501
    Preview abstract The configuration of a distributed storage system typically includes, among other parameters, the set of servers and their roles in the replication protocol. Although mechanisms for changing the configuration at runtime exist, it is usually left to system administrators to manually determine the “best” configuration and periodically reconfigure the system, often by trial and error. This paper describes a new workload-driven optimization framework that dynamically determines the optimal configuration at runtime. We focus on optimizing leader and quorum based replication schemes and divide the framework into three optimization tiers, dynamically optimizing different configuration aspects: 1) leader placement, 2) roles of different servers in the replication protocol, and 3) replica locations. We showcase our optimization framework by applying it to a large-scale distributed storage system used internally in Google and demonstrate that most client applications significantly benefit from using our framework, reducing average operation latency by up to 94%. View details
    Poster Paper: Automatic Reconfiguration of Distributed Storage
    Artyom Sharov
    Alexander Shraer
    Murray Stokely
    The 12th International Conference on Autonomic Computing, IEEE (2015), pp. 133-134
    Preview abstract The configuration of a distributed storage system with multiple data replicas typically includes the set of servers and their roles in the replication protocol. The configuration can usually be changed manually, but in most cases, system administrators have to determine a good configuration by trial and error. We describe a new workload-driven optimization framework that dynamically determines the optimal configuration at run time. Applying the framework to a large-scale distributed storage system used internally in Google resulted in halving the operation latency in 17% of the tested databases, and reducing it by more than 90% in some cases. View details
    Janus: Optimal Flash Provisioning for Cloud Storage Workloads
    Christoph Albrecht
    Murray Stokely
    Muhammad Waliji
    Francois Labelle
    Xudong Shi
    Eric Schrock
    Proceedings of the USENIX Annual Technical Conference, USENIX, Advanced Computing System Association, 2560 Ninth Street, Suite 215, Berkeley, CA 94710, USA (2013), pp. 91-102
    Preview abstract Janus is a system for partitioning the flash storage tier between workloads in a cloud-scale distributed file system with two tiers, flash storage and disk. The file system stores newly created files in the flash tier and moves them to the disk tier using either a First-In-First-Out (FIFO) policy or a Least-Recently-Used (LRU) policy, subject to per-workload allocations. Janus constructs compact metrics of the cacheability of the different workloads, using sampled distributed traces because of the large scale of the system. From these metrics, we formulate and solve an optimization problem to determine the flash allocation to workloads that maximizes the total reads sent to the flash tier, subject to operator-set priorities and bounds on flash write rates. Using measurements from production workloads in multiple data centers using these recommendations, as well as traces of other production workloads, we show that the resulting allocation improves the flash hit rate by 47–76% compared to a unified tier shared by all workloads. Based on these results and an analysis of several thousand production workloads, we conclude that flash storage is a cost-effective complement to disks in data centers. View details
    Hathi: durable transactions for memory using flash
    Mohit Saxena
    Mehul A. Shah
    Stavros Harizopoulos
    Michael M. Swift
    Proceedings of the Eighth International Workshop on Data Management on New Hardware, ACM, New York, NY, USA (2012), pp. 33-38
    Preview
    Projecting Disk Usage Based on Historical Trends in a Cloud Environment
    Murray Stokely
    Amaan Mehrabian
    Christoph Albrecht
    Francois Labelle
    ScienceCloud 2012 Proceedings of the 3rd International Workshop on Scientific Cloud Computing, ACM, pp. 63-70
    Preview abstract Provisioning scarce resources among competing users and jobs remains one of the primary challenges of operating large-scale, distributed computing environments. Distributed storage systems, in particular, typically rely on hard operator-set quotas to control disk allocation and enforce isolation for space and I/O bandwidth among disparate users. However, users and operators are very poor at predicting future requirements and, as a result, tend to over-provision grossly. For three years, we collected detailed usage information for data stored in distributed filesystems in a large private cloud spanning dozens of clusters on multiple continents. Specifically, we measured the disk space usage, I/O rate, and age of stored data for thousands of different engineering users and teams. We find that although the individual timeseries often have non-stable usage trends, regional aggregations, user classification, and ensemble forecasting methods can be combined to provide a more accurate prediction of future use for the majority of users. We applied this methodology for the storage users in one geographic region and back-tested these techniques over the past three years to compare our forecasts against actual usage. We find that by classifying a small subset of users with unforecastable trend changes due to known product launches, we can generate three-month out forecasts with mean absolute errors of less than ~12%. This compares favorably to the amount of allocated but unused quota that is generally wasted with manual operator-set quotas. View details
    Uncertainty in Aggregate Estimates from Sampled Distributed Traces
    Murray Stokely
    2012 Workshop on Managing Systems Automatically and Dynamically, USENIX
    Preview abstract Tracing mechanisms in distributed systems give important insight into system properties and are usually sampled to control overhead. At Google, Dapper [8] is the always-on system for distributed tracing and performance analysis, and it samples fractions of all RPC traffic. Due to difficult implementation, excessive data volume, or a lack of perfect foresight, there are times when system quantities of interest have not been measured directly, and Dapper samples can be aggregated to estimate those quantities in the short or long term. Here we find unbiased variance estimates of linear statistics over RPCs, taking into account all layers of sampling that occur in Dapper, and allowing us to quantify the sampling uncertainty in the aggregate estimates. We apply this methodology to the problem of assigning jobs and data to Google datacenters, using estimates of the resulting cross-datacenter traffic as an optimization criterion, and also to the detection of change points in access patterns to certain data partitions. View details
    Maestro: Quality-of-Service in Large Disk Arrays
    Mustafa Uysal
    Pradeep Padala
    Xiaoyun Zhu
    Sharad Singhal
    Kang Shin
    Proceedings of the 8th ACM international conference on Autonomic computing (ICAC), ACM, New York, NY, USA (2011), pp. 245-254
    Preview abstract Provisioning storage in disk arrays is a difficult problem because many applications with different workload characteristics and priorities share resources provided by the array. Currently, storage in disk arrays is statically partitioned, leading to difficult choices between over-provisioning to meet peak demands and resource sharing to meet efficiency targets. In this paper, we present Maestro, a feedback controller that can manage resources on large disk arrays to provide performance differentiation among multiple applications. Maestro monitors the performance of each application and dynamically allocates the array resources so that diverse performance requirements can be met without static partitioning. It supports multiple performance metrics (e.g., latency and throughput) and application priorities so that important applications receive better performance in case of resource contention. By ensuring that high-priority applications sharing storage with other applications obtain the performance levels they require, Maestro makes it possible to use storage resources efficiently. We evaluate Maestro using both synthetic and real-world workloads on a large, commercial disk array. Our experiments indicate that Maestro can reliably adjust the allocation of disk array resources to achieve application performance targets. View details
    Sequential Prefetch Cache Sizing for Maximal Hit Rate
    Swapnil Bhatia
    Elizabeth Varki
    MASCOTS (2010), pp. 89-98
    mClock: Handling Throughput Variability for Hypervisor IO Scheduling
    Ajay Gulati
    Peter J. Varman
    OSDI (2010), pp. 437-450
    Designing Dependable Storage Solutions for Shared Application Environments
    Shravan Gaonkar
    Kimberly Keeton
    William H. Sanders
    IEEE Trans. Dependable Sec. Comput., vol. 7 (2010), pp. 366-380
    Sinfonia: A new paradigm for building scalable distributed systems
    Marcos Kawazoe Aguilera
    Mehul A. Shah
    Alistair C. Veitch
    Christos T. Karamanolis
    ACM Trans. Comput. Syst., vol. 27 (2009)
    Technical perspective - Disk array models for automating storage management
    Commun. ACM, vol. 52 (2009), pp. 90
    Efficient and adaptive proportional share I/O scheduling
    Ajay Gulati
    Mustafa Uysal
    Pradeep Padala
    Peter J. Varman
    SIGMETRICS Performance Evaluation Review, vol. 37 (2009), pp. 79-80
    What does control theory bring to systems research?
    Xiaoyun Zhu
    Mustafa Uysal
    Zhikui Wang
    Sharad Singhal
    Pradeep Padala
    Kang G. Shin
    Operating Systems Review, vol. 43 (2009), pp. 62-69
    Autograph: automatically extracting workflow file signatures
    Anna Povzner
    Kimberly Keeton
    Charles B. Morrey III
    Mustafa Uysal
    Marcos Kawazoe Aguilera
    Operating Systems Review, vol. 43 (2009), pp. 76-83
    Automated control of multiple virtualized resources
    Pradeep Padala
    Kai-Yuan Hou
    Kang G. Shin
    Xiaoyun Zhu
    Mustafa Uysal
    Zhikui Wang
    Sharad Singhal
    EuroSys (2009), pp. 13-26
    TaP: Table-based Prefetching for Storage Caches
    Mingju Li
    Elizabeth Varki
    Swapnil Bhatia
    FAST (2008), pp. 81-96
    d-clock: distributed QoS in heterogeneous resource environments
    Ajay Gulati
    Peter J. Varman
    PODC (2007), pp. 330-331
    pClock: an arrival curve based approach for QoS guarantees in shared storage systems
    Ajay Gulati
    Peter J. Varman
    SIGMETRICS (2007), pp. 13-24
    Don't Settle for Less Than the Best: Use Optimization to Make Decisions
    Kimberly Keeton
    Terence Kelly
    Cipriano A. Santos
    Janet L. Wiener
    Xiaoyun Zhu
    Dirk Beyer
    HotOS (2007)
    Improving Recoverability in Multi-tier Storage Systems
    Marcos Kawazoe Aguilera
    Kimberly Keeton
    Kiran-Kumar Muniswamy-Reddy
    Mustafa Uysal
    DSN (2007), pp. 677-686
    Adaptive control of virtualized resources in utility computing environments
    Pradeep Padala
    Kang G. Shin
    Xiaoyun Zhu
    Mustafa Uysal
    Zhikui Wang
    Sharad Singhal
    Kenneth Salem
    EuroSys (2007), pp. 289-302
    Altering document term vectors for classification: ontologies as expectations of co-occurrence
    Meenakshi Nagarajan
    Amit P. Sheth
    Marcos Kawazoe Aguilera
    Kimberly Keeton
    Mustafa Uysal
    WWW (2007), pp. 1225-1226
    Towards fairness and efficiency in storage systems
    Ajay Gulati
    Peter J. Varman
    Mustafa Uysal
    SIGMETRICS Performance Evaluation Review, vol. 35 (2007), pp. 56-58
    Sinfonia: a new paradigm for building scalable distributed systems
    Marcos Kawazoe Aguilera
    Mehul A. Shah
    Alistair C. Veitch
    Christos T. Karamanolis
    SOSP (2007), pp. 159-174
    Proportional-Share Scheduling for Distributed Storage Systems
    Yin Wang
    FAST (2007), pp. 47-60
    Designing and managing storage systems: issues, techniques, and challenges
    QEST (2006), pp. 265-268
    Designing dependable storage solutions for shared application environments
    Shravan Gaonkar
    Kimberly Keeton
    William H. Sanders
    DSN (2006), pp. 371-382
    On the road to recovery: restoring data after disasters
    Kimberly Keeton
    Dirk Beyer
    Ernesto Brau
    Cipriano A. Santos
    Alex Zhang
    EuroSys (2006), pp. 235-248
    Challenges in managing dependable data systems
    Kimberly Keeton
    SIGMETRICS Performance Evaluation Review, vol. 33 (2006), pp. 4-10
    A Framework for Evaluating Storage System Dependability
    Kimberly Keeton
    DSN (2004), pp. 877-886
    FAB: building distributed enterprise disk arrays from commodity components
    Yasushi Saito
    Svend Frølund
    Alistair C. Veitch
    Susan Spence
    ASPLOS (2004), pp. 48-58
    A Decentralized Algorithm for Erasure-Coded Virtual Disks
    Svend Frølund
    Yasushi Saito
    Susan Spence
    Alistair C. Veitch
    DSN (2004), pp. 125-134
    Lessons and challenges in automating data dependability
    Kimberly Keeton
    Jeffrey S. Chase
    Dirk Beyer
    Cipriano A. Santos
    ACM SIGOPS European Workshop (2004), pp. 4
    Issues and Challenges in the Performance Analysis of Real Disk Arrays
    Elizabeth Varki
    Jianzhang Xu
    Xiaozhou Qiu
    IEEE Trans. Parallel Distrib. Syst., vol. 15 (2004), pp. 559-574
    An integrated performance model of disk arrays
    Elizabeth Varki
    Jianzhang Xu
    Xiaozhou Qiu
    MASCOTS (2003), pp. 296-305
    Façade: Virtual Storage Devices with Performance Guarantees
    Christopher R. Lumb
    Guillermo A. Alvarez
    FAST (2003)
    Using MEMS-Based Storage in Disk Arrays
    Mustafa Uysal
    Guillermo A. Alvarez
    FAST (2003)
    FAB: Enterprise Storage Systems on a Shoestring
    Svend Frølund
    Yasushi Saito
    Susan Spence
    Alistair C. Veitch
    HotOS (2003), pp. 169-174
    Minerva: An automated resource provisioning tool for large-scale storage systems
    Guillermo A. Alvarez
    Elizabeth Borowsky
    Susie Go
    Theodore H. Romer
    Ralph A. Becker-Szendy
    Richard A. Golding
    Mirjana Spasojevic
    Alistair C. Veitch
    ACM Trans. Comput. Syst., vol. 19 (2001), pp. 483-518
    A Modular, Analytical Throughput Model for Modern Disk Arrays
    Mustafa Uysal
    Guillermo A. Alvarez
    MASCOTS (2001), pp. 183-192
    Capacity planning with phased workloads
    Elizabeth Borowsky
    Richard A. Golding
    P. Jacobson
    L. Schreier
    Mirjana Spasojevic
    WOSP (1998), pp. 199-207
    An Analytic Behavior Model for Disk Drives with Readahead Caches and Request Reordering
    Elizabeth A. M. Shriver
    SIGMETRICS (1998), pp. 182-191
    Analysis of a Control Mechanism for a Variable Speed Processor
    Benjamin Melamed
    Eugen Schenfeld
    Bhaskar Sengupta
    IEEE Trans. Computers, vol. 45 (1996), pp. 793-801
    Analytic Modeling of Clustered RAID with Mapping Based on Nearly Random Permutation
    Philip S. Yu
    IEEE Trans. Computers, vol. 45 (1996), pp. 367-373
    Performance Analysis of Dynamic Finite Versioning Schemes: Storage Cost vs. Obsolescence
    Kun-Lung Wu
    Philip S. Yu
    Ming-Syan Chen
    IEEE Trans. Knowl. Data Eng., vol. 8 (1996), pp. 985-1001
    Assignment of cells to switches in PCS networks
    Bhaskar Sengupta
    IEEE/ACM Trans. Netw., vol. 3 (1995), pp. 521-526
    Analytic Modeling and Comparisons of Striping Strategies for Replicated Disk Arrays
    Philip S. Yu
    IEEE Trans. Computers, vol. 44 (1995), pp. 419-433
    An Analytical Model of Reconstruction Time in Mirrored Disks
    Philip S. Yu
    Perform. Eval., vol. 20 (1994), pp. 115-129
    Multiway Graph Partitioning with Applications to PCS Networks
    Bhaskar Sengupta
    INFOCOM (1994), pp. 593-600
    Performance Analysis of a Dual Striping Strategy for Replicated Disk Arrays
    Philip S. Yu
    PDIS (1993), pp. 148-157
    Design and Modeling of Clustered RAID
    Philip S. Yu
    FTCS (1992), pp. 140-149
    Analytical Models of Combining Banyan Networks
    SIGMETRICS (1992), pp. 205-212
    Performance Analysis of Dynamic Finite Versioning for Concurrency Transaction and Query Processing
    Kun-Lung Wu
    Philip S. Yu
    Ming-Syan Chen
    SIGMETRICS (1992), pp. 103-114
    Settling Time Bounds for M/G/1 Queues
    Queueing Systems, vol. 8 (1991), pp. 105-110
    A Markov Chain Approximation for the Analysis of Banyan Networks
    SIGMETRICS (1991), pp. 60-67