Thinking about Availability in Large Service Infrastructures

Jeffrey C. Mogul

Rebecca Isaacs

Brent Welch

Proc. HotOS XVI (2017)

Download Google Scholar

Abstract

We increasingly depend on the availability of online services, either directly as users, or indirectly, when cloud-provider services support directly-accessed services. The availability of these "visible services" depends in complex ways on the availability of a complex underlying set of invisible infrastructure services. In our experience, most software engineers lack useful frameworks to create and evaluate designs for individual services that support end-to-end availability in these infrastructures, especially given cost, performance, and other constraints on viable commercial services. Even given the extensive research literature on techniques for replicated state machines and other fault-tolerance mechanisms, we found little help in this literature for addressing infrastructure-wide availability. Past research has often focused on point solutions, rather than end-to-end ones. In particular, it seems quite difficult to define useful targets for infrastructure-level availability, and then to translate these to design requirements for individual services. We argue that, in many but not all ways, one can think about availability with the mindset that we have learned to use for security, and we discuss some general techniques that appear useful for implementing and operating high-availability infrastructures. We encourage a shift in emphasis for academic research into availability.

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Thinking about Availability in Large Service Infrastructures

Abstract

Research Areas

Learn more about how we conduct our research

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Thinking about Availability in Large Service Infrastructures

Abstract

Research Areas

Learn more about how we conduct our research

AI/ML Foundations  & Capabilities