DRAM Errors in the Wild: A Large-Scale Field Study
Venue
SIGMETRICS (2009)
Publication Year
2009
Authors
Bianca Schroeder, Eduardo Pinheiro, Wolf-Dietrich Weber
BibTeX
Abstract
Errors in dynamic random access memory (DRAM) are a common form of hardware failure
in modern compute clusters. Failures are costly both in terms of hardware
replacement costs and service disruption. While a large body of work exists on DRAM
in laboratory conditions, little has been reported on real DRAM failures in large
production clusters. In this paper, we analyze measurements of memory errors in a
large fleet of commodity servers over a period of 2.5 years. The collected data
covers multiple vendors, DRAM capacities and technologies, and comprises many
millions of DIMM days. The goal of this paper is to answer questions such as the
following: How common are memory errors in practice? What are their statistical
properties? How are they affected by external factors, such as temperature and
utilization, and by chip-specific factors, such as chip density, memory technology
and DIMM age? We find that DRAM error behavior in the field differs in many key
aspects from commonly held assumptions. For example, we observe DRAM error rates
that are orders of magnitude higher than previously reported, with 25,000 to 70,000
errors per billion device hours per Mbit and more than 8\% of DIMMs affected by
errors per year. We provide strong evidence that memory errors are dominated by
hard errors, rather than soft errors, which previous work suspects to be the
dominant error mode. We find that temperature, known to strongly impact DIMM error
rates in lab conditions, has a surprisingly small effect on error behavior in the
field, when taking all other factors into account. Finally, unlike commonly feared,
we don't observe any indication that newer generations of DIMMs have worse error
behavior.
