Characterization of Impact of Transient Faults and Detection of Data Corruption Errors in Large-Scale N-Body Programs Using Graphics Processing Units
Venue
IEEE International Parallel and Distributed Processing Symposium (IPDPS), IEEE International Parallel and Distributed Processing Symposium (IPDPS) (2014), pp. 458-467
Publication Year
2014
Authors
BibTeX
Abstract
In N-body programs, trajectories of simulated particles have chaotic patterns if
errors are in the initial conditions or occur during some computation steps. It was
believed that the global properties (e.g., total energy) of simulated particles are
unlikely to be affected by a small number of such errors. In this paper, we present
a quantitative analysis of the impact of transient faults in GPU devices on a
global property of simulated particles. We experimentally show that a single-bit
error in non-control data can change the final total energy of a large- scale
N-body program with ~2.1% probability. We also find that the corrupted total energy
values have certain biases (e.g., the values are not a normal distribution), which
can be used to reduce the expected number of re-executions. In this paper, we also
present a data error detection technique for N-body pro- grams by utilizing two
types of properties that hold in simulated physical models. The presented technique
and an existing redundancy-based technique together cover many data errors (e.g.,
>97.5%) with a small performance overhead (e.g., 2.3%).
