Gestalt: Fast, Unified Fault Localization for Networked Systems
Venue
Proceedings of the USENIX Annual Technical Conference (2014)
Publication Year
2014
Authors
Radhika Niranjan Mysore, Amin Vahdat, Ratul Mahajan, George Varghese
BibTeX
Abstract
We show that the performance of existing fault localization algorithms differs
markedly for different networks; and no algorithm simultaneously provides high
localization accuracy and low computational overhead. We develop a framework to
explain these behaviors by anatomizing the algorithms with respect to six important
characteristics of real networks, such as uncertain dependencies, noise, and
covering relationships. We use this analysis to develop Gestalt, a new algorithm
that combines the best elements of existing ones and includes a new technique to
explore the space of fault hypotheses. We run experiments on three real, diverse
networks. For each, Gestalt has either significantly higher localization accuracy
or an order of magnitude lower running time. For example, when applied to the Lync
messaging system that is used widely within corporations, Gestalt localizes faults
with the same accuracy as Sherlock, while reducing fault localization time from
days to 23 seconds
