Norming to Performing: Failure Analysis and Deployment Automation of Big Data Software Developed by Highly Iterative Models
Venue
IEEE International Symposium on Software Reliability Engineering, IEEE International Symposium on Software Reliability Engineering (2014), pp. 144-155
Publication Year
2014
Authors
BibTeX
Abstract
We observe many interesting failure characteristics from Big Data software
developed and released using some kinds of highly iterative development models
(e.g., agile). ~16% of failures occur due to faults in software deployments (e.g.,
packaging and pushing to production). Our analysis shows that many such production
outages are at least partially due to some human errors rooted in the high
frequency and complexity of software deployments. ~51% of the observed human errors
(e.g., transcription, education, and communication error types) are avoidable
through automation. We thus develop a fault-tolerant automation framework to make
it efficient to automate end-to-end software deployment procedures. We apply the
framework to two Big Data products. Our case studies show the complexity of the
deployment procedures of multi-homed Big Data applications and help us to study the
effectiveness of the validation and verification techniques for user-provided
automation programs. We analyze the production failures of the two products again
after the automation. Our experimental data shows how the automation and the
associated procedure improvements reduce the deployment faults and overall failure
rate, and improve the feature launch velocity. Automation facilitates more formal,
procedure-driven software engineering practices which not only reduce the manual
work and human-oriented, avoidable production outages but also help engineers to
better understand overall software engineering procedures, making them more
auditable, predictable, reliable, and efficient. We discuss two novel metrics to
evaluate progress in mitigating human errors and the conditions indicating points
to start such transition from owner-driven deployment practice.
