Obfuscatory obscanturism: making workload traces of commercially-sensitive systems safe to release
Abstract
Cloud providers such as Google are interested in
fostering research on the daunting technical challenges they
face in supporting planetary-scale distributed systems, but no
academic organizations have similar scale systems on which
to experiment. Fortunately, good research can still be done
using traces of real-life production workloads, but there are
risks in releasing such data, including inadvertently disclosing
confidential or proprietary information, as happened with the
Netflix Prize data. This paper discusses these risks, and our
approach to them, which we call {\em systematic obfuscation}. It protects proprietary and personal data while leaving it possible to answer some interesting research questions. We explain and motivate some of the risks and concerns and propose how they can best be mitigated, using as an example our recent publication of a month-long trace of a production system workload on a 11k-machine cluster.
fostering research on the daunting technical challenges they
face in supporting planetary-scale distributed systems, but no
academic organizations have similar scale systems on which
to experiment. Fortunately, good research can still be done
using traces of real-life production workloads, but there are
risks in releasing such data, including inadvertently disclosing
confidential or proprietary information, as happened with the
Netflix Prize data. This paper discusses these risks, and our
approach to them, which we call {\em systematic obfuscation}. It protects proprietary and personal data while leaving it possible to answer some interesting research questions. We explain and motivate some of the risks and concerns and propose how they can best be mitigated, using as an example our recent publication of a month-long trace of a production system workload on a 11k-machine cluster.