Data-driven software security: Models and methods
Abstract
For computer software, our security models, policies, mechanisms, and means of
assurance were primarily conceived and developed before the end of the 1970’s.
However, since that time, software has changed radically: it is thousands of times
larger, comprises countless libraries, layers, and services, and is used for more
purposes, in far more complex ways. It is worthwhile to revisit our core computer
security concepts. For example, it is unclear whether the Principle of Least
Privilege can help dictate security policy, when software is too complex for either
its developers or its users to explain its intended behavior. One possibility is to
take an empirical, data-driven approach to modern software, and determine its
exact, concrete behavior via comprehensive, online monitoring. Such an approach can
be a practical, effective basis for security—as demonstrated by its success in spam
and abuse fighting—but its use to constrain software behavior raises many
questions. In particular, three questions seem critical. First, can we efficiently
monitor the details of how software is behaving, in the large? Second, is it
possible learn those details without intruding on users’ privacy? Third, are those
details a good foundation for security policies that constrain how software should
behave? This paper outlines what a data-driven model for software security could
look like, and describes how the above three questions can be answered
affirmatively. Specifically, this paper briefly describes methods for efficient,
detailed software monitoring, as well as methods for learning detailed software
statistics while providing differential privacy for its users, and, finally, how
machine learning methods can help discover users’ expectations for intended
software behavior, and thereby help set security policy. Those methods can be
adopted in practice, even at very large scales, and demonstrate that data-driven
software security models can provide real-world benefits.
