Learning to Attack: Adversarial Transformation Networks
Venue
Proceedings of AAAI-2018, AAAI (to appear)
Publication Year
2018
Authors
BibTeX
Abstract
With the rapidly increasing popularity of deep neural networks for image
recognition tasks, a parallel interest in generating adversarial examples to attack
the trained models has arisen. To date, these approaches have involved either
directly computing gradients with respect to the image pixels or directly solving
an optimization on the image pixels. We generalize this pursuit in a novel
direction: can a separate network be trained to efficiently attack another fully
trained network? We demonstrate that it is possible, and that the generated attacks
yield startling insights into the weaknesses of the target network. We call such a
network an Adversarial Transformation Network (ATN). ATNs transform any input into
an adversarial attack on the target network, while being minimally perturbing to
the original inputs and the target network’s outputs. Further, we show that ATNs
are capable of not only causing the target network to make an error, but can be
constructed to explicitly control the type of misclassification made. We
demonstrate ATNs on both simple MNIST digit classifiers and state-of-the-art
ImageNet classifiers deployed by Google, Inc.: Inception ResNet-v2.