Are adversarial examples inevitable?

A number of adversarial attacks on neural networks have been recently proposed. To counter these attacks, a number of authors have proposed a range of defenses. However, these defenses are often quickly broken by new and revised attacks. Given the lack of success at generating robust defenses, we are led to ask a fundamental question: Are adversarial attacks inevitable?

We identify a broad class of problems for which adversarial examples cannot be avoided. We also derive fundamental limits on the susceptibility of a classifier to adversarial attacks that depend on properties of the data distribution as well as the dimensionality of the dataset.

A full description of these results, and experiments exploring their implications, is available in our article.

“Are adversarial examples inevitable?”

Existence of adversarial examples

We assume that images have their pixels scaled between 0 and 1. We also assume that images in an object class are sampled from a probability density function with maximum density \(U_c\). Under this assumption, we can derive the following fundamental limit of the susceptibility of a classifier to adversarial examples.

Sample an image \(x\) at random from the class distribution. Choose any measurable classifier on the space of images. Then, with probability at least

\\( 1-U_c \exp(-\pi \epsilon^2)/(2\pi) \\)
one of the following conditions holds true: 1) The point \\(x\\) is mislcassified by the classifier, or 2) There is an adversarial examples, \\(\hat x\\), in a different class, but with \\(||x-\hat x||_2 < \epsilon\\).

This bound guarantees the existence of adversarial examples if the density bound \(U_c\) isn’t too big. We provide theoretical results to show that, for very simple image classes, \(U_c\) gets big as the dimension of the problem increases. However, we also provide experimental results suggesting that complex image classes may have much lower values of \(U_c\), which could result in a highly non-trivial susceptibility bound.

Sparse adversarial examples

We consider the case of sparse adversarial examples, where only a subset of image pixels are manipulated. In this situation, we can prove that adversarial perturbation is possible using a small subset of pixels, provided the density bound \(U_c\) doesn’t grow too large.

  

Sparse adversarial examples perturb a small subset of pixels and can hide adversarial "fuzz" inside high-frequency image regions. The original image (left) is classified as an "ox." Under \( \ell_\infty \)-norm perturbations (center-left), it is classified as "traffic light," but the perturbations visibly distort smooth regions of the image (the sky). These effects are hidden in the grass (center right) using sparse perturbations limited to a small subset of pixels (right).