Compositional Explanations of Neurons

04 Jan 2021

Introduction

The paper describes a method to explain/interpret the representations learned by individual neurons in deep neural networks.
The explanations are generated by searching for logical forms defined by a set of composition operators (like OR, AND, NOT) over primitive concepts (like water).
Link to the paper

Given a neural network f, the goal is to explain a neuron’s behavior (of this network) in human-understandable terms.
Previous work builds on the idea that a good explanation is a description that identifies the inputs for which the neuron activates.
Given a set of pre-defined atomic concepts $c \in C$ and a similarity measure $\delta(n, c)$ where $n$ represents the activation of the $n^{th}$ neuron, the explanation, for the $n^{th}$ neuron, is the concept most similar to $n$.
For images, a concept could be represented as an image segmentation map. For example, the water concept can be represented by the segments of the images that show water.
The similarity can be measured by first thresholding the neuron activations (to get a neuron mask) and then computing the IoU score (or Jaccard Similarity) between the neuron mask and the concept.
One limitation of this approach is that the explanations are restricted to pre-defined concepts.
The paper expands the set of candidate concepts by considering the logical forms of the atomics concepts.
In theory, the search space would explode exponentially. In practice, it is restricted to explanations with at most $N$ atomics concepts, and beam search is performed (instead of exhaustive search).

Image Classification Setup
- Neurons from the final 512-unit convolutional layer of a ResNet-18 trained on the Places365 dataset.
- Probing for concepts from ADE20k scenes dataset with atomic concepts defined by annotations in the Broden dataset
NLI Setup
- BiLSTM baseline followed by MLP layers trained on Stanford Natural Language Inference (SNLI) corpus.
- Probing the penultimate hidden layer (of the MLP component) for sentence-level explanations.
- Concepts are created using the 2000 most common words in the validation split of the SNLI dataset.
- Additional concepts are created based on the lexical overlap between premise and hypothesis.

Image Classification Setup
- As $N$ increases, the mean IoU increases (i.e., the explanation quality increases) though the returns become diminishing beyond $N=10$.
- Manual inspection of 128 neurons and their length 10 explanations show that 69% neurons learned some meaningful combination of concepts, while 31% learned some unrelated concepts.
- The meaningful combination of concepts include:
  - perceptual abstraction that is also lexically coherent (e.g., “skyscraper OR lighthouse OR water tower”).
  - perceptual abstraction that is not lexically coherent (e.g., “cradle OR autobus OR fire escape”).
  - specialized abstraction of the form L1 AND NOT L2 (e.g. (water OR river) AND NOT blue).
NLI Setup
- As $N$ increases, the mean IoU increases (as in the image classification setup) though the IoU keeps increasing past $N=30$.
- Many neurons correspond to lexical features. For example, some neurons are gender-sensitive or activate for verbs like sitting, eating or sleeping. Some neurons are activated when the lexical overlap between premise and hypothesis is high.

In image classification setup, the more interpretable the neuron is, the more accurate is the model (when the neuron is active).
However, the opposite trend is seen in NLI models. i.e., the more interpretable neurons are less accurate.
Key takeaway - interpretability (as measured by the paper) is not correlated with performance. Given a concept space, the identified behaviors may be correlated or anti-correlated with the model’s performance.

The idea is to construct examples that activate (or inhibit) certain neurons, causing a change in the model’s predictions.
These adversarial examples are referred to as “copy-paste” adversarial examples.
For example, the neuron corresponding to “(water OR river) AND (NOT blue)” is a major contributor for detecting “swimming hole” classes. An adversarial example is created by making the water blue. This prompts the model to predict “grotto” instead of “swimming hole.”
Similarly, in the NLI model, a neuron detects the word “nobody” in the hypothesis as highly indicative of contradiction. An adversarial example can be created by adding the word “nobody” to the hypothesis, prompting the model to predict contradiction while the true label should be neutral.
These observations support the hypothesis that one can use explanations to create adversarial examples.