Neuron-based explanations of neural networks sacrifice completeness and interpretability

TL;DR: The most important principal components provide more complete and interpretable explanations than the most important neurons.

High quality explanations of neural networks (NNs) should exhibit two key properties. Completeness ensures that they accurately reflect a network's function and interpretability makes them understandable to humans. The most complete explanation would be to simply display the equation for a layer's forward pass. However, this explanation has poor interpretability. At the opposite extreme, many popular DNN explanation methods make choices that increase interpretability at the expense of completeness. Many popular methods provide explanations of individual neurons within a network. We provide evidence that for AlexNet, neuron-based explanation methods sacrifice both completeness and interpretability compared to activation principal components (PCs). Neurons are a poor basis for AlexNet embeddings because they don't account for the distributed nature of these representations.

The problem of explaining a NN can be decomposed into understanding the nonlinear transformation applied by each layer in terms of the NN's input space. To facilitate this for a particular layer, we sample activations, fit a basis for activation space, and visualize points along each basis vector.

Using the interface below, you can interpret visualizations of the following:

PCs and neurons for every layer of pre-trained AlexNet.
PCs for every residual block of pre-trained ResNet-18.
PCs for every residual block of pre-trained ResNet-50.
PCs for 10 layers of pre-trained ViT-B/16.

Model: Basis: Layer: Component: / 64 available in descending variance order

Quantifying explanation interpretability

We measure interpretability via a user study to validate that humans can indeed interpret coherent stimuli along the visualized basis vectors. We presented users with two visualizations of each PC/neuron. One visualization was randomly shuffled while the other was in the correct order. Participants were instructed to select the visualization that displayed a coherent transition from left to right. If participants cannot accurately determine which one is random, then they cannot interpret the stimulus dimension; the continuity of the visualization is its defining feature. PC visualizations were, on average, more interpretable than Neuron visualizations for each layer in AlexNet with the most pronounced differences seen in layers conv2, fc1, and fc2.

User study accuracy for each AlexNet layer. Shaded regions indicate the standard error across 22 study participants.

Quantifying explanation completeness

Explanation completeness is an abstract concept that could be measured in a variety of ways. We use two complementary measures of subspace completeness below.

One measure of completeness is the fraction of activation variance explained by a set of basis vectors. Below, we plot the cumulative explained variance ratio of the top-k basis vectors. Much of the activation variance is concentrated in the most important PCs (blue line) whereas explained variance is far less concentrated in the neuron basis (orange line). For example, to explain 80% of the activation variance for fc1, one could either study the first 42 PCs, or the 2782 highest variance neurons.

Cumulative sum of explained variance ratio for each AlexNet layer plotted against the number of basis vectors being used. Both PCs and neurons are ordered by descending variance. The number of basis vectors required to explain 80% and 99% variance is annotated.

Another measure of completeness is to cumulatively ablate basis vectors and observe how much accuracy degrades. Basis vectors more important for a network's function should degrade accuracy rapidly compared to less important basis vectors. For most layers in AlexNet, ablating the highest variance PCs (solid blue line) damages accuracy more than ablating the highest variance neurons (solid orange line).

For each AlexNet layer, we ablate basis vectors in activation space and measure the effect on ImageNet top-1 validation accuracy. Both PCs and neurons are ordered by their explained variance. Reverse order corresponds to ascending explained variance.

How to cite

Citation

Nolan Dey, Eric Taylor, Alexander Wong, Bryan Tripp, Graham Taylor. Neuron-based explanations of neural networks sacrifice completeness and interpretability. Transactions on Machine Learning Research, 2025.

Bibtex

  @article{dey2025neurons,
    author = {Dey, Nolan and Taylor, Eric and Wong, Alexander and Tripp, Bryan and Taylor, Graham},
    title = {Neuron-based explanations of neural networks sacrifice completeness and interpretability},
    year = {2025},
    journal = {Transactions on Machine Learning Research},
    url = {https://openreview.net/forum?id=UWNa9Pv6qA}
  }

Acknowledgments

This research was supported by funding from BMO Bank of Montreal through the the Waterloo Artificial Intelligence Institute (SRA #081648). The authors thank Thomas Fortin for helping to run experiments with ResNet and ViT. This research was supported, in part, by the Province of Ontario and the Government of Canada through the Canadian Institute for Advanced Research (CIFAR), and companies sponsoring the Vector Institute. GWT is supported by the Natural Sciences and Engineering Research Council of Canada (NSERC), the Canada Research Chairs program, and the Canada CIFAR AI Chairs program. This research was conducted with approval from the University of Guelph Research Ethics Board (REB #20-12-003).