A downloadable project

Mechanistic Interpretability techniques can be employed to characterize the function of specific attention heads in transformer models, given a task. Prior work has shown, however, that when all heads performing a particular function are ablated for a run of the model, other attention heads replace the ablated heads by performing their original function. Such heads are known as "backup heads". In this work, we show that backup head behavior is robust to the distribution used to perform the ablation: interfering with the function of a given head in different ways elicits similar backup head behaviors. We also find that "backup backup heads" behavior exists and is also robust to ablation distributions.

Code supporting the writeup can be found at the following Colab Notebook:
https://colab.research.google.com/drive/1Qa58m1X_bgsV2QT9mIpP-OlcMAGchSnO?usp=sh...

More information

Status	Released
Category	Other
Author	satojk

Download

backup_transformer_heads_are_robust_to_ablation_distribution.pdf 69 kB

Download

appendix_notebook.ipynb 3 MB

Backup Transformer Heads are Robust to Ablation Distribution

Download

Leave a comment