A downloadable project

Mechanistic Interpretability techniques can be employed to characterize the function of specific attention heads in transformer models, given a task. Prior work has shown, however, that when all heads performing a particular function are ablated for a run of the model, other attention heads replace the ablated heads by performing their original function. Such heads are known as "backup heads". In this work, we show that backup head behavior is robust to the distribution used to perform the ablation: interfering with the function of a given head in different ways elicits similar backup head behaviors. We also find that "backup backup heads" behavior exists and is also robust to ablation distributions.

Code supporting the writeup can be found at the following Colab Notebook: 
https://colab.research.google.com/drive/1Qa58m1X_bgsV2QT9mIpP-OlcMAGchSnO?usp=sh...

Download

Download
backup_transformer_heads_are_robust_to_ablation_distribution.pdf 69 kB
Download
appendix_notebook.ipynb 3 MB

Leave a comment

Log in with itch.io to leave a comment.