Backup Transformer Heads are Robust to Ablation Distribution
A downloadable project
Mechanistic Interpretability techniques can be employed to characterize the function of specific attention heads in transformer models, given a task. Prior work has shown, however, that when all heads performing a particular function are ablated for a run of the model, other attention heads replace the ablated heads by performing their original function. Such heads are known as "backup heads". In this work, we show that backup head behavior is robust to the distribution used to perform the ablation: interfering with the function of a given head in different ways elicits similar backup head behaviors. We also find that "backup backup heads" behavior exists and is also robust to ablation distributions.
Code supporting the writeup can be found at the following Colab Notebook:
https://colab.research.google.com/drive/1Qa58m1X_bgsV2QT9mIpP-OlcMAGchSnO?usp=sh...
Leave a comment
Log in with itch.io to leave a comment.