Create Next App

Provably Overwhelming Transformer Models with Designed Inputs

We develop an algorithm which, given a trained transformer model $M$ as input, as well as a string of tokens $s$ of length $n_{fix}$ and an integer $n_{free}$ , can generate a mathematical proof that $M$ is ``overwhelmed`` by $s$ , in time and space $O˜(n^{2}_{fix}+n^{3}_{free})$ . We say that $M$ is ``overwhelmed`` by $s$ when the output of the model evaluated on this string plus any additional string $t$ , $M$ ( $s$ + $t$ ), is completely insensitive to the value of the string $t$ whenever length( $t$ )≤ $n_{free}$ . Along the way, we prove a particularly strong worst-case form of ``over-squashing'', which we use to bound the model's behavior. Our technique uses computer-aided proofs to establish this type of operationally relevant guarantee about transformer models. We empirically test our algorithm on a single layer transformer complete with an attention head, layer-norm, MLP/ReLU layers, and RoPE positional encoding. We believe that this work is a stepping stone towards the difficult task of obtaining useful guarantees for trained transformer models.

Lev Stambler, Sajjad Nezhadi, Matthew Coudron

#transformers #mech_interp

View on arXiv

Research Areas

Model Distillation

Alternatives to Gradient Descent

Mechanistic Interpretability

Publications

Provably Overwhelming Transformer Models with Designed Inputs