Add attention mechanism#

Estimate value#

Should you add an “attention” mechanism to your model? For one argument, see the “Why Self-Attention” section of “Attention is All You Need” (1706.03762.pdf - Why Self-Attention).

The authors include three “desiderata” behind self-attention. They wanted a model that was cheaper (“total computational complexity per layer”), which they argue they got in Table 1 and Table 2 (via FLOPs). They wanted a model that was highly parallelizable (“amount of computation that can be parallelized”), which they argue they got in Table 1. And finally, they wanted to minimize the path length between long-range dependencies which they argue for in Table 1.

The first goal is about reducing costs, the second is about both scale and fast feedback, and the third is about model performance. They obviously argue they hit the third goal in their SOTA results.

Many people bash on transformers because they are expensive, but if you’re only thinking about attention mechanisms then it will only be expensive if you take advantage of their scalability. If you don’t, they are by design meant to be cheaper (goal #1). The Transformer model itself (because it has both encoders and decoders) has scalability issues, however. See Transformer (machine learning model) - Alternative.

The original authors were probably smart enough to keep it to three points because they know most ML practioners have a short Attention span, but I’ll add a fourth goal for attention mechanisms: Generalizability.


The attention mechanism is useful across both NLP and CV (like e.g. residual connections). They only touched on this in 1706.03762.pdf - Conclusion, but attention has found success in many CV problems. See also:

Attention vs. Self-Attention#

See 5. in What’s the difference between Attention vs Self-Attention?. The generalizability of attention mechanisms means that attention (in contrast to self-attention) can be used to combine modalities. See further comments about this application to Perceiver models in KQV attention.

Soft weights#

From Attention (machine learning) - Wikipedia:

Its flexibility comes from its role as “soft weights” that can change during runtime, in contrast to standard weights that must remain fixed at runtime.

In terms of QKV attention, the “soft weights” are what are sometimes called the attention weights. These are formed by using the WQ and WK matrices (from training) to build Q and K matrices from X, which are multiplied (Q by the transpose of K) to produce the attention weights. They’re called “weights” only because you get a matrix here, and a matrix is what you’d typically use in a FC layer (ignoring the b offsets).

Said another way, the “Inquiry system” described in mon’s answer to - What exactly are keys, queries, and values in attention mechanisms? is a function that takes a function (the trained WK and WQ matrices). For another simple explanation, see what’s the difference between “self-attention mechanism” and “full-connection” layer?.


The term “attention” is used in this article following the definition in Attention (machine learning) - Wikipedia. Does Attention (machine learning) compress everything the word Attention means to humans? Not even close. The self-attention mechanism can at best be described as a tool to help a model decide what to weigh highly (pay “attention” to), given what it is currently processing. For another simple definition, see the start of Attention? Attention! | Lil’Log. ML practitioners love the phrase “attend to” though.

We may need to start inventing new words. In my opinion there is already little to no distinction between the words “attention” and “focus” except perhaps that the former is often treated as a resource; see etymology - Attention, focus, and respect as distributable resources and Attention economy. See also Focus (linguistics), which has some clear parallels to Attention (machine learning). The term Hyperfocus is defined in terms of attention. The word concentrate is even less independent of the word focus than attention. In concentrate - Wiktionary, the verb is defined in terms of focus - Wiktionary (and vice-versa).

The following graph shows the development history of variants of attention mechanisms, linking papers that reference each other and removing transitive dependencies. Dates match the original publication to Arxiv, not the original publication. The list is roughly based on:


The nodes in this graph are clickable links that default to Wikipedia (a collection of links). Otherwise links are to the abstract of every paper in Arxiv (in the same order) to make it easy to get the latest (e.g. v7 for some papers) version even if the author adds an update.

See Connected Papers for a denser graph, though CP is not a citation graph (see Connected Papers | About). See this link for a custom graph with three of these papers as origins.

Estimated cost#

See KQV attention for community-sourced annotations to AIAYN and references to code implementations of KQV attention mechanisms.