KQV attention#

Supplementary material to the questions and answers in What exactly are keys, queries, and values in attention mechanisms? - CV. Indirectly, commentary on Attention is All You Need (AIAYN).

Why is this question important? Many versions of attention are used with older RNN-based models, and it’s not clear they are being used in practice any more. On the other hand, the KQV (or QKV) method seems to still be used extensively. See An Overview of Attention | Papers With Code. Besides being common, see comments on the value of all attention mechanisms (not just KQV) in Add attention mechanism.

The QKV attention mechanism is particularly interesting because it’s used in Perceivers. That is, QKV lets you easily connect two different modalities (i.e. text and image) because e.g. QK can both be text and V can be an image (image search in web browsers).

Library versions#

%pip install numpy pandas
Requirement already satisfied: numpy in /opt/conda/lib/python3.9/site-packages (1.21.5)
Requirement already satisfied: pandas in /opt/conda/lib/python3.9/site-packages (1.4.2)
Requirement already satisfied: python-dateutil>=2.8.1 in /opt/conda/lib/python3.9/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.9/site-packages (from pandas) (2022.1)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.9/site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
Note: you may need to restart the kernel to use updated packages.

Sam’s answer#

The YouTube video Sam’s answer links to seems to have changed since the author added the reference. It also leaves a lot to be desired, only covering the topic for a few minutes. If you want to avoid the YouTube paywall (or advertisements) an arguably better resource to learn from is the “Derivation” section of Latent semantic analysis - Wikipedia. Using the same data from the image in Sam’s answer but following the logic in the Wikipedia page on LSA:

import numpy as np
import pandas as pd

X = np.array([ \
   [1, 1, 1, 0, 0], \
   [3, 3, 3, 0, 0], \
   [4, 4, 4, 0, 0], \
   [5, 5, 5, 0, 0], \
   [0, 2, 0, 4, 4], \
   [0, 0, 0, 5, 5], \
   [0, 1, 0, 2, 2], \
])
pd.DataFrame(X,
    index=['P1', 'P2', 'P3', 'P4', 'P5', 'P6', 'P7'],
    columns=["Star Wars", "The Matrix", "Iron man", "U got mail", "Titanic"])
Star Wars The Matrix Iron man U got mail Titanic
P1 1 1 1 0 0
P2 3 3 3 0 0
P3 4 4 4 0 0
P4 5 5 5 0 0
P5 0 2 0 4 4
P6 0 0 0 5 5
P7 0 1 0 2 2
from numpy import linalg as la

np.set_printoptions(precision=2, suppress=True, floatmode='maxprec_equal')

U, s, Vt = la.svd(X, full_matrices=False)
U, s, Vt
(array([[-0.14, -0.02, -0.01,  0.31,  0.84],
        [-0.41, -0.07, -0.03,  0.83, -0.35],
        [-0.55, -0.09, -0.04, -0.17,  0.34],
        [-0.69, -0.12, -0.05, -0.43, -0.23],
        [-0.15,  0.59,  0.65, -0.00,  0.00],
        [-0.07,  0.73, -0.68,  0.00,  0.00],
        [-0.08,  0.30,  0.33, -0.00,  0.00]]),
 array([12.48,  9.51,  1.35,  0.00,  0.00]),
 array([[-0.56, -0.59, -0.56, -0.09, -0.09],
        [-0.13,  0.03, -0.13,  0.70,  0.70],
        [-0.41,  0.80, -0.41, -0.09, -0.09],
        [-0.71,  0.00,  0.71, -0.00,  0.00],
        [ 0.00, -0.00,  0.00, -0.71,  0.71]]))
k = 2
U_k, s_k, Vt_k = U[:, :k], s[:k], Vt[:k, :]
V_k = Vt_k.T
U_k, s_k, V_k
(array([[-0.14, -0.02],
        [-0.41, -0.07],
        [-0.55, -0.09],
        [-0.69, -0.12],
        [-0.15,  0.59],
        [-0.07,  0.73],
        [-0.08,  0.30]]),
 array([12.48,  9.51]),
 array([[-0.56, -0.13],
        [-0.59,  0.03],
        [-0.56, -0.13],
        [-0.09,  0.70],
        [-0.09,  0.70]]))
P8 = np.array([5, 0, 0, 0, 0])
P9 = np.array([0, 4, 5, 0, 0])
cos_sim = P8.dot(P9) / (la.norm(P8) * la.norm(P9))
cos_sim
0.0

The variables P8 and P9 are equivalent to \(\textbf{t}_i^T\) in the Wikipedia article; they are the “queries” (Q) in the language of KQV and information retrieval. To project them into the “semantic” or “concept” space we’ll use a variation on this equation from the Wikipedia article:

\[\begin{split} \begin{align} \hat{\textbf{t}}_i & = \Sigma_k^{-1} V_k^T \textbf{t}_i \\ \end{align} \end{split}\]

Taking the transpose:

\[\begin{split} \begin{align} (\hat{\textbf{t}}_i)^T & = (\Sigma_k^{-1} V_k^T \textbf{t}_i)^T \\ \hat{\textbf{t}}_i^T & = \textbf{t}_i^T V_k \Sigma_k^{-1} \\ \end{align} \end{split}\]

Initially we’ll drop the \(\Sigma_k^{-1}\) term:

P8_hat = P8 @ V_k
P9_hat = P9 @ V_k
cos_sim = P8_hat.dot(P9_hat) / (la.norm(P8_hat) * la.norm(P9_hat))
P8_hat, P9_hat, cos_sim
(array([-2.81, -0.63]), array([-5.18, -0.52]), 0.9925794247632609)

Re-adding the \(\Sigma_k^{-1}\) term:

P8_hat = P8 @ V_k @ np.diag(1 / s_k)
P9_hat = P9 @ V_k @ np.diag(1 / s_k)
P8_hat, P9_hat
(array([-0.23, -0.07]), array([-0.42, -0.05]))

Notice these 2 new \(\textbf{t}_i^T\) are on the same scale as the original 7 \(\textbf{t}_i^T\):

U_k
array([[-0.14, -0.02],
       [-0.41, -0.07],
       [-0.55, -0.09],
       [-0.69, -0.12],
       [-0.15,  0.59],
       [-0.07,  0.73],
       [-0.08,  0.30]])

We also end up with (in general) a different cosine similarity measure:

cos_sim = P8_hat.dot(P9_hat) / (la.norm(P8_hat) * la.norm(P9_hat))
cos_sim
0.9877037969111454

To try to summarize, the author is saying the \(K\) and \(Q\) matrices in KQV attention both represent something like the \(V_k\) matrix of left-singular values above, and where we also disregard the \(\Sigma_k^{-1}\) term. In optimization we can learn this scaling matrix as part of the weight matrix, that is, learn \(V_k \Sigma_k^{-1}\) rather than only \(V_k\).

Said another way, in KQV attention we use a potentially different mapping \(K\) and \(Q\) to transform (linear map) vectors from their original basis to a “semantic” (or “contextualized”) space where we get reasonable values from a similarity measure. In latent semantic indexing (LSI) there is only one weight matrix represented above as \(V_k\) (not two). It transforms both the original (P1-P7) and new (P8-P9) terms to the same “semantic” space already.

In KQV attention the loss function is also different; it’s not as simple as the Frobenious norm because a torch.nn.Linear layer (a linear layer in general) also learns a bias by default and because the network’s loss function is not always L2. To avoid confusion over whether a bias term is implied, don’t use the two words Linear function together.

Sam’s answer mentions the SVD; it would probably improve the answer to reference PCA as well. For more details see SVD for PCA. You interpet point 2. in Sam’s answer as a reference to Feature learning - PCA and 3. as a reference to Dimensionality reduction - PCA. See the comments following “Feature extraction and dimension reduction can be combined in one step” in Dimensionality reduction - Dimension reduction for other techniques for doing both these steps at once.

Sam’s answer provides a decent common-sense explanation in point 1. for why we need at least one of the \(W_Q\) or \(W_K\) matrices; so that we don’t leave our input embeddings (the \(x\) in each row of \(X\)) in the same vector space. To use a little cleaner syntax than Sam’s answer does, for each head we have that:

\[\begin{split} \begin{align} K & = X W_K \\ Q & = X W_Q \end{align} \end{split}\]

If we eliminated both the \(W_Q\) and \(W_K\) matrices then the \(Q K^T\) term in the attention calculation would be \(X X^T\), which is an auto-covariance matrix, which are symmetric, meaning the attention weights could only capture information about the similarity of words in the original representation. There would be no need for multiple heads (it’d be hard to justify separate \(V\) in each, the only remaining changeable component) and the mechanism would include no contextualization. See further comments about eliminating weight matrices below.

mon’s answer#

See mon’s answer for a conversation more focused on the “meaning” of KQV attention rather than the mechanics. The answer is relatively high-level, however, and doesn’t even try to address multi-head attention. It references Self-attention: step-by-step video | Peltarion, which provides a similar high-level discussion.

A single-head KQV attention mechanism can really only provide a guess at what other words are important to include in a “contextualized” embedding of a more generic word. That is, it can pick out only one kind of generic Anaphora (linguistics) to include; see also an anaphora head in AIAYN - Pg14. Hence, mon’s answer simply says K/Q is about finding the “most related” word.

The answer implies single-head attention is about where you should look for the most useful word. That is, that attention provides a probabilistic estimate of “value” for understanding. Is this where our eyes search (guess) in practice? Could we measure our saccades and compare them to the results of an attention mechanism? Arguably a search engine provides the same single-head attention scores (what you should pay attention to, what’s “valuable” for understanding) based on training on e.g. web links.

Multi-headed attention#

If you have more than one attention head, however, the different heads should be pulling different features out of the sentence. See What Does BERT Look At?. Like the features provided by a CNN, only some of the heads provide features interpretable by a human being (see Why use multi-headed attention in Transformers?).

The Peltarion author refers to Polysemy, closely related to Homonymy. You need to be able to find anaphora with attention in order to distinguish between these kinds of words. If the anaphora you need aren’t part of your sentence, you’re out of luck. You can see attention as providing a “feature” on top of your word, to help contextualize it (add or refine information for e.g. a polyseme) or disambiguate it (change its default meaning for e.g. a homonym).

Why do we need both a \(W_Q\) and \(W_K\) matrix?#

Do the \(W_K\) and \(W_Q\) matrices learn to project to the same “semantic” or “contextualized” vector space? If so, perhaps there is no need to keep both of them. Let’s say we applied this single matrix to both:

\[\begin{split} \begin{align} K & = X W_{KQ} \\ Q & = X W_{KQ} \end{align} \end{split}\]

This approach allows for some contextualization, but the product \(Q K^T\) will be a symmetric matrix. Few word relationships are of this type; e.g. when you want to find the proper noun associated with a pronoun you do not want your query to discover pronouns when you look up a proper noun.

What if we only applied the single weight matrix to one of the inputs?

\[\begin{split} \begin{align} K & = X W_K \\ Q & = X \end{align} \end{split}\]

This approach only allows for limited contextualization because the \(W_K\) matrix will not be able to be selected (learned) in a way that produces \(k\) and \(q\) dot products significantly different than those in the original (e.g. word2vec) embedding. That is, because there’s no change in \(q\) examples, there’s limited flexibility in creating a new semantic (or “contextualized”) space for this particular head.

Do the K and Q matrices learn to project to the same “semantic” or “contextualized” space? Yes, but because we are interested in building a non-symmetric relationship, they must both exist. I’d disagree with mon’s comment here.

Just to be clear, let’s work through a specific example with pronouns and proper nouns. If he (pronoun) is the query then “Hans” (proper noun) may be the most-similar key we want to pull up. The word “he” (an \(x\) in the \(X\) matrix) would hopefully be transformed (if weights are selected properly) to a \(q\) through the \(W_Q\) matrix that would be much more similar to the key “Hans” translated to a \(k\) through the \(W_K\) matrix than the word “Mary” translated to a \(k\) through the \(W_K\) matrix. We want the dot-product attention score produced in the attention weights matrix produced by the pronoun attention head to be high, which requires the vectors to be in the same space. We need a “pronoun” semantic (or “contextualized”) space.

Because \(W_Q\) and \(W_K\) are jointly trained they should have time to work out this common representation. Remember that it’s only at run-time however that we get a specific answer to which pronouns are most likely associated with which proper nouns, when you can use these “soft” weights as a linear map in itself to convert the newly-remapped \(v\) to a single weighted \(v\).

In this case the \(W_Q\) and \(W_K\) matrices may need to learn to emphasize gender from the original embedding in order to find associated proper nouns (despite the reduction in dimension from \(d_{model}\) to \(d_k\)), but only the \(K\) matrix would need to learn that e.g. “Hans” is a Germanic boy’s name that is a proper noun because e.g. it’s capitalized (and is not a pronoun). An Article (grammar) head may be able to strip gender information if the only concern is e.g. the multiplicity of the reference word.

Why do we need a \(W_V\) matrix?#

Could we skip the \(W_V\) matrix if all we are doing is forwarding the original word to the next layer? Once we’ve identified the word as e.g. a pronoun it seems the attention layer has done its job and we can use our attention weights to properly emphasize the word relative to others.

The obvious reason is that our original embedded words of dimension \(d_{model}\) may be in a different dimension than the \(d_v\) we want to use downstream. That is, we may need to do dimensionality reduction to avoid an explosion in the size of our representation over the course of several layers.

Another advantage of a \(W_V\) matrix is we’ll be able to map the original embedding to a new custom (“contextualized”) embedding coming out of the whole multi-head attention mechanism. Recall that at the end of multi-head attention we concatenate the results of every head and then apply another weight matrix to build a new embdedding of size \(d_{model}\) similar to the \(d_{model}\) sized original (e.g. word2vec) representations.

Continuing the pronoun example, the \(v\) associated with “Hans” would thus get multiplied by an attention weight near one. This new representation may contain information specific to the name “Hans” such as that it’s a Germanic name, but only if that’s information that’s important to downstream layers. Including this information in the word “he” may help future decisions if geography is important in the sentence.

The linear map \(W_V\) provides does more than just add information to words, however (e.g. to deal with a polyseme). It actually completely transforms the original words (“he” and “Hans”) so that you can potentially change the meaning of words (e.g. to deal with a homonym).

Let’s consider an example with adjectives and the word “bank” (actually both a polyseme and homonym). If we see the word “river” immediately before bank we know the word has an almost completely different meaning than if we see e.g. “savings” before. Remember we can use the positional encoding in the word to check if we have e.g. an immediate adjective. If we have an attention head that recognizes these particular two-word tuples then it can change the meaning of the noun to include the information in the adjective. You could then build up higher level concepts like phrases through multiple layers.

The Annotated Transformer#

See The Annotated Transformer (old version) and The Annotated Transformer (new version) for a helpful multi-modal summary of the Transformer’s paper. The text in the newer version is too large, but can be zoomed. It also annoyingly doesn’t automatically fill anywhere near the width of a standard computer monitor; someone should republish it to automatically resize to the full width of the screen. A drawing of some of the classes:

x

As mentioned in a comment on this SE question, the implementation of MultiHeadedAttention is not easy to follow. Quoting the code:

class MultiHeadedAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1):
        "Take in model size and number of heads."
        super(MultiHeadedAttention, self).__init__()
        assert d_model % h == 0
        # We assume d_v always equals d_k
        self.d_k = d_model // h
        self.h = h
        self.linears = clones(nn.Linear(d_model, d_model), 4)
        self.attn = None
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, query, key, value, mask=None):
        "Implements Figure 2"
        if mask is not None:
            # Same mask applied to all h heads.
            mask = mask.unsqueeze(1)
        nbatches = query.size(0)

        # 1) Do all the linear projections in batch from d_model => h x d_k
        query, key, value = [
            lin(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
            for lin, x in zip(self.linears, (query, key, value))
        ]

        # 2) Apply attention on all the projected vectors in batch.
        x, self.attn = attention(
            query, key, value, mask=mask, dropout=self.dropout
        )

        # 3) "Concat" using a view and apply a final linear.
        x = (
            x.transpose(1, 2)
            .contiguous()
            .view(nbatches, -1, self.h * self.d_k)
        )
        del query
        del key
        del value
        return self.linears[-1](x)

We expect to see e.g. \(W^Q_i \in \mathbb{R}^{d_{model} \times d_k}\) but instead see four square weight matrices instantiated at once with clones. Why are these square rather than rectangular? We’ll focus on the first three which are associated with KQV; the fourth is not the issue here.

Read through EncoderDecoder, EncoderLayer, Embeddings, and nn.Embedding to confirm the query, key, value arguments to the forward method are all equal for self-attention and of shape \(N_{batch} \times n_{words} \times d_{model}\). The second dimension \(n_{words}\) is the maximum number of words (embedded words) per sentence.

Look only at the line lin(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2). Considering only dimensions, the nn.Linear operation produces a tensor of the same shape as its argument (given above) because the weight matrices are square. We then reshape the result to \(N_{batch} \times h \times n_{words} \times d_k\) (after the transpose) which works because \(d_{model} = d_k h\) where h is the number of attention heads.

How is this reshaping justified? What’s happening here is probably best understood as Block matrix multiplication. The code makes it look like we’re working with square matrices, but conceptually they’re matrices of shape \(d_k h \times d_{model}\) (note \(d_v = d_k\)). Let’s imagine the operation is \(Y = X A^T\) where an embedded word \(x\) is on every row of \(X\) (along the last dimension, the column direction). If \(A\) is \(m \times n\) the input dimension is \(n\) and the output dimension is \(m\); notice you specify these in the opposite order (the first argument is \(n\)) to nn.Linear.

Clearly the first row of \(Y\) is influenced by all weights but only the first example \(x\), and the first column of \(Y\) is influenced by all examples but only the first column of \(A^T\). Extend this to a “block” to say the first \(d_k\) columns of \(Y\) are influenced by all examples but only the first \(d_k\) columns of \(A^T\). Conceptually then we can see \(W^Q_i \in \mathbb{R}^{d_{model} \times d_k}\) fitting in these columns of \(A^T\) (rows of \(A\)). Notice we can add offsets b with an influence limited to one head.

If we want to reinterpret these columns of \(Y\) as a matrix then we need to acknowledge that the \(d_k\) dimension is contiguous; see Row- and column-major order. Because PyTorch is row-major order the self.d_k argument is the last to view.

Other implementations make all this clearer by using d_model, n_head, and d_k; see attention-is-all-you-need-pytorch/SubLayers.py. See also MultiheadAttention in:

If you want more context note the first three arguments to the attention implementation are then four-dimensional. This logic is doing batching at two levels, one the normal mini-batch and one across all heads. See the “batched” examples in torch.matmul:

def attention(query, key, value, mask=None, dropout=None):
    "Compute 'Scaled Dot Product Attention'"
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    p_attn = scores.softmax(dim=-1)
    if dropout is not None:
        p_attn = dropout(p_attn)
    return torch.matmul(p_attn, value), p_attn

Other annotations#

See The Illustrated Transformer and Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) for parts of the model translated to visualizations.

The tutorials Language Modeling with nn.Transformer and TorchText and Language Translation with nn.Transformer and torchtext are more focused on the details of implementing a Transformer model than how or why it works; for example they don’t describe “attention” in detail and only mention MultiheadAttention - PyTorch rather than use it in the code (much less look at its internals).

The other answers on this SE question weren’t particularly helpful to me (2022-08). See About if you want me to review your answer again.