Skip to main content
Calculation·Difficulty 2/5·Target edge: matrix-multiplication-algorithmsattention-mechanism-theory

Attention shape trace through QK^T

Question

In a single attention head, the query matrix Q has shape [B, n_q, d_k], the key matrix K has shape [B, n_k, d_k], and the value matrix V has shape [B, n_k, d_v], where B is the batch size. What is the shape of the attention-score tensor QK^T (before softmax and before multiplication by V)?

Why this matters

Shape tracing is the lowest-level interview reflex for transformer internals. Getting the (B, n_q, n_k) shape wrong by one axis is the single most common source of silent bugs in custom attention implementations. The score tensor's shape is what determines that softmax-over-key-axis (the previous proof task) is well-defined.

Common mistake

Forgetting to transpose only the last two axes of K when batched. Calling K.T on a 3-D tensor in NumPy reverses ALL axes giving shape [d_k, n_k, B], which then fails to broadcast against Q. Use K.transpose(-1, -2) or K.swapaxes(-1, -2) for batched attention.

Source anchor

content/topics/attention-mechanism-theory.mdx#scaled-dot-product-attention