Attention shape trace through QK^T
Question
In a single attention head, the query matrix Q has shape [B, n_q, d_k], the key matrix K has shape [B, n_k, d_k], and the value matrix V has shape [B, n_k, d_v], where B is the batch size. What is the shape of the attention-score tensor QK^T (before softmax and before multiplication by V)?
Why this matters
Shape tracing is the lowest-level interview reflex for transformer internals. Getting the (B, n_q, n_k) shape wrong by one axis is the single most common source of silent bugs in custom attention implementations. The score tensor's shape is what determines that softmax-over-key-axis (the previous proof task) is well-defined.
Common mistake
Forgetting to transpose only the last two axes of K when batched. Calling K.T on a 3-D tensor in NumPy reverses ALL axes giving shape [d_k, n_k, B], which then fails to broadcast against Q. Use K.transpose(-1, -2) or K.swapaxes(-1, -2) for batched attention.
Source anchor
content/topics/attention-mechanism-theory.mdx#scaled-dot-product-attention