Lab 2 Walkthrough โ Multi-Head Attention & Transformer Decoder
Goal: In this lab, you will implement the core building blocks of a Transformer โ first at the C++ level (custom matrix math and attention kernel), then at the PyTorch level (modular deep learning architecture).
By the end, youโll understand both the numerical flow and the architectural flow of attention.
๐งฉ Task 1 โ Matrix Multiplication (matmul)
๐ฏ Goal
Implement C = A ร B + bias in row-major order.
This forms the foundation for Q, K, V projections later โ every linear layer in your Transformer depends on it.
๐ Understanding the math
If
then
๐ง Step-by-Step
-
Initialize
C- If bias exists โ start each
C[i,j] = bias[j] - Else โ start from
0.0.
- If bias exists โ start each
-
Compute dot products
cppfor (unsigned i = 0; i < d0; ++i) for (unsigned j = 0; j < d2; ++j) for (unsigned k = 0; k < d1; ++k) C[i*d2 + j] += A[i*d1 + k] * B[k*d2 + j]; -
Check indexing carefully Remember: row-major means elements of a row are contiguous in memory.
โCommon Questions
Q: Why do we add the bias per column? A: Because each output neuron (column) has its own bias term โ it shifts all rows by the same amount.
Q: Why three nested loops? A: Weโre performing a full matrix product; hardware accelerators can parallelize it, but logically itโs a triple loop.
โ Checkpoint
Compare your result against NumPy or PyTorch:
torch.allclose(torch.tensor(C), A @ B + bias, atol=1e-5)
๐งฎ Task 2 โ Row-Wise Softmax
๐ฏ Goal
Convert each row of logits into probabilities that sum to 1.
๐ The math
For each row ( i ):
Subtracting the row max prevents overflow in the exponential.
๐ง Implementation Steps
-
Find max per row
cppfloat maxv = A[i*d1]; for (unsigned j = 1; j < d1; ++j) maxv = std::max(maxv, A[i*d1 + j]); -
Exponentiate and sum
cppfloat sum = 0.f; for (unsigned j = 0; j < d1; ++j) { float e = std::exp(A[i*d1 + j] - maxv); A[i*d1 + j] = e; sum += e; } -
Normalize
cppfor (unsigned j = 0; j < d1; ++j) A[i*d1 + j] /= sum;
๐ก Why this matters
Without this function, your attention weights would be arbitrary โ the softmax enforces that they represent a probability distribution over tokens.
โ Checkpoint
Each row should sum to ~1:
// After softmax:
for (unsigned i = 0; i < d0; ++i) {
float s = 0;
for (unsigned j = 0; j < d1; ++j) s += A[i*d1+j];
assert(std::abs(s - 1.0f) < 1e-3);
}
โ๏ธ Task 3 โ Multi-Head Self-Attention (PyTorch)
๐ฏ Goal
Implement the forward() of MultiHeadSelfAttention.
๐ Conceptual Overview
Each token produces:
- a query vector (Q) โ what Iโm looking for
- a key vector (K) โ what I contain
- a value vector (V) โ what information Iโll share
We compute attention weights using:
๐ง Step-by-Step Implementation
-
Project input into Q, K, V
Pythonq = self.q_proj(x) k = self.k_proj(x) v = self.v_proj(x) -
Reshape into heads
PythonB, T, C = x.shape q = q.view(B, T, self.num_heads, self.head_dim).transpose(1, 2) k = k.view(B, T, self.num_heads, self.head_dim).transpose(1, 2) v = v.view(B, T, self.num_heads, self.head_dim).transpose(1, 2)โ Shape becomes
(B, num_heads, T, head_dim) -
Compute attention scores
Pythonattn_scores = torch.matmul(q, k.transpose(-2, -1)) / (self.head_dim ** 0.5) -
Apply causal mask (optional)
Pythonif use_causal_mask: mask = torch.triu(torch.ones(T, T, dtype=torch.bool, device=x.device), diagonal=1) attn_scores.masked_fill_(mask, float('-inf')) -
Softmax across last dimension
Pythonattn_weights = torch.softmax(attn_scores, dim=-1) -
Weighted sum of V
Pythonattn_output = torch.matmul(attn_weights, v) -
Reshape back and output projection
Pythonattn_output = attn_output.transpose(1, 2).contiguous().view(B, T, C) out = self.out_proj(attn_output) return out
โCommon Questions
Q: Why divide by โdโ? A: To keep the dot-product magnitude consistent โ without it, large embeddings cause softmax saturation.
Q: Why transpose(1, 2)?
A: To bring num_heads before seq_len so that each head runs its own matrix multiplication.
Q: What does โcausalโ mean?
A: In language models, a token canโt attend to future tokens.
Masking ensures i only sees โค i.
โ Checkpoint
You can test equivalence with PyTorchโs built-in:
torch.allclose(my_mhsa(x), nn.MultiheadAttention(embed_dim, num_heads)(x,x,x)[0], atol=1e-3)
๐ง Task 4 โ Decoder-Only Transformer
๐ฏ Goal
Build the GPT-style stack: embeddings โ N decoder blocks โ layer norm โ linear output.
๐ Flow Diagram
Input Tokens โโโถ Embedding โโโถ +PosEnc โโโถ [DecoderBlock ร N]
โ
โโโ Each block:
โโ LayerNorm
โโ MHSA (+Residual)
โโ LayerNorm
โโ FFN (+Residual)
โโโถ LayerNorm โโถ Linear (vocab projection) โโถ Logits
๐ง Implementation Steps
-
Embed tokens + add position encoding
Pythonx = self.embed(x) x = x + self.pos_encoding[:x.size(1), :].to(x.device) -
Pass through layers
Pythonfor layer in self.layers: x = layer(x) -
Normalize + project
Pythonx = self.ln_f(x) logits = self.head(x) return logits
โQuestions You Might Have
Q: Why add positional encoding? A: Transformers have no notion of order; positions encode sequential structure.
Q: Why apply LayerNorm twice? A: Each normalization isolates sub-layer learning โ one before attention, one before feedforward.
Q: Why dropout? A: To regularize training and reduce overfitting in large models.
โ Checkpoint
Feed dummy input:
x = torch.randint(0, vocab_size, (2, 8))
logits = model(x)
print(logits.shape) # [2, 8, vocab_size]
๐งญ Summary
| Task | Concept | Core Skill | Checkpoint |
|---|---|---|---|
| 1 | Matrix multiply | Memory indexing, loops | Matches PyTorch matmul |
| 2 | Softmax | Numerical stability | Rows sum โ 1 |
| 3 | MHSA | Attention mechanism | Matches built-in attention |
| 4 | Decoder block | Architectural flow | Output logits valid |
Next: Run your reference model side-by-side to verify numerical equivalence. Understanding why each piece exists is key โ youโve just built the backbone of GPT-style transformers from scratch!
