Lab 2 Walkthrough — Multi-Head Attention & Transformer Decoder

Goal: In this lab, you will implement the core building blocks of a Transformer — first at the C++ level (custom matrix math and attention kernel), then at the PyTorch level (modular deep learning architecture).

By the end, you’ll understand both the numerical flow and the architectural flow of attention.

🧩 Task 1 — Matrix Multiplication (`matmul`)

🎯 Goal

Implement C = A × B + bias in row-major order.

This forms the foundation for Q, K, V projections later — every linear layer in your Transformer depends on it.

🔍 Understanding the math

A \in \mathbb{R}^{d_0 \times d_1}, \quad B \in \mathbb{R}^{d_1 \times d_2}

then

C[i, j] = \sum_{k=0}^{d_1-1} A[i, k] \times B[k, j] + \text{bias}[j]

🧠 Step-by-Step

Initialize C
- If bias exists → start each C[i,j] = bias[j]
- Else → start from 0.0.

Compute dot products

cpp

for (unsigned i = 0; i < d0; ++i)
  for (unsigned j = 0; j < d2; ++j)
    for (unsigned k = 0; k < d1; ++k)
      C[i*d2 + j] += A[i*d1 + k] * B[k*d2 + j];

Check indexing carefully Remember: row-major means elements of a row are contiguous in memory.

❓Common Questions

Q: Why do we add the bias per column? A: Because each output neuron (column) has its own bias term — it shifts all rows by the same amount.

Q: Why three nested loops? A: We’re performing a full matrix product; hardware accelerators can parallelize it, but logically it’s a triple loop.

✅ Checkpoint

Compare your result against NumPy or PyTorch:

Python

torch.allclose(torch.tensor(C), A @ B + bias, atol=1e-5)

🧮 Task 2 — Row-Wise Softmax

🎯 Goal

Convert each row of logits into probabilities that sum to 1.

🔍 The math

For each row ( i ):

p_{i,j} = \frac{e^{A_{i,j} - \max_j A_{i,j}}}{\sum_k e^{A_{i,k} - \max_j A_{i,j}}}

Subtracting the row max prevents overflow in the exponential.

🧠 Implementation Steps

Find max per row

cpp

float maxv = A[i*d1];
for (unsigned j = 1; j < d1; ++j)
  maxv = std::max(maxv, A[i*d1 + j]);

Exponentiate and sum

cpp

float sum = 0.f;
for (unsigned j = 0; j < d1; ++j) {
  float e = std::exp(A[i*d1 + j] - maxv);
  A[i*d1 + j] = e;
  sum += e;
}

Normalize

cpp

for (unsigned j = 0; j < d1; ++j)
  A[i*d1 + j] /= sum;

💡 Why this matters

Without this function, your attention weights would be arbitrary — the softmax enforces that they represent a probability distribution over tokens.

✅ Checkpoint

Each row should sum to ~1:

cpp

// After softmax:
for (unsigned i = 0; i < d0; ++i) {
  float s = 0;
  for (unsigned j = 0; j < d1; ++j) s += A[i*d1+j];
  assert(std::abs(s - 1.0f) < 1e-3);
}

⚙️ Task 3 — Multi-Head Self-Attention (PyTorch)

🎯 Goal

Implement the forward() of MultiHeadSelfAttention.

🔍 Conceptual Overview

Each token produces:

a query vector (Q) — what I’m looking for
a key vector (K) — what I contain
a value vector (V) — what information I’ll share

We compute attention weights using:

\text{Attention}(Q, K, V) = \text{softmax}!\left(\frac{QK^T}{\sqrt{d_k}}\right)V

🧠 Step-by-Step Implementation

Project input into Q, K, V

Python

q = self.q_proj(x)
k = self.k_proj(x)
v = self.v_proj(x)

Reshape into heads

Python

B, T, C = x.shape
q = q.view(B, T, self.num_heads, self.head_dim).transpose(1, 2)
k = k.view(B, T, self.num_heads, self.head_dim).transpose(1, 2)
v = v.view(B, T, self.num_heads, self.head_dim).transpose(1, 2)

→ Shape becomes (B, num_heads, T, head_dim)

Compute attention scores

Python

attn_scores = torch.matmul(q, k.transpose(-2, -1)) / (self.head_dim ** 0.5)

Apply causal mask (optional)

Python

if use_causal_mask:
    mask = torch.triu(torch.ones(T, T, dtype=torch.bool, device=x.device), diagonal=1)
    attn_scores.masked_fill_(mask, float('-inf'))

Softmax across last dimension

Python

attn_weights = torch.softmax(attn_scores, dim=-1)

Weighted sum of V

Python

attn_output = torch.matmul(attn_weights, v)

Reshape back and output projection

Python

attn_output = attn_output.transpose(1, 2).contiguous().view(B, T, C)
out = self.out_proj(attn_output)
return out

❓Common Questions

Q: Why divide by √dₖ? A: To keep the dot-product magnitude consistent — without it, large embeddings cause softmax saturation.

Q: Why transpose(1, 2)? A: To bring num_heads before seq_len so that each head runs its own matrix multiplication.

Q: What does “causal” mean? A: In language models, a token can’t attend to future tokens. Masking ensures i only sees ≤ i.

✅ Checkpoint

You can test equivalence with PyTorch’s built-in:

Python

torch.allclose(my_mhsa(x), nn.MultiheadAttention(embed_dim, num_heads)(x,x,x)[0], atol=1e-3)

🧠 Task 4 — Decoder-Only Transformer

🎯 Goal

Build the GPT-style stack: embeddings → N decoder blocks → layer norm → linear output.

🔍 Flow Diagram

Input Tokens ──▶ Embedding ──▶ +PosEnc ──▶ [DecoderBlock × N]
                                          │
                                          └── Each block:
                                              ├─ LayerNorm
                                              ├─ MHSA (+Residual)
                                              ├─ LayerNorm
                                              └─ FFN (+Residual)
──▶ LayerNorm ─▶ Linear (vocab projection) ─▶ Logits

🧠 Implementation Steps

Embed tokens + add position encoding

Python

x = self.embed(x)
x = x + self.pos_encoding[:x.size(1), :].to(x.device)

Pass through layers

Python

for layer in self.layers:
    x = layer(x)

Normalize + project

Python

x = self.ln_f(x)
logits = self.head(x)
return logits

❓Questions You Might Have

Q: Why add positional encoding? A: Transformers have no notion of order; positions encode sequential structure.

Q: Why apply LayerNorm twice? A: Each normalization isolates sub-layer learning — one before attention, one before feedforward.

Q: Why dropout? A: To regularize training and reduce overfitting in large models.

✅ Checkpoint

Feed dummy input:

Python

x = torch.randint(0, vocab_size, (2, 8))
logits = model(x)
print(logits.shape)  # [2, 8, vocab_size]

🧭 Summary

Task	Concept	Core Skill	Checkpoint
1	Matrix multiply	Memory indexing, loops	Matches PyTorch matmul
2	Softmax	Numerical stability	Rows sum ≈ 1
3	MHSA	Attention mechanism	Matches built-in attention
4	Decoder block	Architectural flow	Output logits valid

Next: Run your reference model side-by-side to verify numerical equivalence. Understanding why each piece exists is key — you’ve just built the backbone of GPT-style transformers from scratch!

Lab 2 Walkthrough — Multi-Head Attention & Transformer Decoder

🧩 Task 1 — Matrix Multiplication (matmul)

🎯 Goal

🔍 Understanding the math

🧠 Step-by-Step

❓Common Questions

✅ Checkpoint

🧮 Task 2 — Row-Wise Softmax

🎯 Goal

🔍 The math

🧠 Implementation Steps

💡 Why this matters

✅ Checkpoint

⚙️ Task 3 — Multi-Head Self-Attention (PyTorch)

🎯 Goal

🔍 Conceptual Overview

🧠 Step-by-Step Implementation

❓Common Questions

✅ Checkpoint

🧠 Task 4 — Decoder-Only Transformer

🎯 Goal

🔍 Flow Diagram

🧠 Implementation Steps

❓Questions You Might Have

✅ Checkpoint

🧭 Summary

🧩 Task 1 — Matrix Multiplication (`matmul`)