Lab 2 Companion - Full Technical Specification with Visuals
Purpose. This companion document explains every operation and conceptual link behind Lab 2: from low-level linear algebra in C++ to high-level transformer architectures in PyTorch. It assumes zero background in AI or programming and aims to make the mechanics of attention and transformer models fully transparent.
0) Overview and Non-Spoiler Policy
This document helps you understand how and why each task works without giving away code implementations. You will learn the mathematical structure, data flow, and shape transformations — enough to debug and verify your implementation.
- No numeric results or completed code are shown.
- All examples are shape-based and conceptual.
1) From Linear Algebra to Machine Learning Computation
1.1 The three primitives
Every modern ML model — from CNNs to Transformers — ultimately depends on three core operations:
- Vector dot product: combines two 1-D arrays → scalar.
- Matrix–vector multiply: combines 2-D and 1-D → 1-D.
- Matrix–matrix multiply: combines two 2-D → 2-D.
These three follow the same pattern:
- Perform elementwise multiply across a shared dimension.
- Perform reduction (sum) over that dimension.
For example, matrix multiply:
This nested summation represents the dot product of row i from A and column j from B.
1.2 Parallelism and Latency
Hardware designers view this operation as two stages:
| Stage | Description | Parallelism |
|---|---|---|
| Elementwise multiply | Multiply all pairs (, ) | Embarrassingly parallel |
| Reduction (sum) | Accumulate partial results per output | Requires synchronization |
Even with infinite compute units, the reduction imposes a lower bound on latency: all partial sums must complete before output.
In hardware terms: Multiplications can run in parallel; additions must merge results step-by-step. This insight drives accelerator design (Task 1 Q1–Q3).
1.3 Memory layout and access
In real systems, matrices live in linear memory, not true 2-D arrays. Access pattern matters:
- Row-major:
addr(i,j) = base + (i*N + j) - Column-major:
addr(i,j) = base + (j*M + i)
Why this matters:
- Stride patterns affect cache efficiency.
- AI accelerators often re-order tensors to minimize memory stalls.
1.4 Connecting to C++ Implementation
In mhsa.cpp, the function signature:
void matmul(const float *A, const float *B, float *C, unsigned d0, unsigned d1, unsigned d2, const float *bias)
represents:
| Symbol | Meaning | Shape |
|---|---|---|
| A | Left matrix | [d0 × d1] |
| B | Right matrix | [d1 × d2] |
| C | Output matrix | [d0 × d2] |
| bias | Optional vector | [d2] |
Row-major indexing:
A[i*d1 + k] → element Aᵢₖ
So the loop nest:
for i in rows(A):
for j in cols(B):
C[i,j] = bias[j]
for k in shared_dim:
C[i,j] += A[i,k]*B[k,j]
explicitly performs the multiply-accumulate pattern.
2) The Softmax Operation
2.1 What it does
softmax(x) converts a vector of raw scores (logits) into probabilities:
2.2 Why it matters
- Converts unbounded logits → bounded probabilities (0–1).
- Makes outputs comparable across classes.
- Used in attention to turn similarity scores into weights that sum to 1.
2.3 Numerical stability
Direct exponentiation can cause overflow/underflow. The standard fix is:
This subtraction doesn’t change the result — it just keeps numbers safe.
2.4 Implementation insight
In mhsa.cpp, Task 2 asks you to implement row-wise softmax:
- Find row max → subtract.
- Compute exp of shifted values.
- Divide each by the row sum.
Why row-wise? Each query token’s attention weights must sum to 1 across all keys.
3) Attention and Its Intuition
3.1 What is attention?
Given a sequence of inputs , attention learns which other positions matter when computing an output at position .
Each token forms three learned projections:
- Query (Q): what am I looking for?
- Key (K): what do I have to offer?
- Value (V): what information do I share?
3.2 Scaled dot-product attention
Why divide by ? Without it, dot-products grow with vector length, pushing softmax into saturated regions. Scaling keeps gradients well-behaved.
3.3 Shape trace
| Tensor | Shape |
|---|---|
| Q | [seq, dₖ] |
| K | [seq, dₖ] |
| V | [seq, dᵥ] |
| QKᵀ | [seq, seq] (attention scores) |
| softmax(QKᵀ) | [seq, seq] (weights per query) |
| output | [seq, dᵥ] |
3.4 Causal masking
For autoregressive models (like language generators):
- Each token should only attend to itself and previous tokens.
- The mask sets future positions to a very negative value (≈ −1e9) so their softmax weight ≈ 0.
Visual example for seq_len = 4:
[[0, -inf, -inf, -inf],
[0, 0, -inf, -inf],
[0, 0, 0, -inf],
[0, 0, 0, 0 ]]
4) Multi-Head Self-Attention (MHSA)
4.1 Why multiple heads?
Different heads can learn different dependency patterns:
- One head focuses on short-range relations.
- Another tracks long-range or semantic relationships.
Each head runs its own attention operation with its own parameters.
4.2 Structure overview
For H heads:
- Split the embedding dimension:
embed_dim = H × head_dim. - Compute Q, K, V for each head.
- Apply attention(Q,K,V) separately.
- Concatenate all heads → [seq, embed_dim].
- Project through a final
W_o.
4.3 C++ side
In mhsa.cpp, the loop over heads explicitly:
- Slices each weight matrix (
W_q,W_k,W_v,W_o). - Calls your
matmul()andsoftmax()functions. - Applies causal mask if enabled.
- Combines results per head into the final output.
This structure shows the mechanical composition of attention layers — useful for future accelerator mapping.
4.4 PyTorch side
In models.py → MultiHeadSelfAttention.forward():
- Project input x → q,k,v (linear layers).
- Reshape to
(B, num_heads, T, head_dim). - Compute attention weights via
torch.matmul(q, k.transpose(-2,-1)) / sqrt(d_k). - Apply mask.
softmaxacross last dimension.- Compute weighted sum with
v. - Reshape back to
(B, T, embed_dim)and project throughout_proj.
5) Transformer Decoder-Only Architecture
5.1 Structure
A decoder block combines:
- Masked self-attention (causal).
- Feed-forward network (FFN): 2 linear layers + activation.
- Residual connections around both sub-modules.
- Layer normalization for stability.
The forward path:
x → LN → SelfAttention + Residual → LN → FFN + Residual
5.2 Positional encoding
Transformers have no built-in notion of sequence order, so we add positional encodings:
-
Fixed sinusoidal patterns:
- Even dims: sin(position / 10000^(2i/dim))
- Odd dims: cos(position / 10000^(2i/dim))
-
Added directly to embeddings.
5.3 Decoder-only workflow
During training:
- Input:
[x₁, x₂, …, xₙ, <start>, reversed_targets, <end>]. - All timesteps processed in parallel.
- Mask ensures token t only sees ≤ t positions.
At inference:
- Predict one token → append → repeat (autoregressive generation).
5.4 Why Transformers beat CNNs here
- Parallelism: No sequential dependency like RNNs.
- Global context: Each token can consider all previous tokens, regardless of distance.
- Parameter efficiency: Same weights reused via attention instead of large kernels.
6) Sanity Checks and Debugging Guide
| Symptom | Likely Cause | Fix |
|---|---|---|
| Output all zeros | Forgetting bias init in matmul | Initialize C = bias or 0 |
| NaNs in softmax | Forgot to subtract max or used wrong dimension | Subtract row max before exp |
| Loss not decreasing | Gradient blocked by softmax misuse | Don’t apply softmax before CrossEntropyLoss |
| C++ vs PyTorch mismatch | Row/col indexing swapped | Verify indexing order and shape |
| Model diverges | LR too high | Reduce LR or use Adam |
7) Conceptual FAQs
Q1. Why does softmax appear again in attention? It converts raw similarity scores into a distribution of attention weights.
Q2. Why is causal masking essential? Without it, a language model could peek at future words — invalidating autoregressive training.
Q3. Why use scaling? Prevents large dot-products from pushing softmax into saturation (where gradients vanish).
Q4. Why are transformers better for long sequences? Because every token can attend to any other directly — no distance decay like in CNN or RNN windows.
Q5. What makes the C++ version educational? It exposes the exact loop structure behind high-level PyTorch ops — vital for hardware design intuition.
8) Verification Steps
- Matrix multiply test: Compare C++ vs PyTorch (
--task 1). - Softmax test: Row sums ≈ 1 (
--task 2). - Multi-head attention test: Compare C++ vs PyTorch MHSA (
--task 3). - Model training test: Compare CNN vs Transformer (
--task 4).
Expected trends:
- CNN performs well for short sequences but fails for long ones.
- Transformer learns reversal faster and generalizes better even with fewer parameters.
9) Closing Remarks
This lab bridges the mathematical, implementation, and architectural views of deep learning. By writing low-level kernels in C++ and connecting them to PyTorch’s high-level abstractions, you’ll develop the essential skill of reasoning across all layers — from tensor algebra → accelerator design → full model behavior.
The transformer isn’t just a neural network — it’s a composition of the same linear algebra principles from Part 1, scaled up and optimized for parallel hardware. Once you grasp that, the rest of deep learning becomes a matter of shape management and memory control.
