Lab 2 Companion - Full Technical Specification with Visuals

Purpose. This companion document explains every operation and conceptual link behind Lab 2: from low-level linear algebra in C++ to high-level transformer architectures in PyTorch. It assumes zero background in AI or programming and aims to make the mechanics of attention and transformer models fully transparent.

0) Overview and Non-Spoiler Policy

This document helps you understand how and why each task works without giving away code implementations. You will learn the mathematical structure, data flow, and shape transformations — enough to debug and verify your implementation.

No numeric results or completed code are shown.
All examples are shape-based and conceptual.

1) From Linear Algebra to Machine Learning Computation

1.1 The three primitives

Every modern ML model — from CNNs to Transformers — ultimately depends on three core operations:

Vector dot product: combines two 1-D arrays → scalar.
Matrix–vector multiply: combines 2-D and 1-D → 1-D.
Matrix–matrix multiply: combines two 2-D → 2-D.

These three follow the same pattern:

Perform elementwise multiply across a shared dimension.
Perform reduction (sum) over that dimension.

For example, matrix multiply:

C_{ij} = \sum_k A_{ik} B_{kj}

This nested summation represents the dot product of row i from A and column j from B.

1.2 Parallelism and Latency

Hardware designers view this operation as two stages:

Stage	Description	Parallelism
Elementwise multiply	Multiply all pairs ( $A_{ik}$ , $B_{kj}$ )	Embarrassingly parallel
Reduction (sum)	Accumulate partial results per output	Requires synchronization

Even with infinite compute units, the reduction imposes a lower bound on latency: all partial sums must complete before output.

In hardware terms: Multiplications can run in parallel; additions must merge results step-by-step. This insight drives accelerator design (Task 1 Q1–Q3).

1.3 Memory layout and access

In real systems, matrices live in linear memory, not true 2-D arrays. Access pattern matters:

Row-major: addr(i,j) = base + (i*N + j)
Column-major: addr(i,j) = base + (j*M + i)

Why this matters:

Stride patterns affect cache efficiency.
AI accelerators often re-order tensors to minimize memory stalls.

1.4 Connecting to C++ Implementation

In mhsa.cpp, the function signature:

cpp

void matmul(const float *A, const float *B, float *C, unsigned d0, unsigned d1, unsigned d2, const float *bias)

represents:

Symbol	Meaning	Shape
A	Left matrix	[d0 × d1]
B	Right matrix	[d1 × d2]
C	Output matrix	[d0 × d2]
bias	Optional vector	[d2]

Row-major indexing: A[i*d1 + k] → element Aᵢₖ

So the loop nest:

cpp

for i in rows(A):
  for j in cols(B):
    C[i,j] = bias[j]
    for k in shared_dim:
      C[i,j] += A[i,k]*B[k,j]

explicitly performs the multiply-accumulate pattern.

2) The Softmax Operation

2.1 What it does

softmax(x) converts a vector of raw scores (logits) into probabilities:

\mathrm{softmax}(x_i)=\frac{e^{x_i}}{\sum_j e^{x_j}}

2.2 Why it matters

Converts unbounded logits → bounded probabilities (0–1).
Makes outputs comparable across classes.
Used in attention to turn similarity scores into weights that sum to 1.

2.3 Numerical stability

Direct exponentiation can cause overflow/underflow. The standard fix is:

softmax(x_i) = \frac{e^{x_i - \max(x)}}{\sum_j e^{x_j - \max(x)}}

This subtraction doesn’t change the result — it just keeps numbers safe.

2.4 Implementation insight

In mhsa.cpp, Task 2 asks you to implement row-wise softmax:

Find row max → subtract.
Compute exp of shifted values.
Divide each by the row sum.

Why row-wise? Each query token’s attention weights must sum to 1 across all keys.

3) Attention and Its Intuition

3.1 What is attention?

Given a sequence of inputs $(x_1, x_2, \dots, x_n)$ , attention learns which other positions matter when computing an output at position $i$ .

Each token forms three learned projections:

Query (Q): what am I looking for?
Key (K): what do I have to offer?
Value (V): what information do I share?

3.2 Scaled dot-product attention

Attention(Q, K, V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Why divide by $\sqrt{d_k}$ ? Without it, dot-products grow with vector length, pushing softmax into saturated regions. Scaling keeps gradients well-behaved.

3.3 Shape trace

Tensor	Shape
Q	[seq, dₖ]
K	[seq, dₖ]
V	[seq, dᵥ]
QKᵀ	[seq, seq] (attention scores)
softmax(QKᵀ)	[seq, seq] (weights per query)
output	[seq, dᵥ]

3.4 Causal masking

For autoregressive models (like language generators):

Each token should only attend to itself and previous tokens.
The mask sets future positions to a very negative value (≈ −1e9) so their softmax weight ≈ 0.

Visual example for seq_len = 4:

[[0, -inf, -inf, -inf],
 [0,  0,   -inf, -inf],
 [0,  0,    0,   -inf],
 [0,  0,    0,    0 ]]

4) Multi-Head Self-Attention (MHSA)

4.1 Why multiple heads?

Different heads can learn different dependency patterns:

One head focuses on short-range relations.
Another tracks long-range or semantic relationships.

Each head runs its own attention operation with its own parameters.

4.2 Structure overview

For H heads:

Split the embedding dimension: embed_dim = H × head_dim.
Compute Q, K, V for each head.
Apply attention(Q,K,V) separately.
Concatenate all heads → [seq, embed_dim].
Project through a final W_o.

4.3 C++ side

In mhsa.cpp, the loop over heads explicitly:

Slices each weight matrix (W_q, W_k, W_v, W_o).
Calls your matmul() and softmax() functions.
Applies causal mask if enabled.
Combines results per head into the final output.

This structure shows the mechanical composition of attention layers — useful for future accelerator mapping.

4.4 PyTorch side

In models.py → MultiHeadSelfAttention.forward():

Project input x → q,k,v (linear layers).
Reshape to (B, num_heads, T, head_dim).
Compute attention weights via torch.matmul(q, k.transpose(-2,-1)) / sqrt(d_k).
Apply mask.
softmax across last dimension.
Compute weighted sum with v.
Reshape back to (B, T, embed_dim) and project through out_proj.

5) Transformer Decoder-Only Architecture

5.1 Structure

A decoder block combines:

Masked self-attention (causal).
Feed-forward network (FFN): 2 linear layers + activation.
Residual connections around both sub-modules.
Layer normalization for stability.

The forward path:

x → LN → SelfAttention + Residual → LN → FFN + Residual

5.2 Positional encoding

Transformers have no built-in notion of sequence order, so we add positional encodings:

Fixed sinusoidal patterns:
- Even dims: sin(position / 10000^(2i/dim))
- Odd dims: cos(position / 10000^(2i/dim))
Added directly to embeddings.

5.3 Decoder-only workflow

During training:

Input: [x₁, x₂, …, xₙ, <start>, reversed_targets, <end>].
All timesteps processed in parallel.
Mask ensures token t only sees ≤ t positions.

At inference:

Predict one token → append → repeat (autoregressive generation).

5.4 Why Transformers beat CNNs here

Parallelism: No sequential dependency like RNNs.
Global context: Each token can consider all previous tokens, regardless of distance.
Parameter efficiency: Same weights reused via attention instead of large kernels.

6) Sanity Checks and Debugging Guide

Symptom	Likely Cause	Fix
Output all zeros	Forgetting bias init in matmul	Initialize C = bias or 0
NaNs in softmax	Forgot to subtract max or used wrong dimension	Subtract row max before exp
Loss not decreasing	Gradient blocked by softmax misuse	Don’t apply softmax before CrossEntropyLoss
C++ vs PyTorch mismatch	Row/col indexing swapped	Verify indexing order and shape
Model diverges	LR too high	Reduce LR or use Adam

7) Conceptual FAQs

Q1. Why does softmax appear again in attention? It converts raw similarity scores into a distribution of attention weights.

Q2. Why is causal masking essential? Without it, a language model could peek at future words — invalidating autoregressive training.

Q3. Why use $\sqrt{d_k}$ scaling? Prevents large dot-products from pushing softmax into saturation (where gradients vanish).

Q4. Why are transformers better for long sequences? Because every token can attend to any other directly — no distance decay like in CNN or RNN windows.

Q5. What makes the C++ version educational? It exposes the exact loop structure behind high-level PyTorch ops — vital for hardware design intuition.

8) Verification Steps

Matrix multiply test: Compare C++ vs PyTorch (--task 1).
Softmax test: Row sums ≈ 1 (--task 2).
Multi-head attention test: Compare C++ vs PyTorch MHSA (--task 3).
Model training test: Compare CNN vs Transformer (--task 4).

Expected trends:

CNN performs well for short sequences but fails for long ones.
Transformer learns reversal faster and generalizes better even with fewer parameters.

9) Closing Remarks

This lab bridges the mathematical, implementation, and architectural views of deep learning. By writing low-level kernels in C++ and connecting them to PyTorch’s high-level abstractions, you’ll develop the essential skill of reasoning across all layers — from tensor algebra → accelerator design → full model behavior.

The transformer isn’t just a neural network — it’s a composition of the same linear algebra principles from Part 1, scaled up and optimized for parallel hardware. Once you grasp that, the rest of deep learning becomes a matter of shape management and memory control.