Lab 1 Companion - Full Technical Specification with Visuals
Purpose. This document is a mechanical + theoretical spec for Lab 1. It is intentionally dense: every component includes what, why, how the shapes move, and how to verify your implementation. It avoids spoilers (no results), but removes ambiguity so students don’t get blocked.
0) Non‑Spoiler Policy
- No numeric results, no “which model is better,” no hyperparameter answers.
- You are given procedures, mechanics, and validation steps. Conclusions must come from your own runs.
1) Problem, Data, and Tensor Conventions
Goal. Recognize digits 0–9 from 28×28 grayscale images.
1.1 Image format
- Each image is a 2D grid of intensities in
[0,1]. - Grayscale ⇒ one channel. Shape of one image:
1 × 28 × 28.
1.2 Tensor layout (channels‑first)
- Single image (unbatched):
C × H × W = 1 × 28 × 28 - Batch of images:
N × C × H × WN: batch size;C: channels;H,W: height/width
PyTorch defaults to channels‑first. If you ever see channels‑last (
N × H × W × C), you must permute to match the model’s expectation.
1.3 Labels and logits
- Labels: integers
0..9(not one‑hot). - Logits: 10 real‑valued scores, one per class, before softmax.
Larger logits for a class ⇒ model is more confident in that class. Softmax turns logits into probabilities.
2) Training vs. Testing; Accuracy vs. Loss
- Training set: updates parameters using gradients from the loss.
- Test set: never used for updating parameters; only for generalization.
- Loss (cross‑entropy): optimization target; differentiable, produces gradients.
- Accuracy: evaluation metric computed via
argmax(logits); not used to update weights.
Track both. Loss tells you if learning is happening; accuracy tells you if predictions are correct.
3) Epochs, Batches, and the DataLoader
-
Epoch: one full pass over the training set.
-
Mini‑batch: subset of samples processed before each update; shapes:
x:N × 1 × 28 × 28y:N
-
DataLoader: batches, shuffles, and (optionally) parallelizes loading (
num_workers).
Sanity check: print the first batch’s shapes before training starts.
print(next(iter(train_loader))[0].shape) # expect: (N, 1, 28, 28)
4) Autograd, Optimizer, and the Training Step
- Autograd: records ops to compute gradients via backprop.
- Optimizer: (Adam/SGD) applies gradient‑based updates using LR, momentum, etc.
Per‑batch training step
model.train()logits = model(x)loss = criterion(logits, y)optimizer.zero_grad()loss.backward()optimizer.step()
Evaluation step
model.eval()+with torch.no_grad():(no gradient tracking)
Forgetting
zero_grad()causes gradients to accumulate across steps.
5) Layers and Their Mechanics (with Visuals)
5.1 nn.Flatten() - from grid to vector
What it does (shape). N × C × H × W → N × (C·H·W)
Why it’s needed. nn.Linear expects a vector per example. Images are 2D grids (with channels). Flatten rearranges the same numbers into a single vector so a fully connected layer can process them.
Data‑value invariant. Flatten does not change values, only the layout.
Concrete toy example (C=1, H=W=2, N=2)
Before flatten (shape N × C × H × W = 2 × 1 × 2 × 2):
[
[ # sample 0, shape 1×2×2
[ [1, 2],
[3, 4] ]
],
[ # sample 1, shape 1×2×2
[ [5, 6],
[7, 8] ]
]
]
After flatten (shape N × (C·H·W) = 2 × 4):
[
[1, 2, 3, 4],
[5, 6, 7, 8]
]
For MNIST:
1×28×28 → 784features per sample.
Ordering note. PyTorch flattens in row‑major order with channels‑first layout: iterate channels, then rows, then columns.
5.2 nn.Linear(d_in, d_out) - affine map
Formula. y = x Wᵀ + b
x: shapeN × d_inW: shaped_out × d_inb: shaped_out- Output:
N × d_out
Parameter count. (d_in + 1)·d_out (weights + biases)
Interpretation. Each output unit is a learned weighted sum of all input features (plus bias). A stack of Linear layers without activation is still a single affine map overall (see §7.2).
5.3 Activation functions - why linear stacks are not enough
Without a nonlinearity, any stack of Linear layers collapses to one Linear layer (proof in §7.2). Activations (ReLU/Tanh/Sigmoid) insert nonlinear steps so the network can learn curved decision boundaries.
- ReLU:
max(0, x); piecewise linear, fast, robust. - Tanh/Sigmoid: saturating; useful but less common for hidden layers here.
Geometric intuition. Linear ⇒ hyperplanes only. With ReLU, you can compose many piecewise linear regions ⇒ arbitrarily complex shapes.
5.4 nn.Conv2d - local, shared detectors
Purpose. Learn small filters (e.g., 3×3) that detect edges/strokes anywhere.
Output shape. For input N × Cin × H × W and conv params (Cout, K, stride=s, padding=p):
H' = ⌊(H + 2p − K)/s⌋ + 1W' = ⌊(W + 2p − K)/s⌋ + 1- Output:
N × Cout × H' × W'
Parameter count. (Cin·K·K + 1)·Cout
Why this helps for images.
- Local connectivity: each output depends on a small neighborhood.
- Weight sharing: one filter scans all positions ⇒ far fewer params than a fully connected layer on raw pixels.
Visual: 1 filter on a toy 3×3 input (no padding, stride 1)
input(1×3×3): kernel(1×2×2): conv map(1×2×2):
[ [a b c], [ [k1 k2], [ [a·k1+b·k2+d·k3+e·k4, b·k1+c·k2+e·k3+f·k4],
[d e f], * [k3 k4] ] = [ d·k1+e·k2+g·k3+h·k4, e·k1+f·k2+h·k3+i·k4] ]
[g h i] ]
(Plus bias per output channel.)
5.5 nn.MaxPool2d(2) - downsample with a max
- Takes each non‑overlapping
2×2block and outputs its maximum. - Halves
HandW(for stride 2), keeping strongest activations and reducing compute.
Toy visual
input (1×4×4) → MaxPool2d(2) → output (1×2×2)
[ [1, 9, 2, 3], [ [9, 4],
[0, 4, 1, 1], → [7, 9] ]
[7, 6, 2, 8],
[5, 3, 9, 0] ]
6) Logits, Softmax, and Cross‑Entropy (why no manual softmax)
- Softmax:
softmax(z)_i = exp(z_i) / Σ_j exp(z_j) - CrossEntropyLoss: implements
log_softmax+ NLL in one numerically stable op. Feed logits directly toCrossEntropyLoss.
Why. Doing your own softmax then NLLLoss is redundant and can be less stable due to floating‑point range.
7) Task Specifications - Contracts without Blueprints
To avoid spoilers, this section states interface contracts and constraints only. It deliberately omits exact layer sequences. Your implementation must satisfy the contracts and pass the checks, but how you assemble compliant modules is up to you.
7.1 Task 1 - LinearModel (single-step classifier)
Contract
- Accepts:
N × 1 × 28 × 28 - Produces:
N × 10logits - Must include a stage that converts image grids into feature vectors (no spatial dims remain when producing logits).
- No convolution allowed.
Checks
- Final tensor rank: 2 (batch, class)
- Parameter count matches a single affine mapping from the flattened input to 10 outputs (you should derive and verify).
7.2 Task 2 - TwoLinearModel (no nonlinearity)
Contract
- Accepts:
N × 1 × 28 × 28 - Produces:
N × 10logits - Contains exactly two affine transformations with no activation in between.
- Intermediate feature width is a free hyperparameter
Hthat you choose.
Checks
- Intermediate activation is strictly forbidden (no ReLU/Tanh/Sigmoid/… between the two affine maps).
- Parameter count matches two affine maps with hidden width
H(derive yourself; verify at runtime). - Overall mapping remains affine in the input (theory note: composition of affine maps is affine).
7.3 Task 3 - NeuralNetworkModel (introduce nonlinearity)
Contract
- Accepts:
N × 1 × 28 × 28 - Produces:
N × 10logits - Contains at least one nonlinear activation between two affine transformations.
- Hidden width
His your choice; the nonlinearity choice is yours (ReLU recommended, but not mandated here).
Checks
- Ensure the nonlinearity is applied between affine stages (not after the final logits feeding CrossEntropy).
- Parameter count equals two affine maps with hidden width
H(nonlinearities carry no trainable params).
7.4 Task 4 - ConvNetworkModel (use spatial bias)
Contract
- Accepts:
N × 1 × 28 × 28 - Produces:
N × 10logits - Must include at least one convolutional stage that preserves or deliberately transforms spatial structure before collapsing to class logits.
- Include at least one spatial downsampling step (e.g., pooling or strided conv) so the final classifier head operates on a reduced spatial representation.
- You choose kernel size
K, number of output channelsC_out, padding/stride policy, and where to collapse spatial dims.
Checks
- After your spatial downsampling, the flattened feature size times classifier width yields a parameter count consistent with your chosen
C_out, spatial dims, and classes. - Verify output shapes at each checkpoint (print shapes); confirm the final tensor is
N × 10.
General validation for all tasks
- Do not apply softmax in the model body when training with
CrossEntropyLoss. - Use the same preprocessing and dataloaders across tasks when performing comparisons.
- Print
sum(p.numel() for p in model.parameters())and reconcile with your own derivations.
8) DTypes, Devices, and Reproducibility
- DTypes: prefer
float32for speed;float64increases precision but is slower. - Devices: CPU is fine; if using GPU, ensure both data and model are on the same device.
- Reproducibility:
torch.manual_seed(SEED)- Fix DataLoader worker seeds if using
num_workers>0. - Disable nondeterministic algorithms if strict reproducibility is required.
9) Initialization and Optimization
- PyTorch defaults (Kaiming/Glorot) are adequate for these shallow networks.
- Optimizer: Adam is a good starting point.
- LR: too high → divergence; too low → slow/no learning.
- Weight decay: optional L2 regularization for generalization.
10) Metrics and Evaluation Protocol
Track for train and test:
- Loss per epoch (aggregate over all batches)
- Accuracy per epoch
- (Optional) Confusion matrix for error analysis
Pseudocode (test pass)
model.eval()
correct = 0
with torch.no_grad():
for x, y in test_loader:
logits = model(x)
pred = logits.argmax(dim=1)
correct += (pred == y).sum().item()
acc = correct / len(test_dataset)
11) Parameter Counting - formulas and checks
- Linear(d_in→d_out):
(d_in + 1)·d_out - Conv2d(C_in→C_out, K):
(C_in·K·K + 1)·C_out - Flatten (MNIST):
1·28·28 = 784
Verify in code
sum(p.numel() for p in model.parameters())
12) Debugging Playbook
- Shape errors: print
.shapeat each stage (input → flatten → layer outputs). - Softmax misuse: don’t apply softmax before
CrossEntropyLoss. - Wrong targets: pass integer class IDs, not one‑hot.
- Learning stalls: try lowering LR (divergence) or raising LR (too slow). Ensure
optimizer.zero_grad()is called beforebackward(). - Device/dtype mismatch: keep model and data aligned.
13) Fair Experimentation Protocol (no spoilers)
- Fix a baseline (LR, batch size, epochs, seed, dtype).
- Change one variable at a time (e.g.,
hidden_dim). - Log configuration + outcomes per run.
- Compare curves and final metrics fairly.
- Explain why differences occur (capacity, nonlinearity, spatial bias), not just which wins.
14) FAQ (Targeted Clarifications)
Q1. Why must we Flatten before Linear? Linear expects N × d_in where d_in is a single number. Images are N × C × H × W. Flatten computes d_in = C·H·W and reshapes the data accordingly without changing values.
Q2. Where does softmax fit? Nowhere in the model for training - CrossEntropyLoss handles it internally (via log_softmax + NLL).
Q3. Why is a two‑Linear layer model still linear? Because composing affine maps yields another affine map (BAx + (Bb + c)). Nonlinearity is required to exceed linear expressivity.
Q4. What does pooling really buy us? Fewer spatial locations (compute savings), some translation tolerance, and a regularized representation that highlights the strongest local responses.
Q5. Does Flatten lose spatial information? Yes - once flattened, the model no longer knows which pixel was near which. CNNs intentionally delay flattening to preserve spatial structure longer.
15) Shape Cheat‑Sheets (non‑spoiler edition)
These summarize required interfaces, not blueprints. They tell you what must go in and what must come out - not how to wire every layer.
Task 1 (LinearModel)
- Input:
N × 1 × 28 × 28 - Output:
N × 10 - Constraint: single affine mapping from flattened inputs to classes.
Task 2 (TwoLinearModel, no activation)
- Input:
N × 1 × 28 × 28 - Output:
N × 10 - Constraint: exactly two affine maps with a linear (identity) connection between them; no nonlinearity.
Task 3 (NeuralNetworkModel, with nonlinearity)
- Input:
N × 1 × 28 × 28 - Output:
N × 10 - Constraint: at least one nonlinearity strictly between two affine maps.
Task 4 (ConvNetworkModel)
- Input:
N × 1 × 28 × 28 - Output:
N × 10 - Constraint: contains convolution; includes at least one spatial reduction before collapsing to class logits.
16) Verification & Sanity Tests
- Module summary: print the model to inspect layer order and parameter counts.
- Tiny overfit: train on ~128 samples for a few epochs; training loss should drop clearly.
- Type checks: confirm dtype and device consistency throughout.
17) What You Must Choose (and Justify)
hidden_dimfor MLPs- dtype (
float32vsfloat64) - optimizer hyperparameters (LR, epochs, batch size)
- evaluation cadence (per‑epoch vs per‑k‑steps)
- what you log and how you compare runs
This companion gives you the mechanics and invariant checks. Your conclusions should come from your experiments.
