Lab 1 Companion - Full Technical Specification with Visuals

Purpose. This document is a mechanical + theoretical spec for Lab 1. It is intentionally dense: every component includes what, why, how the shapes move, and how to verify your implementation. It avoids spoilers (no results), but removes ambiguity so students don’t get blocked.

0) Non‑Spoiler Policy

No numeric results, no “which model is better,” no hyperparameter answers.
You are given procedures, mechanics, and validation steps. Conclusions must come from your own runs.

1) Problem, Data, and Tensor Conventions

Goal. Recognize digits 0–9 from 28×28 grayscale images.

1.1 Image format

Each image is a 2D grid of intensities in [0,1].
Grayscale ⇒ one channel. Shape of one image: 1 × 28 × 28.

1.2 Tensor layout (channels‑first)

Single image (unbatched): C × H × W = 1 × 28 × 28
Batch of images: N × C × H × W
- N: batch size; C: channels; H, W: height/width

PyTorch defaults to channels‑first. If you ever see channels‑last (N × H × W × C), you must permute to match the model’s expectation.

1.3 Labels and logits

Labels: integers 0..9 (not one‑hot).
Logits: 10 real‑valued scores, one per class, before softmax.

Larger logits for a class ⇒ model is more confident in that class. Softmax turns logits into probabilities.

2) Training vs. Testing; Accuracy vs. Loss

Training set: updates parameters using gradients from the loss.
Test set: never used for updating parameters; only for generalization.
Loss (cross‑entropy): optimization target; differentiable, produces gradients.
Accuracy: evaluation metric computed via argmax(logits); not used to update weights.

Track both. Loss tells you if learning is happening; accuracy tells you if predictions are correct.

3) Epochs, Batches, and the DataLoader

Epoch: one full pass over the training set.
Mini‑batch: subset of samples processed before each update; shapes:
- x: N × 1 × 28 × 28
- y: N
DataLoader: batches, shuffles, and (optionally) parallelizes loading (num_workers).

Sanity check: print the first batch’s shapes before training starts.

Python

print(next(iter(train_loader))[0].shape)  # expect: (N, 1, 28, 28)

4) Autograd, Optimizer, and the Training Step

Autograd: records ops to compute gradients via backprop.
Optimizer: (Adam/SGD) applies gradient‑based updates using LR, momentum, etc.

Per‑batch training step

model.train()
logits = model(x)
loss = criterion(logits, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()

Evaluation step

model.eval() + with torch.no_grad(): (no gradient tracking)

Forgetting zero_grad() causes gradients to accumulate across steps.

5) Layers and Their Mechanics (with Visuals)

5.1 `nn.Flatten()` - from grid to vector

What it does (shape). N × C × H × W → N × (C·H·W)

Why it’s needed. nn.Linear expects a vector per example. Images are 2D grids (with channels). Flatten rearranges the same numbers into a single vector so a fully connected layer can process them.

Data‑value invariant. Flatten does not change values, only the layout.

Concrete toy example (C=1, H=W=2, N=2)

Before flatten (shape N × C × H × W = 2 × 1 × 2 × 2):

[
  [  # sample 0, shape 1×2×2
    [ [1, 2],
      [3, 4] ]
  ],
  [  # sample 1, shape 1×2×2
    [ [5, 6],
      [7, 8] ]
  ]
]

After flatten (shape N × (C·H·W) = 2 × 4):

[
  [1, 2, 3, 4],
  [5, 6, 7, 8]
]

For MNIST: 1×28×28 → 784 features per sample.

Ordering note. PyTorch flattens in row‑major order with channels‑first layout: iterate channels, then rows, then columns.

5.2 `nn.Linear(d_in, d_out)` - affine map

Formula. y = x Wᵀ + b

x: shape N × d_in
W: shape d_out × d_in
b: shape d_out
Output: N × d_out

Parameter count. (d_in + 1)·d_out (weights + biases)

Interpretation. Each output unit is a learned weighted sum of all input features (plus bias). A stack of Linear layers without activation is still a single affine map overall (see §7.2).

5.3 Activation functions - why linear stacks are not enough

Without a nonlinearity, any stack of Linear layers collapses to one Linear layer (proof in §7.2). Activations (ReLU/Tanh/Sigmoid) insert nonlinear steps so the network can learn curved decision boundaries.

ReLU: max(0, x); piecewise linear, fast, robust.
Tanh/Sigmoid: saturating; useful but less common for hidden layers here.

Geometric intuition. Linear ⇒ hyperplanes only. With ReLU, you can compose many piecewise linear regions ⇒ arbitrarily complex shapes.

5.4 `nn.Conv2d` - local, shared detectors

Purpose. Learn small filters (e.g., 3×3) that detect edges/strokes anywhere.

Output shape. For input N × Cin × H × W and conv params (Cout, K, stride=s, padding=p):

H' = ⌊(H + 2p − K)/s⌋ + 1
W' = ⌊(W + 2p − K)/s⌋ + 1
Output: N × Cout × H' × W'

Parameter count. (Cin·K·K + 1)·Cout

Why this helps for images.

Local connectivity: each output depends on a small neighborhood.
Weight sharing: one filter scans all positions ⇒ far fewer params than a fully connected layer on raw pixels.

Visual: 1 filter on a toy 3×3 input (no padding, stride 1)

input(1×3×3):         kernel(1×2×2):        conv map(1×2×2):
[ [a b c],            [ [k1 k2],             [ [a·k1+b·k2+d·k3+e·k4,  b·k1+c·k2+e·k3+f·k4],
  [d e f],    *         [k3 k4] ]     =        [ d·k1+e·k2+g·k3+h·k4,  e·k1+f·k2+h·k3+i·k4] ]
  [g h i] ]

(Plus bias per output channel.)

5.5 `nn.MaxPool2d(2)` - downsample with a max

Takes each non‑overlapping 2×2 block and outputs its maximum.
Halves H and W (for stride 2), keeping strongest activations and reducing compute.

Toy visual

input (1×4×4) → MaxPool2d(2) → output (1×2×2)
[ [1, 9, 2, 3],          [ [9, 4],
  [0, 4, 1, 1],    →       [7, 9] ]
  [7, 6, 2, 8],
  [5, 3, 9, 0] ]

6) Logits, Softmax, and Cross‑Entropy (why no manual softmax)

Softmax: softmax(z)_i = exp(z_i) / Σ_j exp(z_j)
CrossEntropyLoss: implements log_softmax + NLL in one numerically stable op. Feed logits directly to CrossEntropyLoss.

Why. Doing your own softmax then NLLLoss is redundant and can be less stable due to floating‑point range.

7) Task Specifications - Contracts without Blueprints

To avoid spoilers, this section states interface contracts and constraints only. It deliberately omits exact layer sequences. Your implementation must satisfy the contracts and pass the checks, but how you assemble compliant modules is up to you.

7.1 Task 1 - `LinearModel` (single-step classifier)

Contract

Accepts: N × 1 × 28 × 28
Produces: N × 10 logits
Must include a stage that converts image grids into feature vectors (no spatial dims remain when producing logits).
No convolution allowed.

Checks

Final tensor rank: 2 (batch, class)
Parameter count matches a single affine mapping from the flattened input to 10 outputs (you should derive and verify).

7.2 Task 2 - `TwoLinearModel` (no nonlinearity)

Contract

Accepts: N × 1 × 28 × 28
Produces: N × 10 logits
Contains exactly two affine transformations with no activation in between.
Intermediate feature width is a free hyperparameter H that you choose.

Checks

Intermediate activation is strictly forbidden (no ReLU/Tanh/Sigmoid/… between the two affine maps).
Parameter count matches two affine maps with hidden width H (derive yourself; verify at runtime).
Overall mapping remains affine in the input (theory note: composition of affine maps is affine).

7.3 Task 3 - `NeuralNetworkModel` (introduce nonlinearity)

Contract

Accepts: N × 1 × 28 × 28
Produces: N × 10 logits
Contains at least one nonlinear activation between two affine transformations.
Hidden width H is your choice; the nonlinearity choice is yours (ReLU recommended, but not mandated here).

Checks

Ensure the nonlinearity is applied between affine stages (not after the final logits feeding CrossEntropy).
Parameter count equals two affine maps with hidden width H (nonlinearities carry no trainable params).

7.4 Task 4 - `ConvNetworkModel` (use spatial bias)

Contract

Accepts: N × 1 × 28 × 28
Produces: N × 10 logits
Must include at least one convolutional stage that preserves or deliberately transforms spatial structure before collapsing to class logits.
Include at least one spatial downsampling step (e.g., pooling or strided conv) so the final classifier head operates on a reduced spatial representation.
You choose kernel size K, number of output channels C_out, padding/stride policy, and where to collapse spatial dims.

Checks

After your spatial downsampling, the flattened feature size times classifier width yields a parameter count consistent with your chosen C_out, spatial dims, and classes.
Verify output shapes at each checkpoint (print shapes); confirm the final tensor is N × 10.

General validation for all tasks

Do not apply softmax in the model body when training with CrossEntropyLoss.
Use the same preprocessing and dataloaders across tasks when performing comparisons.
Print sum(p.numel() for p in model.parameters()) and reconcile with your own derivations.

8) DTypes, Devices, and Reproducibility

DTypes: prefer float32 for speed; float64 increases precision but is slower.
Devices: CPU is fine; if using GPU, ensure both data and model are on the same device.
Reproducibility:
- torch.manual_seed(SEED)
- Fix DataLoader worker seeds if using num_workers>0.
- Disable nondeterministic algorithms if strict reproducibility is required.

9) Initialization and Optimization

PyTorch defaults (Kaiming/Glorot) are adequate for these shallow networks.
Optimizer: Adam is a good starting point.
LR: too high → divergence; too low → slow/no learning.
Weight decay: optional L2 regularization for generalization.

10) Metrics and Evaluation Protocol

Track for train and test:

Loss per epoch (aggregate over all batches)
Accuracy per epoch
(Optional) Confusion matrix for error analysis

Pseudocode (test pass)

Python

model.eval()
correct = 0
with torch.no_grad():
  for x, y in test_loader:
    logits = model(x)
    pred = logits.argmax(dim=1)
    correct += (pred == y).sum().item()
acc = correct / len(test_dataset)

11) Parameter Counting - formulas and checks

Linear(d_in→d_out): (d_in + 1)·d_out
Conv2d(C_in→C_out, K): (C_in·K·K + 1)·C_out
Flatten (MNIST): 1·28·28 = 784

Verify in code

Python

sum(p.numel() for p in model.parameters())

12) Debugging Playbook

Shape errors: print .shape at each stage (input → flatten → layer outputs).
Softmax misuse: don’t apply softmax before CrossEntropyLoss.
Wrong targets: pass integer class IDs, not one‑hot.
Learning stalls: try lowering LR (divergence) or raising LR (too slow). Ensure optimizer.zero_grad() is called before backward().
Device/dtype mismatch: keep model and data aligned.

13) Fair Experimentation Protocol (no spoilers)

Fix a baseline (LR, batch size, epochs, seed, dtype).
Change one variable at a time (e.g., hidden_dim).
Log configuration + outcomes per run.
Compare curves and final metrics fairly.
Explain why differences occur (capacity, nonlinearity, spatial bias), not just which wins.

14) FAQ (Targeted Clarifications)

Q1. Why must we Flatten before Linear? Linear expects N × d_in where d_in is a single number. Images are N × C × H × W. Flatten computes d_in = C·H·W and reshapes the data accordingly without changing values.

Q2. Where does softmax fit? Nowhere in the model for training - CrossEntropyLoss handles it internally (via log_softmax + NLL).

Q3. Why is a two‑Linear layer model still linear? Because composing affine maps yields another affine map (BAx + (Bb + c)). Nonlinearity is required to exceed linear expressivity.

Q4. What does pooling really buy us? Fewer spatial locations (compute savings), some translation tolerance, and a regularized representation that highlights the strongest local responses.

Q5. Does Flatten lose spatial information? Yes - once flattened, the model no longer knows which pixel was near which. CNNs intentionally delay flattening to preserve spatial structure longer.

15) Shape Cheat‑Sheets (non‑spoiler edition)

These summarize required interfaces, not blueprints. They tell you what must go in and what must come out - not how to wire every layer.

Task 1 (LinearModel)

Input: N × 1 × 28 × 28
Output: N × 10
Constraint: single affine mapping from flattened inputs to classes.

Task 2 (TwoLinearModel, no activation)

Input: N × 1 × 28 × 28
Output: N × 10
Constraint: exactly two affine maps with a linear (identity) connection between them; no nonlinearity.

Task 3 (NeuralNetworkModel, with nonlinearity)

Input: N × 1 × 28 × 28
Output: N × 10
Constraint: at least one nonlinearity strictly between two affine maps.

Task 4 (ConvNetworkModel)

Input: N × 1 × 28 × 28
Output: N × 10
Constraint: contains convolution; includes at least one spatial reduction before collapsing to class logits.

16) Verification & Sanity Tests

Module summary: print the model to inspect layer order and parameter counts.
Tiny overfit: train on ~128 samples for a few epochs; training loss should drop clearly.
Type checks: confirm dtype and device consistency throughout.

17) What You Must Choose (and Justify)

hidden_dim for MLPs
dtype (float32 vs float64)
optimizer hyperparameters (LR, epochs, batch size)
evaluation cadence (per‑epoch vs per‑k‑steps)
what you log and how you compare runs

This companion gives you the mechanics and invariant checks. Your conclusions should come from your experiments.

Lab 1 Companion - Full Technical Specification with Visuals

1.1 Image format

1.2 Tensor layout (channels‑first)

1.3 Labels and logits

5.1 nn.Flatten() - from grid to vector

5.2 nn.Linear(d_in, d_out) - affine map

5.3 Activation functions - why linear stacks are not enough

5.4 nn.Conv2d - local, shared detectors

5.5 nn.MaxPool2d(2) - downsample with a max

7.1 Task 1 - LinearModel (single-step classifier)

7.2 Task 2 - TwoLinearModel (no nonlinearity)

7.3 Task 3 - NeuralNetworkModel (introduce nonlinearity)

7.4 Task 4 - ConvNetworkModel (use spatial bias)

5.1 `nn.Flatten()` - from grid to vector

5.2 `nn.Linear(d_in, d_out)` - affine map

5.4 `nn.Conv2d` - local, shared detectors

5.5 `nn.MaxPool2d(2)` - downsample with a max

7.1 Task 1 - `LinearModel` (single-step classifier)

7.2 Task 2 - `TwoLinearModel` (no nonlinearity)

7.3 Task 3 - `NeuralNetworkModel` (introduce nonlinearity)

7.4 Task 4 - `ConvNetworkModel` (use spatial bias)