Loss functions for neural networks
Before we begin
Forward pass + backprop need a scalar loss — one number saying how wrong the batch was.
For MNIST (10 digit classes), the standard pair is:
Softmax outputs + cross-entropy loss
What you will learn
- Explain cross-entropy in plain language.
- Know why MSE is a weak default for classification.
- Read a simple training curve (loss down, accuracy up).
Before this lesson
Cross-entropy (one example)
True digit: 3 (one-hot: index 3 = 1, others 0).
Model probabilities after softmax: p₀…p₉.
Loss = −log(p₃)
- If model is confident and correct (p₃ ≈ 1) → loss ≈ 0.
- If model assigns low probability to the true class → large loss.
Average loss over the batch → one number for backprop.
Softmax + cross-entropy together
Softmax converts 10 logits to probabilities summing to 1:
Cross-entropy pushes mass onto the correct class. Frameworks often combine them as CrossEntropyLoss on logits (softmax inside for numerical stability).
Why not MSE on one-hot labels?
Mean squared error can work but often trains slower and is less aligned with probabilistic classification. Cross-entropy penalizes confident wrong answers more sharply.
| Task | Common loss |
|---|---|
| MNIST digits | Cross-entropy |
| House price | MSE / MAE |
| Module 1 patch brightness | MSE |
Reading training curves
Healthy training often shows:
- Training loss trending down (not necessarily to zero).
- Validation accuracy rising, then flattening.
- If train acc ↑ but val acc ↓ → overfitting (Module 2).
Checkpoint
When is accuracy a misleading metric during training?
Answer sketch
Accuracy can hide poor performance on rare classes; loss captures how confident wrong predictions are. Always track val accuracy and confusion matrix for MNIST too.