Loss functions for neural networks

Before we begin

Forward pass + backprop need a scalar loss — one number saying how wrong the batch was.

For MNIST (10 digit classes), the standard pair is:

Softmax outputs + cross-entropy loss

What you will learn

Explain cross-entropy in plain language.
Know why MSE is a weak default for classification.
Read a simple training curve (loss down, accuracy up).

Before this lesson

Cross-entropy (one example)

True digit: 3 (one-hot: index 3 = 1, others 0).
Model probabilities after softmax: p₀…p₉.

Loss = −log(p₃)

If model is confident and correct (p₃ ≈ 1) → loss ≈ 0.
If model assigns low probability to the true class → large loss.

Average loss over the batch → one number for backprop.

Softmax + cross-entropy together

Softmax converts 10 logits to probabilities summing to 1:

$p_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$

Cross-entropy pushes mass onto the correct class. Frameworks often combine them as CrossEntropyLoss on logits (softmax inside for numerical stability).

Why not MSE on one-hot labels?

Mean squared error can work but often trains slower and is less aligned with probabilistic classification. Cross-entropy penalizes confident wrong answers more sharply.

Task	Common loss
MNIST digits	Cross-entropy
House price	MSE / MAE
Module 1 patch brightness	MSE

Reading training curves

Healthy training often shows:

Training loss trending down (not necessarily to zero).
Validation accuracy rising, then flattening.
If train acc ↑ but val acc ↓ → overfitting (Module 2).

Checkpoint

When is accuracy a misleading metric during training?

Answer sketch

Accuracy can hide poor performance on rare classes; loss captures how confident wrong predictions are. Always track val accuracy and confusion matrix for MNIST too.

What's next

Module 3 quiz