Light, sensors, and the imaging pipeline

This lesson builds a mental model of what a digital image is before you touch convolutions or neural networks. If you know what is being measured (and what noise is), every later algorithm makes more sense.

Figure

Photons to pixels at a glance

A typical RGB pipeline: every stage is a choice that can later show up as an artifact.

Learning objectives

By the end of this lesson you should be able to:

Trace the path from scene radiance to stored pixel values in plain language.
Distinguish radiance, irradiance, and sensor response — and explain why algorithms care about linearity.
Relate exposure settings (ISO, shutter, aperture) to signal-to-noise ratio.
Explain quantization, demosaicing, and gamma with enough precision to debug artifacts.
List the main sources of sensor noise and model shot noise with a simple formula.
Sketch an ISP (image signal processor) pipeline and name what each stage changes.

Prerequisites

Comfort with basic algebra (ratios, averages, square roots).
Optional: any first exposure to RGB images as 3D arrays (height × width × channels).

Step 1 — What a camera actually measures

A camera does not “capture reality.” It samples light that reaches a sensor plane over a finite exposure time, through optics with a particular spectral response.

Radiance, irradiance, and the pixel well

Radiance $L$ (units: W·sr⁻¹·m⁻²·nm⁻¹) describes how much power leaves a surface patch along a direction. It depends on material (BRDF), lighting, and geometry.
At the sensor, you care about irradiance $E$ — power per unit area hitting the photodiode. The lens maps scene radiance to sensor irradiance; aperture and focal length set how much light is collected.
Photons are converted to electrical charge in each pixel well. More photons → more electrons, up to full well capacity.

The lens focuses a bundle of rays so that (ideally) each small region on the sensor corresponds to a direction in the scene (pinhole / thin-lens story). Defocus spreads one scene point across multiple pixels — that blur is optical, not algorithmic.

Checkpoint (conceptual): In one sentence, what physical quantity is ultimately turned into an integer in your image file?

Answer sketch: Electrons accumulated in a pixel well during exposure, then amplified, digitized, and heavily processed — the stored integer is only loosely “brightness in the scene.”

Step 2 — Exposure triangle and signal-to-noise

Three controls dominate how many photons you collect:

Control	Effect on photons	Typical side effect
Shutter time	Linear: 2× time ≈ 2× photons	Motion blur if scene or camera moves
Aperture (f-number)	Area ∝ $1/N^2$ ; f/2.8 vs f/5.6 is ~4× photons	Depth of field, vignetting
ISO / gain	Amplifies electronic signal after collection	Does not add photons; amplifies read noise too

Signal-to-noise ratio (SNR) in a single pixel (schematic):

\mathrm{SNR} \approx \frac{N_{\text{signal}}}{\sqrt{N_{\text{signal}} + \sigma_{\text{read}}^2}}

where $N_{\text{signal}}$ is mean electron count from photons and $\sigma_{\text{read}}$ is read noise (electrons). At high light, shot noise $\sqrt{N_{\text{signal}}}$ dominates; at low light, read noise sets the floor — boosting ISO does not fix missing photons.

Exercise: You shoot indoors at 1/30 s, f/2.0, ISO 3200 and see grain. List two capture changes that increase photons without raising ISO, and one change that only amplifies electronically.

Step 3 — From analog signal to discrete pixels

After readout, the analog signal is amplified and passed through an ADC (analog-to-digital converter).

Bit depth (e.g. 10–14 bits on many RAW pipelines) sets how finely intensity is quantized before compression. Quantization step size $\Delta$ adds noise on the order of $\Delta/\sqrt{12}$ (uniform quantizer intuition).
Saturation happens when the well fills: highlights clip — no recovery in a single exposure without HDR fusion.
Black level offsets exist: “zero light” is not digital zero after processing. RAW developers subtract a black level per channel before scaling.

Worked example: A 12-bit ADC gives $2^{12} = 4096$ levels. If full well is 60,000 electrons mapped to 3800 codes, one ADU ≈ 16 electrons. Clipping at code 4095 loses highlight detail permanently in that frame.

Checkpoint: Why do two different phones sometimes show different brightness for the same scene even before “filters”?

Different metering, tone mapping, color matrices, and auto-exposure targets — not necessarily different photon counts.

Step 4 — Color filters and demosaicing (Bayer)

Most color cameras place a CFA (color filter array) over the sensor. A common pattern is the Bayer mosaic (often RGGB): each pixel measures mostly one spectral band.

Because each spatial location does not have full RGB immediately, the camera (or RAW developer) interpolates missing colors — demosaicing.

Figure

Bayer color filter array (RGGB)

Each sensor pixel measures only one color; the missing two are filled in by demosaicing from neighbors.

What demosaicing must infer

At a green site, R and B are unknown; algorithms use spatial and sometimes spectral correlation:

Bilinear: average available neighbors — fast, zippering on edges.
Edge-directed / malvar-he-cutler: steer interpolation along estimated edge direction — fewer color fringes, more compute.

Demosaicing choices affect edges (zippering, false color) and fine texture. Aggressive sharpening after demosaicing can create halos that downstream detectors treat as real structure.

Exercise (paper / notes): Sketch a 4×4 RGGB pattern. For the center green pixel, write which neighbors you would use in bilinear R and B estimates. Why does a red–blue edge confuse naive interpolation?

Step 5 — Gamma, linear light, and display encoding

Linear light: pixel value (after black-level correction) proportional to photoelectrons / irradiance.

Display-referred / sRGB: a transfer function compresses shadows and stretches midtones for human perception and 8-bit storage:

V_{\text{sRGB}} \approx \begin{cases} 12.92\, L & L \le 0.0031308 \\ 1.055\, L^{1/2.4} - 0.055 & L > 0.0031308 \end{cases}

(approximate piecewise form; exact spec has linear toe.)

Domain	Use in vision
Linear RAW	Photometric stereo, HDR merge, physically motivated shading
sRGB / JPEG	What most datasets and pretrained nets see
Log / PQ (video)	Wide dynamic range display pipelines

Many classical algorithms assume linear intensity for physically meaningful operations. Deep networks often train on display-referred images anyway — but blending, sharpening, or shadow recovery in sRGB is not the same as in linear space.

Checkpoint: If you blur a JPEG in an editor and edges look “glowy,” what non-linearity might be involved?

Averaging encoded values is darker than averaging linear light then re-encoding — gamma bleeding.

Step 6 — Noise: models you can use

Shot (Poisson) noise

Photon arrivals are random. If mean count is $\mu$ , variance is also $\mu$ :

\sigma_{\text{shot}} = \sqrt{\mu}

Relative noise $\sigma/\mu = 1/\sqrt{\mu}$ improves with more light — expose to the right (ETTR) in RAW without clipping highlights uses this fact.

Read noise

Additive Gaussian in electrons, independent of signal. Dominates in shadows and short exposures.

Other sources

Thermal / dark current — grows with exposure time and temperature.
Fixed pattern noise (FPN) — column/row offsets; often calibrated out in ISP.
Quantization noise — from ADC and 8-bit export.

Figure

Noise regimes vs signal level

Schematic: read noise sets a floor; shot noise grows like √signal. At low light the floor dominates; at bright light shot noise wins.

Exercise: For a dark indoor frame vs a bright outdoor frame, which noise source tends to dominate visually in each case? How would stacking $N$ identical frames change SNR (approximately)?

Indoor: read + quantization; outdoor: shot. Stacking $\sqrt{N}$ improvement if noise is independent between frames.

Step 7 — The ISP: from RAW to what algorithms see

A phone ISP typically runs on-sensor or immediately after readout:

Stage	What it does	Vision impact
Black level / OB	Subtract offsets	Prevents color cast in shadows
Lens shading	Per-channel vignette correction	Uniform illumination for photometry
Demosaic	CFA → RGB	Edge artifacts if aggressive
White balance	Diagonal color scaling	Changes “true” color ratios
Color matrix	Sensor RGB → display RGB	Dataset color statistics
Tone / gamma	Dynamic range compression	Non-linear; affects gradients
Sharpen / NR	High-frequency boost or suppression	Fake edges, texture loss
JPEG encode	Lossy compression	Blocking, ringing

Putting it together:

Scene → optics → CFA + exposure → RAW → ISP chain → stored file.

You do not need to implement each stage yet. You need the vocabulary to read datasheets, papers, and failure cases.

Deep dive — HDR, rolling shutter, and metrology

Multi-exposure HDR fuses short (highlights) and long (shadows) frames with alignment — ghosting if objects move.

Rolling shutter reads rows sequentially; fast motion or vibration skews geometry (wobbly buildings, bent propellers). Global shutter sensors avoid this at higher cost.

Radiometric calibration maps digital numbers to irradiance via flat-field panels and known lights — required for quantitative vision (agriculture, medical imaging, satellite).

When things go wrong (debugging checklist)

Symptom	Often caused by
Purple/green fringes on edges	Demosaic + chromatic aberration
Banding in smooth skies	8-bit + aggressive tone mapping
Flickering exposure between frames	Auto-exposure hunting — breaks optical flow / SLAM
“Crunchy” fine detail	Oversharpening after NR
Color shift under LED lights	Narrow-band spectra vs daylight-trained AWB

Check your understanding

Why does a single RAW “pixel” not immediately give you an RGB triple at that location?
Name two reasons identical scenes might produce different digital numbers on two devices.
Why might edge-aware vision algorithms behave differently on JPEG vs RAW-derived linear images?
If you double shutter time and halve ISO, what happens to photon count and read-noise contribution?
Why is averaging three JPEGs not equivalent to averaging three linear RAW frames?

Lab-style stretch goals

Histograms: Load an image, split channels, plot R/G/B histograms. Note clipping at 0 and 255 and skew — relate to exposure and tone curve.

RAW vs sRGB (if available): Develop the same RAW twice — “linear 16-bit” vs “camera JPEG.” Run Sobel magnitude on both (next lesson) and compare edge energy in shadows.