Camera models and projection

You will connect 3D geometry to 2D image coordinates using the pinhole camera model, then understand intrinsics, extrinsics, distortion, and calibration well enough to implement projection in code.

Figure

Pinhole projection — rays through the camera center

Each 3D point's ray hits the image plane at (x, y) = (f·X/Z, f·Y/Z). Scale is lost — depth and size are entangled in a single image.

Learning objectives

Write pinhole projection in camera coordinates and in homogeneous form.
Build the intrinsic matrix $K$ and compose $P = K[R|t]$ .
Project a 3D world point to pixels with a worked numeric example.
Explain radial and tangential distortion and the Brown–Conrady model qualitatively.
Describe Zhang's calibration method and reprojection error.
State why a single view cannot recover metric scale.

Prerequisites

Basic linear algebra: matrix–vector multiply, homogeneous coordinates.
Convolution lesson (helpful for understanding image planes as grids).

Step 1 — The pinhole idealization

A pinhole camera maps a 3D point $\mathbf{X}_c = (X, Y, Z)^\top$ in camera coordinates ( $Z$ along the optical axis) to the image plane:

x = f \frac{X}{Z}, \quad y = f \frac{Y}{Z}

$f$ is focal length in the same units as $X,Y,Z$ (meters on the sensor plane before pixel scaling).

Checkpoint: What happens as $Z \to 0^{+}$ ? Why do real lenses fail at macro distances?

Division blows up; real lenses have minimum focus distance and finite aperture — not a true pinhole.

Step 2 — Homogeneous coordinates

Augment 3D points: $\tilde{\mathbf{X}} = [X, Y, Z, 1]^\top$ . Then

\lambda \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} = K \begin{bmatrix} R & \mathbf{t} \end{bmatrix} \tilde{\mathbf{X}}_w

Extrinsics $[R|\mathbf{t}]$ : rigid transform world → camera, $R \in \mathrm{SO}(3)$ .
Homogeneous scale $\lambda$ is arbitrary until you divide; $\lambda \approx Z_c$ for the usual forward camera.

Exercise: Why do we use 4 components for a 3D point?

So translation becomes matrix multiplication: $\mathbf{X}_c = R\mathbf{X}_w + \mathbf{t}$ is $[R|\mathbf{t}]\tilde{\mathbf{X}}_w$ .

Step 3 — Intrinsics: meters to pixels

u = f_x \frac{X_c}{Z_c} + c_x, \quad v = f_y \frac{Y_c}{Z_c} + c_y

K = \begin{bmatrix} f_x & s & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix}

$(c_x, c_y)$ : principal point — intersection of optical axis with sensor (often near image center, rarely exact center pixel).
$f_x, f_y$ in pixels: $f_x = f_{\text{mm}} \cdot s_x$ where $s_x$ is pixels per mm.
Skew $s$ : shear if sensor axes not perpendicular; ~0 on modern phones.

Figure

World → camera → pixels

Extrinsics [R | t] move a 3D point into the camera frame; intrinsics K turn normalized coordinates into pixels.

Checkpoint: If you crop the center 512×512 from a 4K frame in software (no sensor change), what changes in $K$ ?

$c_x, c_y$ shift; $f_x, f_y$ unchanged in pixel units if crop is pure translation on the grid.

Step 4 — Worked projection example

World point $\mathbf{X}_w = (1, 0, 5)^\top$ m. Suppose

R = I, \quad \mathbf{t} = (0, 0, 0)^\top, \quad K = \begin{bmatrix} 800 & 0 & 320 \\ 0 & 800 & 240 \\ 0 & 0 & 1 \end{bmatrix}

Camera coords = world coords. Normalized: $x = 800 \cdot 1/5 + 320 = 480$ , $y = 800 \cdot 0/5 + 240 = 240$ .

Exercise: Move the point to $Z=2.5$ . How do $u,v$ change? What does that say about apparent size vs depth?

Coordinates double — closer points project larger; scale ambiguity in a single image.

Step 5 — Full projection matrix

P = K [R \,|\, \mathbf{t}] \in \mathbb{R}^{3 \times 4}, \quad \lambda \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} = P \tilde{\mathbf{X}}_w

$P$ has 11 degrees of freedom up to scale (15 entries minus rank constraints). Calibration estimates $K$ and distortion; pose estimation finds $R,\mathbf{t}$ per view.

Checkpoint: Why can you not recover absolute metric scale from one image?

$P$ is defined only up to similarity transform of the world unless you fix scale with a known object size or multi-view triangulation.

Step 6 — Lens distortion

Real lenses bend rays. Common Brown–Conrady radial model (2D normalized coords $\hat{x}, \hat{y}$ ):

\hat{x}_d = \hat{x}(1 + k_1 r^2 + k_2 r^4 + k_3 r^6), \quad r^2 = \hat{x}^2 + \hat{y}^2

(similarly for $\hat{y}_d$ ; tangential terms $p_1, p_2$ model decentering).

Figure

Pinhole vs barrel vs pincushion

A perfectly straight world grid bends near the edges under real lenses. Calibration estimates the coefficients that undo it.

Undistortion: solve for undistorted normalized coords (iterative or closed-form approximations), then apply $K$ .
SLAM / AR: ignore distortion at your peril on wide-angle phone lenses.

Checkpoint: Where are radial effects usually strongest?

Image periphery where $r$ is large.

Step 7 — Calibration with Zhang's method (outline)

Given a planar checkerboard of known square size:

Detect corners in multiple images at different poses.
Each view gives homography constraints linking board plane to image.
Solve for $K$ , distortion $(k_1, k_2, \ldots)$ , and per-image $R,\mathbf{t}$ via nonlinear least squares.
Minimize reprojection error:

\sum_{i,j} \left\| \mathbf{x}_{ij} - \pi(P, \mathbf{X}_j) \right\|^2

where $\mathbf{x}_{ij}$ is observed corner $j$ in image $i$ and $\pi$ is projection including distortion.

RMS reprojection (px)	Typical interpretation
< 0.3	Excellent
0.3 – 0.7	Usable for many apps
> 1.0	Re-check board size, focus, motion blur

Deep dive — coordinate conventions and hand–eye

OpenCV vs OpenGL vs robotics: $Y$ may point down in image rows but up in world; always document which frame $R,\mathbf{t}$ maps between.

Hand–eye calibration (robotics track preview): relates camera frame to gripper frame so pixels → grasp poses. Needs known motion and calibration target or structure.

Check your understanding

What is the difference between extrinsics and intrinsics?
Why is homogeneous scale arbitrary in projection, yet pixel coordinates are unique after division?
Give one application where ignoring distortion would break downstream geometry.
If $f_x \neq f_y$ , what physical imperfection might that encode?
Why are at least two views needed to triangulate a 3D point?

Lab-style stretch goals

Calibrate with a printed checkerboard (OpenCV calibrateCamera or similar). Report RMS error and show undistorted lines on a straight-edge scene.

Code sketch: Implement project(K, R, t, X_w) returning $(u,v)$ ; verify on one corner of your board.