← Back to curriculum

Module 2 — Geometry & correspondence

Camera models and projection

Homogeneous projection, K and [R|t], worked pixel examples, Brown–Conrady distortion, and Zhang calibration with reprojection error.

~85 min read + exercises

Camera models and projection

You will connect 3D geometry to 2D image coordinates using the pinhole camera model, then understand intrinsics, extrinsics, distortion, and calibration well enough to implement projection in code.

Figure

Pinhole projection — rays through the camera center

Pinhole projection: a 3D point becomes a 2D pixelRays through the camera center intersect the image plane at (u, v).Z (optical axis)image plane (z = f)C (camera center)X₁X₂x₁x₂f (focal length)Pinhole projectionx = f · X/Z, y = f · Y/Z(in camera coordinates, scale is lost.)
Each 3D point's ray hits the image plane at (x, y) = (f·X/Z, f·Y/Z). Scale is lost — depth and size are entangled in a single image.

Learning objectives

  • Write pinhole projection in camera coordinates and in homogeneous form.
  • Build the intrinsic matrix KK and compose P=K[Rt]P = K[R|t].
  • Project a 3D world point to pixels with a worked numeric example.
  • Explain radial and tangential distortion and the Brown–Conrady model qualitatively.
  • Describe Zhang's calibration method and reprojection error.
  • State why a single view cannot recover metric scale.

Prerequisites

  • Basic linear algebra: matrix–vector multiply, homogeneous coordinates.
  • Convolution lesson (helpful for understanding image planes as grids).

Step 1 — The pinhole idealization

A pinhole camera maps a 3D point Xc=(X,Y,Z)\mathbf{X}_c = (X, Y, Z)^\top in camera coordinates (ZZ along the optical axis) to the image plane:

x=fXZ,y=fYZx = f \frac{X}{Z}, \quad y = f \frac{Y}{Z}

ff is focal length in the same units as X,Y,ZX,Y,Z (meters on the sensor plane before pixel scaling).

Checkpoint: What happens as Z0+Z \to 0^{+}? Why do real lenses fail at macro distances?

Division blows up; real lenses have minimum focus distance and finite aperture — not a true pinhole.


Step 2 — Homogeneous coordinates

Augment 3D points: X~=[X,Y,Z,1]\tilde{\mathbf{X}} = [X, Y, Z, 1]^\top. Then

λ[uv1]=K[Rt]X~w\lambda \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} = K \begin{bmatrix} R & \mathbf{t} \end{bmatrix} \tilde{\mathbf{X}}_w
  • Extrinsics [Rt][R|\mathbf{t}]: rigid transform world → camera, RSO(3)R \in \mathrm{SO}(3).
  • Homogeneous scale λ\lambda is arbitrary until you divide; λZc\lambda \approx Z_c for the usual forward camera.

Exercise: Why do we use 4 components for a 3D point?

So translation becomes matrix multiplication: Xc=RXw+t\mathbf{X}_c = R\mathbf{X}_w + \mathbf{t} is [Rt]X~w[R|\mathbf{t}]\tilde{\mathbf{X}}_w.


Step 3 — Intrinsics: meters to pixels

u=fxXcZc+cx,v=fyYcZc+cyu = f_x \frac{X_c}{Z_c} + c_x, \quad v = f_y \frac{Y_c}{Z_c} + c_y K=[fxscx0fycy001]K = \begin{bmatrix} f_x & s & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix}
  • (cx,cy)(c_x, c_y): principal point — intersection of optical axis with sensor (often near image center, rarely exact center pixel).
  • fx,fyf_x, f_y in pixels: fx=fmmsxf_x = f_{\text{mm}} \cdot s_x where sxs_x is pixels per mm.
  • Skew ss: shear if sensor axes not perpendicular; ~0 on modern phones.

Figure

World → camera → pixels

Extrinsics then intrinsics: world → camera → pixelsWorldXᵂCameraXᶜ = R Xᵂ + tNormalized(X/Z, Y/Z, 1)Pixels(u, v) = K · …extrinsics [R | t]project / Zintrinsics K
Extrinsics [R | t] move a 3D point into the camera frame; intrinsics K turn normalized coordinates into pixels.

Checkpoint: If you crop the center 512×512 from a 4K frame in software (no sensor change), what changes in KK?

cx,cyc_x, c_y shift; fx,fyf_x, f_y unchanged in pixel units if crop is pure translation on the grid.


Step 4 — Worked projection example

World point Xw=(1,0,5)\mathbf{X}_w = (1, 0, 5)^\top m. Suppose

R=I,t=(0,0,0),K=[80003200800240001]R = I, \quad \mathbf{t} = (0, 0, 0)^\top, \quad K = \begin{bmatrix} 800 & 0 & 320 \\ 0 & 800 & 240 \\ 0 & 0 & 1 \end{bmatrix}

Camera coords = world coords. Normalized: x=8001/5+320=480x = 800 \cdot 1/5 + 320 = 480, y=8000/5+240=240y = 800 \cdot 0/5 + 240 = 240.

Exercise: Move the point to Z=2.5Z=2.5. How do u,vu,v change? What does that say about apparent size vs depth?

Coordinates double — closer points project larger; scale ambiguity in a single image.


Step 5 — Full projection matrix

P=K[Rt]R3×4,λ[uv1]=PX~wP = K [R \,|\, \mathbf{t}] \in \mathbb{R}^{3 \times 4}, \quad \lambda \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} = P \tilde{\mathbf{X}}_w

PP has 11 degrees of freedom up to scale (15 entries minus rank constraints). Calibration estimates KK and distortion; pose estimation finds R,tR,\mathbf{t} per view.

Checkpoint: Why can you not recover absolute metric scale from one image?

PP is defined only up to similarity transform of the world unless you fix scale with a known object size or multi-view triangulation.


Step 6 — Lens distortion

Real lenses bend rays. Common Brown–Conrady radial model (2D normalized coords x^,y^\hat{x}, \hat{y}):

x^d=x^(1+k1r2+k2r4+k3r6),r2=x^2+y^2\hat{x}_d = \hat{x}(1 + k_1 r^2 + k_2 r^4 + k_3 r^6), \quad r^2 = \hat{x}^2 + \hat{y}^2

(similarly for y^d\hat{y}_d; tangential terms p1,p2p_1, p_2 model decentering).

Figure

Pinhole vs barrel vs pincushion

Radial lens distortion deforms straight world linesEffect is strongest at the periphery; calibration estimates the coefficients.Pinhole (no distortion)Barrel (k₁ < 0)Pincushion (k₁ > 0)
A perfectly straight world grid bends near the edges under real lenses. Calibration estimates the coefficients that undo it.
  • Undistortion: solve for undistorted normalized coords (iterative or closed-form approximations), then apply KK.
  • SLAM / AR: ignore distortion at your peril on wide-angle phone lenses.

Checkpoint: Where are radial effects usually strongest?

Image periphery where rr is large.


Step 7 — Calibration with Zhang's method (outline)

Given a planar checkerboard of known square size:

  1. Detect corners in multiple images at different poses.
  2. Each view gives homography constraints linking board plane to image.
  3. Solve for KK, distortion (k1,k2,)(k_1, k_2, \ldots), and per-image R,tR,\mathbf{t} via nonlinear least squares.
  4. Minimize reprojection error:
i,jxijπ(P,Xj)2\sum_{i,j} \left\| \mathbf{x}_{ij} - \pi(P, \mathbf{X}_j) \right\|^2

where xij\mathbf{x}_{ij} is observed corner jj in image ii and π\pi is projection including distortion.

RMS reprojection (px)Typical interpretation
< 0.3Excellent
0.3 – 0.7Usable for many apps
> 1.0Re-check board size, focus, motion blur

Deep dive — coordinate conventions and hand–eye

OpenCV vs OpenGL vs robotics: YY may point down in image rows but up in world; always document which frame R,tR,\mathbf{t} maps between.

Hand–eye calibration (robotics track preview): relates camera frame to gripper frame so pixels → grasp poses. Needs known motion and calibration target or structure.


Check your understanding

  1. What is the difference between extrinsics and intrinsics?
  2. Why is homogeneous scale arbitrary in projection, yet pixel coordinates are unique after division?
  3. Give one application where ignoring distortion would break downstream geometry.
  4. If fxfyf_x \neq f_y, what physical imperfection might that encode?
  5. Why are at least two views needed to triangulate a 3D point?

Lab-style stretch goals

Calibrate with a printed checkerboard (OpenCV calibrateCamera or similar). Report RMS error and show undistorted lines on a straight-edge scene.

Code sketch: Implement project(K, R, t, X_w) returning (u,v)(u,v); verify on one corner of your board.