Features, matching, and robust estimation

This lesson is the bridge between pixels and discrete geometric constraints: detect repeatable keypoints, build descriptors, propose matches, then fit models (homography, essential matrix) despite outliers with RANSAC.

Figure

The classical feature pipeline

Four stages: detect repeatable keypoints, describe them with a vector, find candidate matches, then keep only the geometry-consistent ones.

Learning objectives

Derive the Harris corner criterion from the structure tensor.
Compare SIFT, ORB, and learned local features at a practical level.
Apply Lowe's ratio test and explain mutual nearest neighbor matching.
Fit a homography and essential matrix; know when each model applies.
Compute RANSAC iteration counts from outlier fraction.
Outline epipolar geometry for two calibrated views.

Prerequisites

Convolution / gradients lesson (structure tensor uses $I_x, I_y$ ).
Camera projection lesson (homogeneous coordinates, $K$ ).

Step 1 — Structure tensor and Harris corners

In a window $W$ , accumulate gradient statistics:

M = \sum_{(x,y)\in W} \begin{bmatrix} I_x^2 & I_x I_y \\ I_x I_y & I_y^2 \end{bmatrix}

Eigenvalues $\lambda_1, \lambda_2$ of $M$ characterize local structure:

$\lambda_1, \lambda_2$	Region type
both small	flat — no corner
one large, one small	edge — ambiguous along edge
both large	corner — distinctive

Harris response (one common form):

R = \det(M) - k\,(\mathrm{trace}\,M)^2 = \lambda_1\lambda_2 - k(\lambda_1+\lambda_2)^2

with $k \approx 0.04$ – $0.06$ . Peaks in $R$ are corner candidates; non-maximum suppression thins them.

Figure

Flat patch vs edge vs corner

A patch only counts as a distinctive feature if shifting it in any direction makes the patch content change.

Checkpoint: Why is a straight step edge a poor unique landmark?

Sliding along the edge does not change appearance — only one large eigenvalue.

Step 2 — Detect → describe → match

Stage	Output	Failure mode
Detector	$(x,y,\sigma)$ keypoints	Repeatability under blur / exposure
Descriptor	vector $\mathbf{d} \in \mathbb{R}^D$	Not invariant to desired transforms
Matcher	pairs $(i,j)$	Ambiguity on repetitive texture

SIFT (classical gold standard)

DoG scale-space extrema for scale selection.
Dominant orientation from gradient histogram.
128-D histogram-of-gradients descriptor, normalized for illumination.

ORB (fast, binary)

FAST corners + oriented BRIEF (binary tests) → Hamming distance.
Common on mobile; less robust to large scale change than SIFT.

Learned features (SuperPoint, etc.)

CNN predicts keypoints + descriptors end-to-end; often better on texture-poor scenes at compute cost.

Exercise: List three transforms descriptors target (e.g. rotation) and one that still breaks matchers (e.g. strong specular highlight).

Step 3 — Matching and Lowe's ratio test

Nearest neighbor in descriptor space: $j^* = \arg\min_j \|\mathbf{d}_i - \mathbf{d}'_j\|$ .

Lowe's ratio test: accept match $i \to j^*$ only if

\frac{\|\mathbf{d}_i - \mathbf{d}'_{j^*}\|}{\|\mathbf{d}_i - \mathbf{d}'_{j^{**}}\|} < \rho

(e.g. $\rho = 0.7$ – $0.8$ ), where $j^{**}$ is the second-best match. Rejects ambiguous matches on repetitive brick.

Mutual consistency: keep $i \to j$ only if $j \to i$ under the same metric — removes many one-way false positives.

Checkpoint: Why do brick walls break naive matching?

Many descriptors are equidistant — ratio test fails without distinct second-nearest gap.

Step 4 — Homography (planar / rotating camera)

If scene is planar or camera rotates about its center, 2D points relate by a homography $H \in \mathbb{R}^{3\times 3}$ (8 DOF):

\lambda \begin{bmatrix} u' \\ v' \\ 1 \end{bmatrix} = H \begin{bmatrix} u \\ v \\ 1 \end{bmatrix}

DLT: each match gives 2 linear equations in entries of $H$ ; 4 matches minimum, more via least squares on inliers.

Exercise: Panorama stitching of a flat mural — homography or essential matrix? Why?

Homography — plane induces projective warp between views.

Step 5 — RANSAC and iteration budget

RANSAC loop:

Sample minimal set (4 for homography, 5 for essential matrix in calibrated case).
Fit model; count inliers within tolerance $\epsilon$ (pixels or Sampson distance).
Keep best model; refine on all inliers.

Probability that at least one sample is all-inlier in $k$ iterations:

P_{\text{success}} = 1 - (1 - (1-e)^s)^k

where $e$ = outlier fraction, $s$ = sample size. Solve for $k$ given desired $P_{\text{success}}$ .

Example: $e=0.5$ , $s=4$ , want 99% success: $(1-0.5^4)^k = 0.01 \Rightarrow k \approx 35$ .

Figure

RANSAC keeps the line with the most votes

Inliers fall inside an ε-tolerance band around the candidate model. Outliers don't influence the chosen line — that's why RANSAC beats least-squares here.

Checkpoint: Outlier fraction doubles — what happens to required $k$ ?

Grows quickly — RANSAC cost is why good descriptor + ratio test front-ends matter.

Step 6 — Epipolar geometry (two views, general 3D)

For calibrated cameras, normalized coords $\hat{\mathbf{x}}, \hat{\mathbf{x}}'$ satisfy the epipolar constraint:

\hat{\mathbf{x}}'^\top E \hat{\mathbf{x}} = 0

$E = [\mathbf{t}]_\times R$ (essential matrix, 5 DOF). Uncalibrated case uses fundamental matrix $F$ (7 DOF).

Epipolar line: in image 2, match for $\hat{\mathbf{x}}$ lies on line $l' = E\hat{\mathbf{x}}$ — search 1D instead of 2D.
Pose from $E$ : decompose $E$ into four $(R,\mathbf{t})$ pairs; disambiguate with cheirality (points in front of both cameras).

Triangulation: with known $P, P'$ and correspondences, least-squares triangulation (DLT or midpoint) yields 3D points — scale fixed if baseline metric.

Model	DOF	When valid
Homography	8	Planar scene / pure rotation
Essential $E$	5	General 3D, calibrated
Fundamental $F$	7	General 3D, uncalibrated

Deep dive — failure modes in production

Symptom	Likely cause
Panorama tears	Parallax — non-planar scene, homography wrong
Few inliers	Motion blur, exposure change, repetitive texture
Ghost duplicates	Symmetric structures, wrong second-best in ratio test
Drift in VO	Pure rotation mistaken as translation without parallax

Check your understanding

What is the difference between a feature detector and a descriptor?
Why does least-squares homography on all matches fail?
Name two sources of false matches unrelated to descriptor distance.
How many DOF does $E$ have, and why fewer than 9 entries?
When is a homography exactly valid for a 3D scene?

Lab-style stretch goals

Match two desk photos with ORB or SIFT: visualize all matches, then RANSAC homography inliers vs outliers. Repeat with a scene that has depth variation — watch inlier count drop.

Stretch: Estimate $F$ with RANSAC, draw epipolar lines on a few points (OpenCV findFundamentalMat).