Pixels, convolution, and edges

Here you will treat an image as a function on a grid and build intuition for linear filtering — the same family of operations that generalizes into the first layers of convolutional neural networks.

Figure

A 3×3 kernel sweeping across pixels

At every location the kernel multiplies neighbors by its weights and sums them — here visualized with Sobel-x weights.

Learning objectives

Represent grayscale and multi-channel images as $\mathrm{I}[x,y,c]$ on a discrete lattice.
Apply 2D convolution with small kernels by hand on a numeric patch.
Derive gradients, Sobel operators, and the Canny pipeline step by step.
Explain separable kernels and count operations saved.
Connect linear filters to frequency intuition (low-pass vs high-pass).
Recognize when intensity edges do not imply geometric edges.

Prerequisites

Lesson: Light, sensors, and the imaging pipeline (pixel values, linear vs sRGB).
Basic idea of averaging and weighted sums.

Step 1 — Images as discrete signals

Let $\mathrm{I}[x,y]$ be intensity at integer coordinates $(x,y)$ .

Domain: $0 \le x < W$ , $0 \le y < H$ .
Channels: color images stack $c \in \{R,G,B\}$ or luma–chroma (e.g. Y in YCbCr) — many filters run per-channel; others (color edge detectors) mix channels deliberately.
Boundary handling: zero padding, reflect, replicate, or wrap. Different policies change border responses by several pixels.

Checkpoint: Why do boundary pixels behave differently under convolution almost no matter what you do?

The kernel window extends outside the image; synthetic border values invent content that is not in the scene.

Step 2 — Convolution vs correlation

In vision libraries, conv2d often implements cross-correlation:

(I * K)[x,y] = \sum_{i}\sum_{j} I[x+i, y+j]\, K[i,j]

True convolution flips the kernel: $K'[i,j] = K[-i,-j]$ . For symmetric kernels (Gaussian, Laplacian) the distinction vanishes; for Sobel-x they differ by sign only on the kernel, which you can absorb into downstream logic.

Worked example (3×3 patch): Suppose a neighborhood (row-major) is

\begin{bmatrix} 1 & 2 & 1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{bmatrix}

and kernel $K = \frac{1}{9}\mathbf{1}_{3\times 3}$ (box filter). Output = $(1+2+1+0+0+0+1+2+1)/9 = 8/9 \approx 0.89$ .

Exercise: With zero padding on a 4×4 image, how many output positions does a 3×3 kernel produce? (Answer: still 4×4 if you pad by 1 on each side.)

Step 3 — Gaussian smoothing and scale

Continuous 2D Gaussian:

G_\sigma(x,y) = \frac{1}{2\pi\sigma^2} e^{-(x^2+y^2)/(2\sigma^2)}

Discrete kernels truncate at $\approx 3\sigma$ . Larger $\sigma$ → more blur → scale space: objects smaller than $\sigma$ disappear from the smoothed image.

Kernel	Role
Box 3×3	Fast, blocky frequency response
Gaussian	Smooth, no sharp nulls in spectrum
Median 3×3	Nonlinear — removes salt-and-pepper, preserves step edges better than mean

Checkpoint: If you smooth before edge detection, what happens to edge maps visually and why?

Noise spikes shrink; true edges widen and weaken — threshold trade-off moves.

Step 4 — Gradients and discrete derivatives

Forward difference:

G_x[x,y] \approx I[x+1,y] - I[x,y]

Central difference (better symmetry):

G_x[x,y] \approx \frac{I[x+1,y] - I[x-1,y]}{2}

Gradient magnitude and orientation:

\|\nabla I\| = \sqrt{G_x^2 + G_y^2}, \quad \theta = \operatorname{atan2}(G_y, G_x)

Figure

Step edge → peaked gradient

Forward differences on a 1D row of pixels: the gradient lights up exactly where intensity jumps. The 2D story is the same in each direction.

Exercise: Row $[10, 10, 10, 50, 50, 50]$ . Compute central $G_x$ at the step (index 3). Where does $|G_x|$ peak?

Step 5 — Sobel and structured derivatives

Sobel-x (unnormalized classic form):

K_x = \begin{bmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{bmatrix}, \quad K_y = \begin{bmatrix} -1 & -2 & -1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{bmatrix}

The $\times 2$ center weights approximate a Gaussian-smoothed derivative — less sensitive to isolated noise than bare differences.

Checkpoint: Why is “differentiate then smooth” equivalent to “smooth then differentiate” for linear operators?

Convolution is associative: $G * (\partial I) = \partial (G * I)$ .

Step 6 — Canny edge detector (full pipeline)

Canny (1986) is still the reference classical edge detector:

Gaussian smooth — control $\sigma$ .
Gradient magnitude and angle — often Sobel.
Non-maximum suppression (NMS): keep a pixel only if it is a local maximum along the gradient direction (thin edges).
Double threshold: strong edges $T_h$ , weak $T_l$ . Strong pixels are edges; weak pixels kept only if connected to strong (hysteresis).
Optional morphological cleanup.

Parameter	Too low	Too high
$\sigma$	Noisy, cluttered edges	Miss thin structures
$T_h$	Everything is an edge	Broken contours
$T_l$	Streaks of weak edges	Gaps in boundaries

Exercise: Why does NMS require knowing edge orientation, not just magnitude?

Without orientation you cannot decide which neighbors to compare along the ridge.

Step 7 — Separable kernels

If $K = \mathbf{k}\,\mathbf{k}^\top$ , then

I * K = (I * \mathbf{k}) * \mathbf{k}^\top

Cost: $2WHk$ vs $W H k^2$ for $k\times k$ 2D — for $k=11$ that is ~6× fewer multiplies.

Figure

Separable kernel: 2D = row ⊗ column

A 3×3 Gaussian-like kernel as the outer product of two 1D kernels. Two 1D passes replace one 2D pass — much faster as kernels grow.

Exercise: A 1D Gaussian has 11 taps. How many multiplies per pixel for separable 2D vs naive 11×11?

Separable: $11+11=22$ ; full: $121$ .

Step 8 — Frequency intuition (short)

Convolution in space is multiplication in frequency: low-pass kernels attenuate high frequencies (noise, texture); high-pass (Laplacian, second derivative) emphasize edges.

Laplacian of Gaussian (LoG): $\nabla^2 (G_\sigma * I)$ — blob detector at scale $\sigma$ ; zero-crossings locate edges. Used historically; today learned filters subsume much of this.

Deep dive — edges are not always “object boundaries”

Edge cause	Geometric boundary?
Depth discontinuity	Often yes
Cast shadow	No — same surface
Texture (stripes)	No
Specular highlight	No
Albedo change (paint)	Sometimes

Intensity-only edge detectors cannot disambiguate these without depth, motion, stereo, or learning.

Bridge to the next lessons

Corners (next module) need variation in two directions — built from gradient structure tensors.
CNNs replace hand-designed $K$ with learned filters but keep locality and translation equivariance.

Check your understanding

What is the difference between an edge due to depth discontinuity vs cast shadow — can intensity alone always tell them apart?
Why do CNNs use small kernels repeatedly rather than one giant kernel?
Name two boundary policies and one artifact each can introduce.
In Canny, what problem does hysteresis solve?
Why is median filtering not equivalent to a single convolution kernel?

Lab-style stretch goals

Implement Sobel magnitude + NMS + hysteresis on grayscale (or use OpenCV Canny and compare your thresholds).

Color: Convert to LAB, run edges on L channel only vs each RGB channel — when does chroma create spurious edges?