Summary of "CAP5415 Lecture 9 [Features - Part 2] - Fall 2020"

Summary — main ideas and lessons

Recap of prior topics

Corner / interest-point detection: measure change in intensity when a small window is shifted.
- Large change in multiple directions → corner (interest point).
- Large change in one direction → edge.
- Small change → flat region.
Common methods: autocorrelation and gradient-based approximations (e.g., Harris corner detector). Windows can be uniform or Gaussian-weighted.
Histogram of Oriented Gradients (HOG): image divided into blocks → cells → compute per-cell orientation histograms → normalize and concatenate to form descriptors.

A good interest point detector identifies locations where a small window shift causes significant intensity change; corners produce large changes in multiple directions, edges in one direction, and flat regions produce little change.

Introduction to SIFT (Scale-Invariant Feature Transform)

Purpose: find keypoints and extract descriptors that are invariant to translation, rotation, and scale (and robust to other imaging changes).
The SIFT pipeline is commonly broken into four main stages:
1. Scale-space extrema detection — find blob-like structures across scale (introduces scale invariance).
2. Keypoint localization — refine locations and discard low-contrast or unstable detections.
3. Orientation assignment — assign one or more dominant orientations to each keypoint for rotation invariance.
4. Keypoint description — extract a fixed-length descriptor from a normalized patch around each keypoint.

Scale invariance and blob detection

A detector should give similar responses for the same scene structure at different image scales; a fixed-size kernel on differently scaled images will not do that unless the kernel scale is adapted.
Laplacian-of-Gaussian (LoG, “Mexican hat”) is a blob detector whose response peaks when kernel scale matches blob size.
SIFT approximates LoG by Difference-of-Gaussians (DoG): blur the image with two Gaussians at nearby sigmas and subtract. DoG approximates LoG and is cheaper to compute.
Processing is done across multiple scales and reduced-resolution copies (octaves). Extrema are found by comparing each location with neighbors in the same and adjacent scales.
Full SIFT includes post-processing to filter unstable points and edge responses (not detailed in the lecture).

Orientation assignment

Compute image derivatives (dx, dy) in the patch around each keypoint to obtain per-pixel gradient magnitude and orientation.
Build an orientation histogram (0–360° in the lecture) weighted by gradient magnitudes and typically by a Gaussian window centered on the keypoint.
Identify peaks in the histogram. Each sufficiently strong peak yields a keypoint orientation; multiple orientations per location are allowed.
To achieve rotation invariance, rotate the patch (or align descriptor bins) so the dominant orientation is treated as “north.”

Descriptor extraction (SIFT descriptor)

For each keypoint:
- Extract a canonical patch aligned to the keypoint’s scale and orientation (lecture: 16×16 window after scaling/resizing).
- Divide the 16×16 patch into a 4×4 grid of subregions (cells) → 16 cells total.
- For each cell compute an orientation histogram (lecture: 8 bins).
- Concatenate the 16 histograms → 16 × 8 = 128-dimensional descriptor.
- Normalize the descriptor vector and clamp large values (lecture referenced a clipping threshold; in practice values are often clipped at 0.2 then renormalized) to reduce illumination/contrast effects.
Comparison to HOG: both are histogram-of-gradient based, but SIFT typically uses 4×4 cells × 8 bins = 128 dims and different normalization choices; HOG often uses 9 bins and different block normalizations.

Detailed SIFT methodology (step-by-step)

Build scale-space (octaves and scales)
- Convolve the input with Gaussians at increasing sigma values to form progressively blurred images.
- For efficiency, form octaves: downsample the image between octaves and repeat blurring within each octave.
Compute Difference-of-Gaussians (DoG)
- Subtract adjacent Gaussian-blurred images within each octave to produce DoG images (an LoG approximation).
- Repeat across several scales per octave.
Detect scale-space extrema (candidate keypoints)
- For each pixel in each DoG image, compare its value to its 26 neighbors: 8 in the same scale, 9 in the scale above, 9 in the scale below.
- If it is a local maximum or minimum compared to these neighbors, mark it as a candidate keypoint.
Keypoint localization and filtering (brief)
- Fit a local quadratic (Taylor expansion) to refine the location and scale (not detailed in the lecture).
- Discard low-contrast keypoints or those along edges (post-processing details omitted).
Orientation assignment
- For pixels in a region around each keypoint compute gradient magnitude and orientation.
- Form an orientation histogram (0–360°), weight by gradient magnitude (and typically a Gaussian spatial window).
- Select dominant orientation(s): peaks above a threshold determine keypoint orientations; multiple peaks produce multiple oriented keypoints.
Descriptor formation (per keypoint and per orientation)
- Extract a patch around the keypoint scaled to a canonical size (lecture: 16×16).
- Rotate patch so dominant orientation is aligned (rotation normalization).
- Divide patch into a 4×4 grid (cells typically 4×4 pixels for a 16×16 patch).
- For each cell compute an 8-bin orientation histogram, weighted by gradient magnitude (and optionally a spatial Gaussian).
- Concatenate the 16 histograms → 128-D vector.
- Normalize the vector, clamp large values (lecture mentioned clipping threshold), and renormalize to improve illumination invariance.

Practical notes and comparisons

SIFT detects blobs (LoG/DoG), while Harris detects corners; both are useful interest points and are often combined (e.g., Harris keypoints with SIFT descriptors or vice versa).
Local descriptors like HOG and SIFT are template/histogram-based, rely on image derivatives (edges), and tend to be robust, distinctive, and computationally efficient.
These gradient-based descriptors generally operate on intensity (edges) and do not use color information.

Speakers / sources mentioned

Course lecturer (CAP5415 instructor delivering the lecture)
Dr. David G. Lowe — author of the original SIFT paper (referenced as “Dr Lo” in the transcript)
Students/questions: Zai (asked about point meanings), Brett (mentioned in transcript)
Methods/sources referenced: Harris corner detector, HOG, Laplacian-of-Gaussian (LoG / “Mexican hat”), Difference-of-Gaussians (DoG) approximation

Key takeaways

SIFT provides scale- and rotation-invariant keypoints and distinctive 128-D descriptors built from gradient histograms.
DoG approximates the computationally expensive LoG, enabling efficient multi-scale blob detection.
Orientation assignment and descriptor normalization/clipping are critical steps for achieving invariance and robustness.