Summary of "R-CNN Explained"
Summary of "R-CNN Explained" Video
Main Ideas and Concepts
-
Introduction to Object Detection and R-CNN
- R-CNN stands for "Regions with CNN features."
- This video is the first in a series covering major object detection methods from R-CNN to YOLOv8.
- Object detection differs from image classification and localization by requiring detection and classification of all instances of objects in an image, not just one.
-
Difference Between Image Classification, Localization, and Object Detection
- Classification: Assign a single class label to the entire image.
- Localization: Predict a bounding box around a single object along with its class.
- Detection: Identify and locate multiple objects of different classes in an image with bounding boxes.
-
Bounding Boxes
- Boxes are usually axis-aligned rectangles defined either by:
- Top-left and bottom-right coordinates (X1, Y1, X2, Y2), or
- Center coordinates plus width and height.
- Boxes should tightly cover objects without including unnecessary background.
- Coordinates can be absolute pixel values or normalized/scaled values.
- Boxes are usually axis-aligned rectangles defined either by:
-
Challenges in Object Detection
- Detecting multiple objects of different sizes and aspect ratios requires scanning many possible windows (crops) across scales and shapes.
- This brute-force sliding window approach is computationally expensive.
-
R-CNN Approach
- Instead of classifying all possible windows, R-CNN uses a region proposal method to reduce candidate windows.
- The method used is Selective Search, which proposes regions likely to contain objects.
- Only these proposed regions are processed by the CNN, reducing computation.
-
Selective Search for Region Proposals
- Two-step process:
- Graph Segmentation: Image pixels are nodes in a graph, edges weighted by similarity (color, texture, etc.). Components (regions) are merged based on edge weights until no more merges are possible.
- Region Merging: Smaller regions are merged into larger ones based on similarity metrics (color, texture, size, shape).
- This generates a hierarchy of regions at multiple scales.
- Selective Search returns ~1,600 region proposals per image.
- Two-step process:
-
Training the R-CNN Model
- Start with a CNN (AlexNet) pretrained on ImageNet.
- Replace the last classification layer to match detection classes plus background.
- Resize region proposals to fixed input size (227x227), adding some context padding.
- Assign labels to proposals based on Intersection over Union (IoU) with ground truth boxes:
- Proposals with IoU > 0.5 are positive (object class).
- Others are background.
- Fine-tune the CNN on these labeled proposals using cross-entropy loss.
-
Support Vector Machines (SVMs) for Classification
- After fine-tuning, the CNN’s fully connected layer outputs are used as features.
- Train one linear SVM per class to classify proposals as positive or negative.
- SVM training uses a stricter IoU threshold:
- Only the exact ground truth box is positive.
- Proposals with IoU < 0.3 are negative.
- Proposals with IoU between 0.3 and 0.5 are ignored.
- Hard negative mining is used to focus training on difficult negative examples.
-
Bounding Box Regression
- To improve localization accuracy, R-CNN trains a class-specific bounding box regressor.
- This linear regression model predicts adjustments to the proposal box coordinates to better fit the object.
- Trained using features from the CNN and ground truth boxes with IoU > 0.6.
- During inference, predicted adjustments refine the bounding box locations.
-
Non-Maximum Suppression (NMS)
- Multiple proposals often overlap the same object.
- NMS removes redundant detections by:
- Sorting predictions by confidence score.
- Iteratively selecting the highest scoring box and removing boxes with IoU > 0.5 overlap.
- Applied separately for each class to yield final detections.
-
Results and Performance
- Without fine-tuning: ~44.7% mean average precision (mAP) on PASCAL VOC 2007.
- With fine-tuning: improves to ~54.2%.
- Adding bounding box regression: ~58.5%.
- Using a deeper network (VGG) further improves accuracy to ~66% at the cost of higher computation.
- R-CNN significantly outperforms previous methods in detection accuracy.
-
Discussion on Design Choices
- Different IoU thresholds are used during fine-tuning and SVM training to balance positive sample counts and prevent overfitting.
- SVMs help improve discriminative power and localization precision, which fine-tuning alone cannot fully achieve.
Category
Educational