Object Detection
YOLO, R-CNN families, anchor-based and anchor-free methods.
Anchor-Free Detection
Anchor-free detectors eliminate predefined anchor boxes by directly predicting object locations as per-pixel classifications (FCOS) or center-point heatmaps (CenterNet), removing a major source of hyperparameter tuning while matching or exceeding anchor-based accuracy.
DETR (Detection Transformer)
DETR reformulates object detection as a direct set prediction problem using a transformer encoder-decoder architecture with bipartite matching, eliminating the need for anchors, non-maximum suppression, and most hand-designed components.
Fast R-CNN and Faster R-CNN
Fast R-CNN shares convolutional computation across all proposals via RoI pooling and trains end-to-end, while Faster R-CNN replaces external proposals with a learned Region Proposal Network (RPN) to achieve near-real-time detection at ~5 FPS.
Feature Pyramid Network
Feature Pyramid Networks (FPN) build a multi-scale feature hierarchy by combining top-down semantically strong features with bottom-up spatially precise features through lateral connections, enabling robust detection of objects at all sizes.
Focal Loss
Focal loss down-weights the contribution of easy, well-classified examples during training by applying a modulating factor (1 - p_t)^\gamma, solving the extreme foreground-background class imbalance that limits single-stage detector accuracy.
Intersection over Union
Intersection over Union (IoU) measures the overlap between two bounding boxes as the ratio of their intersection area to their union area, serving as the universal metric for evaluating localization quality in object detection.
Multi-Scale Detection
Multi-scale detection addresses the challenge of recognizing objects that vary enormously in size (from a few pixels to thousands) within a single image, using strategies ranging from image pyramids to feature pyramids to scale-aware architectures.
Non-Maximum Suppression
Non-maximum suppression (NMS) is a greedy post-processing algorithm that removes duplicate detections by iteratively keeping the highest-scoring box and discarding all boxes that overlap with it above an IoU threshold.
R-CNN
Region-based Convolutional Neural Network (R-CNN) applies a deep CNN to each of ~2,000 region proposals independently, achieving a dramatic leap in detection accuracy while being prohibitively slow at 47 seconds per image.
Sliding Window and Region Proposals
Sliding windows exhaustively scan every location and scale in an image, while region proposals intelligently suggest a small subset of likely object locations to dramatically reduce computation.
SSD (Single Shot MultiBox Detector)
SSD performs object detection in a single forward pass by predicting bounding boxes and class scores from multiple convolutional feature maps at different scales, achieving 59 FPS with accuracy competitive with two-stage detectors.
YOLO (You Only Look Once)
YOLO frames object detection as a single regression problem from image pixels to bounding box coordinates and class probabilities, enabling real-time detection by processing the entire image in one pass through the network.