Evaluation & Datasets
Benchmarks, metrics, and standard datasets.
Benchmark Leaderboards
Benchmark leaderboards – tracked by Papers With Code, COCO, and ImageNet evaluation servers – standardize model comparison, drive competitive progress, and shape research priorities, but also introduce biases toward benchmark-specific optimization.
Classification Metrics
Classification metrics – accuracy, precision, recall, F1, and their variants – quantify model performance from different angles, with the choice of metric depending on class balance, error costs, and deployment context.
Detection Metrics
Object detection evaluation uses mean Average Precision (mAP), computed over precision-recall curves at various IoU thresholds, with the COCO protocol (AP@[.50:.05:.95]) as the standard benchmark.
Generative Model Metrics
Generative model quality is measured by FID (distribution distance, lower is better), Inception Score (diversity and quality), CLIP Score (text-image alignment), LPIPS (perceptual similarity), and KID (unbiased small-sample alternative to FID).
Landmark Datasets
Landmark datasets – ImageNet (1.2M images, 1K classes), COCO (330K images, 80 categories), Pascal VOC, ADE20K, Cityscapes, and Open Images – define the benchmarks that drive computer vision progress and shape architectural design decisions.
Segmentation Metrics
Segmentation is evaluated using mean Intersection over Union (mIoU) for semantic tasks, Dice/F1 for medical imaging, pixel accuracy for basic assessment, and Panoptic Quality (PQ = SQ x RQ) for unified panoptic evaluation.