Evaluation & Datasets

Benchmarks, metrics, and standard datasets.

Benchmark Leaderboards

Benchmark leaderboards – tracked by Papers With Code, COCO, and ImageNet evaluation servers – standardize model comparison, drive competitive progress, and shape research priorities, but also introduce biases toward benchmark-specific optimization.

Classification Metrics

Classification metrics – accuracy, precision, recall, F1, and their variants – quantify model performance from different angles, with the choice of metric depending on class balance, error costs, and deployment context.

Detection Metrics

Object detection evaluation uses mean Average Precision (mAP), computed over precision-recall curves at various IoU thresholds, with the COCO protocol (AP@[.50:.05:.95]) as the standard benchmark.

Generative Model Metrics

Generative model quality is measured by FID (distribution distance, lower is better), Inception Score (diversity and quality), CLIP Score (text-image alignment), LPIPS (perceptual similarity), and KID (unbiased small-sample alternative to FID).

Landmark Datasets

Landmark datasets – ImageNet (1.2M images, 1K classes), COCO (330K images, 80 categories), Pascal VOC, ADE20K, Cityscapes, and Open Images – define the benchmarks that drive computer vision progress and shape architectural design decisions.

Segmentation Metrics

Segmentation is evaluated using mean Intersection over Union (mIoU) for semantic tasks, Dice/F1 for medical imaging, pixel accuracy for basic assessment, and Panoptic Quality (PQ = SQ x RQ) for unified panoptic evaluation.