Optical Flow Estimation
Optical flow estimation computes dense per-pixel motion vectors between consecutive video frames, evolving from variational energy minimization to learned architectures like FlowNet, PWC-Net, and RAFT.
What Is Optical Flow Estimation?
Imagine holding a transparent sheet over a photograph and marking, for every single point, an arrow showing where that point moved in the next photograph. The collection of all these arrows – one per pixel – is the optical flow field. It answers the question: “Where did each pixel go?”
Formally, optical flow is a 2D vector field defined over the image plane, where represents the horizontal and vertical displacement of pixel between frame and frame . The brightness constancy assumption underlying classical methods states:
Taking a first-order Taylor expansion yields the optical flow constraint equation:
where are spatiotemporal image gradients. Since this single equation has two unknowns per pixel, additional constraints (smoothness, local constancy) are needed.
How It Works
Classical Variational Methods
Horn-Schunck (1981) minimized a global energy combining the brightness constancy and smoothness:
TV-L1 (Zach et al., 2007) replaced the quadratic data term with an norm for robustness to outliers and the smoothness term with total variation. TV-L1 became the standard for pre-computing flow in action recognition (~0.5s per frame pair on CPU, ~0.06s on GPU).
FlowNet: Learning to Estimate Flow
Dosovitskiy et al. (2015) proposed FlowNet, the first end-to-end CNN for optical flow:
FlowNetS (Simple): Encoder-decoder with skip connections. Two frames are concatenated as a 6-channel input and processed through a contracting encoder. The decoder upsamples with deconvolutions and skip connections from the encoder.
FlowNetC (Correlation): Two frames are processed by separate (shared-weight) encoder branches up to a correlation layer. The correlation layer computes local matching scores:
over a search neighborhood , producing a -channel correlation volume. This explicit matching computation improved accuracy on small displacements.
FlowNet trained on the synthetic FlyingChairs dataset (22k image pairs). EPE (endpoint error) on Sintel: ~6.0 pixels (vs. ~5.0 for EpicFlow, a classical method at the time).
FlowNet2.0: Stacking and Scheduling
Ilg et al. (2017) improved FlowNet through:
- Stacking: Cascading multiple FlowNets where each refines the previous estimate. FlowNet2 stacks FlowNetC -> FlowNetS -> FlowNetS with warping between stages.
- Training schedule: Curriculum learning from FlyingChairs (simple) to FlyingThings3D (complex).
- Small displacement network: A specialized sub-network for sub-pixel motions.
FlowNet2.0 achieved 2.02 EPE on Sintel Clean, matching classical state-of-the-art at ~25 FPS on GPU (vs. minutes for variational methods).
PWC-Net: Pyramidal Processing
Sun et al. (2018) introduced PWC-Net (Pyramid, Warping, Cost volume), achieving better accuracy with a fraction of FlowNet2’s parameters:
- Feature pyramid: Extract features at multiple scales using a learnable pyramid (not fixed Gaussian).
- Warping: At each level, warp features of the second image using the upsampled flow estimate from the coarser level.
- Cost volume: Compute a partial cost volume by correlating features within a limited search range ( pixels at each level).
- CNN estimator: A compact CNN predicts the flow residual from the cost volume and context features.
Processing coarse-to-fine enables handling large motions efficiently. PWC-Net: 2.55 EPE on Sintel Clean, 4.38 on Sintel Final, with only 8.75M parameters (vs. 162M for FlowNet2) at ~35 FPS.
RAFT: Recurrent All-Pairs Field Transforms
Teed and Deng (2020) introduced RAFT, which fundamentally changed the approach:
- Feature extraction: A shared encoder extracts features from both frames at 1/8 resolution.
- All-pairs correlation: Compute correlations between ALL pairs of pixels (not just local neighborhoods), producing a 4D correlation volume :
This volume is constructed once and stored as a correlation pyramid (pooled at multiple scales) for efficient lookup.
- Iterative refinement: A GRU-based recurrent unit iteratively updates the flow estimate. At each iteration :
where includes correlation lookups indexed by the current flow estimate, the current flow, and context features. Typically 12–32 iterations during training, with the same number or more at inference.
RAFT achieved 1.61 EPE on Sintel Clean and 2.86 on Sintel Final, a major leap. Its recurrent design allows trading compute for accuracy at inference time (more iterations = better flow).
Post-RAFT Developments
- GMA (Jiang et al., 2021): Added global motion aggregation via self-attention to handle occlusions; 1.39 EPE on Sintel Clean
- FlowFormer (Huang et al., 2022): Transformer-based cost volume processing; 1.16 EPE on Sintel Clean
- VideoFlow (Shi et al., 2023): Exploits temporal context from multiple frames; further improvements on Sintel
Evaluation Metrics
Endpoint Error (EPE): The Euclidean distance between predicted and ground-truth flow vectors, averaged over all pixels:
Fl-all (KITTI): Percentage of pixels where EPE > 3 pixels AND relative error > 5%.
Why It Matters
- Video understanding backbone: Optical flow provides motion information critical for action recognition (two-stream networks), video segmentation, and frame interpolation.
- Autonomous driving: Dense motion estimation enables independent motion detection (identifying other moving vehicles), ego-motion estimation, and scene flow computation. KITTI flow benchmarks directly evaluate this.
- Video editing and VFX: Motion compensation, temporal interpolation (slow-motion), stabilization, and object tracking all rely on accurate flow.
- Self-supervised learning signal: Flow provides a free supervisory signal for learning video representations without human annotations, and flow prediction can itself be trained self-supervised using photometric losses.
Key Technical Details
- Sintel Clean / Final EPE: RAFT 1.61 / 2.86, FlowFormer 1.16 / 2.09, GMA 1.39 / 2.47
- KITTI-2015 Fl-all: RAFT 5.10%, FlowFormer 4.09%
- RAFT inference: ~10 FPS on 1080p with 20 iterations on a V100 GPU; faster with fewer iterations
- PWC-Net: 8.75M params, ~35 FPS at 1024x436; RAFT: ~5.3M params, ~10 FPS
- FlowNet2.0: 162M params, 25 FPS at 1024x436
- Training data: FlyingChairs (22k), FlyingThings3D (22k), Sintel (1k), KITTI (200); pre-train on synthetic, fine-tune on target
- RAFT correlation volume memory: at 1/8 resolution; for input, ~600 MB
- TV-L1 flow (classical): ~0.06s/frame on GPU, ~0.5s on CPU; main bottleneck for two-stream training pipelines
Common Misconceptions
- “Learned optical flow has completely replaced classical methods.” TV-L1 remains widely used for pre-computing flow in action recognition pipelines due to its simplicity and sufficient quality. Classical methods also have no data distribution shift issues and work reliably on arbitrary inputs.
- “Optical flow requires two frames and cannot handle occlusions.” Multi-frame methods (VideoFlow) and occlusion-aware architectures (GMA) explicitly reason about occluded regions. Forward-backward consistency checks () detect occlusions where flow is undefined.
- “More parameters always improve flow quality.” RAFT (5.3M params) significantly outperforms FlowNet2 (162M params), demonstrating that architectural design (all-pairs correlation, iterative refinement) matters more than model size.
Connections to Other Concepts
- Two Stream Networks: Optical flow is the input to the temporal stream, making flow quality directly impact action recognition accuracy.
- Video Representation: Stacked flow fields are a primary video representation for motion-sensitive tasks.
- Video Object Tracking: Flow provides motion cues for predicting object locations in subsequent frames.
- Video Generation: Temporal coherence in generated videos can be evaluated and enforced using optical flow consistency.
- 3d Convolutions: 3D CNNs learn implicit motion representations that partially overlap with optical flow information.
Further Reading
- Dosovitskiy et al., “FlowNet: Learning Optical Flow with Convolutional Networks” (2015) – First end-to-end CNN for optical flow. [Scholar]
- Ilg et al., “FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks” (2017) – Stacked refinement and training schedule improvements. [Scholar]
- Sun et al., “PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume” (2018) – Efficient coarse-to-fine learned flow. [Scholar]
- Teed & Deng, “RAFT: Recurrent All-Pairs Field Transforms for Optical Flow” (2020) – All-pairs correlation with iterative GRU refinement; current paradigm. [Scholar]
- Huang et al., “FlowFormer: A Transformer Architecture for Optical Flow” (2022) – Transformer-based cost volume processing. [Scholar]