Video Understanding

Temporal modeling, action recognition, and video analysis.

3D Convolutions

3D convolutions extend standard 2D spatial filters with a temporal dimension, enabling neural networks to learn spatiotemporal features directly from raw video clips.

Action Recognition

Action recognition classifies human activities in video clips, evolving from hand-crafted features through two-stream CNNs and 3D convolutions to transformer-based models evaluated on benchmarks like Kinetics, UCF-101, and HMDB-51.

Optical Flow Estimation

Optical flow estimation computes dense per-pixel motion vectors between consecutive video frames, evolving from variational energy minimization to learned architectures like FlowNet, PWC-Net, and RAFT.

Two-Stream Networks

Two-stream networks process video through parallel spatial (RGB) and temporal (optical flow) pathways, fusing their predictions to capture both appearance and motion for action recognition.

Video Generation

Video generation extends image synthesis to the temporal domain, using diffusion models or autoregressive approaches to produce temporally coherent frame sequences while battling flickering, motion artifacts, and immense computational costs.

Video Object Tracking

Video object tracking localizes a target object across video frames, encompassing single-object tracking (SOT) with template matching and multi-object tracking (MOT) with detection-and-association pipelines.

Video Representation

Video representation converts raw video into structured tensors suitable for neural networks through frame stacking, temporal differencing, and clip sampling strategies.

Video Transformers

Video transformers apply self-attention to spatiotemporal tokens extracted from video, achieving strong accuracy but facing a quadratic cost challenge that demands factorized attention strategies.