Video Understanding
Temporal modeling, action recognition, and video analysis.
3D Convolutions
3D convolutions extend standard 2D spatial filters with a temporal dimension, enabling neural networks to learn spatiotemporal features directly from raw video clips.
Action Recognition
Action recognition classifies human activities in video clips, evolving from hand-crafted features through two-stream CNNs and 3D convolutions to transformer-based models evaluated on benchmarks like Kinetics, UCF-101, and HMDB-51.
Optical Flow Estimation
Optical flow estimation computes dense per-pixel motion vectors between consecutive video frames, evolving from variational energy minimization to learned architectures like FlowNet, PWC-Net, and RAFT.
Two-Stream Networks
Two-stream networks process video through parallel spatial (RGB) and temporal (optical flow) pathways, fusing their predictions to capture both appearance and motion for action recognition.
Video Generation
Video generation extends image synthesis to the temporal domain, using diffusion models or autoregressive approaches to produce temporally coherent frame sequences while battling flickering, motion artifacts, and immense computational costs.
Video Object Tracking
Video object tracking localizes a target object across video frames, encompassing single-object tracking (SOT) with template matching and multi-object tracking (MOT) with detection-and-association pipelines.
Video Representation
Video representation converts raw video into structured tensors suitable for neural networks through frame stacking, temporal differencing, and clip sampling strategies.
Video Transformers
Video transformers apply self-attention to spatiotemporal tokens extracted from video, achieving strong accuracy but facing a quadratic cost challenge that demands factorized attention strategies.