BD Brain Drip
Convolutional Neural Networks

Depthwise Separable Convolutions

Depthwise separable convolutions factorize a standard convolution into a spatial depthwise convolution and a channel-wise pointwise convolution, reducing computation by 8–9x with minimal accuracy loss.

Prerequisites | Convolution in neural networks 1x1 convolutions computational cost of convolutions

What Is Depthwise Separable Convolution?

Consider washing a stack of colored plates. A standard approach is to scrub each plate with a sponge that handles both the circular scrubbing motion (spatial) and the color-specific soap (channel) simultaneously. A depthwise separable approach splits this into two steps: first, scrub each plate individually with a plain sponge (spatial filtering per channel), then apply the color-specific treatment across all plates at once (channel mixing). By decoupling where to look from what to combine, each step becomes simpler and cheaper.

Technically, a standard convolution with kernel size k×kk \times k, CinC_{in} input channels, and CoutC_{out} output channels applies CoutC_{out} filters, each of size k×k×Cink \times k \times C_{in}. A depthwise separable convolution breaks this into:

  1. Depthwise convolution: Apply one k×kk \times k filter per input channel (total: CinC_{in} filters), producing CinC_{in} feature maps.
  2. Pointwise convolution: Apply CoutC_{out} filters of size 1×1×Cin1 \times 1 \times C_{in} to mix channels, producing the final CoutC_{out} feature maps.

How It Works

Standard Convolution Cost

For an input of size H×W×CinH \times W \times C_{in} and CoutC_{out} output channels with kernel size kk:

Coststandard=HWCinCoutk2\text{Cost}_{\text{standard}} = H \cdot W \cdot C_{in} \cdot C_{out} \cdot k^2

Depthwise Convolution

Each of the CinC_{in} channels is convolved independently with its own k×kk \times k filter:

Costdepthwise=HWCink2\text{Cost}_{\text{depthwise}} = H \cdot W \cdot C_{in} \cdot k^2

This produces CinC_{in} output feature maps. No inter-channel information is mixed at this stage. In group convolution terms, depthwise convolution is a grouped convolution with g=Cing = C_{in} (one group per channel).

Pointwise Convolution

A standard 1×11 \times 1 convolution combines the CinC_{in} depthwise outputs into CoutC_{out} channels:

Costpointwise=HWCinCout\text{Cost}_{\text{pointwise}} = H \cdot W \cdot C_{in} \cdot C_{out}

Total Reduction

Costseparable=HWCin(k2+Cout)\text{Cost}_{\text{separable}} = H \cdot W \cdot C_{in} \cdot (k^2 + C_{out})

The ratio of separable to standard convolution cost:

CostseparableCoststandard=k2+Coutk2Cout=1Cout+1k2\frac{\text{Cost}_{\text{separable}}}{\text{Cost}_{\text{standard}}} = \frac{k^2 + C_{out}}{k^2 \cdot C_{out}} = \frac{1}{C_{out}} + \frac{1}{k^2}

For k=3k = 3 and Cout=256C_{out} = 256: ratio =1256+190.115= \frac{1}{256} + \frac{1}{9} \approx 0.115, which is approximately an 8.7x reduction in computation. For larger CoutC_{out}, the savings approach 1k2\frac{1}{k^2} (9x for 3×33 \times 3 kernels).

Parameter Reduction

Standard: k2CinCoutk^2 \cdot C_{in} \cdot C_{out} parameters. Separable: k2Cin+CinCoutk^2 \cdot C_{in} + C_{in} \cdot C_{out} parameters.

For k=3k=3, Cin=Cout=256C_{in} = C_{out} = 256: Standard has 589,824589{,}824 params; separable has 67,84067{,}840 – an 8.7x reduction.

Implementation Details

# PyTorch implementation
import torch.nn as nn

# Depthwise convolution: groups = C_in
depthwise = nn.Conv2d(
    in_channels=64, out_channels=64,
    kernel_size=3, padding=1, groups=64  # key: groups = in_channels
)

# Pointwise convolution: 1x1
pointwise = nn.Conv2d(
    in_channels=64, out_channels=128,
    kernel_size=1
)

The groups parameter in nn.Conv2d controls the grouping. Setting groups=in_channels produces a depthwise convolution.

Batch Normalization and Activation

In practice, BN and ReLU (or ReLU6, swish) are applied after each stage:

Input -> DepthwiseConv -> BN -> ReLU -> PointwiseConv -> BN -> ReLU -> Output

Some architectures (MobileNetV2) apply a linear bottleneck after the pointwise convolution, omitting the final ReLU to prevent information loss in low-dimensional spaces.

Why It Matters

  1. Enables mobile and edge deployment. The 8–9x computation reduction makes it feasible to run CNNs on smartphones, IoT devices, and other resource-constrained hardware.
  2. Foundation of efficient architectures. MobileNet, Xception, EfficientNet, and many NAS-discovered architectures rely on depthwise separable convolutions as their core building block.
  3. Minimal accuracy trade-off. On ImageNet, MobileNetV1 with depthwise separable convolutions achieves 70.6% top-1 accuracy with only 569M FLOPs, compared to VGG-16 at 71.5% with 15,500M FLOPs – a 27x compute reduction for less than 1% accuracy loss.
  4. Influenced hardware design. Mobile NPUs and DSPs now include specific optimizations for depthwise convolutions due to their prevalence in deployed models.

Key Technical Details

  • The depthwise stage accounts for only about 3% of total computation but ~33% of total runtime on GPUs due to poor arithmetic intensity (low ratio of computation to memory access).
  • On CPUs and DSPs, depthwise separable convolutions are proportionally faster because these platforms are more memory-bound and the reduced computation translates directly to speedup.
  • Xception (Chollet, 2017) uses depthwise separable convolutions throughout and achieves 79.0% top-1 accuracy on ImageNet (vs. Inception V3’s 78.8%) with the same parameter count.
  • The depthwise filter has k2k^2 parameters per channel, which is quite small. This low parameter count per channel can limit the expressive power of the spatial filtering step.
  • Grouped convolutions with group sizes between 1 and CinC_{in} provide a tunable middle ground between standard and depthwise convolutions.

Common Misconceptions

  • “Depthwise separable convolutions are always faster.” On GPUs, the depthwise stage has very low arithmetic intensity, making it memory-bound and difficult to parallelize efficiently. The theoretical 8-9x FLOP reduction often translates to only 2-3x wall-clock speedup on GPUs without specialized kernels.
  • “The two stages are just a mathematical trick.” The factorization assumes that spatial and channel correlations can be learned independently, which is a genuine modeling constraint. Chollet (2017) argued this is actually a beneficial inductive bias, as the Xception results suggest.
  • “Depthwise separable convolutions cannot match standard convolution accuracy.” With sufficient width and depth, networks using depthwise separable convolutions can match or exceed standard convolution networks (Xception vs. Inception V3, EfficientNet vs. ResNet).

Connections to Other Concepts

  • Convolution In Neural Networks: Depthwise separable convolution is a strict factorization of the standard convolution operation into spatial and channel components.
  • Inception: Inception’s use of 1×11 \times 1 convolutions for dimensionality reduction is conceptually related; depthwise separable convolutions take the factorization further by making every spatial convolution single-channel.
  • Mobilenet: Built entirely on depthwise separable convolutions, applying them systematically for mobile-efficient architectures.
  • Efficientnet: Uses depthwise separable convolutions as the primary operator within its mobile inverted bottleneck blocks.

Further Reading

  • Sifre & Mallat, “Rigid-Motion Scattering for Texture Classification” (2014) – Early work proposing separable filters in deep networks. [Scholar]
  • Chollet, “Xception: Deep Learning with Depthwise Separable Convolutions” (2017) – Systematic replacement of Inception modules with depthwise separable convolutions. [Scholar]
  • Howard et al., “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications” (2017) – Established depthwise separable convolutions as the standard for efficient mobile CNNs. [Scholar]