Depthwise Separable Convolution

Implement depthwise separable convolution as used in MobileNet. This factorizes a standard convolution into two cheaper operations:

Depthwise convolution: apply one filter per input channel (no cross-channel mixing). Each channel is convolved independently.
Pointwise convolution: a 1×1 convolution that mixes channels, projecting from C_in to C_out.

This reduces computation by roughly a factor of k² (kernel size squared) compared to a full convolution.

Signature: def depthwise_separable_conv(x, dw_kernel, pw_kernel)

x: (H, W, C_in)
dw_kernel: (kH, kW, C_in) — one spatial filter per input channel (no C_out axis)
pw_kernel: (C_out, C_in) — 1×1 pointwise weights
Returns: (H_out, W_out, C_out) where H_out = H - kH + 1

Step 1 — depthwise: For each channel c and output position (i, j):

dw_out[i, j, c] = sum over (kh, kw) of x[i+kh, j+kw, c] * dw_kernel[kh, kw, c]

Step 2 — pointwise: mix the depthwise-output channels into the C_out channels using the pw_kernel 1×1 weights at every spatial position. Equivalent to a per-pixel linear projection from C_in to C_out.

Math

out [i, j, c_{o u t}] = c_{in} \sum k_{h}, k_{w} \sum x [i + k_{h}, j + k_{w}, c_{in}] \cdot w_{d w} [k_{h}, k_{w}, c_{in}] \cdot w_{pw} [c_{o u t}, c_{in}]

Asked at