NumPy Broadcasting: Bias Add

In every transformer block you write x + b where x has shape (B, T, D) and b has shape (D,). The fact that this just works is the entire point of NumPy broadcasting — and it is the foundation for every vectorized formula you write later in this track.

Implement: def bias_add(x, b) returning x + b.

Shapes:

x: (B, T, D) — batch of T-token sequences with D-dim features
b: (D,) — per-feature bias
output: (B, T, D)

Why this works: NumPy aligns shapes from the right. (B, T, D) and (D,) align as (B, T, D) vs (_, _, D). Missing left dims are treated as size 1 and stretched. Trailing dim D matches in both, so the operation is legal.

You can also write x + b[None, None, :] to make the broadcast explicit — b[None, None, :] has shape (1, 1, D), which broadcasts identically.

The trap: x + b where b has shape (B,) does not broadcast — (B, T, D) aligned with (_, _, B) requires D == B. Bias on the wrong axis is one of the most common shape bugs in ML code; getting comfortable with broadcasting alignment is how you avoid it.

Math

y_{b, t, d} = x_{b, t, d} + b_{d}

Asked at