In every transformer block you write x + b where x has shape (B, T, D) and b has shape (D,). The fact that this just works is the entire point of NumPy broadcasting — and it is the foundation for every vectorized formula you write later in this track.
Implement: def bias_add(x, b) returning x + b.
Shapes:
x: (B, T, D) — batch of T-token sequences with D-dim featuresb: (D,) — per-feature bias(B, T, D)Why this works: NumPy aligns shapes from the right. (B, T, D) and (D,) align as (B, T, D) vs (_, _, D). Missing left dims are treated as size 1 and stretched. Trailing dim D matches in both, so the operation is legal.
You can also write x + b[None, None, :] to make the broadcast explicit — b[None, None, :] has shape (1, 1, D), which broadcasts identically.
The trap: x + b where b has shape (B,) does not broadcast — (B, T, D) aligned with (_, _, B) requires D == B. Bias on the wrong axis is one of the most common shape bugs in ML code; getting comfortable with broadcasting alignment is how you avoid it.
Math
Asked at
import numpy as np
def bias_add(...):
pass
Premium problem
Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.
Already premium?