Parameter Count Sweep

Vectorize the standard transformer parameter-count formula (param-count-transformer, problem #109) over arrays of architectural configs.

Implement: def param_count_sweep(d_model, n_layers, vocab, n_heads=None) where:

d_model, n_layers, vocab are 1-D integer arrays of shape (N,) — paired per config.
n_heads is unused (kept for API compatibility with the scalar version; the formula doesn't actually need it).

Return shape (N,) of int64 — parameter count per config.

Per config (GPT-style, untied embeddings): the per-layer block (attention + FFN with 4× expansion) contributes 12 · d² parameters; embeddings plus a separate lm_head contribute 2 · V · d. See the math reference below for the closed form.

Why int64 inputs? d * d * L for a 7B-class config is 4096^2 * 32 ≈ 5e8 — fits in int32, but d^2 * L for a 70B-class config (d=8192, L=80) is 5.4e9 — overflows int32 silently. Cast to int64 upfront.

Math

N_{i} = 12 \cdot d_{i}^{2} \cdot L_{i} + 2 \cdot V_{i} \cdot d_{i}

Asked at

Vectorize the standard transformer parameter-count formula (param-count-transformer, problem #109) over arrays of architectural configs.

Implement: def param_count_sweep(d_model, n_layers, vocab, n_heads=None) where:

d_model, n_layers, vocab are 1-D integer arrays of shape (N,) — paired per config.
n_heads is unused (kept for API compatibility with the scalar version; the formula doesn't actually need it).

Return shape (N,) of int64 — parameter count per config.

Math

N_{i} = 12 \cdot d_{i}^{2} \cdot L_{i} + 2 \cdot V_{i} \cdot d_{i}

Asked at

229. Parameter Count Sweep

229. Parameter Count Sweep