Given the architectural hyperparameters of a decoder-only Transformer, compute the total number of trainable parameters.
Signature: def transformer_params(d_model: int, n_layers: int, vocab_size: int, n_heads: int = None) -> int
Assumptions (standard GPT-style):
d_model x d_model).d x 4d and 4d x d.lm_head are not tied — count both as vocab_size * d_model.Formula:
4 * d^2 (Q, K, V, O)8 * d^2 (2 matrices of size d * 4d)2 * V * d12 * d^2 * L + 2 * V * dExample: For GPT-2 small (d=768, L=12, V=50257), the formula yields ~124M parameters.
Math
Asked at
Test Results