Transformer Parameter Count

Given the architectural hyperparameters of a decoder-only Transformer, compute the total number of trainable parameters.

Signature: def transformer_params(d_model: int, n_layers: int, vocab_size: int, n_heads: int = None) -> int

Assumptions (standard GPT-style):

Each block has multi-head self-attention with separate Q, K, V, and output projections (each d_model x d_model).
The FFN uses a 4x hidden expansion: two linear layers of shape d x 4d and 4d x d.
Token embedding and lm_head are not tied — count both as vocab_size * d_model.
Ignore biases and LayerNorm parameters (negligible vs. matmuls).

Formula:

Example: For GPT-2 small (d=768, L=12, V=50257), the formula yields ~124M parameters.

Math

Asked at

Given the architectural hyperparameters of a decoder-only Transformer, compute the total number of trainable parameters.

Signature: def transformer_params(d_model: int, n_layers: int, vocab_size: int, n_heads: int = None) -> int

Assumptions (standard GPT-style):

Each block has multi-head self-attention with separate Q, K, V, and output projections (each d_model x d_model).
The FFN uses a 4x hidden expansion: two linear layers of shape d x 4d and 4d x d.
Token embedding and lm_head are not tied — count both as vocab_size * d_model.
Ignore biases and LayerNorm parameters (negligible vs. matmuls).

Formula:

Example: For GPT-2 small (d=768, L=12, V=50257), the formula yields ~124M parameters.

Math

Asked at