TorchedUp
ProblemsPremium
TorchedUp
Transformer Parameter CountEasy
ProblemsPremium

Transformer Parameter Count

Given the architectural hyperparameters of a decoder-only Transformer, compute the total number of trainable parameters.

Signature: def transformer_params(d_model: int, n_layers: int, vocab_size: int, n_heads: int = None) -> int

Assumptions (standard GPT-style):

  • Each block has multi-head self-attention with separate Q, K, V, and output projections (each d_model x d_model).
  • The FFN uses a 4x hidden expansion: two linear layers of shape d x 4d and 4d x d.
  • Token embedding and lm_head are not tied — count both as vocab_size * d_model.
  • Ignore biases and LayerNorm parameters (negligible vs. matmuls).

Formula:

  • Attention per layer: 4 * d^2 (Q, K, V, O)
  • FFN per layer: 8 * d^2 (2 matrices of size d * 4d)
  • Embeddings + lm_head: 2 * V * d
  • Total: 12 * d^2 * L + 2 * V * d

Example: For GPT-2 small (d=768, L=12, V=50257), the formula yields ~124M parameters.

Math

Asked at

Python (numpy)0/3 runs today

Test Results

○GPT-2 small
○tiny model
○GPT-3 13B-ish🔒 Premium
Advertisement