A full Transformer encoder block combines self-attention + FFN with Pre-LN residuals: each sublayer is wrapped so that the input is normalized before the sublayer and the residual is added on top (see the math reference). The block applies, in order, a Pre-LN self-attention sublayer, then a Pre-LN feed-forward sublayer.
The self-attention sublayer projects the normalized input into Q/K/V using Wq, Wk, Wv, runs standard (non-causal) scaled dot-product softmax attention, and applies the output projection Wo before the residual add.
The FFN sublayer is a two-layer MLP d_model → d_ff → d_model with the exact (erf-based) GELU activation in the middle; bias b1 is added before GELU and bias b2 is added at the very end (before the residual).
Signature: def transformer_encoder_block(x, Wq, Wk, Wv, Wo, W1, b1, W2, b2, gamma1, beta1, gamma2, beta2)
x: (seq_len, d_model)Wq, Wk, Wv: (d_model, d_model) — QKV projectionsWo: (d_model, d_model) — output projectionW1: (d_ff, d_model), b1: (d_ff,) — FFN layer 1W2: (d_model, d_ff), b2: (d_model,) — FFN layer 2gamma1, beta1: (d_model,) — LN1 params (before attention)gamma2, beta2: (d_model,) — LN2 params (before FFN)(seq_len, d_model)LayerNorm epsilon: 1e-5. Use the exact GELU (scipy.special.erf), not the tanh approximation.
Math
Related problems
Asked at
import numpy as np
def transformer_encoder_block(...):
pass
Premium problem
Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.
Already premium?