A full Transformer encoder block combines self-attention + FFN with Pre-LN residuals. Each sublayer is wrapped as:
x = x + sublayer(LayerNorm(x))
The block:
x = x + Attention(LN1(x)) @ Wox = x + GELU(LN2(x) @ W1.T + b1) @ W2.T + b2Signature: def transformer_encoder_block(x, Wq, Wk, Wv, Wo, W1, b1, W2, b2, gamma1, beta1, gamma2, beta2)
x: (seq_len, d_model)Wq, Wk, Wv: (d_model, d_model) — QKV projectionsWo: (d_model, d_model) — output projectionW1: (d_ff, d_model), b1: (d_ff,) — FFN layer 1W2: (d_model, d_ff), b2: (d_model,) — FFN layer 2gamma1, beta1: (d_model,) — LN1 params (before attention)gamma2, beta2: (d_model,) — LN2 params (before FFN)(seq_len, d_model)Use standard (non-causal) self-attention. GELU: 0.5 * x * (1 + erf(x / sqrt(2))). LayerNorm epsilon: 1e-5.
Math
Asked at
Test Results