TorchedUp
ProblemsPremium
TorchedUp
Transformer Encoder BlockMedium
ProblemsPremium

Transformer Encoder Block

A full Transformer encoder block combines self-attention + FFN with Pre-LN residuals. Each sublayer is wrapped as:

x = x + sublayer(LayerNorm(x))

The block:

  1. Pre-LN Self-Attention: x = x + Attention(LN1(x)) @ Wo
  2. Pre-LN FFN: x = x + GELU(LN2(x) @ W1.T + b1) @ W2.T + b2

Signature: def transformer_encoder_block(x, Wq, Wk, Wv, Wo, W1, b1, W2, b2, gamma1, beta1, gamma2, beta2)

  • x: (seq_len, d_model)
  • Wq, Wk, Wv: (d_model, d_model) — QKV projections
  • Wo: (d_model, d_model) — output projection
  • W1: (d_ff, d_model), b1: (d_ff,) — FFN layer 1
  • W2: (d_model, d_ff), b2: (d_model,) — FFN layer 2
  • gamma1, beta1: (d_model,) — LN1 params (before attention)
  • gamma2, beta2: (d_model,) — LN2 params (before FFN)
  • Returns: (seq_len, d_model)

Use standard (non-causal) self-attention. GELU: 0.5 * x * (1 + erf(x / sqrt(2))). LayerNorm epsilon: 1e-5.

Math

Asked at

Python (numpy)0/3 runs today

Test Results

○seed 42 — seq_len=3 d_model=4 d_ff=8
○seed 1 — seq_len=3 d_model=4 d_ff=8 (hidden)🔒 Premium
Advertisement