TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
←

76. Transformer Encoder Block

Medium

A full Transformer encoder block combines self-attention + FFN with Pre-LN residuals: each sublayer is wrapped so that the input is normalized before the sublayer and the residual is added on top (see the math reference). The block applies, in order, a Pre-LN self-attention sublayer, then a Pre-LN feed-forward sublayer.

The self-attention sublayer projects the normalized input into Q/K/V using Wq, Wk, Wv, runs standard (non-causal) scaled dot-product softmax attention, and applies the output projection Wo before the residual add.

The FFN sublayer is a two-layer MLP d_model → d_ff → d_model with the exact (erf-based) GELU activation in the middle; bias b1 is added before GELU and bias b2 is added at the very end (before the residual).

Signature: def transformer_encoder_block(x, Wq, Wk, Wv, Wo, W1, b1, W2, b2, gamma1, beta1, gamma2, beta2)

  • x: (seq_len, d_model)
  • Wq, Wk, Wv: (d_model, d_model) — QKV projections
  • Wo: (d_model, d_model) — output projection
  • W1: (d_ff, d_model), b1: (d_ff,) — FFN layer 1
  • W2: (d_model, d_ff), b2: (d_model,) — FFN layer 2
  • gamma1, beta1: (d_model,) — LN1 params (before attention)
  • gamma2, beta2: (d_model,) — LN2 params (before FFN)
  • Returns: (seq_len, d_model)

LayerNorm epsilon: 1e-5. Use the exact GELU (scipy.special.erf), not the tanh approximation.

Math

x=x+Attn(LN1​(x));x=x+FFN(LN2​(x))

Related problems

  • Transformer Encoder Block (PyTorch)mediumPyTorch

Asked at

NumPy

import numpy as np

 

def transformer_encoder_block(...):

    pass

🔒

Premium problem

Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.

Upgrade to PremiumBack to problems

Already premium?