TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
←

81. Transformer Decoder Block

Hard

A GPT-style Transformer decoder block has two sublayers (Pre-LayerNorm):

  1. Masked (causal) self-attention — each position can only attend to earlier positions
  2. Feed-forward network (FFN) — two linear layers with GeLU activation

Both sublayers use Pre-LN: LayerNorm is applied to the residual before the sublayer, and the sublayer output is added back (residual connection).

Signature: def transformer_decoder_block(x, Wq, Wk, Wv, Wo, W1, b1, W2, b2, gamma1, beta1, gamma2, beta2)

  • x: (seq, d) — input sequence
  • Wq, Wk, Wv, Wo: (d, d) — query/key/value/output projection weights
  • W1: (d_ff, d), b1: (d_ff,) — FFN first layer
  • W2: (d, d_ff), b2: (d,) — FFN second layer
  • gamma1, beta1: (d,) — LN params for self-attention
  • gamma2, beta2: (d,) — LN params for FFN
  • Returns: (seq, d)

Causal mask: position i can attend to positions 0..i only (upper-triangle set to -1e9).

GeLU: 0.5 * x * (1 + erf(x / sqrt(2)))

Math

x′=x+CausalAttn(LN1​(x)),out=x′+FFN(LN2​(x′))

Related problems

  • Transformer Decoder Block (PyTorch)hardPyTorch

Asked at

NumPy

import numpy as np

 

def transformer_decoder_block(...):

    pass

🔒

Premium problem

Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.

Upgrade to PremiumBack to problems

Already premium?