Implement a mini encoder-decoder Transformer for sequence-to-sequence tasks (like translation).
Signature: def mini_transformer(src, tgt, weights)
src: (S, d_model) — source sequencetgt: (T, d_model) — target sequence (teacher-forced during inference)weights: dict with all weight matrices (see below)(T, d_model) — decoder outputEncoder layer:
src (no mask)memoryDecoder layer:
tgt (causal mask: upper-triangular -1e9)memoryFor n_heads heads with head dim d_h = d_model // n_heads:
Q = x @ Wq.T, K = x @ Wk.T, V = x @ Wv.T
# Split into heads: reshape (S, d) → (n_heads, S, d_h)
# Per head: scores = Q_h @ K_h.T / sqrt(d_h) [+ optional mask]
# attn = softmax(scores) @ V_h
# Concatenate heads → (S, d) → @ Wo.T
LayerNorm(x, gamma, beta) = gamma * (x - mean) / sqrt(var + 1e-5) + beta
| Key | Shape | Description |
|-----|-------|-------------|
| n_heads | int | number of attention heads |
| enc_Wq/Wk/Wv/Wo | (d, d) | encoder self-attention |
| enc_ln1_g/b, enc_ln2_g/b | (d,) | encoder layer norms |
| enc_W1 | (d_ff, d), enc_b1 (d_ff,) | encoder FFN layer 1 |
| enc_W2 | (d, d_ff), enc_b2 (d,) | encoder FFN layer 2 |
| dec_self_Wq/Wk/Wv/Wo | (d, d) | decoder masked self-attention |
| dec_cross_Wq/Wk/Wv/Wo | (d, d) | decoder cross-attention |
| dec_ln1_g/b, dec_ln2_g/b, dec_ln3_g/b | (d,) | decoder layer norms |
| dec_W1 | (d_ff, d), dec_b1 (d_ff,) | decoder FFN layer 1 |
| dec_W2 | (d, d_ff), dec_b2 (d,) | decoder FFN layer 2 |
Math
Asked at
Test Results