Implement a single AdamW optimizer parameter update — the weight-decay variant used by virtually every modern transformer trainer (HuggingFace, PyTorch torch.optim.AdamW, Megatron, etc.).
Signature: def adamw_step(theta, grad, m, v, t, lr=1e-3, beta1=0.9, beta2=0.999, eps=1e-8, weight_decay=0.01) -> (theta_new, m_new, v_new)
The update:
m_new = beta1 * m + (1 - beta1) * gradv_new = beta2 * v + (1 - beta2) * grad ** 2m_hat = m_new / (1 - beta1 ** t)v_hat = v_new / (1 - beta2 ** t)theta_new = theta - lr * (m_hat / (sqrt(v_hat) + eps) + weight_decay * theta)Return the tuple (theta_new, m_new, v_new). The harness verifies theta_new (the first element).
The naive way to add weight decay to Adam is to fold an L2 term weight_decay * theta into the gradient before the moment updates. That is wrong: the adaptive denominator sqrt(v_hat) + eps then divides the decay term, so parameters with large historical gradients receive less regularization than parameters with small ones — the opposite of what you want.
Loshchilov & Hutter (2017) decouple the decay: weight decay is applied to theta directly, after the adaptive Adam step, so every parameter is shrunk by the same multiplicative factor regardless of its gradient history. With weight_decay=0, AdamW reduces exactly to Adam.
Math
Asked at
import numpy as np
def adamw_step(...):
pass
Premium problem
Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.
Already premium?