AdamW (Decoupled Weight Decay)

Implement a single AdamW optimizer parameter update — the weight-decay variant used by virtually every modern transformer trainer (HuggingFace, PyTorch torch.optim.AdamW, Megatron, etc.).

Signature: def adamw_step(theta, grad, m, v, t, lr=1e-3, beta1=0.9, beta2=0.999, eps=1e-8, weight_decay=0.01) -> (theta_new, m_new, v_new)

The update:

m_new = beta1 * m + (1 - beta1) * grad
v_new = beta2 * v + (1 - beta2) * grad ** 2
m_hat = m_new / (1 - beta1 ** t)
v_hat = v_new / (1 - beta2 ** t)
theta_new = theta - lr * (m_hat / (sqrt(v_hat) + eps) + weight_decay * theta)

Return the tuple (theta_new, m_new, v_new). The harness verifies theta_new (the first element).

AdamW vs "Adam + L2"

The naive way to add weight decay to Adam is to fold an L2 term weight_decay * theta into the gradient before the moment updates. That is wrong: the adaptive denominator sqrt(v_hat) + eps then divides the decay term, so parameters with large historical gradients receive less regularization than parameters with small ones — the opposite of what you want.

Loshchilov & Hutter (2017) decouple the decay: weight decay is applied to theta directly, after the adaptive Adam step, so every parameter is shrunk by the same multiplicative factor regardless of its gradient history. With weight_decay=0, AdamW reduces exactly to Adam.

Math

θ_{t} = θ_{t - 1} - η (\frac{m ^ _{t}}{v ^ _{t} + ϵ} + λ θ_{t - 1})

Asked at

Implement a single AdamW optimizer parameter update — the weight-decay variant used by virtually every modern transformer trainer (HuggingFace, PyTorch torch.optim.AdamW, Megatron, etc.).

Signature: def adamw_step(theta, grad, m, v, t, lr=1e-3, beta1=0.9, beta2=0.999, eps=1e-8, weight_decay=0.01) -> (theta_new, m_new, v_new)

The update:

m_new = beta1 * m + (1 - beta1) * grad
v_new = beta2 * v + (1 - beta2) * grad ** 2
m_hat = m_new / (1 - beta1 ** t)
v_hat = v_new / (1 - beta2 ** t)
theta_new = theta - lr * (m_hat / (sqrt(v_hat) + eps) + weight_decay * theta)

Return the tuple (theta_new, m_new, v_new). The harness verifies theta_new (the first element).

AdamW vs "Adam + L2"

Math

θ_{t} = θ_{t - 1} - η (\frac{m ^ _{t}}{v ^ _{t} + ϵ} + λ θ_{t - 1})

Asked at

212. AdamW (Decoupled Weight Decay)

AdamW vs "Adam + L2"

212. AdamW (Decoupled Weight Decay)

AdamW vs "Adam + L2"