TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
←

212. AdamW (Decoupled Weight Decay)

Medium

Implement a single AdamW optimizer parameter update — the weight-decay variant used by virtually every modern transformer trainer (HuggingFace, PyTorch torch.optim.AdamW, Megatron, etc.).

Signature: def adamw_step(theta, grad, m, v, t, lr=1e-3, beta1=0.9, beta2=0.999, eps=1e-8, weight_decay=0.01) -> (theta_new, m_new, v_new)

The update:

  1. m_new = beta1 * m + (1 - beta1) * grad
  2. v_new = beta2 * v + (1 - beta2) * grad ** 2
  3. m_hat = m_new / (1 - beta1 ** t)
  4. v_hat = v_new / (1 - beta2 ** t)
  5. theta_new = theta - lr * (m_hat / (sqrt(v_hat) + eps) + weight_decay * theta)

Return the tuple (theta_new, m_new, v_new). The harness verifies theta_new (the first element).

AdamW vs "Adam + L2"

The naive way to add weight decay to Adam is to fold an L2 term weight_decay * theta into the gradient before the moment updates. That is wrong: the adaptive denominator sqrt(v_hat) + eps then divides the decay term, so parameters with large historical gradients receive less regularization than parameters with small ones — the opposite of what you want.

Loshchilov & Hutter (2017) decouple the decay: weight decay is applied to theta directly, after the adaptive Adam step, so every parameter is shrunk by the same multiplicative factor regardless of its gradient history. With weight_decay=0, AdamW reduces exactly to Adam.

Math

θt​=θt−1​−η(v^t​​+ϵm^t​​+λθt−1​)

Asked at

NumPy

import numpy as np

 

def adamw_step(...):

    pass

🔒

Premium problem

Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.

Upgrade to PremiumBack to problems

Already premium?