TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
←

70. Mixed Precision Training Step

Medium

Mixed precision training uses fp16 for the forward and backward passes (faster, less GPU memory) but fp32 for the optimizer update (prevents catastrophic precision loss in weight updates).

The key challenge: fp16 has a tiny dynamic range. Gradients near zero get flushed to exactly 0 ("underflow"), killing learning. The fix is loss scaling: multiply the loss by a large constant before backward, then divide gradients by the same constant before the optimizer step.

Steps:

  1. Cast activations to fp16, run forward pass.
  2. Multiply the loss by loss_scale before calling backward, so the resulting gradients are also scaled by loss_scale (this lifts tiny gradients out of fp16's underflow region).
  3. After backward, divide the gradients back down by loss_scale to recover the true (unscaled) gradient.
  4. Inspect the scaled gradient for inf/nan. If any entry is non-finite, the gradient overflowed fp16 — skip the optimizer update for this step (return the parameters unchanged).
  5. Otherwise, apply a standard fp32 SGD update with the unscaled gradient.

Signature: def mixed_precision_step(params_fp32, grad_fp32, loss_scale, lr)

  • params_fp32: (N,) current fp32 parameters
  • grad_fp32: (N,) gradients in fp32 (before overflow check)
  • loss_scale: float — scale factor used (check overflow by scaling grad)
  • lr: float — learning rate
  • Returns: (new_params, skipped) — updated params and bool (True if skipped due to inf/nan)

Math

g~​=g⋅s(check for inf/nan)θ←θ−η⋅sg~​​=θ−η⋅g

Asked at

NumPy

import numpy as np

 

def mixed_precision_step(...):

    pass

🔒

Premium problem

Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.

Upgrade to PremiumBack to problems

Already premium?