GRPO Loss (PyTorch)

Implement the GRPO (Group Relative Policy Optimization) loss in PyTorch.

Signature: def grpo_loss(rewards: torch.Tensor, log_probs: torch.Tensor, beta_kl: float, kl: torch.Tensor) -> torch.Tensor

The rule: you may NOT call any high-level loss wrapper. Implement the z-score and the policy-gradient surrogate yourself.

Allowed primitives: .mean(), .std(...), basic arithmetic.

Formula:

advantages = (rewards - rewards.mean()) / (rewards.std(unbiased=False) + 1e-8)
loss       = -mean(advantages * log_probs) + beta_kl * mean(kl)

Critical PyTorch detail: PyTorch's .std() defaults to unbiased=True (Bessel-corrected, divisor n-1). NumPy's .std() defaults to ddof=0 (divisor n). To match the NumPy reference and the standard GRPO definition, you must pass unbiased=False. This is the most common GRPO porting bug.

Math

L_{GR P O} = - \frac{1}{G} i \sum A_{i} lo g π (y_{i} ∣ x) + β \overline{D_{K L}}, A_{i} = \frac{r _{i} - r ˉ}{σ _{r} + ε}

Related problems

GRPO LosshardNumPy

Asked at