GRPO Loss — TorchedUp

Implement the GRPO loss used in DeepSeek-R1 style training. For a group of responses sampled to the same prompt, the advantage is the within-group z-score of rewards — no value network needed.

Signature: def grpo_loss(rewards: np.ndarray, log_probs: np.ndarray, beta_kl: float, kl: np.ndarray) -> float

All inputs are 1D arrays of shape (group_size,). Compute:

advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-8)
loss = -mean(advantages * log_probs) + beta_kl * mean(kl)

Return a Python float.

Math

L_{GR P O} = - \frac{1}{G} i \sum A_{i} lo g π (y_{i} ∣ x) + β \overline{D_{K L}}, A_{i} = \frac{r _{i} - r ˉ}{σ _{r} + ε}

Related problems

GRPO Loss (PyTorch)hardPyTorch

Asked at

Signature: def grpo_loss(rewards: np.ndarray, log_probs: np.ndarray, beta_kl: float, kl: np.ndarray) -> float

All inputs are 1D arrays of shape (group_size,). Compute:

advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-8)
loss = -mean(advantages * log_probs) + beta_kl * mean(kl)

Return a Python float.

Math

L_{GR P O} = - \frac{1}{G} i \sum A_{i} lo g π (y_{i} ∣ x) + β \overline{D_{K L}}, A_{i} = \frac{r _{i} - r ˉ}{σ _{r} + ε}

156. GRPO Loss

156. GRPO Loss