TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
←

156. GRPO Loss

Hard

Implement the GRPO loss used in DeepSeek-R1 style training. For a group of responses sampled to the same prompt, the advantage is the within-group z-score of rewards — no value network needed.

Signature: def grpo_loss(rewards: np.ndarray, log_probs: np.ndarray, beta_kl: float, kl: np.ndarray) -> float

All inputs are 1D arrays of shape (group_size,). Compute:

advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-8)
loss = -mean(advantages * log_probs) + beta_kl * mean(kl)

Return a Python float.

Math

LGRPO​=−G1​i∑​Ai​logπ(yi​∣x)+βDKL​​,Ai​=σr​+εri​−rˉ​

Related problems

  • GRPO Loss (PyTorch)hardPyTorch

Asked at

NumPy

import numpy as np

 

def grpo_loss(...):

    pass

🔒

Premium problem

Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.

Upgrade to PremiumBack to problems

Already premium?