Implement the GRPO loss used in DeepSeek-R1 style training. For a group of responses sampled to the same prompt, the advantage is the within-group z-score of rewards — no value network needed.
Signature: def grpo_loss(rewards: np.ndarray, log_probs: np.ndarray, beta_kl: float, kl: np.ndarray) -> float
All inputs are 1D arrays of shape (group_size,). Compute:
advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-8)
loss = -mean(advantages * log_probs) + beta_kl * mean(kl)
Return a Python float.
Math
Related problems
Asked at
import numpy as np
def grpo_loss(...):
pass
Premium problem
Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.
Already premium?