Implement the PPO clipped objective (the inner loss minimized during each PPO epoch).
Signature: def ppo_clip_loss(log_probs_new: np.ndarray, log_probs_old: np.ndarray, advantages: np.ndarray, eps: float = 0.2) -> float
The importance ratio is recovered by exponentiating the log-prob difference between new and old policies. Compute the unclipped surrogate (ratio · advantage) and the clipped surrogate (where the ratio is clamped to [1 - eps, 1 + eps] before multiplying by the advantage), take the elementwise minimum of the two — the pessimistic lower bound — average over all elements (do not collapse along just one axis), and negate so this is a loss we minimize. See the math reference below.
Return a single Python float.
Math
Related problems
Asked at
import numpy as np
def ppo_clip_loss(...):
pass
Premium problem
Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.
Already premium?