Implement the PPO clipped objective (the inner loss minimized during each PPO epoch).
Signature: def ppo_clip_loss(log_probs_new: np.ndarray, log_probs_old: np.ndarray, advantages: np.ndarray, eps: float = 0.2) -> float
Let r = exp(log_probs_new - log_probs_old). Then the loss is
L = -mean( min( r * adv, clip(r, 1-eps, 1+eps) * adv ) )
Return a single Python float (negative because we minimize).
Math
Asked at
Test Results