TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
←

154. PPO Clip Objective

Hard

Implement the PPO clipped objective (the inner loss minimized during each PPO epoch).

Signature: def ppo_clip_loss(log_probs_new: np.ndarray, log_probs_old: np.ndarray, advantages: np.ndarray, eps: float = 0.2) -> float

The importance ratio is recovered by exponentiating the log-prob difference between new and old policies. Compute the unclipped surrogate (ratio · advantage) and the clipped surrogate (where the ratio is clamped to [1 - eps, 1 + eps] before multiplying by the advantage), take the elementwise minimum of the two — the pessimistic lower bound — average over all elements (do not collapse along just one axis), and negate so this is a loss we minimize. See the math reference below.

Return a single Python float.

Math

LCLIP=−E[min(rt​At​,clip(rt​,1−ϵ,1+ϵ)At​)]

Related problems

  • PPO Clip Objective (PyTorch)hardPyTorch

Asked at

NumPy

import numpy as np

 

def ppo_clip_loss(...):

    pass

🔒

Premium problem

Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.

Upgrade to PremiumBack to problems

Already premium?