Implement the PPO clip objective in PyTorch.
Signature: def ppo_clip_loss(log_probs_new: torch.Tensor, log_probs_old: torch.Tensor, advantages: torch.Tensor, eps: float = 0.2) -> torch.Tensor
The rule: you may NOT call F.relu, F.clip, or any high-level objective wrapper. Implement clip and min yourself with primitives.
Allowed primitives: .exp(), .clamp, torch.minimum, .mean(), basic arithmetic.
The PPO clip objective compares the per-sample importance ratio (new policy over old, recovered from log-prob differences) against the same ratio clipped to [1 - eps, 1 + eps], takes the elementwise minimum of the two ratio-times-advantage terms, averages over all elements (batch + time + any extra axes), and negates so we minimize a loss whose minimum corresponds to maximizing the surrogate objective. See the math reference below.
PyTorch idioms vs the NumPy version:
np.clip(arr, lo, hi) becomes arr.clamp(lo, hi) (or arr.clamp(min=lo, max=hi)).np.minimum(a, b) becomes torch.minimum(a, b) — element-wise min of two tensors. Don't confuse with a.min(b) (no such method) or a.min(dim=...) (reduction with NamedTuple return).Math
Related problems
Asked at
import numpy as np
def ppo_clip_loss(...):
pass
Premium problem
Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.
Already premium?