Implement the Bradley-Terry preference loss used to train reward models in RLHF.
Signature: def reward_model_loss(r_chosen: np.ndarray, r_rejected: np.ndarray) -> float
Both inputs are 1D reward scores from the model. The loss is:
L = -mean( log_sigmoid(r_chosen - r_rejected) )
Use a numerically stable log_sigmoid (piecewise: -log1p(exp(-|x|)) plus min(x, 0)).
Return a Python float.
Math
Asked at
import numpy as np
def reward_model_loss(...):
pass
Premium problem
Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.
Already premium?