Implement the Bradley-Terry preference loss used to train reward models in RLHF.
Signature: def reward_model_loss(r_chosen: np.ndarray, r_rejected: np.ndarray) -> float
Both inputs are 1D reward scores from the model. The loss is:
L = -mean( log_sigmoid(r_chosen - r_rejected) )
Use a numerically stable log_sigmoid (piecewise: -log1p(exp(-|x|)) plus min(x, 0)).
Return a Python float.
Math
Asked at
Test Results