TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
←

157. Bradley-Terry Reward Loss

Medium

Implement the Bradley-Terry preference loss used to train reward models in RLHF.

Signature: def reward_model_loss(r_chosen: np.ndarray, r_rejected: np.ndarray) -> float

Both inputs are 1D reward scores from the model. The loss is:

L = -mean( log_sigmoid(r_chosen - r_rejected) )

Use a numerically stable log_sigmoid (piecewise: -log1p(exp(-|x|)) plus min(x, 0)).

Return a Python float.

Math

LRM​=−E(yc​,yr​)​[logσ(rθ​(yc​)−rθ​(yr​))]

Asked at

NumPy

import numpy as np

 

def reward_model_loss(...):

    pass

🔒

Premium problem

Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.

Upgrade to PremiumBack to problems

Already premium?