TorchedUp
ProblemsPremium
TorchedUp
Bradley-Terry Reward LossMedium
ProblemsPremium

Bradley-Terry Reward Model Loss

Implement the Bradley-Terry preference loss used to train reward models in RLHF.

Signature: def reward_model_loss(r_chosen: np.ndarray, r_rejected: np.ndarray) -> float

Both inputs are 1D reward scores from the model. The loss is:

L = -mean( log_sigmoid(r_chosen - r_rejected) )

Use a numerically stable log_sigmoid (piecewise: -log1p(exp(-|x|)) plus min(x, 0)).

Return a Python float.

Math

Asked at

Python (numpy)0/3 runs today

Test Results

○tied -> log(2)
○chosen >> rejected
○rejected wins (bad model)🔒 Premium
Advertisement