Bradley-Terry Reward Loss

Bradley-Terry Reward Model Loss

Implement the Bradley-Terry preference loss used to train reward models in RLHF.

Signature: def reward_model_loss(r_chosen: np.ndarray, r_rejected: np.ndarray) -> float

Both inputs are 1D reward scores from the model. The loss is:

L = -mean( log_sigmoid(r_chosen - r_rejected) )

Use a numerically stable log_sigmoid (piecewise: -log1p(exp(-|x|)) plus min(x, 0)).

Return a Python float.

Math

Asked at

Python (numpy)0/3 runs today

Test Results

○tied -> log(2)

○chosen >> rejected

○rejected wins (bad model)🔒 Premium