Backprop: RoPE rotation

Hand-derive the gradient of L = sum(RoPE(x, theta)) w.r.t. the input x. RoPE pairs adjacent feature dims and rotates each pair by an angle.

Forward: Group x into pairs along the last axis (d must be even). For pair index i (covering features 2i and 2i+1) with angle theta_i:

y[..., 2i]   = x[..., 2i]   * cos(theta_i) - x[..., 2i+1] * sin(theta_i)
y[..., 2i+1] = x[..., 2i]   * sin(theta_i) + x[..., 2i+1] * cos(theta_i)

Implement:

rope_forward(x, theta) -> y of the same shape as x
rope_backward(x, theta) -> dL/dx of the same shape

theta has shape (d // 2,) — one angle per pair, broadcast across leading dims.

The Jacobian of a 2D rotation is the rotation matrix itself. With upstream dL/dy = ones (because L = sum(y)), the rotation's transpose acts on the upstream gradient, so

dL/dx[..., 2i]   = cos(theta_i) + sin(theta_i)
dL/dx[..., 2i+1] = cos(theta_i) - sin(theta_i)

(transposed rotation = inverse rotation by -theta, applied to the all-ones upstream).

Math

R_{θ} = (cos θ sin θ - sin θ cos θ), \frac{\partial L}{\partial x} = R_{θ}^{⊤} 1

Asked at