Hand-derive the gradient of L = sum(RoPE(x, theta)) w.r.t. the input x. RoPE pairs adjacent feature dims and rotates each pair by an angle.
Forward: Group x into pairs along the last axis (d must be even). For pair index i (covering features 2i and 2i+1) with angle theta_i:
y[..., 2i] = x[..., 2i] * cos(theta_i) - x[..., 2i+1] * sin(theta_i)
y[..., 2i+1] = x[..., 2i] * sin(theta_i) + x[..., 2i+1] * cos(theta_i)
Implement:
rope_forward(x, theta) -> y of the same shape as xrope_backward(x, theta) -> dL/dx of the same shapetheta has shape (d // 2,) — one angle per pair, broadcast across leading dims.
The Jacobian of a 2D rotation is the rotation matrix itself. With upstream dL/dy = ones (because L = sum(y)), the rotation's transpose acts on the upstream gradient, so
dL/dx[..., 2i] = cos(theta_i) + sin(theta_i)
dL/dx[..., 2i+1] = cos(theta_i) - sin(theta_i)
(transposed rotation = inverse rotation by -theta, applied to the all-ones upstream).
Math
Asked at
import numpy as np
def rope_forward(...):
pass
Premium problem
Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.
Already premium?