TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
←

211. Backprop: RoPE rotation

Medium

Hand-derive the gradient of L = sum(RoPE(x, theta)) w.r.t. the input x. RoPE pairs adjacent feature dims and rotates each pair by an angle.

Forward: Group x into pairs along the last axis (d must be even). For pair index i (covering features 2i and 2i+1) with angle theta_i:

y[..., 2i]   = x[..., 2i]   * cos(theta_i) - x[..., 2i+1] * sin(theta_i)
y[..., 2i+1] = x[..., 2i]   * sin(theta_i) + x[..., 2i+1] * cos(theta_i)

Implement:

  • rope_forward(x, theta) -> y of the same shape as x
  • rope_backward(x, theta) -> dL/dx of the same shape

theta has shape (d // 2,) — one angle per pair, broadcast across leading dims.

The Jacobian of a 2D rotation is the rotation matrix itself. With upstream dL/dy = ones (because L = sum(y)), the rotation's transpose acts on the upstream gradient, so

dL/dx[..., 2i]   = cos(theta_i) + sin(theta_i)
dL/dx[..., 2i+1] = cos(theta_i) - sin(theta_i)

(transposed rotation = inverse rotation by -theta, applied to the all-ones upstream).

Math

Rθ​=(cosθsinθ​−sinθcosθ​),∂x∂L​=Rθ⊤​1

Asked at

NumPy

import numpy as np

 

def rope_forward(...):

    pass

🔒

Premium problem

Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.

Upgrade to PremiumBack to problems

Already premium?