Hand-derive the gradient of L = sum(W[indices]) w.r.t. the embedding table W.
Forward: y = W[indices] — gather rows of W according to the integer index array indices. L = sum(y).
Implement:
embedding_forward(W, indices) -> y of shape indices.shape + (embed_dim,)embedding_backward(W, indices) -> dL/dW of shape (vocab_size, embed_dim)The trick: indices may repeat. If token i appears k times in indices, then W[i] contributes k copies to the sum, so dL/dW[i] = k * ones(embed_dim). Indices that don't appear get zero rows.
This is also called scatter-add in autograd land — the backward of a gather is a scatter that accumulates duplicates.
Math
Related problems
Asked at
import numpy as np
def embedding_forward(...):
pass
Premium problem
Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.
Already premium?