Backprop: Embedding lookup

Hand-derive the gradient of L = sum(W[indices]) w.r.t. the embedding table W.

Forward: y = W[indices] — gather rows of W according to the integer index array indices. L = sum(y).

Implement:

embedding_forward(W, indices) -> y of shape indices.shape + (embed_dim,)
embedding_backward(W, indices) -> dL/dW of shape (vocab_size, embed_dim)

The trick: indices may repeat. If token i appears k times in indices, then W[i] contributes k copies to the sum, so dL/dW[i] = k * ones(embed_dim). Indices that don't appear get zero rows.

This is also called scatter-add in autograd land — the backward of a gather is a scatter that accumulates duplicates.

Math

y_{n, d} = W_{i_{n}, d}, \frac{\partial L}{\partial W _{v, d}} = # {n : i_{n} = v}

208. Backprop: Embedding lookup

208. Backprop: Embedding lookup