Speculative Decoding Speedup

Speculative decoding drafts k tokens with a small model, then verifies them in one target-model forward pass. Compute the expected speedup vs. plain autoregressive decoding.

Signature: def spec_decoding_speedup(alpha: float, draft_cost_ratio: float, k: int) -> float

alpha is the per-token acceptance probability (0..1)
draft_cost_ratio is the draft-model cost as a fraction of one target-model step
k is the number of speculative tokens drafted per round

Expected accepted tokens per round:

E[accepts] = alpha * (1 - alpha**k) / (1 - alpha) if alpha < 1 else k.

Speedup: E[accepts] / (1 + k * draft_cost_ratio).

Example:

alpha=0.8, draft_cost_ratio=0.1, k=4 → E[acc] = 0.8*(1-0.4096)/0.2 = 2.3616, speedup = 2.3616 / 1.4 ≈ 1.687.

Math

S = \frac{α ( 1 - α ^{k} ) / ( 1 - α )}{1 + k \cdot c _{d r a f t}}

Asked at