Speculative decoding drafts k tokens with a small model, then verifies them in one target-model forward pass. Compute the expected speedup vs. plain autoregressive decoding.
Signature: def spec_decoding_speedup(alpha: float, draft_cost_ratio: float, k: int) -> float
alpha is the per-token acceptance probability (0..1)draft_cost_ratio is the draft-model cost as a fraction of one target-model stepk is the number of speculative tokens drafted per roundExpected accepted tokens per round:
E[accepts] = alpha * (1 - alpha**k) / (1 - alpha) if alpha < 1 else k.
Speedup: E[accepts] / (1 + k * draft_cost_ratio).
Example:
alpha=0.8, draft_cost_ratio=0.1, k=4 → E[acc] = 0.8*(1-0.4096)/0.2 = 2.3616, speedup = 2.3616 / 1.4 ≈ 1.687.Math
Asked at
Test Results