TorchedUp
ProblemsPremium
TorchedUp
Speculative Decoding SpeedupMedium
ProblemsPremium

Speculative Decoding — Effective Speedup

Speculative decoding drafts k tokens with a small model, then verifies them in one target-model forward pass. Compute the expected speedup vs. plain autoregressive decoding.

Signature: def spec_decoding_speedup(alpha: float, draft_cost_ratio: float, k: int) -> float

  • alpha is the per-token acceptance probability (0..1)
  • draft_cost_ratio is the draft-model cost as a fraction of one target-model step
  • k is the number of speculative tokens drafted per round

Expected accepted tokens per round:

E[accepts] = alpha * (1 - alpha**k) / (1 - alpha) if alpha < 1 else k.

Speedup: E[accepts] / (1 + k * draft_cost_ratio).

Example:

  • alpha=0.8, draft_cost_ratio=0.1, k=4 → E[acc] = 0.8*(1-0.4096)/0.2 = 2.3616, speedup = 2.3616 / 1.4 ≈ 1.687.

Math

Asked at

Python (numpy)0/3 runs today

Test Results

○good draft
○mediocre draft
○perfect draft🔒 Premium
Advertisement