TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
←

147. Speculative Decoding Speedup

Medium

Speculative decoding drafts k tokens with a small model, then verifies them in one target-model forward pass. Compute the expected speedup vs. plain autoregressive decoding.

Signature: def spec_decoding_speedup(alpha: float, draft_cost_ratio: float, k: int) -> float

  • alpha is the per-token acceptance probability (0..1)
  • draft_cost_ratio is the draft-model cost as a fraction of one target-model step
  • k is the number of speculative tokens drafted per round

Expected accepted tokens per round:

E[accepts] = alpha * (1 - alpha**k) / (1 - alpha) if alpha < 1 else k.

Speedup: E[accepts] / (1 + k * draft_cost_ratio).

Example:

  • alpha=0.8, draft_cost_ratio=0.1, k=4 → E[acc] = 0.8*(1-0.4096)/0.2 = 2.3616, speedup = 2.3616 / 1.4 ≈ 1.687.

Math

S=1+k⋅cdraft​α(1−αk)/(1−α)​

Asked at

NumPy

import numpy as np

 

def spec_decoding_speedup(...):

    pass

🔒

Premium problem

Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.

Upgrade to PremiumBack to problems

Already premium?