TorchedUp
ProblemsPremium
TorchedUp
Unigram Language Model TokenizerHard
ProblemsPremium

Unigram Language Model Tokenizer

Implement the Viterbi decoding step of the Unigram Language Model tokenizer (SentencePiece, XLNet, mBART). Given a vocabulary of subword tokens with log-probabilities, find the tokenization with the highest total log-probability.

Signature: def unigram_tokenize(text, vocab_scores)

  • text: string to tokenize
  • vocab_scores: dict mapping token string → log-probability (float, ≤ 0)
  • Returns: list of token strings (best Viterbi path), or ['[UNK]'] if no valid path exists

Algorithm: Viterbi Dynamic Programming

n = len(text)
dp[0] = 0.0          # log-prob of empty prefix = 0
dp[1..n] = -inf      # initialize to -inf

for i in 1..n:
    for token, log_prob in vocab_scores.items():
        j = i - len(token)
        if j >= 0 and text[j:i] == token and dp[j] > -inf:
            candidate = dp[j] + log_prob
            if candidate > dp[i]:
                dp[i] = candidate
                backtrack[i] = (j, token)

# Reconstruct best tokenization by following backtrack pointers

Unigram vs BPE vs WordPiece

| | BPE | WordPiece | Unigram | |--|-----|-----------|---------| | Matching | Greedy merges | Greedy longest | Viterbi global best | | Objective | Frequency | Likelihood | Maximize log-prob | | Used by | GPT/Llama | BERT | SentencePiece/XLNet |

Asked at

Python (numpy)0/3 runs today

Test Results

○"hello" — whole word has best log-prob
○"lower" — lower score beats lo+wer
○"abc" — forced split: a+bc beats ab+c
○"xyz" — no tokens match → [UNK]
○"running" — "running" whole word beats "run"+"ning"🔒 Premium
Advertisement