Implement the Viterbi decoding step of the Unigram Language Model tokenizer (SentencePiece, XLNet, mBART). Given a vocabulary of subword tokens with log-probabilities, find the tokenization with the highest total log-probability.
Signature: def unigram_tokenize(text, vocab_scores)
text: string to tokenizevocab_scores: dict mapping token string → log-probability (float, ≤ 0)['[UNK]'] if no valid path existsn = len(text)
dp[0] = 0.0 # log-prob of empty prefix = 0
dp[1..n] = -inf # initialize to -inf
for i in 1..n:
for token, log_prob in vocab_scores.items():
j = i - len(token)
if j >= 0 and text[j:i] == token and dp[j] > -inf:
candidate = dp[j] + log_prob
if candidate > dp[i]:
dp[i] = candidate
backtrack[i] = (j, token)
# Reconstruct best tokenization by following backtrack pointers
| | BPE | WordPiece | Unigram | |--|-----|-----------|---------| | Matching | Greedy merges | Greedy longest | Viterbi global best | | Objective | Frequency | Likelihood | Maximize log-prob | | Used by | GPT/Llama | BERT | SentencePiece/XLNet |
Asked at
Test Results