Unigram Language Model Tokenizer

Implement the Viterbi decoding step of the Unigram Language Model tokenizer (SentencePiece, XLNet, mBART). Given a vocabulary of subword tokens with log-probabilities, find the tokenization with the highest total log-probability.

Signature: def unigram_tokenize(text, vocab_scores)

text: string to tokenize
vocab_scores: dict mapping token string → log-probability (float, ≤ 0)
Returns: list of token strings (best Viterbi path), or ['[UNK]'] if no valid path exists

Algorithm: Viterbi Dynamic Programming

n = len(text)
dp[0] = 0.0          # log-prob of empty prefix = 0
dp[1..n] = -inf      # initialize to -inf

for i in 1..n:
    for token, log_prob in vocab_scores.items():
        j = i - len(token)
        if j >= 0 and text[j:i] == token and dp[j] > -inf:
            candidate = dp[j] + log_prob
            if candidate > dp[i]:
                dp[i] = candidate
                backtrack[i] = (j, token)

# Reconstruct best tokenization by following backtrack pointers

Unigram vs BPE vs WordPiece

| | BPE | WordPiece | Unigram | |--|-----|-----------|---------| | Matching | Greedy merges | Greedy longest | Viterbi global best | | Objective | Frequency | Likelihood | Maximize log-prob | | Used by | GPT/Llama | BERT | SentencePiece/XLNet |

Asked at