WordPiece Tokenization (BERT-style)

Implement the WordPiece tokenization algorithm used by BERT, DistilBERT, and many other models. WordPiece splits words into subword tokens using greedy longest-match from left to right.

Signature: def wordpiece_tokenize(word, vocab)

word: string to tokenize (a single word)
vocab: list of vocabulary tokens (strings)
Returns: list of token strings, or ['[UNK]'] if no valid split exists

Algorithm

start = 0
tokens = []
while start < len(word):
    end = len(word)
    found = False
    while start < end:
        substr = word[start:end]
        if start > 0:
            substr = '##' + substr  # continuation subword prefix
        if substr in vocab:
            tokens.append(substr)
            start = end
            found = True
            break
        end -= 1
    if not found:
        return ['[UNK]']
return tokens

Key Rules

The first subword is unprefixed: "run" matches "run" in vocab
Continuation subwords get the ## prefix: "ning" matches "##ning"
Greedy longest-match: always try the longest possible substring first
If any position can't be matched, return ['[UNK]']

WordPiece vs BPE

| | BPE | WordPiece | |--|-----|-----------| | Build vocab | Bottom-up: merge most frequent | Bottom-up: merge to maximize likelihood | | Tokenize | Apply learned merges in order | Greedy longest-match from left | | Used by | GPT, Llama | BERT, DistilBERT |

Asked at