Implement the WordPiece tokenization algorithm used by BERT, DistilBERT, and many other models. WordPiece splits words into subword tokens using greedy longest-match from left to right.
Signature: def wordpiece_tokenize(word, vocab)
word: string to tokenize (a single word)vocab: list of vocabulary tokens (strings)['[UNK]'] if no valid split existsstart = 0
tokens = []
while start < len(word):
end = len(word)
found = False
while start < end:
substr = word[start:end]
if start > 0:
substr = '##' + substr # continuation subword prefix
if substr in vocab:
tokens.append(substr)
start = end
found = True
break
end -= 1
if not found:
return ['[UNK]']
return tokens
"run" matches "run" in vocab## prefix: "ning" matches "##ning"['[UNK]']| | BPE | WordPiece | |--|-----|-----------| | Build vocab | Bottom-up: merge most frequent | Bottom-up: merge to maximize likelihood | | Tokenize | Apply learned merges in order | Greedy longest-match from left | | Used by | GPT, Llama | BERT, DistilBERT |
Asked at
Test Results