TorchedUp
ProblemsPremium
TorchedUp
WordPiece Tokenization (BERT-style)Medium
ProblemsPremium

WordPiece Tokenization (BERT-style)

Implement the WordPiece tokenization algorithm used by BERT, DistilBERT, and many other models. WordPiece splits words into subword tokens using greedy longest-match from left to right.

Signature: def wordpiece_tokenize(word, vocab)

  • word: string to tokenize (a single word)
  • vocab: list of vocabulary tokens (strings)
  • Returns: list of token strings, or ['[UNK]'] if no valid split exists

Algorithm

start = 0
tokens = []
while start < len(word):
    end = len(word)
    found = False
    while start < end:
        substr = word[start:end]
        if start > 0:
            substr = '##' + substr  # continuation subword prefix
        if substr in vocab:
            tokens.append(substr)
            start = end
            found = True
            break
        end -= 1
    if not found:
        return ['[UNK]']
return tokens

Key Rules

  • The first subword is unprefixed: "run" matches "run" in vocab
  • Continuation subwords get the ## prefix: "ning" matches "##ning"
  • Greedy longest-match: always try the longest possible substring first
  • If any position can't be matched, return ['[UNK]']

WordPiece vs BPE

| | BPE | WordPiece | |--|-----|-----------| | Build vocab | Bottom-up: merge most frequent | Bottom-up: merge to maximize likelihood | | Tokenize | Apply learned merges in order | Greedy longest-match from left | | Used by | GPT, Llama | BERT, DistilBERT |

Asked at

Python (numpy)0/3 runs today

Test Results

○"unwanted" is in vocab → single token
○"unwanted" not in vocab → split into subwords
○"runs" split as run + ##s
○"xbc" — x not in vocab → [UNK]
○"hello" — longest match finds "he" + "##llo"🔒 Premium
Advertisement