TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
←

95. WordPiece Tokenization (BERT-style)

Medium

Implement the WordPiece tokenization algorithm used by BERT, DistilBERT, and many other models. WordPiece splits words into subword tokens using greedy longest-match from left to right.

Signature: def wordpiece_tokenize(word, vocab)

  • word: string to tokenize (a single word)
  • vocab: list of vocabulary tokens (strings)
  • Returns: list of token strings, or ['[UNK]'] if no valid split exists

Algorithm

start = 0
tokens = []
while start < len(word):
    end = len(word)
    found = False
    while start < end:
        substr = word[start:end]
        if start > 0:
            substr = '##' + substr  # continuation subword prefix
        if substr in vocab:
            tokens.append(substr)
            start = end
            found = True
            break
        end -= 1
    if not found:
        return ['[UNK]']
return tokens

Key Rules

  • The first subword is unprefixed: "run" matches "run" in vocab
  • Continuation subwords get the ## prefix: "ning" matches "##ning"
  • Greedy longest-match: always try the longest possible substring first
  • If any position can't be matched, return ['[UNK]']

WordPiece vs BPE

| | BPE | WordPiece | |--|-----|-----------| | Build vocab | Bottom-up: merge most frequent | Bottom-up: merge to maximize likelihood | | Tokenize | Apply learned merges in order | Greedy longest-match from left | | Used by | GPT, Llama | BERT, DistilBERT |

Asked at

NumPy

import numpy as np

 

def wordpiece_tokenize(...):

    pass

🔒

Premium problem

Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.

Upgrade to PremiumBack to problems

Already premium?