TorchedUp
ProblemsPremium
TorchedUp
BPE Tokenization (Apply Merge Rules)Easy
ProblemsPremium

BPE Tokenization (Apply Merge Rules)

Implement the inference step of Byte-Pair Encoding: given a pre-learned list of merge rules, apply them to tokenize a new string.

Signature: def bpe_apply(text, merges)

  • text: string to tokenize
  • merges: list of [a, b] pairs (learned merge rules, in priority order)
  • Returns: list of token strings

Algorithm

  1. Initialize: split text into individual characters: tokens = list(text)
  2. Apply merges in order: for each merge (a, b):
    • Scan tokens left-to-right
    • Wherever tokens[i] == a and tokens[i+1] == b, merge into a+b
    • (greedy left-to-right, replace as you go)
  3. Return the final token list
tokens = list("hello")  # ['h','e','l','l','o']
apply ('h','e') → ['he','l','l','o']
apply ('l','l') → ['he','ll','o']
apply ('he','ll') → ['hell','o']
apply ('hell','o') → ['hello']

BPE vs GPT Tokenizer

The real GPT tokenizer applies merges to byte sequences and handles special tokens/regex pre-splitting. This problem focuses on the core merge application algorithm, which is the same in all BPE variants.

Asked at

Python (numpy)0/3 runs today

Test Results

○"hello" with merges that build it up → ["hello"]
○"aaab" — second merge (aa,b) can't apply because aa is not adjacent to b
○"abcabc" — merges build abc twice then merge the pair
○"xyz" with no applicable merges → individual chars
○"aab" — greedy left-to-right: first aa merges before trying ab🔒 Premium
Advertisement