BPE Tokenization (Apply Merge Rules)

Implement the inference step of Byte-Pair Encoding: given a pre-learned list of merge rules, apply them to tokenize a new string.

Signature: def bpe_apply(text, merges)

text: string to tokenize
merges: list of [a, b] pairs (learned merge rules, in priority order)
Returns: list of token strings

Algorithm

Initialize: split text into individual characters: tokens = list(text)
Apply merges in order: for each merge (a, b):
- Scan tokens left-to-right
- Wherever tokens[i] == a and tokens[i+1] == b, merge into a+b
- (greedy left-to-right, replace as you go)
Return the final token list

tokens = list("hello")  # ['h','e','l','l','o']
apply ('h','e') → ['he','l','l','o']
apply ('l','l') → ['he','ll','o']
apply ('he','ll') → ['hell','o']
apply ('hell','o') → ['hello']

BPE vs GPT Tokenizer

The real GPT tokenizer applies merges to byte sequences and handles special tokens/regex pre-splitting. This problem focuses on the core merge application algorithm, which is the same in all BPE variants.

Asked at

BPE Tokenization (Apply Merge Rules)

Implement the inference step of Byte-Pair Encoding: given a pre-learned list of merge rules, apply them to tokenize a new string.

Signature: def bpe_apply(text, merges)

text: string to tokenize
merges: list of [a, b] pairs (learned merge rules, in priority order)
Returns: list of token strings

Algorithm

Initialize: split text into individual characters: tokens = list(text)
Apply merges in order: for each merge (a, b):
- Scan tokens left-to-right
- Wherever tokens[i] == a and tokens[i+1] == b, merge into a+b
- (greedy left-to-right, replace as you go)
Return the final token list

tokens = list("hello")  # ['h','e','l','l','o']
apply ('h','e') → ['he','l','l','o']
apply ('l','l') → ['he','ll','o']
apply ('he','ll') → ['hell','o']
apply ('hell','o') → ['hello']

BPE vs GPT Tokenizer

Asked at