Implement the inference step of Byte-Pair Encoding: given a pre-learned list of merge rules, apply them to tokenize a new string.
Signature: def bpe_apply(text, merges)
text: string to tokenizemerges: list of [a, b] pairs (learned merge rules, in priority order)text into individual characters: tokens = list(text)(a, b):
tokens left-to-righttokens[i] == a and tokens[i+1] == b, merge into a+btokens = list("hello") # ['h','e','l','l','o']
apply ('h','e') → ['he','l','l','o']
apply ('l','l') → ['he','ll','o']
apply ('he','ll') → ['hell','o']
apply ('hell','o') → ['hello']
The real GPT tokenizer applies merges to byte sequences and handles special tokens/regex pre-splitting. This problem focuses on the core merge application algorithm, which is the same in all BPE variants.
Asked at
Test Results