TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
←

94. BPE Tokenization (Apply Merge Rules)

Easy

Implement the inference step of Byte-Pair Encoding: given a pre-learned list of merge rules, apply them to tokenize a new string.

Signature: def bpe_apply(text, merges)

  • text: string to tokenize
  • merges: list of [a, b] pairs (learned merge rules, in priority order)
  • Returns: list of token strings

Algorithm

  1. Initialize: split text into individual characters: tokens = list(text)
  2. Apply merges in order: for each merge (a, b):
    • Scan tokens left-to-right
    • Wherever tokens[i] == a and tokens[i+1] == b, merge into a+b
    • (greedy left-to-right, replace as you go)
  3. Return the final token list
tokens = list("hello")  # ['h','e','l','l','o']
apply ('h','e') → ['he','l','l','o']
apply ('l','l') → ['he','ll','o']
apply ('he','ll') → ['hell','o']
apply ('hell','o') → ['hello']

BPE vs GPT Tokenizer

The real GPT tokenizer applies merges to byte sequences and handles special tokens/regex pre-splitting. This problem focuses on the core merge application algorithm, which is the same in all BPE variants.

Asked at

NumPy

import numpy as np

 

def bpe_apply(...):

    pass

🔒

Premium problem

Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.

Upgrade to PremiumBack to problems

Already premium?