TorchedUp
ProblemsPremium
TorchedUp
Byte-Pair Encoding (BPE)Medium
ProblemsPremium

Byte-Pair Encoding (BPE)

BPE is the tokenization algorithm used by GPT-2, GPT-3, LLaMA, and Mistral. Starting from a character-level vocabulary, it iteratively merges the most frequent adjacent pair of tokens until the vocabulary reaches the desired size.

Signature: def bpe_train(corpus, num_merges)

  • corpus: dict mapping word → count, e.g. {"low": 5, "lower": 2, "newest": 6}
  • num_merges: number of BPE merge operations to perform
  • Returns: list of merge rules as tuples, e.g. [("e", "s"), ("es", "t"), ...]

Algorithm:

  1. Initialize vocabulary: split each word into characters + </w> end-of-word marker
  2. Count pair frequencies across all words (weighted by word count)
  3. Find the most frequent adjacent pair (break ties alphabetically)
  4. Merge that pair everywhere in the vocabulary
  5. Repeat until num_merges merges are done

Math

Asked at

Python (numpy)0/3 runs today

Test Results

○Classic corpus, 5 merges
○Small corpus, 3 merges
○Single word — merges until exhausted🔒 Premium
Advertisement