TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
←

30. Byte-Pair Encoding (BPE)

Medium

BPE is the tokenization algorithm used by GPT-2, GPT-3, LLaMA, and Mistral. Starting from a character-level vocabulary, it iteratively merges the most frequent adjacent pair of tokens until the vocabulary reaches the desired size.

Signature: def bpe_train(corpus, num_merges)

  • corpus: dict mapping word → count, e.g. {"low": 5, "lower": 2, "newest": 6}
  • num_merges: number of BPE merge operations to perform
  • Returns: list of merge rules as tuples, e.g. [("e", "s"), ("es", "t"), ...]

Algorithm:

  1. Initialize vocabulary: split each word into characters + </w> end-of-word marker
  2. Count pair frequencies across all words (weighted by word count)
  3. Find the most frequent adjacent pair (break ties alphabetically)
  4. Merge that pair everywhere in the vocabulary
  5. Repeat until num_merges merges are done

Math

BPE merges the most frequent pair (a,b) at each step:freq(a,b)=word w∑​count(w)⋅occurrences of (a,b) in tokenized(w)Merge: replace all (a b)→ab in vocabulary

Asked at

NumPy

import numpy as np

 

def bpe_train(...):

    pass

🔒

Premium problem

Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.

Upgrade to PremiumBack to problems

Already premium?