Byte-Pair Encoding (BPE) — TorchedUp

Problems Premium

Byte-Pair Encoding (BPE)Medium

Problems Premium

Byte-Pair Encoding (BPE)

BPE is the tokenization algorithm used by GPT-2, GPT-3, LLaMA, and Mistral. Starting from a character-level vocabulary, it iteratively merges the most frequent adjacent pair of tokens until the vocabulary reaches the desired size.

Signature: def bpe_train(corpus, num_merges)

corpus: dict mapping word → count, e.g. {"low": 5, "lower": 2, "newest": 6}
num_merges: number of BPE merge operations to perform
Returns: list of merge rules as tuples, e.g. [("e", "s"), ("es", "t"), ...]

Algorithm:

Initialize vocabulary: split each word into characters + </w> end-of-word marker
Count pair frequencies across all words (weighted by word count)
Find the most frequent adjacent pair (break ties alphabetically)
Merge that pair everywhere in the vocabulary
Repeat until num_merges merges are done

Math

Asked at

Python (numpy)0/3 runs today

Test Results

○Classic corpus, 5 merges

○Small corpus, 3 merges

○Single word — merges until exhausted🔒 Premium

Advertisement