BPE is the tokenization algorithm used by GPT-2, GPT-3, LLaMA, and Mistral. Starting from a character-level vocabulary, it iteratively merges the most frequent adjacent pair of tokens until the vocabulary reaches the desired size.
Signature: def bpe_train(corpus, num_merges)
corpus: dict mapping word → count, e.g. {"low": 5, "lower": 2, "newest": 6}num_merges: number of BPE merge operations to perform[("e", "s"), ("es", "t"), ...]Algorithm:
</w> end-of-word markernum_merges merges are doneMath
Asked at
Test Results