Byte Pair Encoding (BPE) is a data compression technique and also a subword tokenisation algorithm used primarily in natural language processing (NLP) and machine learning.
It works by iteratively merging the most frequent pair of consecutive bytes (or characters) to create a new byte (or character).
Algorithm
Let the initial vocabulary be the set of all individual characters = {A, B, C, D,…, a, b, c, d…}
Repeat:
- Choose the two symbols that are most frequently adjacent in the training corpus (say ‘A’, ‘B’)
- Add a new merged symbol ‘AB’ to the vocabulary
- Replace every adjacent ‘A’ ‘B’ in the corpus with ‘AB’
- Repeat Until Step 1 until
k
merges have been done
Pros
BPE has several advantages:
- Variable-Length Encoding: BPE allows for variable-length encoding, which means that tokens can represent different lengths of text, capturing both common sequences and rare words efficiently.
- Data Compression: By merging frequent pairs, BPE can compress the vocabulary size, which is especially useful in scenarios where memory or storage space is limited.
- Subword Representations: BPE generates subword units, enabling the model to handle out-of-vocabulary words and improving generalisation.
BPE has been widely adopted in various NLP tasks and is used in popular models like GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers).