Byte Pair Encoding

Byte Pair Encoding (BPE) is a data compression technique and also a subword tokenisation algorithm used primarily in natural language processing (NLP) and machine learning.

It works by iteratively merging the most frequent pair of consecutive bytes (or characters) to create a new byte (or character).

Basics of Tokenisation and Stemming + Looking into SubWord Tokenisation

Algorithm

Let the initial vocabulary be the set of all individual characters = {A, B, C, D,…, a, b, c, d…}

Repeat:

Choose the two symbols that are most frequently adjacent in the training corpus (say ‘A’, ‘B’)
Add a new merged symbol ‘AB’ to the vocabulary
Replace every adjacent ‘A’ ‘B’ in the corpus with ‘AB’
Repeat Until Step 1 until k merges have been done

Pros

BPE has several advantages:

Variable-Length Encoding: BPE allows for variable-length encoding, which means that tokens can represent different lengths of text, capturing both common sequences and rare words efficiently.
Data Compression: By merging frequent pairs, BPE can compress the vocabulary size, which is especially useful in scenarios where memory or storage space is limited.
Subword Representations: BPE generates subword units, enabling the model to handle out-of-vocabulary words and improving generalisation.

BPE has been widely adopted in various NLP tasks and is used in popular models like GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers).

Explorer

Recent Notes

Visionary GenAI

Industry and Competitive Analysis

Byte Pair Encoding

Algorithm

Pros

Example

Graph View

Table of Contents