Byte Pair Encoding (BPE) is a data compression technique and also a subword tokenisation algorithm used primarily in natural language processing (NLP) and machine learning.

It works by iteratively merging the most frequent pair of consecutive bytes (or characters) to create a new byte (or character).

Algorithm

Let the initial vocabulary be the set of all individual characters = {A, B, C, D,…, a, b, c, d…}

Repeat:

  1. Choose the two symbols that are most frequently adjacent in the training corpus (say ‘A’, ‘B’)
  2. Add a new merged symbol ‘AB’ to the vocabulary
  3. Replace every adjacent ‘A’ ‘B’ in the corpus with ‘AB’
  4. Repeat Until Step 1 until k​ merges have been done

Pros

BPE has several advantages:

  • Variable-Length Encoding: BPE allows for variable-length encoding, which means that tokens can represent different lengths of text, capturing both common sequences and rare words efficiently.
  • Data Compression: By merging frequent pairs, BPE can compress the vocabulary size, which is especially useful in scenarios where memory or storage space is limited.
  • Subword Representations: BPE generates subword units, enabling the model to handle out-of-vocabulary words and improving generalisation.

BPE has been widely adopted in various NLP tasks and is used in popular models like GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers).

Example