Tokenization

Tokenization - Byte Pair Encoding, WordPiece & SentencePiece

5 minute read

PlaylistNatural Language Processing | Full Course

Tokenization

Tokenization is preprocessing step of breaking down raw text (sentences) into smaller units called tokens, such as, words, characters, or subwords.

Word-Level Tokenization

Text is split by whitespace or punctuation.
Each unique word gets its own ‘ID’ in the vocabulary.

e.g., “The player is playing”, becomes:
[“The”, “player”, “is”, “playing”]

Issues

Out-Of-Vocabulary (OOV): If the model has seen only “player” in the training set and at runtime it sees “players” or “playful”, then it marks them as (Unknown).
Vocabulary Explosion: If we use full words then vocabulary size.
No Shared Meaning: The model treats “running” and “run” as completely unrelated symbols, even though they share the same root.

Character-Level Tokenization

Text is split into individual letters or symbols.

e.g., “Apple”, becomes:
[“A”, “p”, “p”, “l”, “e”]

Issues

Loss of Semantics: Individual characters like “a” or “p” carry no inherent meaning.
- The model has to work much harder to learn that the sequence “A-p-p-l-e” represents a fruit.
Long Sequences: A single sentence becomes a massive string of tokens.
Computationally Expensive: Longer sequences require more memory and processing power to calculate relationships between tokens.

Sub-Word Tokenization

A fixed-size vocabulary that can represent any word by breaking it into meaningful sub-units.

e.g., “unhappiness”, becomes:
[“un”, “happi”, “ness”]

Benefits

Small vocabulary size.
No OOV (everything decomposable).
Better generalization.

Comparison: Word-Level, Character-Level & Sub-Word Tokenization

Feature	Word-Level	Char-Level	Sub-Word
Vocab Size	Very Large	Small	Medium (Fixed)
OOV Problem	Severe	None	None
Meaning (Per Token)	High	Low	High (Morpheme-based)
Sequence Length	Small	Very High	Balanced

Byte Pair Encoding

Originally a data compression algorithm, Byte Pair Encoding (BPE) was adapted for NLP to iteratively merge the most frequent adjacent pairs of characters into new tokens.

BPE Algorithm

Initialization: Prepare a corpus and define a target vocabulary size ‘’.  Break every word into individual characters (plus a special end-of-word symbol ).
Frequency Count: Count all adjacent pairs of symbols in the corpus.
Merge: Identify the pair with the highest frequency and merge them into a single new symbol .
Repeat: Update the corpus with the new symbol and repeat until the vocabulary reaches size ‘’ or no more merges are possible.

Example

--- Initial Word Corpus ---
Word                  Count
------------------  -------
l o w </w>                5
l o w e r </w>            2
n e w e s t </w>          6
w i d e s t </w>          3
h i g h e s t </w>        2

 Iteration 0
| Token   |   Frequency |
|---------+-------------|
| e       |          19 |
| </w>    |          18 |
| w       |          16 |
| s       |          11 |
| t       |          11 |
| l       |           7 |
| o       |           7 |
| n       |           6 |
| i       |           5 |
| h       |           4 |
Action: Merge ('e', 's')

Iteration 1
| Token   |   Frequency |
|---------+-------------|
| </w>    |          18 |
| w       |          16 |
| es      |          11 |
| t       |          11 |
| e       |           8 |
| l       |           7 |
| o       |           7 |
| n       |           6 |
| i       |           5 |
| h       |           4 |
Action: Merge ('es', 't')

Iteration 2
| Token   |   Frequency |
|---------+-------------|
| </w>    |          18 |
| w       |          16 |
| est     |          11 |
| e       |           8 |
| l       |           7 |
| o       |           7 |
| n       |           6 |
| i       |           5 |
| h       |           4 |
| d       |           3 |
Action: Merge ('est', '</w>')

Iteration 3
| Token   |   Frequency |
|---------+-------------|
| w       |          16 |
| est</w> |          11 |
| e       |           8 |
| </w>    |           7 |
| l       |           7 |
| o       |           7 |
| n       |           6 |
| i       |           5 |
| h       |           4 |
| d       |           3 |
Action: Merge ('l', 'o')

Iteration 4
| Token   |   Frequency |
|---------+-------------|
| w       |          16 |
| est</w> |          11 |
| e       |           8 |
| </w>    |           7 |
| lo      |           7 |
| n       |           6 |
| i       |           5 |
| h       |           4 |
| d       |           3 |
| g       |           2 |
Action: Merge ('lo', 'w')

Iteration 5
| Token   |   Frequency |
|---------+-------------|
| est</w> |          11 |
| w       |           9 |
| e       |           8 |
| </w>    |           7 |
| low     |           7 |
| n       |           6 |
| i       |           5 |
| h       |           4 |
| d       |           3 |
| g       |           2 |

OOV Handling Example
Original OOV: 'lowest'
BPE Segmented: ['low', 'est</w>']

OpenAI Tokenizer

images/natural_language_processing/tokenization/tokens.png

images/natural_language_processing/tokenization/token_ids.png

Source: https://platform.openai.com/tokenizer

WordPiece

WordPiece merges the pair that maximizes the likelihood of the training data.
It chooses the pair\((s_1, s_2)\) that maximizes:

\[\text{Score} = \frac{P(s_1, s_2)}{P(s_1)P(s_2)}\]

\(P(s_1, s_2)\): probability of the pair appearing together.
\(P(s_1)P(s_2)\): probability of the pair appearing independently.

e.g: Is “unfriendly” in vocabulary? No;
Is “un” + “##friend” + “##ly” available? Yes.
where, “##” means continuation of the previous token.

Note: WordPiece is used in BERT.

SentencePiece

SentencePiece is not just a new algorithm but a subword regularization framework that treats the input as a raw stream of characters, including spaces.

Key Innovations:

Space as a Symbol:
- It treats whitespace as a normal character (represented as “underscore”).
- This makes it “lossless”; you can reconstruct the exact original string from the tokens.
Algorithm Agnostic:
- It can implement both BPE and Unigram Language Model (a probabilistic approach that removes tokens that least impact the overall likelihood of the corpus).
No Pre-tokenization:
- Unlike BPE/WordPiece, which often require a preliminary step to split text by whitespace, SentencePiece works directly on raw Unicode strings.
- This is vital for languages like Chinese or Japanese that do not use spaces between words.

Video What are Tokens in LLM ? | How tokenization works ? | Byte Pair Encoding | Detailed Explanation

Previous: Text Pre-Processing Next: Tokenization