Tokenization

Tokenization - Byte Pair Encoding, WordPiece & SentencePiece
Tokenization
Tokenization is preprocessing step of breaking down raw text (sentences) into smaller units called tokens, such as, words, characters, or subwords.
Word-Level Tokenization

Text is split by whitespace or punctuation.
Each unique word gets its own ‘ID’ in the vocabulary.

e.g., “The player is playing”, becomes:
[“The”, “player”, “is”, “playing”]

Issues

  • Out-Of-Vocabulary (OOV): If the model has seen only “player” in the training set and at runtime it sees “players” or “playful”, then it marks them as (Unknown).
  • Vocabulary Explosion: If we use full words then vocabulary size.
  • No Shared Meaning: The model treats “running” and “run” as completely unrelated symbols, even though they share the same root.
Character-Level Tokenization

Text is split into individual letters or symbols.

e.g., “Apple”, becomes:
[“A”, “p”, “p”, “l”, “e”]

Issues

  • Loss of Semantics: Individual characters like “a” or “p” carry no inherent meaning.
    • The model has to work much harder to learn that the sequence “A-p-p-l-e” represents a fruit.
  • Long Sequences: A single sentence becomes a massive string of tokens.
  • Computationally Expensive: Longer sequences require more memory and processing power to calculate relationships between tokens.

Sub-Word Tokenization

A fixed-size vocabulary that can represent any word by breaking it into meaningful sub-units.

e.g., “unhappiness”, becomes:
[“un”, “happi”, “ness”]

Benefits

  • Small vocabulary size.
  • No OOV (everything decomposable).
  • Better generalization.

Comparison: Word-Level, Character-Level & Sub-Word Tokenization

FeatureWord-LevelChar-LevelSub-Word
Vocab SizeVery LargeSmallMedium (Fixed)
OOV ProblemSevereNoneNone
Meaning (Per Token)HighLowHigh (Morpheme-based)
Sequence LengthSmallVery HighBalanced
Byte Pair Encoding

Originally a data compression algorithm, Byte Pair Encoding (BPE) was adapted for NLP to iteratively merge the most frequent adjacent pairs of characters into new tokens.

BPE Algorithm

  1. Initialization: Prepare a corpus and define a target vocabulary size ‘’. 
Break every word into individual characters (plus a special end-of-word symbol ).
  2. Frequency Count: Count all adjacent pairs of symbols in the corpus.
  3. Merge: Identify the pair  with the highest frequency and merge them into a single new symbol .
  4. Repeat: Update the corpus with the new symbol and repeat until the vocabulary reaches size ‘’ or no more merges are possible.

Example

--- Initial Word Corpus ---
Word                  Count
------------------  -------
l o w </w>                5
l o w e r </w>            2
n e w e s t </w>          6
w i d e s t </w>          3
h i g h e s t </w>        2

 Iteration 0
| Token   |   Frequency |
|---------+-------------|
| e       |          19 |
| </w>    |          18 |
| w       |          16 |
| s       |          11 |
| t       |          11 |
| l       |           7 |
| o       |           7 |
| n       |           6 |
| i       |           5 |
| h       |           4 |
Action: Merge ('e', 's')

Iteration 1
| Token   |   Frequency |
|---------+-------------|
| </w>    |          18 |
| w       |          16 |
| es      |          11 |
| t       |          11 |
| e       |           8 |
| l       |           7 |
| o       |           7 |
| n       |           6 |
| i       |           5 |
| h       |           4 |
Action: Merge ('es', 't')

Iteration 2
| Token   |   Frequency |
|---------+-------------|
| </w>    |          18 |
| w       |          16 |
| est     |          11 |
| e       |           8 |
| l       |           7 |
| o       |           7 |
| n       |           6 |
| i       |           5 |
| h       |           4 |
| d       |           3 |
Action: Merge ('est', '</w>')

Iteration 3
| Token   |   Frequency |
|---------+-------------|
| w       |          16 |
| est</w> |          11 |
| e       |           8 |
| </w>    |           7 |
| l       |           7 |
| o       |           7 |
| n       |           6 |
| i       |           5 |
| h       |           4 |
| d       |           3 |
Action: Merge ('l', 'o')

Iteration 4
| Token   |   Frequency |
|---------+-------------|
| w       |          16 |
| est</w> |          11 |
| e       |           8 |
| </w>    |           7 |
| lo      |           7 |
| n       |           6 |
| i       |           5 |
| h       |           4 |
| d       |           3 |
| g       |           2 |
Action: Merge ('lo', 'w')

Iteration 5
| Token   |   Frequency |
|---------+-------------|
| est</w> |          11 |
| w       |           9 |
| e       |           8 |
| </w>    |           7 |
| low     |           7 |
| n       |           6 |
| i       |           5 |
| h       |           4 |
| d       |           3 |
| g       |           2 |

OOV Handling Example
Original OOV: 'lowest'
BPE Segmented: ['low', 'est</w>']

OpenAI Tokenizer

images/natural_language_processing/tokenization/tokens.png
images/natural_language_processing/tokenization/token_ids.png

Source: https://platform.openai.com/tokenizer

WordPiece

WordPiece merges the pair that maximizes the likelihood of the training data.
It chooses the pair\((s_1, s_2)\) that maximizes:

\[\text{Score} = \frac{P(s_1, s_2)}{P(s_1)P(s_2)}\]
  • \(P(s_1, s_2)\): probability of the pair appearing together.
  • \(P(s_1)P(s_2)\): probability of the pair appearing independently.

e.g: Is “unfriendly” in vocabulary? No;
Is “un” + “##friend” + “##ly” available? Yes.
where, “##” means continuation of the previous token.

Note: WordPiece is used in BERT.

SentencePiece

SentencePiece is not just a new algorithm but a subword regularization framework that treats the input as a raw stream of characters, including spaces.

Key Innovations:

  • Space as a Symbol:
    • It treats whitespace as a normal character (represented as “underscore”).
    • This makes it “lossless”; you can reconstruct the exact original string from the tokens.
  • Algorithm Agnostic:
    • It can implement both BPE and Unigram Language Model (a probabilistic approach that removes tokens that least impact the overall likelihood of the corpus).
  • No Pre-tokenization:
    • Unlike BPE/WordPiece, which often require a preliminary step to split text by whitespace, SentencePiece works directly on raw Unicode strings.
    • This is vital for languages like Chinese or Japanese that do not use spaces between words.