Introduction
Tokens are the basic building blocks of large language models like GPT-3. When we provide text input to these models, they break down the text into smaller chunks called tokens. The model then uses these tokens to understand the meaning of the text and generate relevant outputs. In this comprehensive blog post, we will dive deep into tokens – how they work, different types of tokens, their key benefits, and more.
What are Tokens?
Tokens are the smallest units that hold meaning in a piece of text for large language models. They can be words, subwords, characters, or even whole sentences. Tokens allow the model to ingest variable-length text input and convert it into standardized chunks that are easy to process.
For example, let’s take the sentence – “The quick brown fox jumps over the lazy dog”. This sentence contains 9 distinct words. However, a language model like GPT-3 will break this down into smaller tokens. There are different tokenization strategies, which we will cover in detail later. With a subword tokenization, this sentence may be converted into following tokens:
[“The”, “quick”, “brown”, “fox”, “jump”, “s”, “over”, “the”, “lazy”, “dog”]
So the sentence is split into 10 tokens instead of 9 words. The token “jumps” is further broken down into “jump” and “s”.
Why Tokens?
The reason large language models use tokens is that they allow the model to build a vocabulary that can handle any text input, no matter how rare or uncommon the words are.
Some key benefits of using tokens include:
- Standardization – Tokens convert text into standardized chunks that are easy to process by models. This normalization allows seamless handling of text.
- Fixed vocabulary size – Models have a limit on the total number of tokens they can support in their vocabulary. Tokens allow models to support an open vocabulary while keeping the vocabulary size fixed.
- Understanding morphology – Tokens like subwords help models understand morphological forms of words like jumps=jump+s. This improves the understanding of language.
- Memory efficiency – Smaller tokens are more memory efficient as models need to store embeddings for each token.
Overall, tokens provide a solid foundation for language models to understand and generate human-like text.
Types of Tokens
There are a few different types of tokenization strategies used by modern language models:
1. Word Tokens
This is the most basic tokenization where we break text down into distinct words. In our example sentence, the word tokens would be:
[“The”, “quick”, “brown”, “fox”, “jumps”, “over”, “the”, “lazy”, “dog”]
The main limitation is that very rare or unseen words will be unknown to the model.
2. Character Tokens
Here we break text down character-by-character. For example:
[“T”, “h”, “e”, ” “, “q”, “u”, “i”, “c”, “k”, ” “, “b”, “r”, “o”, “w”, “n”, ” “, “f”, “o”, “x”, ” “, “j”, “u”, “m”, “p”, “s”, …]
Character tokens allow models to handle any word but lose context from breaking words.
3. Subword Tokens
This is the most commonly used tokenization. Here text is broken into subwords and common character combinations. For our example, the subwords would be:
[“The”, “quick”, “brown”, “fox”, “jump”, “s”, “over”, “the”, “lazy”, “dog”]
Subwords balance vocabulary size and retain context of words. This works the best for language models.
4. Sentencepiece Tokens
This is another subword tokenization but generates tokens based on statistical frequency of text. More common text chunks get single tokens. Rare words may be broken down further.
5. Byte-Pair Encoding (BPE) Tokens
BPE starts with character tokens and iteratively combines common pairs to build a vocabulary. It gives a good balance between vocabulary size and retaining meaning.
So in summary, subword and BPE tokens are most common today as they balance the tradeoffs effectively for language models.
How Tokenization Works
Now let’s understand how the process of tokenization actually works:
- The language model starts by ingesting raw text input from the user.
- This input is then passed to the tokenization module.
- The tokenization module applies rules and algorithms like subword tokenization to break down the text.
- This produces a sequence of smaller tokens.
- This token sequence is fed to the model’s encoder which converts each token into a numeric vector also called an embedding.
- The encoded tokens are then used by the model for processing and generation of outputs.
So in a nutshell, tokenization acts as a preprocessing step to prepare raw text for easier consumption by the downstream model.
Tokenization in Popular Models
Different language models employ slightly different tokenization strategies:
- GPT-3 – Uses a BPE algorithm to produce a vocabulary of 50,000 tokens
- BERT – WordPiece tokenization that has 30,000 tokens in vocabulary
- T5 – A SentencePiece model with 32,000 tokens
- PaLM – Has a vocabulary of 64,000 tokens generated using BPE
As we can see, subword BPE and SentencePiece are most common for today’s large models. The token vocabulary typically ranges from 30,000 to 60,000 tokens.
Impact on Model Performance
The choice of tokenization approach has an impact on the model’s capabilities and performance on different tasks:
- Vocabulary coverage – Subword tokens allow models to cover very large vocabularies so they can understand rare words.
- Context understanding – Longer tokens like words retain context better compared to individual characters.
- Morphological competence – Subword tokens help understand word forms better.
- Memory usage – Shorter tokens require less memory for embeddings. So subwords are more memory efficient.
- Learning speed – Subword tokens lead to faster training than word tokens as the vocabulary size is smaller.
So subword tokenization offers the best of both worlds – broad vocabulary coverage and efficient learning. That’s why it is preferred in most state-of-the-art models today.
Emerging Trends
Some interesting developments in tokenization include:
- Adaptive tokenization – The tokenization can be tuned dynamically based on the current text instead of using fixed rules.
- Multilingual tokenization – Having a shared vocabulary for multiple languages through joint subword learning.
- Contextual tokenization – Creating tokens that incorporate surrounding context using contextualized representations.
- Learned tokenization – End-to-end learning of optimal tokenization for a dataset instead of predefined rules.
Conclusion
To summarize, tokens are the basic building blocks used by large language models to ingest text input. Breaking text into smaller tokens allows the models to normalize variable length text, reduce vocabulary size, and improve memory efficiency. Subword tokenization offers the best of both vocabulary coverage and retention of meaning. Tokenization remains an active area of research with innovations in adaptive, contextual and learned tokenization. As language models continue to evolve in size and capability, the role of tokenization will remain crucial in feeding them linguistic knowledge.