AI Glossary/Tokenization
Natural Language Processing

Tokenization

Breaking text into smaller units (tokens) such as words, subwords, or characters.

In-depth explanation

Tokenization is the first step in NLP pipelines, converting raw text into a sequence of tokens that models can process. Word tokenization splits on spaces; subword tokenization (like BPE, WordPiece) handles unknown words better. Each token is then typically mapped to a numerical ID. Tokenization choices significantly impact model performance.

Examples

"Hello world" → ["Hello", "world"]
Subword: "unhappiness" → ["un", "happiness"]

Master Tokenization.

Learn how to apply this concept with hands-on projects in our comprehensive AI programs.