Tokenization

In-depth explanation

Tokenization is the first step in NLP pipelines, converting raw text into a sequence of tokens that models can process. Word tokenization splits on spaces; subword tokenization (like BPE, WordPiece) handles unknown words better. Each token is then typically mapped to a numerical ID. Tokenization choices significantly impact model performance.

Examples

"Hello world" → ["Hello", "world"]

Subword: "unhappiness" → ["un", "happiness"]

Related terms

Token

More in Natural Language Processing

BERT

Bidirectional Encoder Representations from Transformers, a pre-trained language model for NLP tasks.

Named Entity Recognition (NER)

Identifying and classifying named entities in text into categories like person, organization, location.

Natural Language Processing (NLP)

The field of AI focused on enabling computers to understand, interpret, and generate human language.

Sentiment Analysis

Determining the emotional tone or opinion expressed in text, typically positive, negative, or neutral.

Word Embedding

Dense vector representations of words that capture semantic meaning and relationships.