AI Glossary/Vision Transformer
AI Fundamentals

Vision Transformer

A Vision Transformer (ViT) is a type of model architecture that leverages the transformer model, traditionally used in natural language processing, for image classification tasks by treating image patches as sequences of tokens.

In-depth explanation

The Vision Transformer (ViT) represents a significant advancement in computer vision by applying the transformer architecture, originally designed for natural language processing, to image data. Introduced by Dosovitskiy et al. in a 2020 paper titled 'An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,' ViT reshapes the way images are processed by treating them as sequences of image patches, akin to how text sequences are handled in NLP. Historically, convolutional neural networks (CNNs) dominated the field of image processing due to their ability to efficiently capture spatial hierarchies and local patterns. However, the transformer, with its self-attention mechanism, offers a different approach by focusing on the relationships between parts of the input sequence (in this case, image patches) irrespective of their distance from each other. In a ViT, an input image is divided into non-overlapping patches, each of which is flattened and linearly embedded into a sequence. These patches are then combined with positional embeddings, which provide the model with information about the position of patches in the original image. The self-attention mechanism of the transformer allows the model to weigh the importance of each patch relative to others, enabling it to capture global dependencies without the locality bias inherent in CNNs. ViTs have demonstrated outstanding performance on various image classification benchmarks, often surpassing the accuracy of CNNs when trained on large datasets. One of the key advantages of ViTs is their scalability; they benefit significantly from large-scale data, as their reliance on global attention mechanisms allows them to model complex relationships across an image more effectively than local filter-based approaches. Despite their strengths, ViTs also pose challenges, notably the need for extensive computational resources and large datasets to train effectively. They have also prompted research into hybrid models that combine CNNs and transformers to leverage the strengths of both architectures. The impact of ViTs extends beyond image classification, influencing tasks like object detection, segmentation, and even video analysis. Their adaptability in diverse domains underscores their importance in advancing the field of computer vision.

Examples

ImageNet Classification: Vision Transformers have been used to achieve state-of-the-art results on the ImageNet dataset, demonstrating their effectiveness in large-scale image classification.
Fine-tuning for Medical Imaging: ViTs have been fine-tuned for specific tasks like detecting anomalies in medical images, showcasing their adaptability to various domains.
Hybrid Models: Researchers have developed hybrid models that integrate ViTs with CNNs to improve efficiency and performance, especially on smaller datasets.

Related terms

Master Vision Transformer.

Learn how to apply this concept with hands-on projects in our comprehensive AI programs.