Search
Menu
Lambda Research Optics, Inc. - DFO
Photonics Dictionary

vision-transformer networks

Vision transformer (ViT) networks are neural network architectures that apply the Transformer architecture, originally developed for natural language processing, to the task of processing visual data. Unlike traditional convolutional neural networks (CNNs), which dominate in computer vision tasks, ViTs use self-attention mechanisms and multi-layer perceptrons (MLPs) to process image patches directly.

Key features of vision transformer networks include:

Patch embedding: Input images are divided into fixed-size patches, which are linearly embedded into lower-dimensional vectors.

Transformer encoder: These embedded patches are then processed by a stack of Transformer encoder layers. Each layer includes self-attention mechanisms to capture relationships between patches and MLPs for non-linear transformations.

Global context: ViTs can capture global context information through self-attention, which helps in understanding relationships between distant patches.

Classification head:
Typically, a classification head is added on top of the Transformer encoder to produce final predictions for image classification tasks.

Vision transformer networks have shown promising results in various benchmarks and tasks, demonstrating their potential as an alternative to traditional CNNs for computer vision applications.
 
We use cookies to improve user experience and analyze our website traffic as stated in our Privacy Policy. By using this website, you agree to the use of cookies unless you have disabled them.