Vision Transformer(ViT) Paper Note
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Vision Transformer)
Published Year : 2021
Paper URL
What
- CNNs dominated classification tasks until recently. This paper shows that this reliance on CNNs is not necessary and a transformer applied to sequence of images patches can perform very well on image classification.
- Vision Transformer(ViT) attains excellent results compared to state-of-the-art CNN while when pre-trained on large data sets while requiring fewer computaional cost.
Why
- The transformers are doing very well on NLP tacks. What if we apply transformer on computer vision?
How
- Transformer Architecture
 - Algorithm
- 
- Patch ebedding.Reshape the image into 2D patches , where is the resolution of original images, is the number of channels, is the resolution of image patch, and is the resulting number of patches.
- is 1D learnable position encoding which is added to patch embedding.
- Multihead Self-Attention(MSA)
- For each input
- For each input
And?
- Inductive Bias
- Compared to CNNs, ViT has less image-specific inductive bias.
- In CNNs, locality, 2D neighborhood structure, and translation equivariance are baked into each layer throughout the whole model.
- In ViT, only MLP layers are local and translationally equivariant, while the self-attention layers are global.
- This means ViT is much more free than CNN.
- About CNNs
- CNN is very good at detect partial features but it does not consider the relative positioning of facial features.
 - The main difference between ViT and CNNs is vision(visual field).

- CNN is very good at detect partial features but it does not consider the relative positioning of facial features.
Comments