An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Vision Transformer)

Published Year : 2021 Paper URL

What

CNNs dominated classification tasks until recently. This paper shows that this reliance on CNNs is not necessary and a transformer applied to sequence of images patches can perform very well on image classification.
Vision Transformer(ViT) attains excellent results compared to state-of-the-art CNN while when pre-trained on large data sets while requiring fewer computaional cost.

The transformers are doing very well on NLP tacks. What if we apply transformer on computer vision?

Inductive Bias
- Compared to CNNs, ViT has less image-specific inductive bias.
- In CNNs, locality, 2D neighborhood structure, and translation equivariance are baked into each layer throughout the whole model.
- In ViT, only MLP layers are local and translationally equivariant, while the self-attention layers are global.
- This means ViT is much more free than CNN.
About CNNs
- CNN is very good at detect partial features but it does not consider the relative positioning of facial features.
- The main difference between ViT and CNNs is vision(visual field).