An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Vision Transformer)
Published Year : 2021 Paper URL
What
- CNNs dominated classification tasks until recently. This paper shows that this reliance on CNNs is not necessary and a transformer applied to sequence of images patches can perform very well on image classification.
- Vision Transformer(ViT) attains excellent results compared to state-of-the-art CNN while when pre-trained on large data sets while requiring fewer computaional cost.
Why
- The transformers are doing very well on NLP tacks. What if we apply transformer on computer vision?
How
- Transformer Architecture
- Algorithm
- Patch ebedding.Reshape the image \(x\in\mathbb{R}^{H \times W \times C}\) into 2D patches \(x_{p}\in\mathbb{R}^{N\times(P^2\cdot C)}\), where \((H,W)\) is the resolution of original images, \(C\) is the number of channels, \((P,P)\) is the resolution of image patch, and \(N = HW/P^2\) is the resulting number of patches. \([x_{class};x_{p}^{1}E;x_{p}^{2}E;...;x_{p}^{N}E]\)
- \(E_{pos}\) is 1D learnable position encoding which is added to patch embedding.
- Multihead Self-Attention(MSA)
- For each input \(z\in \mathbb{R}^{N \times D}\)
- \([q,k,v] = zU_{qkv},\quad U_{qkv}\in\mathbb{R}^{D\times 3D_{h}}\)
- \(A = softmax(qk^\top / \sqrt{D_{h}}),\quad A\in\mathbb{R}^{N\times N}\)
- \(SA(z) = Av.\)
- \(MSA(z) = [SA_{1}(z);SA_{2}(z);...;SA_{k}(z)]U_{msa},\quad U_{msa}\in\mathbb{R}_{k \cdot D_{h}\times D}\)
- For each input \(z\in \mathbb{R}^{N \times D}\)
And?
- Inductive Bias
- Compared to CNNs, ViT has less image-specific inductive bias.
- In CNNs, locality, 2D neighborhood structure, and translation equivariance are baked into each layer throughout the whole model.
- In ViT, only MLP layers are local and translationally equivariant, while the self-attention layers are global.
- This means ViT is much more free than CNN.
- About CNNs
- CNN is very good at detect partial features but it does not consider the relative positioning of facial features.
- The main difference between ViT and CNNs is vision(visual field).