Vision Transformer(ViT) Paper Note

Vision Transformer(ViT) Paper Note

Kai-Jie Lin Lv3

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Vision Transformer)

Published Year : 2021
Paper URL

What

  • CNNs dominated classification tasks until recently. This paper shows that this reliance on CNNs is not necessary and a transformer applied to sequence of images patches can perform very well on image classification.
  • Vision Transformer(ViT) attains excellent results compared to state-of-the-art CNN while when pre-trained on large data sets while requiring fewer computaional cost.

Why

  • The transformers are doing very well on NLP tacks. What if we apply transformer on computer vision?

How

  • Transformer Architecture
    ![](https://i.imgur.com/1PNfjqR.png =600x)
  • Algorithm
    • ![](https://i.imgur.com/dMU1tvL.png =600x)
    • Patch ebedding.Reshape the image xRH×W×Cx\in\mathbb{R}^{H \times W \times C} into 2D patches xpRN×(P2C)x_{p}\in\mathbb{R}^{N\times(P^2\cdot C)}, where (H,W)(H,W) is the resolution of original images, CC is the number of channels, (P,P)(P,P) is the resolution of image patch, and N=HW/P2N = HW/P^2 is the resulting number of patches. [xclass;xp1E;xp2E;...;xpNE][x_{class};x_{p}^{1}E;x_{p}^{2}E;...;x_{p}^{N}E]
    • EposE_{pos} is 1D learnable position encoding which is added to patch embedding.
    • Multihead Self-Attention(MSA)
      • For each input zRN×Dz\in \mathbb{R}^{N \times D}
        • [q,k,v]=zUqkv,UqkvRD×3Dh[q,k,v] = zU_{qkv},\quad U_{qkv}\in\mathbb{R}^{D\times 3D_{h}}
        • A=softmax(qk/Dh),ARN×NA = softmax(qk^\top / \sqrt{D_{h}}),\quad A\in\mathbb{R}^{N\times N}
        • SA(z)=Av.SA(z) = Av.
        • MSA(z)=[SA1(z);SA2(z);...;SAk(z)]Umsa,UmsaRkDh×DMSA(z) = [SA_{1}(z);SA_{2}(z);...;SA_{k}(z)]U_{msa},\quad U_{msa}\in\mathbb{R}_{k \cdot D_{h}\times D}

And?

  • Inductive Bias
    • Compared to CNNs, ViT has less image-specific inductive bias.
    • In CNNs, locality, 2D neighborhood structure, and translation equivariance are baked into each layer throughout the whole model.
    • In ViT, only MLP layers are local and translationally equivariant, while the self-attention layers are global.
    • This means ViT is much more free than CNN.
  • About CNNs
Comments