DeiT Paper Note

DeiT Paper Note

Kai-Jie Lin Lv3

Training data-efficient image transformers & distillation through attention(DeiT)

Published Year : 2020
Paper URL

What

This paper introduce a new model called DeiT. The main different between DeiT and ViT is a new distillation procedure based on a distillation token. DeiT show that contain no convolutional layer can achieve competitive results against the state of the art on ImageNet with no extra data and fewer parameters.

How

  • Overall Architecture
  • Class token
    • The class token is a trainable vector, appended to the patch tokens before the first layer, that goes through the transformer layers, and is then projected with a linear layer to predict the class.
  • What is distillaiton token do?
    • Distillation token is similar to class token, except that on output of the network its objective is to reproduce the label predicted by a teacher model, instead of true lable. (The teacher model is a good performing classifier. It can be CNN or transformer. We will discuss which one is better later.)
    • Soft distillation
      • It minimizes the Kullback-Leibler divergence between the softmax of the eacher and the softmac of the student model.
      • Let ZtZ_{t} be the logits of the teacher model, ZsZ_{s} the logits of the student model. τ\tau is the temperture for the distillation, λ\lambda the coefficient balacing the KL divergence loss (KL\mathrm{KL}) and the cross-entropy (LCE\mathcal{L}_{CE}) on ground truth labels yy, and ψ\psi the softmax function. The distillation objective is:
Lglobal=(1λ)LCE(ψ(Zs),y)+λτ2KL(ψ(Zs/τ),ψ(Zt/τ)). \mathcal{L}_{global} = (1-\lambda)\mathcal{L}_{CE}(\psi(Z_{s}),y) + \lambda \tau^2 \mathrm{KL}(\psi(Z_{s}/\tau), \psi(Z_{t}/\tau)).
- Hard distillation
    - The authors introduced a new distillation method called hard distillation.
    - Let {% _internal_math_placeholder 48 %} be the hard decision of the teacher, the objective associated with this hard-label distillation is:
LglobalharDistill=12LCE(ψ(Zs),y)+12LCE(ψ(Zs),yt) \mathcal{L}_{global}^{harDistill} = \frac{1}{2}\mathcal{L}_{CE}(\psi(Z_{s}), y)+\frac{1}{2}\mathcal{L}_{CE}(\psi(Z_{s}), y_{t})

Experiment

  • Variants of DeiT model.
    ![](https://imgur.com/LYPf4pF.png =200x)
  • Which teacher is better?
    • Using CNNs as teacher model is better than transformers. Maybe the transformer has learned the inductive bias of teacher.
      ![](https://imgur.com/XJ55VNm.png =300x)
  • Which distillation method is better?
    • Hard distillation outperforms soft distillation for transformers, even with ising only a class token. The classifier with class and distillation tokens is significantly better than the independent class and distillaiton classifiers.
      ![](https://imgur.com/VCJDVpi.png =500x)
  • Agreement with the teacher & inductive bias?
    • This question is difficult to formally answered. From the figure below, the distilled model is more correlated to the convnet than with a transformer learned from scratch.
      ![](https://imgur.com/ch3H8ac.png =200x)

And?

  • Comparing with Transformers based model on different transfer learning task with ImageNet pre-training.
    ![](https://imgur.com/04d66bz.png =300x)
  • I am interesting about inductive bias and what the model has learned. I think it is a hard problem but cool study.
Comments