Training data-efficient image transformers & distillation through attention(DeiT)

Published Year : 2020 Paper URL

What

This paper introduce a new model called DeiT. The main different between DeiT and ViT is a new distillation procedure based on a distillation token. DeiT show that contain no convolutional layer can achieve competitive results against the state of the art on ImageNet with no extra data and fewer parameters.

How

Overall Architecture
- DeiT
- ViT
- You can find that the main differece between these two is that DeiT has one more distillation token than ViT.
Class token
- The class token is a trainable vector, appended to the patch tokens before the first layer, that goes through the transformer layers, and is then projected with a linear layer to predict the class.
What is distillaiton token do?
- Distillation token is similar to class token, except that on output of the network its objective is to reproduce the label predicted by a teacher model, instead of true lable. (The teacher model is a good performing classifier. It can be CNN or transformer. We will discuss which one is better later.)
- Soft distillation
  - It minimizes the Kullback-Leibler divergence between the softmax of the eacher and the softmac of the student model.
  - Let \(Z_{t}\) be the logits of the teacher model, \(Z_{s}\) the logits of the student model. \(\tau\) is the temperture for the distillation, \(\lambda\) the coefficient balacing the KL divergence loss (\(\mathrm{KL}\)) and the cross-entropy (\(\mathcal{L}_{CE}\)) on ground truth labels \(y\), and \(\psi\) the softmax function. The distillation objective is: \[ \mathcal{L}_{global} = (1-\lambda)\mathcal{L}_{CE}(\psi(Z_{s}),y) + \lambda \tau^2 \mathrm{KL}(\psi(Z_{s}/\tau), \psi(Z_{t}/\tau)). \]
- Hard distillation
  - The authors introduced a new distillation method called hard distillation.
  - Let \(y_{t} = argmax_{c}Z_{t}(c)\) be the hard decision of the teacher, the objective associated with this hard-label distillation is: \[ \mathcal{L}_{global}^{harDistill} = \frac{1}{2}\mathcal{L}_{CE}(\psi(Z_{s}), y)+\frac{1}{2}\mathcal{L}_{CE}(\psi(Z_{s}), y_{t}) \] ## Experiment
Variants of DeiT model.
Which teacher is better?
- Using CNNs as teacher model is better than transformers. Maybe the transformer has learned the inductive bias of teacher.
Which distillation method is better?
- Hard distillation outperforms soft distillation for transformers, even with ising only a class token. The classifier with class and distillation tokens is significantly better than the independent class and distillaiton classifiers.
Agreement with the teacher & inductive bias?
- This question is difficult to formally answered. From the figure below, the distilled model is more correlated to the convnet than with a transformer learned from scratch.

And?

Comparing with Transformers based model on different transfer learning task with ImageNet pre-training.
I am interesting about inductive bias and what the model has learned. I think it is a hard problem but cool study.

DeiT Paper Note

KJL 發佈於 2022-09-02

Training data-efficient image transformers & distillation through attention(DeiT)

What

How

And?

KJL