Training data-efficient image transformers & distillation through attention(DeiT)
Published Year : 2020 Paper URL
What
This paper introduce a new model called DeiT. The main different between DeiT and ViT is a new distillation procedure based on a distillation token. DeiT show that contain no convolutional layer can achieve competitive results against the state of the art on ImageNet with no extra data and fewer parameters.
How
- Overall Architecture
- DeiT
- ViT
- You can find that the main differece between these two is that DeiT has one more distillation token than ViT.
- Class token
- The class token is a trainable vector, appended to the patch tokens before the first layer, that goes through the transformer layers, and is then projected with a linear layer to predict the class.
- What is distillaiton token do?
- Distillation token is similar to class token, except that on output of the network its objective is to reproduce the label predicted by a teacher model, instead of true lable. (The teacher model is a good performing classifier. It can be CNN or transformer. We will discuss which one is better later.)
- Soft distillation
- It minimizes the Kullback-Leibler divergence between the softmax of the eacher and the softmac of the student model.
- Let \(Z_{t}\) be the logits of the teacher model, \(Z_{s}\) the logits of the student model. \(\tau\) is the temperture for the distillation, \(\lambda\) the coefficient balacing the KL divergence loss (\(\mathrm{KL}\)) and the cross-entropy (\(\mathcal{L}_{CE}\)) on ground truth labels \(y\), and \(\psi\) the softmax function. The distillation objective is: \[ \mathcal{L}_{global} = (1-\lambda)\mathcal{L}_{CE}(\psi(Z_{s}),y) + \lambda \tau^2 \mathrm{KL}(\psi(Z_{s}/\tau), \psi(Z_{t}/\tau)). \]
- Hard distillation
- The authors introduced a new distillation method called hard distillation.
- Let \(y_{t} = argmax_{c}Z_{t}(c)\) be the hard decision of the teacher, the objective associated with this hard-label distillation is: \[ \mathcal{L}_{global}^{harDistill} = \frac{1}{2}\mathcal{L}_{CE}(\psi(Z_{s}), y)+\frac{1}{2}\mathcal{L}_{CE}(\psi(Z_{s}), y_{t}) \] ## Experiment
- Variants of DeiT model.
- Which teacher is better?
- Using CNNs as teacher model is better than transformers. Maybe the transformer has learned the inductive bias of teacher.
- Which distillation method is better?
- Hard distillation outperforms soft distillation for transformers, even with ising only a class token. The classifier with class and distillation tokens is significantly better than the independent class and distillaiton classifiers.
- Agreement with the teacher & inductive bias?
- This question is difficult to formally answered. From the figure below, the distilled model is more correlated to the convnet than with a transformer learned from scratch.
And?
- Comparing with Transformers based model on different transfer learning task with ImageNet pre-training.
- I am interesting about inductive bias and what the model has learned. I think it is a hard problem but cool study.