Training data-efficient image transformers & distillation through attention(DeiT)
Published Year : 2020 Paper URL
What
This paper introduce a new model called DeiT. The main different between DeiT and ViT is a new distillation procedure based on a distillation token. DeiT show that contain no convolutional layer can achieve competitive results against the state of the art on ImageNet with no extra data and fewer parameters.
How
- Overall Architecture
- DeiT 

 - ViT 

 - You can find that the main differece between these two is that DeiT has one more distillation token than ViT.
 
 - DeiT 
 - Class token
- The class token is a trainable vector, appended to the patch tokens before the first layer, that goes through the transformer layers, and is then projected with a linear layer to predict the class.
 
 - What is distillaiton token do?
- Distillation token is similar to class token, except that on output of the network its objective is to reproduce the label predicted by a teacher model, instead of true lable. (The teacher model is a good performing classifier. It can be CNN or transformer. We will discuss which one is better later.)
 - Soft distillation
- It minimizes the Kullback-Leibler divergence between the softmax of the eacher and the softmac of the student model.
 - Let \(Z_{t}\) be the logits of the teacher model, \(Z_{s}\) the logits of the student model. \(\tau\) is the temperture for the distillation, \(\lambda\) the coefficient balacing the KL divergence loss (\(\mathrm{KL}\)) and the cross-entropy (\(\mathcal{L}_{CE}\)) on ground truth labels \(y\), and \(\psi\) the softmax function. The distillation objective is: \[ \mathcal{L}_{global} = (1-\lambda)\mathcal{L}_{CE}(\psi(Z_{s}),y) + \lambda \tau^2 \mathrm{KL}(\psi(Z_{s}/\tau), \psi(Z_{t}/\tau)). \]
 
 - Hard distillation
- The authors introduced a new distillation method called hard distillation.
 - Let \(y_{t} = argmax_{c}Z_{t}(c)\) be the hard decision of the teacher, the objective associated with this hard-label distillation is: \[ \mathcal{L}_{global}^{harDistill} = \frac{1}{2}\mathcal{L}_{CE}(\psi(Z_{s}), y)+\frac{1}{2}\mathcal{L}_{CE}(\psi(Z_{s}), y_{t}) \] ## Experiment
 
 
 - Variants of DeiT model. 

 - Which teacher is better?
- Using CNNs as teacher model is better than transformers. Maybe the transformer has learned the inductive bias of teacher. 

 
 - Using CNNs as teacher model is better than transformers. Maybe the transformer has learned the inductive bias of teacher. 
 - Which distillation method is better?
- Hard distillation outperforms soft distillation for transformers, even with ising only a class token. The classifier with class and distillation tokens is significantly better than the independent class and distillaiton classifiers. 

 
 - Hard distillation outperforms soft distillation for transformers, even with ising only a class token. The classifier with class and distillation tokens is significantly better than the independent class and distillaiton classifiers. 
 - Agreement with the teacher & inductive bias?
- This question is difficult to formally answered. From the figure below, the distilled model is more correlated to the convnet than with a transformer learned from scratch. 

 
 - This question is difficult to formally answered. From the figure below, the distilled model is more correlated to the convnet than with a transformer learned from scratch. 
 
And?
- Comparing with Transformers based model on different transfer learning task with ImageNet pre-training. 

 - I am interesting about inductive bias and what the model has learned. I think it is a hard problem but cool study.