Review: Deep mutual Learning

3 min readNov 4, 2020

Link to paper: https://arxiv.org/pdf/1706.00384.pdf

In this paper, the authors propose a deep learning method that can obtain light-weighted and efficient models.

Idea

The idea of this paper is closely related to model distillation, a method where a powerful large network is used as a teacher to transfer knowledge to a small student network.

Instead of using a large powerful teacher, in Deep Mutual Learning (DML), student networks (in a student cohort) learn the data together simultaneously and they also learn from each other (2-way knowledge transfer).

Method

Assume that there are 2 networks in the cohort, the authors formulate the idea by having each network minimize 2 loss functions: the objective loss (depend on the task that the network aims to solve) and the Kullback-Leibler distance. Minimizing the Kullback-Leibler distance would mean that Network 1 learns the prediction of Network 2.

The 2 loss functions: objective loss (cross-entropy loss in this example) and Kullback-Leibler distance

Overall loss is the sum of the objective loss and the Kullback-Leibler distance.

Overall loss. The first equation is the overall loss of Network 1, the second one is of Network 2

Experiments and Results

Experiments were run on ImageNet, CIFAR-100 and CIFAR-10 for image classification task; Market-1501 for Re-identification task. The authors applied Deep Mutual Learning method on ResNet-32, MobileNet, InceptionV1 and WRN-28–10 in these experiments.

On CIFAR datasets, DML slightly increased accuracies of networks (all the values in column “DML-Ind” are positive)

The same result was observed on ImageNet

On Market-1501, DML also increase the performance of MobileNet compared to Independent learning. I noticed that the authors stated in this paper:

It can also be seen that our DML approach using two MobileNets significantly outperforms prior state-of-the-art deep re-id methods.

But according to Table 3., MobileNet pretrained on ImageNet had already outperformed the previous re-id methods without using DML. So I think that the statement of the authors seems unfair.

Table 4 compare model distillation, the previous method, and DML. It shows that a better result can be obtained without using a large powerful pretrained network.

Figure 3 shows that when the number of networks (cohort size) increased, it yielded better result. It was also better to make combined prediction than to take the average prediction of each network. Combined prediction was made by “matching based on concatenated feature of all members”, which I haven’t understand how they did it.

Conclusion

The proposed method had successfully improved networks performance and outperform the previous method, model distillation. Training using Deep Mutual Learning, making combined prediction produces better performance

Note:

The authors also explained why Deep Mutual Learning works and a few more things in the paper but they are not within my understanding. So check out the paper if you want to see more :D