Download presentation
Presentation is loading. Please wait.
1
Distilling the Knowledge in a Neural Network
Geoffrey Hinton Oriol Vinyals Jeff Dean 2016/01/19 Ming-Han Yang
2
Outline Abstract Introduction Distillation Experiments Discussion
MNIST Speech Recognition JFT dataset Discussion
3
Abstract A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators [1] have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. MNIST,Android voice search, JFT dataset 大多數的machine learning為了提高效能,最簡單的方式就是用同一組訓練集,訓練出多個模型,並平均這些模型的預測結果 但是這種方法除了產生笨重的模型之外;如果想要讓大量的使用者來使用,也會太耗費計算量 Caruana等人的研究顯示,是有可能壓縮ensemble的知識到一個模型 這篇paper延伸這個想法,提出了模型的壓縮技術 [1] C. Buciluˇa, R. Caruana, and A. Niculescu-Mizil. Model compression. KDD ’06, pages 535–541, New York, NY, USA, ACM.
4
Introduction For cumbersome models that learn to discriminate between a large number of classes, the normal training objective is to maximize the average log probability of the correct answer, side-effect : the trained model assigns probabilities to all of the incorrect answers and even when these probabilities are very small, some of them are much larger than others. The relative probabilities of incorrect answers tell us a lot about how the cumbersome model tends to generalize Ex: BMW, garbage truck, carrot An obvious way to transfer the generalization ability of the cumbersome model to a small model is to use the class probabilities produced by the cumbersome model as “soft targets” for training the small model. For this transfer stage, we could use the same training set or a separate “transfer” set. 笨重的大模型通常學習如何分類大量的類別,訓練的目標就是要最大化正確答案的平均log機率;但是這有個副作用就是模型賦予錯誤答案的機率都很小,有些錯誤答案的機率還比其他的高 他們認為錯誤答案的這些機率告訴我們很多資訊,舉例來說:BMW的圖片只有很小的機率被誤認為垃圾車,但是這個機率還是比誤認為胡蘿蔔大 大家都能接受模型的目標函數要儘可能的反應使用者實際的需求,但是,訓練模型的時候,通常是最佳化模型在訓練集上的效能,而實際的目標卻是要讓高新的data進來也能被分類的好 這件事是有可能做到的,只要我們知道如何以正確的方式提高模型的generalization能力(不過這項資訊往往我們不知道) 他們認為如果可以把大模型的知識透過蒸餾融合進小模型,就可以讓小模型有同樣的generalization能力 --- 要如何把大模型的generalization能力轉移到小模型上,可以透過大模型產生的類別機率,以soft target的形式來訓練小的模型,接下來會介紹
5
Distillation Neural networks typically produce class probabilities by using a “softmax” output layer that converts the logit, 𝑧 𝑖 , computed for each class into a probability, 𝑞 𝑖 , by comparing 𝑧 𝑖 with the other logits. 𝑞 𝑖 = 𝑒𝑥𝑝( 𝑧 𝑖 /𝑇) 𝑗 𝑒𝑥𝑝( 𝑧 𝑗 /𝑇) where 𝑇 is a temperature that is normally set to 1. Using a higher value for 𝑇 produces a softer probability distribution over classes. In the simplest form of distillation: knowledge is transferred to the distilled model by training it on a transfer set using a soft target distribution for each case in the transfer set that is produced by using the cumbersome model with a high temperature in its softmax. The same high temperature is used when training the distilled model but after it has been trained it uses a temperature of 1. 蒸餾模型最簡單的形式呢,透過transfer set將知識轉移到蒸餾模型,並使用大模型+softmax高溫度對transfer set的每個資料,產生對應的soft target 訓練步驟: 1)透過transfer set 2)用訓練好的大模型,把softmax的溫度調高,然後把transfer set輸入,得到soft 的class 機率 3)用transfer set和soft的機率來訓練小的模型,訓練的時候softmax溫度跟大模型的相同,訓練結束後,溫度改回1
6
Experiments on MNIST (1/2)
In addition, the input images were jittered by up to two pixels in any direction. 2層*1200,RELU DNN,60000 training cases 67 test errors The net was strongly regularized using dropout and weight-constraints 2層*800,RELU DNN,60000 training cases 146 test errors no regularization if the smaller net was regularized solely by adding the additional task of matching the soft targets produced by the large net at a temperature of 20, it achieved 74 test errors. +300 neurons per layer all temperatures (> 8) gave fairly similar results -30 neurons per layer temperatures (2.5~4) worked significantly better than higher or lower temperatures. Dropout可以看成是有指數數量,那麼多的模型的ensemble,彼此share weight
7
Experiments on MNIST (2/2)
We then tried omitting all examples of the digit 3 from the transfer set. 206 test errors (其中:133/1010 3s in the test set. ) bias太低 調高bias 109 errors of which 14 are on 3s So with the right bias, the distilled model gets 98.6% of the test 3s correct despite never having seen a 3 during training If the transfer set contains only the 7s and 8s from the training set, the distilled model makes 47.3% test errors but when the biases for 7 and 8 are reduced by 7.6 to optimize test performance, this falls to 13.2% test errors. 小的模型如果每層多增加300個neuron,溫度取高於8的結果都差不多 但是如果每層減少30個neuron的話,溫度取2.5~4的效果是最好的 然後它們在transfer set裡面的3拿掉,所以對小模型來說,3是從來沒看過的資料
8
Experiments on Speech Recognition
8層*2560,RELU DNN,softmax layer with labels The input is 26 frames of 40 Mel-scaled filter- bank coefficients with a 10ms advance per frame and we predict the HMM state of 21st frame. total # parameters is about 85M Android voice search : 2000 hours, 700M training examples For the distillation we tried temperatures of [1, 2, 5, 10] and used a relative weight of 0.5 on the cross-entropy for the hard targets, where bold font indicates the best value that was used for table 1 他們訓練10個模型,訓練使用跟baseline一樣的設定;模型參數採取隨機初始化,然後再平均他們的預測結果,他們發現這種作法的效果比起只用一個模型,有顯著的提升。 他們也試著提供不同的訓練集給不同的模型訓練,但她們發現這樣沒有顯著的進步,所以他們選擇較簡單的作法 --- 表中顯示,我們提出的蒸餾方法確實能夠從訓練資料中粹取出較有用的資訊(比起使用hard label訓練的模型),ensemble的模型在frame的分類上達到了超過80%的改善,他們從ensemble的模型蒸餾出小的模型,也得到不錯的效果; 他們也發現蒸餾的相關方法,微軟在2014年已經提出過,但是他們的作法中,蒸餾的溫度設定為1,並使用大量的unlabeled資料集,但是他們的蒸餾模型的最好結果只減少了約28%的錯誤率,大模型與小模型之間的正確率也有gap,而且他們的大小模型都是使用hard label
9
Training ensembles of specialists on very big datasets
In this section we give an example of such a dataset and we show how learning specialist models that each focus on a different confusable subset of the classes can reduce the total amount of computation required to learn an ensemble. we describe how this overfitting may be prevented by using soft targets JFT is an internal Google dataset that has 100 million labeled images with 15,000 labels baseline model for JFT was a deep convolutional neural network that had been trained for about six months using asynchronous stochastic gradient descent on a large number of cores. 訓練模型的ensemble是一種簡單利用平行運算的方法,但是在測試的時候太花計算,而且就算它很容易做平行處理,但是萬一ensemble的每個模型都很大,而且訓練資料也非常多的時候,計算量就又更驚人了 接下來我們介紹 learning specialist models 這個方法如何減少計算量,而且使用soft target如何能避免overfitting JFT dataset是google內部的資料集,有100M個有標記的圖片,總共有1萬5個label;baseline是使用deep CNN,這個CNN訓練了約6個月,使用ASGD
10
Specialist Models When the number of classes is very large, it makes sense for the cumbersome model to be an ensemble that contains one generalist model trained on all the data and many “specialist” models, each of which is trained on data that is highly enriched in examples from a very confusable subset of the classes (like different types of mushroom). The softmax of this type of specialist can be made much smaller by combining all of the classes it does not care about into a single dustbin class 當類別數量非常大的時候我們可以訓練一個負責管理的模型(從整個的資料集訓練而得),以及很多個專家模型(從容易混淆的資料訓練而得;例如不同種類的香菇) 這種方法的話,專家模型的softmax的size就不用到整個類別這麼大,
11
Assign classes to specialist
In order to derive groupings of object categories for the specialists, we decided to focus on categories that our full network often confuses. In particular, we apply a clustering algorithm to the covariance matrix of the predictions of our generalist model, so that a set of classes 𝑆 𝑚 that are often predicted together will be used as targets for one of our specialist models, 𝑚. We applied an on-line version of the K-means algorithm to the columns of the covariance matrix, and obtained reasonable clusters 為了要決定物體怎麼分配給專家,我們先從管理層的network容易被混淆的物件開始 雖然我們也可以計算confusion matrix,然後使用它來分群;不過我們找到了更簡單的方法,不需要label來建立cluster 我們使用online的K-means方法來取得confusion matrix,然後得到cluster 我們也試了不同的分群法,結果都差不多
12
Performing inference with ensembles of specialists
We wanted to see how well ensembles containing specialists performed. In addition to the specialist models, we always have a generalist model so that we can deal with classes for which we have no specialists and so that we can decide which specialists to use. Given an input image x, we do top-one classification in two steps: For each test case, we find the 𝑛 most probable classes according to the generalist model. Call this set of classes k. (𝑛 = 1). We then take all the specialist models, 𝑚, whose special subset of confusable classes, 𝑆 𝑚 , has a non-empty intersection with k and call this the active set of specialists 𝐴 𝑘 (note that this set may be empty). We then find the full probability distribution 𝐪 over all the classes that minimizes :𝐾𝐿 𝐩 𝑔 , 𝐪 + 𝑚∈ 𝐴 𝑘 𝐾𝐿( 𝐩 𝑚 , 𝐪) 輸入的圖片x,兩個步驟: 每個test case,我們用generalist模型找n個最有可能的class 然後找到跟這類有交集的那些專家模型,叫做 𝐴 𝑘 (可能找不到專家;ak=0) 對Ak中的所有專家m,要最小化下面的式子
13
Performing inference with ensembles of specialists
Table 3 shows the absolute test accuracy for the baseline system and the baseline system combined with the specialist models.
14
Soft Targets as Regularizers
One of our main claims about using soft targets instead of hard targets is that a lot of helpful information can be carried in soft targets that could not possibly be encoded with a single hard target. Table 5 shows that with only 3% of the data (about 20M examples), training the baseline model with hard targets leads to severe overfitting (we did early stopping, as the accuracy drops sharply after reaching 44.5%), whereas the same model trained with soft targets is able to recover almost all the information in the full training set (about 2% shy). This shows that soft targets are a very effective way of communicating the regularities discovered by a model trained on all of the data to another model. 我們認為很多有用的資訊隱藏在soft機率中,表5顯示的是,如果只用3%的資料來訓練,使用hard label的模型很容易就overfitting,而使用soft label的並沒有overfitting 這顯示soft的機率是一種避免overfitting的有效方式,
15
Relationship to Mixtures of Experts
每個專家對於training set都有各自的權重,這個權重會一直改變,而且會跟所有其他的專家有關係 Gating network需要比較所有其他的專家,才能決定要採用哪個專家的結果 It is much easier to parallelize the training of multiple specialists. We first train a generalist model and then use the confusion matrix to define the subsets that the specialists are trained on. Once these subsets have been defined the specialists can be trained entirely independently. At test time we can use the predictions from the generalist model to decide which specialists are relevant and only these specialists need to be run. Mixutre of experts可以想像成在許多專家中找到一個最好的答案。例如針對某個問題,每個專家都發表自己的意見後(每個專家都要run),再整合出一個最好的 這個特色反而造成它的訓練很難平行化 --- mixtures of experts的缺點: 每個專家對於training set都有各自的權重,這個權重會一直改變,而且會跟所有其他的專家有關係 Gating network需要比較所有其他的專家,才能決定要採用哪個專家的結果 這些缺點導致mixture of experts很少被使用,雖然它可能很有用 ==== 但是如果採用multiple specialists的方法的話,就能夠簡單的平行化。首先訓練一個模型,然後使用confusion matrix來定義每個專家負責哪些子集合 子集合定義好了之後,這些專家各自獨立地下去訓練 測試的時候,只要從模型來決定哪些專家是相關的,然後只需要哪些專家需要run
16
Discussion (1/2) We have shown that distilling works very well for transferring knowledge from an ensemble or from a large highly regularized model into a smaller, distilled model. On MNIST distillation works remarkably well even when the transfer set that is used to train the distilled model lacks any examples of one or more of the classes. For a deep acoustic model that is version of the one used by Android voice search, we have shown that nearly all of the improvement that is achieved by training an ensemble of deep neural nets can be distilled into a single neural net of the same size which is far easier to deploy. 我們發現蒸餾模型的技術在轉移知識這方面表現很好 手寫辨識的任務中,即使轉移的資料集缺少其中某類資料,蒸餾模型的技術還是能表現地非常好 而在android語音搜尋的任務上,多個深層類神經網路聲學模型來做ensemble,經過蒸餾後得到的單一個模型,在效能上非常接近,並且小模型也比較容易配置
17
Discussion (2/2) For really big neural networks, it can be infeasible even to train a full ensemble, but we have shown that the performance of a single really big net that has been trained for a very long time can be significantly improved by learning a large number of specialist nets each of which learns to discriminate between the classes in a highly confusable cluster. We have not yet shown that we can distill the knowledge in the specialists back into the single large net. 對於超大的網路來說,訓練完整的ensemble模型是不可能的,但是我們的實驗顯示用大量的專家網路的效果可以比單一個大型網絡還要好 但是我們的研究還沒能夠把知識從專家網路蒸餾回單一的大網路
Similar presentations