Distilling the Knowledge in a Neural Network

Slides:

Advertisements

Similar presentations

如何學好數學？黃駿耀老師

Advertisements

辅助核算 3.5.

10　郑和远航.

三个偶像的故事和功绩 ——第12课　明清时期的反侵略斗争董飞燕.

捣蛋鬼历险记初一四班孙嘉佑小组.

中國歷史明代之患禍及民變.

10　郑和远航郑和郑和，1371年生于云南昆阳州（今昆明晋宁县）一个信奉伊斯兰教的回族家庭，原名马和，小字三宝，十一岁时在明太祖朱元璋发动的统一云南的战争中被俘进宫，后当朱元璋四子燕王朱棣的近侍。1403年朱棣登基，史称明成祖。次年正月初一，朱棣念他有勇有谋，屡立奇功，便赐姓“郑”，改称郑和，并提拔为内宫太监，于永乐三年（1405年7月11日）率领庞大船队首次出使西洋。自1405年到1433年，漫长的28年间，郑和船队历经亚非三十余国，涉十万余里，与各国建立了政治，经济，文化的联系，完成了七下西洋的伟

明清抗击外国侵略的英勇斗争雅克萨反击战（俄）戚继光抗倭（日）郑成功收复台湾（荷兰）荷兰俄罗斯日本台湾沙俄入侵

戚继光抗倭.

刑事訴訟法授課人：林俊益副教授時間：95.9.～96.6..

妩媚人生云计算与大规模数据并行处理技术黄宜华南京大学计算机科学与技术系软件新技术国家重点实验室妩媚人生妩媚人生

第16 课中外的交往与冲突授课人：鲍婷.

历史上的中日关系.

云南外事外语职业学院入党积极分子培训赵田甜.

第四章清代臺灣的社會文化變遷第一節移墾社會的形成

認識食品中毒一、什麼是食品中毒？二人或二人以上攝取相同的食品而發生相似的症狀，並且自可疑的食餘檢體及患者糞便、嘔吐物、血液等人體檢體，或者其它有關環境檢體（如空氣、水、土壤等）中分離出相同類型（如血清型、噬菌體型）的致病原因，則稱為一件“食品中毒”。但如因攝食肉毒桿菌毒素或急性化學性中毒而引起死亡，即使只有一人，也視為一件“食品中毒”。

題目:四大古文明班級:六年八班組員:賴宣光.游家齊.陳羿文吳佳芬.許淑婷.許芳瑜..

琦君《髻》 S 康倩瑜.

眼乾乾唔使慌.

滑膜皱襞综合征.

“公平”是最热的关键词 1、胡锦涛首次进行“总动员”，提出“在促进发展的同时，把维护社会公平放到更加突出的位置” 。

贵州省公务员面试备考指导中公教育面试讲师刘运龙.

外套各式領型與變化武玫莉製作.

第4节人体对食物的消化吸收.

陈冤之魅，心鬼之泪 ——雾里探花《东方快车谋杀案》 By第二小组.

高考作文等级评分标准/发展等级10分深刻丰富有文采有创意 ①透过现象深入本质 ②揭示问题产生的原因 ③观点具有启发作用

文明礼仪在我心文明礼仪在我心.

第10课社会生活的变迁.

故事会盘古开天劈地在很久很久以前，天地可不象我们现在看到的这样————天高高的在上面，地在我们的脚下，中间隔着几千几万米远。那个时候的天地就象是一个包在大黑壳里的鸡蛋，混混沌沌的，什么也看不清。人们走路都得弯着腰，耕田打猎都很不方便，因为一不小心抬个头，就会碰到天，惹它生气，接着就会招来狂风暴雨。因此所有的植物也都长不高，所以结的粮食和果实都很少，根本就不够大家吃。还经常会发生饿死人的事情。

面向三农，拓宽信息渠道辐射千村，服务百万农民

三招让孩子爱上阅读主讲人：芝莺妈妈 2012年10月19日.

FUZHUANGZHITUYANGBANZHIZUO

如何挑選吳郭魚嗨~ 餐旅二乙 4a2m0105 白妤潔 4a2m0122 何姿瑩.

学校春季呼吸道传染病预防知识连云港市疾病预防控制中心

服裝整理概論.

印染纺织类艺术.

创业计划书的编写.

创业计划书撰写.

第九章进行充分调研选择自主创业.

香溢饺子馆创业计划书.

第三章中国的民族民俗第一节概论第二节汉族第三节满族蒙古族维吾尔族回族朝鲜族第四节壮族土家族苗族黎族

第 4 章投资银行：基于资本市场的主业架构.

创业数字图书馆.

中国管理科学发展探索成思危 2006年8月18日于上海复旦大学.

“四文”交融，虚实并举，打造具有鲜明职教特色的校园文化 ——江苏省扬州商务高等职业学校校园文化建设汇报

103年度高職優質化輔助方案計畫申辦及輔導訪視說明會

“十二五”科技发展思路与科技计划管理科技部发展计划司刘敏 2012年9月.

社区妇幼保健工作江东区妇幼保健院胡波瑛.

人生不要太圓滿 ◎ 張忠謀.

导致羊水过少的五大因素.

怎样进行一次宣讲何惠玲.

第三课中国共产党的历程.

[聚會時，請將傳呼機和手提電話關掉，多謝合作]

规范母婴保健服务努力降低孕产妇死亡率市卫生局基妇科朱静.

中国地质科学院矿产资源研究所财务报账培训

白天的月亮想與日爭輝人生不要太圓滿文字取自於：張忠謀攝於陽明山阿道的攝影工作坊.

第十章(上) 实现中华民族的伟大复兴.

营养要均衡.

高中新课程历史必修（Ⅰ）教材比较研究四川师范大学历史文化学院教授陈辉教育部2009普通高中历史课改远程研修资料.

十年职业生涯规划 —— 年姓名：刘娟学号：.

主考官眼中的面试 ——面试主考官教你备战2016年国考面试主讲老师：李海鹏.

国内知名高校医学院(部、中心) 院系及附属医院设置情况调研报告

財務報表分析授課教師：陳依婷.

第六章可供出售金融资产一、可供出售金融资产的概念和特征二、可供出售金融资产的核算.

主讲人：刘文波（四会国税政策法规股） 2014年4月

智慧宁波智慧财税 . 宁波市地方税务局.

第六模块礼仪文书写作第一节求职信、应聘信 QIUZHIXINYINGPINXIN.

Presentation transcript:

Distilling the Knowledge in a Neural Network Geoffrey Hinton Oriol Vinyals Jeff Dean 2016/01/19 Ming-Han Yang

Outline Abstract Introduction Distillation Experiments Discussion MNIST Speech Recognition JFT dataset Discussion

Abstract A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators [1] have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. MNIST，Android voice search, JFT dataset 大多數的machine learning為了提高效能，最簡單的方式就是用同一組訓練集，訓練出多個模型，並平均這些模型的預測結果但是這種方法除了產生笨重的模型之外；如果想要讓大量的使用者來使用，也會太耗費計算量 Caruana等人的研究顯示，是有可能壓縮ensemble的知識到一個模型這篇paper延伸這個想法，提出了模型的壓縮技術 [1] C. Buciluˇa, R. Caruana, and A. Niculescu-Mizil. Model compression. KDD ’06, pages 535–541, New York, NY, USA, 2006. ACM.

Introduction For cumbersome models that learn to discriminate between a large number of classes, the normal training objective is to maximize the average log probability of the correct answer, side-effect ： the trained model assigns probabilities to all of the incorrect answers and even when these probabilities are very small, some of them are much larger than others. The relative probabilities of incorrect answers tell us a lot about how the cumbersome model tends to generalize Ex: BMW, garbage truck, carrot An obvious way to transfer the generalization ability of the cumbersome model to a small model is to use the class probabilities produced by the cumbersome model as “soft targets” for training the small model. For this transfer stage, we could use the same training set or a separate “transfer” set. 笨重的大模型通常學習如何分類大量的類別，訓練的目標就是要最大化正確答案的平均log機率；但是這有個副作用就是模型賦予錯誤答案的機率都很小，有些錯誤答案的機率還比其他的高他們認為錯誤答案的這些機率告訴我們很多資訊，舉例來說：BMW的圖片只有很小的機率被誤認為垃圾車，但是這個機率還是比誤認為胡蘿蔔大大家都能接受模型的目標函數要儘可能的反應使用者實際的需求，但是，訓練模型的時候，通常是最佳化模型在訓練集上的效能，而實際的目標卻是要讓高新的data進來也能被分類的好這件事是有可能做到的，只要我們知道如何以正確的方式提高模型的generalization能力（不過這項資訊往往我們不知道）他們認為如果可以把大模型的知識透過蒸餾融合進小模型，就可以讓小模型有同樣的generalization能力－－－要如何把大模型的generalization能力轉移到小模型上，可以透過大模型產生的類別機率，以soft target的形式來訓練小的模型，接下來會介紹

Distillation Neural networks typically produce class probabilities by using a “softmax” output layer that converts the logit, 𝑧 𝑖 , computed for each class into a probability, 𝑞 𝑖 , by comparing 𝑧 𝑖 with the other logits. 𝑞 𝑖 = 𝑒𝑥𝑝⁡( 𝑧 𝑖 /𝑇) 𝑗 𝑒𝑥𝑝⁡( 𝑧 𝑗 /𝑇) where 𝑇 is a temperature that is normally set to 1. Using a higher value for 𝑇 produces a softer probability distribution over classes. In the simplest form of distillation： knowledge is transferred to the distilled model by training it on a transfer set using a soft target distribution for each case in the transfer set that is produced by using the cumbersome model with a high temperature in its softmax. The same high temperature is used when training the distilled model but after it has been trained it uses a temperature of 1. 蒸餾模型最簡單的形式呢，透過transfer set將知識轉移到蒸餾模型，並使用大模型＋softmax高溫度對transfer set的每個資料，產生對應的soft target 訓練步驟： 1）透過transfer set 2）用訓練好的大模型，把softmax的溫度調高，然後把transfer set輸入，得到soft 的class 機率 3）用transfer set和soft的機率來訓練小的模型，訓練的時候softmax溫度跟大模型的相同，訓練結束後，溫度改回1

Experiments on MNIST (1/2) In addition, the input images were jittered by up to two pixels in any direction. 2層*1200，RELU DNN，60000 training cases  67 test errors The net was strongly regularized using dropout and weight-constraints 2層*800，RELU DNN，60000 training cases  146 test errors no regularization if the smaller net was regularized solely by adding the additional task of matching the soft targets produced by the large net at a temperature of 20, it achieved 74 test errors. +300 neurons per layer  all temperatures (> 8) gave fairly similar results -30 neurons per layer  temperatures (2.5~4) worked significantly better than higher or lower temperatures. Dropout可以看成是有指數數量，那麼多的模型的ensemble，彼此share weight

Experiments on MNIST (2/2) We then tried omitting all examples of the digit 3 from the transfer set. 206 test errors (其中：133/1010 3s in the test set. )  bias太低調高bias  109 errors of which 14 are on 3s So with the right bias, the distilled model gets 98.6% of the test 3s correct despite never having seen a 3 during training If the transfer set contains only the 7s and 8s from the training set, the distilled model makes 47.3% test errors but when the biases for 7 and 8 are reduced by 7.6 to optimize test performance, this falls to 13.2% test errors. 小的模型如果每層多增加300個neuron，溫度取高於8的結果都差不多但是如果每層減少30個neuron的話，溫度取2.5～4的效果是最好的然後它們在transfer set裡面的3拿掉，所以對小模型來說，3是從來沒看過的資料

Experiments on Speech Recognition 8層*2560，RELU DNN，softmax layer with 14000 labels The input is 26 frames of 40 Mel-scaled filter- bank coefficients with a 10ms advance per frame and we predict the HMM state of 21st frame. total # parameters is about 85M Android voice search : 2000 hours, 700M training examples For the distillation we tried temperatures of [1, 2, 5, 10] and used a relative weight of 0.5 on the cross-entropy for the hard targets, where bold font indicates the best value that was used for table 1 他們訓練10個模型，訓練使用跟baseline一樣的設定；模型參數採取隨機初始化，然後再平均他們的預測結果，他們發現這種作法的效果比起只用一個模型，有顯著的提升。他們也試著提供不同的訓練集給不同的模型訓練，但她們發現這樣沒有顯著的進步，所以他們選擇較簡單的作法－－－表中顯示，我們提出的蒸餾方法確實能夠從訓練資料中粹取出較有用的資訊（比起使用hard label訓練的模型），ensemble的模型在frame的分類上達到了超過80%的改善，他們從ensemble的模型蒸餾出小的模型，也得到不錯的效果；他們也發現蒸餾的相關方法，微軟在2014年已經提出過，但是他們的作法中，蒸餾的溫度設定為1，並使用大量的unlabeled資料集，但是他們的蒸餾模型的最好結果只減少了約28%的錯誤率，大模型與小模型之間的正確率也有gap，而且他們的大小模型都是使用hard label

Training ensembles of specialists on very big datasets In this section we give an example of such a dataset and we show how learning specialist models that each focus on a different confusable subset of the classes can reduce the total amount of computation required to learn an ensemble. we describe how this overfitting may be prevented by using soft targets JFT is an internal Google dataset that has 100 million labeled images with 15,000 labels baseline model for JFT was a deep convolutional neural network that had been trained for about six months using asynchronous stochastic gradient descent on a large number of cores. 訓練模型的ensemble是一種簡單利用平行運算的方法，但是在測試的時候太花計算，而且就算它很容易做平行處理，但是萬一ensemble的每個模型都很大，而且訓練資料也非常多的時候，計算量就又更驚人了接下來我們介紹 learning specialist models 這個方法如何減少計算量，而且使用soft target如何能避免overfitting JFT dataset是google內部的資料集，有100M個有標記的圖片，總共有1萬5個label；baseline是使用deep CNN，這個CNN訓練了約6個月，使用ASGD

Specialist Models When the number of classes is very large, it makes sense for the cumbersome model to be an ensemble that contains one generalist model trained on all the data and many “specialist” models, each of which is trained on data that is highly enriched in examples from a very confusable subset of the classes (like different types of mushroom). The softmax of this type of specialist can be made much smaller by combining all of the classes it does not care about into a single dustbin class 當類別數量非常大的時候我們可以訓練一個負責管理的模型（從整個的資料集訓練而得），以及很多個專家模型（從容易混淆的資料訓練而得；例如不同種類的香菇）這種方法的話，專家模型的softmax的size就不用到整個類別這麼大，

Assign classes to specialist In order to derive groupings of object categories for the specialists, we decided to focus on categories that our full network often confuses. In particular, we apply a clustering algorithm to the covariance matrix of the predictions of our generalist model, so that a set of classes 𝑆 𝑚 that are often predicted together will be used as targets for one of our specialist models, 𝑚. We applied an on-line version of the K-means algorithm to the columns of the covariance matrix, and obtained reasonable clusters 為了要決定物體怎麼分配給專家，我們先從管理層的network容易被混淆的物件開始雖然我們也可以計算confusion matrix，然後使用它來分群；不過我們找到了更簡單的方法，不需要label來建立cluster 我們使用online的K-means方法來取得confusion matrix，然後得到cluster 我們也試了不同的分群法，結果都差不多

Performing inference with ensembles of specialists We wanted to see how well ensembles containing specialists performed. In addition to the specialist models, we always have a generalist model so that we can deal with classes for which we have no specialists and so that we can decide which specialists to use. Given an input image x, we do top-one classification in two steps: For each test case, we find the 𝑛 most probable classes according to the generalist model. Call this set of classes k. (𝑛 = 1). We then take all the specialist models, 𝑚, whose special subset of confusable classes, 𝑆 𝑚 , has a non-empty intersection with k and call this the active set of specialists 𝐴 𝑘 (note that this set may be empty). We then find the full probability distribution 𝐪 over all the classes that minimizes ：𝐾𝐿 𝐩 𝑔 , 𝐪 + 𝑚∈ 𝐴 𝑘 𝐾𝐿( 𝐩 𝑚 , 𝐪) 輸入的圖片x，兩個步驟：每個test case，我們用generalist模型找n個最有可能的class 然後找到跟這類有交集的那些專家模型，叫做 𝐴 𝑘 （可能找不到專家；ak＝0）對Ak中的所有專家m，要最小化下面的式子

Performing inference with ensembles of specialists Table 3 shows the absolute test accuracy for the baseline system and the baseline system combined with the specialist models.

Soft Targets as Regularizers One of our main claims about using soft targets instead of hard targets is that a lot of helpful information can be carried in soft targets that could not possibly be encoded with a single hard target. Table 5 shows that with only 3% of the data (about 20M examples), training the baseline model with hard targets leads to severe overfitting (we did early stopping, as the accuracy drops sharply after reaching 44.5%), whereas the same model trained with soft targets is able to recover almost all the information in the full training set (about 2% shy). This shows that soft targets are a very effective way of communicating the regularities discovered by a model trained on all of the data to another model. 我們認為很多有用的資訊隱藏在soft機率中，表5顯示的是，如果只用3%的資料來訓練，使用hard label的模型很容易就overfitting，而使用soft label的並沒有overfitting 這顯示soft的機率是一種避免overfitting的有效方式，

Relationship to Mixtures of Experts 每個專家對於training set都有各自的權重，這個權重會一直改變，而且會跟所有其他的專家有關係 Gating network需要比較所有其他的專家，才能決定要採用哪個專家的結果 It is much easier to parallelize the training of multiple specialists. We first train a generalist model and then use the confusion matrix to define the subsets that the specialists are trained on. Once these subsets have been defined the specialists can be trained entirely independently. At test time we can use the predictions from the generalist model to decide which specialists are relevant and only these specialists need to be run. Mixutre of experts可以想像成在許多專家中找到一個最好的答案。例如針對某個問題，每個專家都發表自己的意見後（每個專家都要run），再整合出一個最好的這個特色反而造成它的訓練很難平行化－－－ mixtures of experts的缺點：每個專家對於training set都有各自的權重，這個權重會一直改變，而且會跟所有其他的專家有關係 Gating network需要比較所有其他的專家，才能決定要採用哪個專家的結果這些缺點導致mixture of experts很少被使用，雖然它可能很有用＝＝＝＝但是如果採用multiple specialists的方法的話，就能夠簡單的平行化。首先訓練一個模型，然後使用confusion matrix來定義每個專家負責哪些子集合子集合定義好了之後，這些專家各自獨立地下去訓練測試的時候，只要從模型來決定哪些專家是相關的，然後只需要哪些專家需要run

Discussion (1/2) We have shown that distilling works very well for transferring knowledge from an ensemble or from a large highly regularized model into a smaller, distilled model. On MNIST distillation works remarkably well even when the transfer set that is used to train the distilled model lacks any examples of one or more of the classes. For a deep acoustic model that is version of the one used by Android voice search, we have shown that nearly all of the improvement that is achieved by training an ensemble of deep neural nets can be distilled into a single neural net of the same size which is far easier to deploy. 我們發現蒸餾模型的技術在轉移知識這方面表現很好手寫辨識的任務中，即使轉移的資料集缺少其中某類資料，蒸餾模型的技術還是能表現地非常好而在android語音搜尋的任務上，多個深層類神經網路聲學模型來做ensemble，經過蒸餾後得到的單一個模型，在效能上非常接近，並且小模型也比較容易配置

Discussion (2/2) For really big neural networks, it can be infeasible even to train a full ensemble, but we have shown that the performance of a single really big net that has been trained for a very long time can be significantly improved by learning a large number of specialist nets each of which learns to discriminate between the classes in a highly confusable cluster. We have not yet shown that we can distill the knowledge in the specialists back into the single large net. 對於超大的網路來說，訓練完整的ensemble模型是不可能的，但是我們的實驗顯示用大量的專家網路的效果可以比單一個大型網絡還要好但是我們的研究還沒能夠把知識從專家網路蒸餾回單一的大網路