A Survey of Multitask Learning 2015/09/22 Ming-Han Yang 1 1
Outline An overview of multitask learning The history of multitask learning Multitask有集體智慧,共同學習的概念 2 1
What is Multitask learning? Multitask learning (MTL) is a machine learning technique that aims at improving the generalization performance of a learning task by jointly learning multiple-related tasks. The key to the successful application of MTL is that the tasks need to be related. Here related does not mean the tasks are similar. Instead, it means at some level of abstraction these tasks share part of the representation. If the tasks are indeed similar learning them together can help transfer knowledge among tasks since it effectively increases the amount of training data for each task. 鄧立的課本提到的multitask: Multitask learning(MTL)是一種machine learning的技術, 它的目標是要透過一起學習多個相關的任務來提升它的generalization的能力(效能) MTL要能成功應用的關鍵是, 一起訓練的任務必須互相有相關性, 而相關性並不代表這些任務彼此很像 而是在一些抽象的概念上, 這些任務共享部分的representation 如果任務之間確實是相似的, 那麼一起學習就可以幫助它們互相傳遞資訊, 也可以相對增加每個任務的訓練資料 3 D. Yu and L. Deng (2014). “Automatic speech recognition - a deep learning approach”, Springer, 219-220.
SDM2012的投影片提到 Multi-task learning與single task learning在training時的不同處在於 : multitask時是一起訓練來截取task之間的內部關聯性 4 Jiayu Zhou, Jianhui Chen and Jieping Ye, Multi-Task Learning: Theory, Algorithms, and Applications, SIAM International Conference on Data Mining, 2012
Learning Methods 5 同樣是SDM2012的投影片提到 Multi-task learning 其實也包含了multi-label learning, multi-label又包含了multi-class learning 也有人認為 multi-task learning 包含在transfer learning裡面, 不過不是paper而是個人的心得報告, 所以比較沒有公信力 5 Jiayu Zhou, Jianhui Chen and Jieping Ye, Multi-Task Learning: Theory, Algorithms, and Applications, SIAM International Conference on Data Mining, 2012
How to do Multitask learning? Multi-task learning is a technique wherein a primary learning task is solved jointly with additional related tasks using a shared input representation. If these secondary tasks are chosen well, the shared structure serves to improve generalization of the model, and its accuracy on an unseen test set. In multi-task learning, the key aspect is choosing appropriate secondary tasks for the network to learn. When choosing secondary tasks for multi-task learning, one should select a task that is related to the primary task, but gives more information about the structure of the problem. ICASSAP 2013, 美國微軟的paper提到 Multi-task learning是一種技術, 透過額外的相關輔助任務與主要任務share representation 如果輔助任務選的好的話, 這個共享的結構能夠增進模型的一般化能力, 並且在沒看過的test set上測試, 也有不錯的正確率 MTL中,關鍵是如何選擇適當的輔助任務 選擇次要任務時,要選擇與主要任務相關,又能提供更多關於這個結構所沒有的資訊 // 這篇paper主要任務是phone 的辨識, 輔助任務他選了三項來實驗效能, 分別是: 預測目前的phone label, 預測前後的state label, 預測前後的phone label, 效果最好的是預測前後phone label 6 ML. Seltzer and J. Droppo(2013). Multi-task learning in deep neural networks for improved phoneme recognition, ICASSP.
The Beginning of Multitask learning Multitask learning has many names and incarnations including learning-to-learn, meta-learning, lifelong learning, and inductive transfer [1] J. Baxter. Learning internal representations. In Proceedings of the International ACM Workshop on Computational Learning Theory, 1995. [2] S. Thrun and L.Y. Pratt. Learning to Learn. Kluwer Academic, 1997 [3] R. Caruana. Multitask learning. Machine Learning, 28:41–75, 1997. [4] S. Thrun. Is learning the n-th thing any easier than learning the first? , NIPS, 1995. Early implementations of multitask learning primarily investigated neural network or nearest neighbor learners [1][3][4]. In addition to neural approaches, Bayesian methods have been explored that implement multitask learning by assuming dependencies between the various models and tasks[5][6]. [5] T. Heskes. Solving a huge number of similar tasks: A combination of multi-task learning and a hierarchical Bayesian approach. ICML, 1998 [6] T. Heskes. Empirical Bayes for learning to learn. ICML, 2004. 1993其實陸陸續續有些multi task概念的paper, 1997年有兩篇集大成, 大家都cite1997這兩篇paper (learning to learn, multitask learning) Multi-task learning的研究大概20年之久, 由於single task learning的過程中忽略了任務之間的聯繫,而現實生活中的學習任務往往是有千絲萬縷的聯繫的,比如圖片的多label分類,人臉的識別等等,這些任務都可以分為多個子任務去學習,多任務學習的優勢就在於能發掘這些子任務之間的關係,同時又能區分這些任務之間的差別。 --- Multitask有很多名字, 像是learning to learn, meta learning, lifelong learning 或者inductive transfer (歸納轉換) 比較早年的134這3篇有的是用neural network, 除了neural network之外; 也有使用bayesian method的, 這類方法是假設不同的模型跟人物之間有相互關係 7 T. Jebara. (2011) Multitask Sparsity via Maximum Entropy Discrimination. In Journal of Machine LearningResearch, (12):75-110.
1997 Multitask learning (1) Multitask Learning is an approach to inductive transfer that improves learning for one task by using the information contained in the training signals of other related tasks. It does this by learning tasks in parallel while using a shared representation; what is learned for each task can help other tasks be learned better. A task will be learned better if we can leverage the information contained in the training signals of other related tasks during learning. 博士論文, 最早MTL paper,通常做NN的multi-task都會cite這篇 MTL是一種歸納轉換的研究, 通過運用與其它相關任務共同訓練中的訊息來增進一個任務的效果 這是透過平行地學習多個任務, share representation, ;每個任務所學到的東西可以幫助其他任務學習得更好。 如果我們可以利用在學習中所含的, 相關任務訓練時的訊息, 那麼就可以學習的比較好。 8 R. Caruana (1997). Multitask learning. Machine Learning, 28(1), 41–75.
1997 Multitask learning (2) This paper reviews prior work on MTL, presents new evidence that MTL in backprop nets discovers task relatedness without the need of supervisory signals. We present an algorithm and results for multitask learning with case-based methods like k-nearest neighbor and kernel regression, and sketch an algorithm for multitask learning in decision trees. 這篇paper回顧MTL以前的相關研究, 並提出了證明MTL在neural network中可以發現任務之間的相關性, 而不用人告訴他 他們也提出了一個MTL應用在KNN及kernel regression, 並且提出了一個將MTL用在decision tree的方法 Figure 2. Multitask Backpropagation (MTL) of four tasks with the same inputs. 9 R. Caruana (1997). Multitask learning. Machine Learning, 28(1), 41–75.
Learning Rate in Backprop MTL 1997 Multitask learning (3) Learning Rate in Backprop MTL Usually better performance is obtained in backprop MTL when all tasks learn at similar rates and reach best performance at roughly the same time. If the main task trains long before the extra tasks, it cannot benefit from what has not yet been learned for the extra tasks. If the main task trains long after the extra tasks, it cannot shape what is learned for the extra tasks. Moreover, if the extra tasks begin to overtrain, they may cause the main task to overtrain too because of the overlap in hidden layer representation. 關鍵問題 = MTL在NN的learning rate 通常所有的任務都用差不多的learning rate, 然後這些任務同時達到最佳的performance, 這樣效果是最好的 如果主要任務訓練的比輔助任務快, 他就沒辦法學到 輔助任務中還沒學到的資訊 如果主要任務訓練的比輔助任務久, 他也沒辦法知道哪些是從輔助任務學到的 而輔助任務如果overtrain, 也會造成主要任務overtrain, 因為任務間是在hidden layer是互相share representation的 解決方法 最簡單的方法就是先讓所有的任務learning rate都相同, train一次; 然後看哪個任務收斂的比較快, 就降低它的learning tate, 再train一次, 重複做個幾次就可以讓所有任務大概在同一個時間收斂 10 R. Caruana (1997). Multitask learning. Machine Learning, 28(1), 41–75.
training experience for each of these tasks, and 1997 Learning to learn Given a family of tasks training experience for each of these tasks, and a family of performance measures (e.g., one for each task), an algorithm is said to learn to learn if its performance at each task improves with experience and with the number of tasks Put differently, a learning algorithm whose performance does not depend on the number of learning tasks, which hence would not benefit from the presence of other learning tasks, is not said to learn to learn. For an algorithm to fit this definition, some kind of transfer must occur between multiple tasks that must have a positive impact on expected task-performance. 這篇定義了learning to learn 已知有一堆相關的任務, 每個任務training的經驗, 以及評估效能的方法, 加入每個任務能夠經由其他任務的經驗改善自己的效能, 就叫learning to learn 要注意的是, 效能並不跟相關任務的數量成正比 一個演算法如果符合這個定義, 就表示不同任務之間發生了某些轉換, 這些轉換是有一些正面的影響的 //舉例: 人臉辨識 除非所有人臉都長得一樣, 否則辨識某個人的模型不能直接用在辨識另一個人的 實際上, 可以假設所有的臉部辨識任務共享某些不變性, 例如同一個人的不同表情, 或者臉的不同角度, 跟光線不同的照射角度 如果這些不變性的資訊能夠在不同學習任務中share, 那麼就可以提高辨識的效果 11 S. Thrun and L. Pratt (1997). Learning to Learn. Norwell, MA, USA: Kluwer.
Regularized multi–task learning (1) 2004 Regularized multi–task learning (1) Past empirical work has shown that learning multiple related tasks from data simultaneously can be advantageous in terms of predictive performance relative to learning these tasks independently. In this paper we present an approach to multi–task learning based on the minimization of regularization functionals similar to existing ones, such as the one for Support Vector Machines (SVMs), that have been successfully used in the past for single–task learning. Our approach allows to model the relation between tasks in terms of a novel kernel function that uses a task–coupling parameter. SVM的multi-task都會cite這篇 過去的經驗表示,從資料中同時學習多個相關的任務, 會比單獨學習這些任務還要好 這篇paper提出了基于minimization regularization 方程式的MTL,并以SVM为例, 推導出MTL的学习支持向量机,将MTL与跟单任务学习SVM联系在一起,并给出了详细的求解过程和他们之间的联系,当然实验结果也证明了多任务支持向量机的优势。文中最重要的假设就是所有任务的分界面共享一个中心分界面,然后再次基础上平移,偏移量和中心分界面最终决定了当前任务的分界面。 我們的方法透過一個新的kernel function來model任務之間的相關性 -- 12 T. Evgeniou and M. Pontil(2004). Regularized multi–task learning, In Proc. of the 10th SIGKDD Int’l Conf. on Knowledge discovery and data mining
Regularized multi–task learning (2) 2004 Regularized multi–task learning (2) When there are relations between the tasks to learn, it can be advantageous to learn all tasks simultaneously instead of following the more traditional approach of learning each task independently of the others. There has been a lot of experimental work showing the benefits of such multi–task learning relative to individual task learning when tasks are related, see [*]. [*] B. Bakker and T. Heskes. Task clustering and gating for Bayesian multi–task learning. JMLR, 4: 83–99, 2003. [*] R. Caruana. Multi–Task Learning. Machine Learning, 28, p. 41–75, 1997. [*] T. Heskes. Empirical Bayes for learning to learn. Proceedings of ICML–2000, ed. Langley, P., pp. 367–374, 2000. [*] S. Thrun and L. Pratt. Learning to Learn. Kluwer Academic Publishers, 1997. In this paper we develop methods for multi–task learning that are natural extensions of existing kernel based learning methods for single task learning, such as Support Vector Machines (SVMs). To the best of our knowledge, this is the first generalization of regularization–based methods from single–task to multi–task learning. 13 T. Evgeniou and M. Pontil(2004). Regularized multi–task learning, In Proc. of the 10th SIGKDD Int’l Conf. on Knowledge discovery and data mining
Regularized multi–task learning (3) 2004 Regularized multi–task learning (3) A statistical learning theory based approach to multi–task learning has been developed in [1-3]. [1] J. Baxter. A Bayesian/Information Theoretic Model of Learning to Learn via Multiple Task Sampling. Machine Learning, 28, pp. 7–39, 1997. [2] J. Baxter. A Model for Inductive Bias Learning. Journal of Artificial Intelligence Research, 12, p. 149–198, 2000. [3] S. Ben-David and R. Schuller, Exploiting Task Relatedness for Multiple Task Learning, COLT, 2003. The problem of multi–task learning has been also studied in the statistics literature [4-5]. [4] L. Breiman and J.H Friedman. Predicting Multivariate Responses in Multiple Linear Regression. Royal Statistical Society Series B, 1998. [5] P.J. Brown and J.V. Zidek. Adaptive Multivariate Ridge Regression. The Annals of Statistics, Vol. 8, No. 1, p. 64–74, 1980. Finally, a number of approaches for learning multiple tasks or for learning to learn are Bayesian, where a probability model capturing the relations between the different tasks is estimated simultaneously with the models’ parameters for each of the individual tasks[6-9]. [6] G.M. Allenby and P.E. Rossi. Marketing Models of Consumer Heterogeneity. Journal of Econometrics, 89, p. 57–78, 1999. [7] N. Arora G.M Allenby, and J. Ginter. A Hierarchical Bayes Model of Primary and Secondary Demand. Marketing Science, 17,1, p. 29–44, 1998 [8] B. Bakker and T. Heskes. Task clustering and gating for Bayesian multi–task learning, JMLR, 4: 83–99, 2003 [9] T. Heskes. Empirical Bayes for learning to learn. Proceedings of ICML–2000, ed. Langley, P., pp. 367–374, 2000 那么VC Dimension本质上到底是什么呢?自由度的概念,体现在我们能够包含多少feature w,能够有多少假设H的数量,以及我们最终的分类能力Dvc,也就是说Dvc本质上大体上是H的分类能力,同样可以理解为feature的个数,假设的个数,因为它们都是互相成正比的。 14 T. Evgeniou and M. Pontil(2004). Regularized multi–task learning, In Proc. of the 10th SIGKDD Int’l Conf. on Knowledge discovery and data mining
Convex multitask feature learning 2008 Convex multitask feature learning We study the problem of learning data representations that are common across multiple related supervised learning tasks. This is a problem of interest in many research areas In this paper, we present a novel method for learning sparse representations common across many supervised learning tasks. In particular, we develop a novel non-convex multi-task generalization of the 1-norm regularization known to provide sparse variable selection in the single-task case. Our method learns a few features common across the tasks using a novel regularizer which both couples the tasks and enforces sparsity. For example, in computer vision the problem of detecting a specific object in images is treated as a single supervised learning task. Images of different objects may share a number of features that are different from the pixel representation of images 15 A. Argyriou,T. Evgeniou and M. Pontil. Convex multitask feature learning. In MachineLearning, 73(3):243-272, 2008.
Clustered multi-task learning: A convex formulation 2008 Clustered multi-task learning: A convex formulation In multi-task learning several related tasks are considered simultaneously, with the hope that by an appropriate sharing of information across tasks, each task may benefit from the others. In this paper, we assume that tasks are clustered into groups, which are unknown beforehand, and that tasks within a group have similar weight vectors. We design a new spectral norm that encodes this a priori assumption, without the prior knowledge of the partition of tasks into groups, resulting in a new convex optimization formulation for multi-task learning. For example, in computer vision the problem of detecting a specific object in images is treated as a single supervised learning task. Images of different objects may share a number of features that are different from the pixel representation of images 16 L. Jacob, F. Bach, and J. Vert. Clustered multi-task learning: A convex formulation. NIPS, 2008
2010 Multi-Task Learning for Boosting with Application to Web Search Ranking Multi-task learning algorithms aim to improve the performance of several learning tasks through shared models. Previous work focussed primarily on neural networks, k-nearest neighbors[1] and support vector machines[2]. In this paper, we introduce a novel multi-task learning algorithm for gradient boosting. [1] T. Evgeniou and M. Pontil. Regularized multi–task learning. In KDD, pages 109–117, 2004. [2] R. Caruana. Multitask learning. In Machine Learning, pages 41–75, 1997. Figure 1: (Multitask 𝜖-boosting) A layout of 4 ranking tasks that are learned jointly. The four countries symbolize the different ranking functions that need to be learned, where 𝛽 1 , . . . , 𝛽 4 are the parameter vectors that store the specifics of each individual task. The various tasks interact through the joint model, symbolized as a globe with parameter vector 𝛽 0 . For example, in computer vision the problem of detecting a specific object in images is treated as a single supervised learning task. Images of different objects may share a number of features that are different from the pixel representation of images 17 O. Chappelle, P. Shivaswamy and S. Vadrevu, Multi-Task Learning for Boosting with Application to Web Search Ranking, ACM, 2010.
Multitask sparsity via maximum entropy discrimination 2011 Multitask sparsity via maximum entropy discrimination A multitask learning framework is developed for discriminative classification and regression where multiple large-margin linear classifiers are estimated for different prediction problems Most machine learning approaches take a single-task perspective where one large homogeneous repository of uniformly collected iid (independent and identically distributed) samples is given and labeled consistently. A more realistic, multitask learning approach is to combine data from multiple smaller sources and synergistically leverage heterogeneous labeling or annotation efforts. feature selection, kernel selection, adaptive pooling and graphical model structure 2011JMLR的paper提到 MTL的framework是為了判別式的分類或regression設計的 大多數machine learning方法都是由單任務的角度來看, 就是假設這些樣本互相獨立且同分佈(iid), 而且有label 更實際一點,MTL的方法是要結合多個較小的資料(來源),並一起leverage這些混合在一起的label資料。 这篇文章可以看作是比较全面的总结性文章,文中总共讨论了四种情况,feature selection, kernel selection, adaptive pooling 跟 graphical model structure。 并详细介绍了四种多任务学习方法。 --- i.i.d. 就是identical independent distributed > 就是這些取到的data的distrbutions都一模一樣且互相獨立 關於 "independent" 的概念, 我 再補充一下: independent 的概念在實作上就是: 取下一個觀測值時, 並不因前面已取得的觀測值而改變方法或改變取樣的群體 特性。 例如: 在抽獎中已抽出的籤不再放回, 因此前面抽到的有 沒有中獎, 會影響後抽者的 (條件) 機率。 若抽出之籤再放回, 或每個人抽的籤箱各不相同, 則前面 的人抽中甚麼, 不會影響到後抽的人。 這樣的結果就會是 「獨立的」。 18 T. Jebara (2011). Multitask Sparsity via Maximum Entropy Discrimination. In Journal of Machine Learning Research, (12):75-110.
Learning task grouping and overlap in multi-task learning (1) 2012 Learning task grouping and overlap in multi-task learning (1) The key aspect in all multi-task learning methods is the introduction of an inductive bias in the joint hypothesis space of all tasks that reflects our prior beliefs about task relatedness structure. Assumptions that task parameters lie close to each other in some geometric sense[1] or parameters share a common prior[2][3][4] or they lie in a low dimensional subspace[1] or on a manifold[5] are some examples of introducing an inductive bias in the hope of achieving better generalization. [1] A. Argyriou,T. Evgeniou and M. Pontil. Convex multitask feature learning. In Machine Learning, 73(3):243-272, 2008. [2] Yu, Kai, Tresp, Volker, and Schwaighofer, Anton. Learning Gaussian Processes from Multiple Task. In ICML, 2005. [3] Lee, S.I., Chatalbashev, V., Vickrey, D., and Koller, D. Learning a meta-level prior for feature relevance from multiple related tasks. In ICML, 2007 [4] Daum´e III, Hal. Bayesian Multitask Learning with Latent Hierarchies. In UAI, 2009. [5] Agarwal, Arvind, Daum´e III, Hal, and Gerber, Samuel. Learning Multiple Tasks using Manifold Regularization. In NIPS, 2010. A major challenge in multi-task learning is how to selectively screen the sharing of information so that unrelated tasks do not end up influencing each other. The key aspect in all multi-task learning methods is the introduction of an inductive bias in the joint hypothesis space of all tasks that reflects our prior beliefs about task relatedness structure. 所有的MTL方法的關鍵是, 它假設一個joint假設空間, 相關的任務之間其實會有一個相關的基底, 透過不同的偏移量就可以adapt到不同的task。 不同的研究對任務參數有不同的假設, 像是1這篇假設在低維度的子空間中上, 任務的參數其實互相躺在旁邊(很靠近) 234這三篇假設參數會共享一個prior 最大的挑戰 = 如何選擇性地屏蔽訓息的共享,以便不相關的任務,最終不會互相影響 19 A. Kumar, H. Daume(2012). Learning Task Grouping and Overlap in Multi-Task Learning, the 29 th International Conference on Machine Learning.
Learning task grouping and overlap in multi-task learning (2) 2012 Learning task grouping and overlap in multi-task learning (2) Sharing information between two unrelated tasks can worsen the performance of both tasks. This phenomenon is also known as negative transfer. We propose a framework for multi-task learning that enables one to selectively share the information across the tasks. We assume that each task parameter vector is a linear combination of a finite number of underlying basis tasks. Our model is based on the assumption that task parameters within a group lie in a low dimensional subspace but allows the tasks in different groups to overlap with each other in one or more bases. 在兩個不相關的任務共享資訊可能會比個別訓練兩個任務的的效果還差, 這個現象叫做negative transfer 我們提出了一個MTL的framework, 讓任務可以選擇性地共享或不共享, 假設每個任務的參數向量是透過無限多個底層的基底任務線性組合而成, 他們提出的model是基於一個假設, 就是group內的任務的參數會在一個低維度的子空間, 但是不同group之間會有overlap 20 A. Kumar, H. Daume(2012). Learning Task Grouping and Overlap in Multi-Task Learning, the 29 th International Conference on Machine Learning.
2015 Joint acoustic modeling of triphones and trigraphemes by multi-task learning deep neural networks for low-resource speech recognition It is well-known in machine learning that multitask learning (MTL) can help improve the generalization performance of singly learning tasks if the tasks being trained in parallel are related, especially when the amount of training data is relatively small. In this paper, we investigate the estimation of triphone acoustic models in parallel with the estimation of trigrapheme acoustic models under the MTL framework using deep neural network (DNN). 這篇paper透過DNN來做multitask learning, 分成triphone與trigrapheme的任務 D. Chen, C. Leung , Joint acoustic modeling of triphones and trigraphemes by multi-task learning deep neural networks for low-resource speech recognition, ICASSP, 2015. 21
THANK YOU! 22 1