Ensemble Learning (集成学习)
Outline What’s is Machine Learning? Background and Motivation The AdaBoost Algorithm How and why AdaBoost works? AdaBoost for Face Detection
What’s is Machine Learning 机器学习是人工智能的核心研究领域之一 经典定义:利用经验改善系统自身的性能 随着该领域的发展,主要做智能数据分析 典型任务:根据现有数据建立预测模型,然后 用来预测新数据。
机器学习的重要性 生物 信息学 计算 金融学 分子 生物学 行星 地质学 …… 工业过程控制 机器人 遥感信 息处理 信息安全 机 器 学 习 美国航空航天局JPL实验室的科学家在《Science》(2001年9月)上撰文指出:机器学习对科学研究的整个过程正起到越来越大的支持作用,……,该领域在今后的若干年内将取得稳定而快速的发展
奥卡姆剃刀定律(Occam's Razor, Ockham'sRazor)又称“奥康的剃刀”,它是由14世纪逻辑学家、圣方济各会修士奥卡姆的威廉(William of Occam,约1285年至1349年)提出。这个原理称为“如无必要,勿增实体”,即“简单有效原理”。正如他在《箴言书注》2卷15题说“切勿浪费较多东西去做,用较少的东西,同样可以做好的事情。”
奥卡姆剃刀定律(Occam's Razor, Ockham'sRazor)又称“奥康的剃刀”,它是由14世纪逻辑学家、圣方济各会修士奥卡姆的威廉(William of Occam,约1285年至1349年)提出。这个原理称为“如无必要,勿增实体”,即“简单有效原理”。正如他在《箴言书注》2卷15题说“切勿浪费较多东西去做,用较少的东西,同样可以做好的事情。”
Motivation 泛化能力是机器学习关注的一个根本问题 泛化能力(generalization ability)表征了学习系统对新事件的适用性 泛化能力越强越好 提高泛化能力是机器学习永远的追求
Motivation 在机器学习中,直接建立一个高性能的分类器是很困难的。 但是,如果能找到一系列性能较差的分类器,并把它们集成起来的话,也许就能得到更好的分类器。 日常生活中,所谓的民主决策,便是部分的利用了这种想法。 譬如选总统,每个人都以自己的考虑,投下自己的一票,但最后由多数人选出的总统,似乎应该好于由一个人指定的总统。
【集成学习:动机】 集成学习,就是一种把输入送入多个学习器,再通过某种办法把学习的结果集成起来的办法。 这每一个学习器,也就相应的被称为“弱学习器”。 集成学习最早也叫做“Committee Voting Method”,也就是因为它和投票的过程相似。 这里区分学习器和分类器的区别。学习器泛指任意的学习问题,分类器指分类问题(一般限制在两类)
【集成学习:图示】 Classifier ensemble Σαihi(x) hn(x) h2(x) h1(x) Input vector …… Classifier N Combine Classifiers Output x
集成学习 集成学习(Ensemble Learning)是一种机器学习范式,它使用多个学习器来解决同一个问题 … ... 由于集成学习可以有效地提高学习系统的泛化能力,因此它成为国际机器学习界的研究热点 “当前机器学习四大研究方向之首” [T.G. Dietterich, AIMag97]
Example: Weather Forecast Reality 1 2 3 4 5 Combine X X X X X
Intuitions Majority vote Suppose we have 5 completely independent classifiers… If accuracy is 70% for each 10 (.7^3)(.3^2)+5(.7^4)(.3)+(.7^5) 83.7% majority vote accuracy 101 such classifiers 99.9% majority vote accuracy Ensemble is a method like Majority vote Suppose we have 5 completely independent classifiers…If accuracy is 70% for each, we can get 83.7% accuracy by majority vote. If we have 101 such classifiers then we can get 99.9% majority vote accuracy 83.7%上边那个公式的解释:1、多数决策,至少3个赞成,2、所以有(3 2) (4 1) (5 0)三种情况。
[A. Krogh & J. Vedelsby, NIPS94] 【如何构建好的集成】 期望结果 期望结果 个体1 (精度33.3%) 个体1 (精度33.3%) 个体2 (精度33.3%) 集成(精度33.3%) 个体2 (精度33.3%) 集成 (精度0%) 个体3 (精度33.3%) 个体3 (精度33.3%) 投票 投票 个体必须有差异 个体精度不能太低 个体学习器越精确、差异越大,集成越好 [A. Krogh & J. Vedelsby, NIPS94]
选择性集成 既然多个学习器的集成比单个学习器更好,那么是不是学习器越多越好? 个体的增加将使得个体间的差异越来越难以获得 更多的个体意味着: 在预测时需要更大的计算开销,因为要计算更多的个体预测 更大的存储开销,因为有更多的个体需要保存 个体的增加将使得个体间的差异越来越难以获得 [A. Krogh & J. Vedelsby, NIPS94]
在有一组个体学习器可用时,从中选择一部分进行集成,可能比用所有个体学习器进行集成更好 选择性集成 提出了选择性集成(Selective Ensemble) 证明了 “Many Could be Better Than All” Theorem 在有一组个体学习器可用时,从中选择一部分进行集成,可能比用所有个体学习器进行集成更好 in classification in regression Z.-H. Zhou, J. Wu, and W. Tang. Ensembling neural networks: many could be better than all. Artificial Intelligence, 2002, 137(1-2): 239-263
选择性集成思想的一般性:利用多个个体,并对个体进行选择,可以获得更好的结果 问题 … ... 个体解 选择性集成的思想可以用到更多的领域中去 选择的基本原则:个体的效用高、差异大
【集成学习:如何构造?】 办法就是改变训练集。 通常的学习算法,根据训练集的不同,会给出不同的学习器。这时就可以通过改变训练集来构造不同的学习器。然后再把它们集成起来。
【带权的采样:讨论】 通过给训练数据赋以不同的权,实际上使得每个学习器关注训练集中的某一部分,这也符合我们最初民主投票的想法。 直观上,每个学习器关注训练集中的某一部分,很多个训练集应该可以覆盖训练集中的大部分,只要巧妙的选择加权平均的权,就可以得到更好的学习效果。
【用多个学习器覆盖样本空间】 如果来来去去都是覆盖同一块,就没有意思了。
【分类设计的重采样技术】 分类器设计的重采样技术也被称为“自适应的权值重置和组合(arcing, adaptive reweighting and combining); 这类方法的主要思想是利用同一个训练样本集合构造多个分类器,然后以某种方式将这些分类器组合成一个分类器; 主要方法包括:bagging算法和boosting算法
【Bagging算法】 从大小为n的原始数据集D中独立随机地抽取n’个数据(n’<=n),形成一个自助数据集; 基本思想:对训练集有放回地抽取训练样例,从而为每一个基本分类器都构造出一个跟训练集相当大小但各不相同的训练集,从而训练出不同的基本分类器;该算法是基于对训练集进行处理的集成方法中最简单、最直观的一种。 从大小为n的原始数据集D中独立随机地抽取n’个数据(n’<=n),形成一个自助数据集; 重复上述过程,产生出多个独立的自助数据集; 利用每个自助数据集训练出一个“分量分类器”; 最终的分类结果由这些“分量分类器”各自的判别结果投票决定。
【Boosting算法】 2类问题,3个分量分类器的训练算法: 在数量为n的原始样本集D中随机选取n1个样本构成D1,利用D1训练出一个分类器C1; 在样本集D-D1中选择被C1正确分类和错误分类的样本各一半组成样本集D2,用D2训练出一个分类器C2; 将样本集D-D1-D2中所有C1和C2分类结果不同的样本组成样本集D3,训练出一个分类器C3;
【Boosting的分类算法】 对新的样本x进行分类,如果C1和C2判别结果相同,则将x判别为此类别,否则以C3的结果作为x的类别; 原始样本集 分量分类器 组合分类器
【Boosting算法步骤】 Boosting算法:首先给每一个训练样例赋予相同的权重,然后训练第一个基本分类器并用它来对训练集进行测试,对于那些分类错误的测试样例提高其权重(实际算法中是降低分类正确的样例的权重),然后用调整后的带权训练集训练第二个基本分类器,然后重复这个过程直到最后得到一个足够好的学习器。
【Bagging算法和Boosting算法比较】
AdaBoost & Its Applications Overview
AdaBoost Introduction Boosting Adaptive A learning algorithm Building a strong classifier a lot of weaker ones
AdaBoost Concept . weak classifiers strong classifier slightly better than random
Weaker Classifiers . weak classifiers strong classifier Each weak classifier learns by considering one simple feature T most beneficial features for classification should be selected How to define features? select beneficial features? train weak classifiers? manage (weight) training samples? associate weight to each weak classifier? . weak classifiers strong classifier slightly better than random
The Strong Classifiers How good the strong one will be? . weak classifiers strong classifier slightly better than random
The AdaBoost Algorithm Given: Initialization: For : Find classifier which minimizes error wrt Dt ,i.e., Weight classifier: Update distribution:
The AdaBoost Algorithm Given: Initialization: For : Find classifier which minimizes error wrt Dt ,i.e., Weight classifier: Update distribution: Output final classifier:
Boosting illustration Weak Classifier 1
Boosting illustration Weights Increased
Boosting illustration Weak Classifier 2
Boosting illustration Weights Increased
Boosting illustration Weak Classifier 3
Boosting illustration Final classifier is a combination of weak classifiers
The AdaBoost Algorithm What goal the AdaBoost wants to reach? The AdaBoost Algorithm Given: Initialization: For : Find classifier which minimizes error wrt Dt ,i.e., Weight classifier: Update distribution: Output final classifier:
The AdaBoost Algorithm What goal the AdaBoost wants to reach? The AdaBoost Algorithm They are goal dependent. Given: Initialization: For : Find classifier which minimizes error wrt Dt ,i.e., Weight classifier: Update distribution: Output final classifier:
Minimize exponential loss Goal Final classifier: Minimize exponential loss
Goal Minimize exponential loss Maximize the margin yH(x) Final classifier: Minimize exponential loss Maximize the margin yH(x)
Minimize Goal Final classifier: Define with Then,
Minimize Final classifier: Define with Then, Set
Minimize Final classifier: Define with Then,
Minimize Final classifier: Define with Then,
Minimize Final classifier: Define with Then,
Minimize Final classifier: Define with Then,
Minimize Final classifier: Define with Then,
Minimize Final classifier: Define with Then, maximized when
Minimize Final classifier: Define with Then, At time t
Minimize Final classifier: At time t At time 1 At time t+1 Define with Then, At time t At time 1 At time t+1
The Task of Face Detection Many slides adapted from P. Viola
Basic Idea Slide a window across image and evaluate a face model at every location.
Image Features Rectangle filters
Image Features Rectangle filters
Size of Feature Space A+B C D Rectangle filters How many number of possible rectangle features for a 24x24 detection region? A+B C D Rectangle filters
Feature Selection A+B C D How many number of possible rectangle features for a 24x24 detection region? A+B C D What features are good for face detection?
Feature Selection A+B C D How many number of possible rectangle features for a 24x24 detection region? A+B C Can we create a good classifier using just a small subset of all possible features? How to select such a subset? D
Integral images The integral image computes a value at each pixel (x, y) that is the sum of the pixel values above and to the left of (x, y), inclusive. (x, y)
Computing the Integral Image The integral image computes a value at each pixel (x, y) that is the sum of the pixel values above and to the left of (x, y), inclusive. This can quickly be computed in one pass through the image. (x, y)
Computing Sum within a Rectangle D B Only 3 additions are required for any size of rectangle! C A
Scaling Integral image enables us to evaluate all rectangle sizes in constant time. Therefore, no image scaling is necessary. Scale the rectangular features instead! 1 2 3 4 5 6
Boosting Boosting is a classification scheme that works by combining weak learners into a more accurate ensemble classifier A weak learner need only do better than chance Training consists of multiple boosting rounds During each boosting round, we select a weak learner that does well on examples that were hard for the previous weak learners “Hardness” is captured by weights attached to training examples Y. Freund and R. Schapire, A short introduction to boosting, Journal of Japanese Society for Artificial Intelligence, 14(5):771-780, September, 1999.
The AdaBoost Algorithm Given: Initialization: For : Find classifier which minimizes error wrt Dt ,i.e., Weight classifier: Update distribution:
The AdaBoost Algorithm Given: Initialization: For : Find classifier which minimizes error wrt Dt ,i.e., Weight classifier: Update distribution: Output final classifier:
Weak Learners for Face Detection What base learner is proper for face detection? Given: Initialization: For : Find classifier which minimizes error wrt Dt ,i.e., Weight classifier: Update distribution: Output final classifier:
Weak Learners for Face Detection value of rectangle feature parity threshold window
Boosting Training set contains face and nonface examples Initially, with equal weight For each round of boosting: Evaluate each rectangle filter on each example Select best threshold for each filter Select best filter/threshold combination Reweight examples Computational complexity of learning: O(MNK) M rounds, N examples, K features
Features Selected by Boosting First two features selected by boosting: This feature combination can yield 100% detection rate and 50% false positive rate
ROC Curve for 200-Feature Classifier A 200-feature classifier can yield 95% detection rate and a false positive rate of 1 in 14084. ROC Curve for 200-Feature Classifier Not good enough! To be practical for real application, the false positive rate must be closer to 1 in 1,000,000.
Attentional Cascade We start with simple classifiers which reject many of the negative sub-windows while detecting almost all positive sub-windows Positive response from the first classifier triggers the evaluation of a second (more complex) classifier, and so on A negative outcome at any point leads to the immediate rejection of the sub-window Classifier 1 Classifier 2 Classifier 3 IMAGE SUB-WINDOW T T FACE T NON-FACE F F NON-FACE F NON-FACE
Attentional Cascade 100 50 % False Pos % Detection ROC Curve Chain classifiers that are progressively more complex and have lower false positive rates Classifier 1 Classifier 2 Classifier 3 IMAGE SUB-WINDOW T T FACE T NON-FACE F F NON-FACE F NON-FACE
Detection Rate and False Positive Rate for Chained Classifiers The detection rate and the false positive rate of the cascade are found by multiplying the respective rates of the individual stages A detection rate of 0.9 and a false positive rate on the order of 106 can be achieved by a 10-stage cascade if each stage has a detection rate of 0.99 (0.9910 ≈ 0.9) and a false positive rate of about 0.30 (0.310 ≈ 6106 ) Classifier 1 Classifier 2 Classifier 3 IMAGE SUB-WINDOW T T FACE T NON-FACE F F NON-FACE F NON-FACE
Training the Cascade Set target detection and false positive rates for each stage Keep adding features to the current stage until its target rates have been met Need to lower AdaBoost threshold to maximize detection (as opposed to minimizing total classification error) Test on a validation set If the overall false positive rate is not low enough, then add another stage Use false positives from current stage as the negative training examples for the next stage
Training the Cascade
ROC Curves Cascaded Classifier to Monlithic Classifier
ROC Curves Cascaded Classifier to Monlithic Classifier There is little difference between the two in terms of accuracy. There is a big difference in terms of speed. The cascaded classifier is nearly 10 times faster since its first stage throws out most non-faces so that they arenever evaluated by subsequent stages.
The Implemented System Training Data 5000 faces All frontal, rescaled to 24x24 pixels 300 million non-faces 9500 non-face images Faces are normalized Scale, translation Many variations Across individuals Illumination Pose
Structure of the Detector Cascade Combining successively more complex classifiers in cascade 38 stages included a total of 6060 features Reject Sub-Window 1 2 3 4 5 6 7 8 38 Face F T All Sub-Windows
Structure of the Detector Cascade 2 features, reject 50% non-faces, detect 100% faces 10 features, reject 80% non-faces, detect 100% faces 25 features 50 features All Sub-Windows by algorithm 1 T 2 T 3 T 4 T 5 T 6 T 7 T 8 T 38 T Face F F F F F F F F F Reject Sub-Window
Speed of the Final Detector On a 700 Mhz Pentium III processor, the face detector can process a 384 288 pixel image in about .067 seconds” 15 Hz 15 times faster than previous detector of comparable accuracy (Rowley et al., 1998) Average of 8 features evaluated per window on test set
Image Processing Training all example sub-windows were variance normalized to minimize the effect of different lighting conditions Detection variance normalized as well Can be computed using integral image Can be computed using integral image of squared image
Scanning the Detector Scaling is achieved by scaling the detector itself, rather than scaling the image Good detection results for scaling factor of 1.25 The detector is scanned across location Subsequent locations are obtained by shifting the window [s] pixels, where s is the current scale The result for = 1.0 and = 1.5 were reported
Merging Multiple Detections
ROC Curves for Face Detection
Output of Face Detector on Test Images
Facial Feature Localization Other Detection Tasks Facial Feature Localization Profile Detection Male vs. female
Facial Feature Localization Other Detection Tasks Facial Feature Localization Profile Detection
Other Detection Tasks Male vs. Female
Conclusions How adaboost works? Why adaboost works? Adaboost for face detection Rectangle features Integral images for fast computation Boosting for feature selection Attentional cascade for fast rejection of negative windows