Ensemble Learning (集成学习)

Slides:



Advertisements
Similar presentations
Chapter 2 Combinatorial Analysis 主講人 : 虞台文. Content Basic Procedure for Probability Calculation Counting – Ordered Samples with Replacement – Ordered.
Advertisements

台生vs.陸生— 生涯競爭力面面觀 主講人:吳正興
Some theoretical notes on boosting
Business English Reading
Performance Evaluation
B型肝炎帶原之肝細胞癌患者接受肝動脈栓塞治療後血液中DNA之定量分析
-Artificial Neural Network- Hopfield Neural Network(HNN) 朝陽科技大學 資訊管理系 李麗華 教授.
Academic Year TFC EFL Data Collection Outline 学年美丽中国英语测试数据收集概述
Mode Selection and Resource Allocation for Deviceto- Device Communications in 5G Cellular Networks 林柏毅 羅傑文.
Leftmost Longest Regular Expression Matching in Reconfigurable Logic
libD3C: 一种免参数的、支持不平衡分类的二类分类器
Operating System CPU Scheduing - 3 Monday, August 11, 2008.
深層學習 暑期訓練 (2017).
Minimum Spanning Trees
-Artificial Neural Network- Adaline & Madaline
Some Effective Techniques for Naive Bayes Text Classification
Rate and Distortion Optimization for Reversible Data Hiding Using Multiple Histogram Shifting Source: IEEE Transactions On Cybernetics, Vol. 47, No. 2,February.
Platypus — Indoor Localization and Identification through Sensing Electric Potential Changes in Human Bodies.
指導教授:許子衡 教授 報告學生:翁偉傑 Qiangyuan Yu , Geert Heijenk
Population proportion and sample proportion
模式识别 Pattern Recognition
非線性規劃 Nonlinear Programming
Source: IEEE Access, vol. 5, pp , October 2017
SAT and max-sat Qi-Zhi Cai.
Seam Carving for Content-Aware Image Resizing
第五讲 数据的分组、合并与转换.
Creating Animated Apps (I) 靜宜大學資管系 楊子青
第十章 基于立体视觉的深度估计.
Guide to Freshman Life Prepared by Sam Wu.
Decision Support System (靜宜資管楊子青)
Course 9 NP Theory序論 An Introduction to the Theory of NP
Step 1. Semi-supervised Given a region, where a primitive event happens Given the beginning and end time of each instance of the primitive event.
Advanced Artificial Intelligence
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi
Randomized Algorithms
Interval Estimation區間估計
子博弈完美Nash均衡 我们知道,一个博弈可以有多于一个的Nash均衡。在某些情况下,我们可以按照“子博弈完美”的要求,把不符合这个要求的均衡去掉。 扩展型博弈G的一部分g叫做一个子博弈,如果g包含某个节点和它所有的后继点,并且一个G的信息集或者和g不相交,或者整个含于g。 一个Nash均衡称为子博弈完美的,如果它在每.
重點 資料結構之選定會影響演算法 選擇對的資料結構讓您上天堂 程式.
Decision Support System (靜宜資管楊子青)
A high payload data hiding scheme based on modified AMBTC technique
Maintaining Frequent Itemsets over High-Speed Data Streams
Research 裴澍炜 Shuwei Pei Tel:
Guide to a successful PowerPoint design – simple is best
Respect cannot be demanded, it must be earned
3.5 Region Filling Region Filling is a process of “coloring in” a definite image area or region. 2019/4/19.
中国科学技术大学计算机系 陈香兰 2013Fall 第七讲 存储器管理 中国科学技术大学计算机系 陈香兰 2013Fall.
虚 拟 仪 器 virtual instrument
中央社新聞— <LTTC:台灣學生英語聽說提升 讀寫相對下降>
HITSCIR-TM zkli-李泽魁 March. 24, 2015
中考英语阅读理解 完成句子命题与备考 宝鸡市教育局教研室 任军利
高考应试作文写作训练 5. 正反观点对比.
An Efficient MSB Prediction-based Method for High-capacity Reversible Data Hiding in Encrypted Images 基于有效MSB预测的加密图像大容量可逆数据隐藏方法。 本文目的: 做到既有较高的藏量(1bpp),
李宏毅專題 Track A, B, C 的時間、地點開學前通知
Efficient Query Relaxation for Complex Relationship Search on Graph Data 李舒馨
名词从句(2).
Introduction of this course
More About Auto-encoder
Speaker : YI-CHENG HUNG
动词不定式(6).
何正斌 博士 國立屏東科技大學工業管理研究所 教授
Chapter 9 Validation Prof. Dehan Luo
Class imbalance in Classification
簡單迴歸分析與相關分析 莊文忠 副教授 世新大學行政管理學系 計量分析一(莊文忠副教授) 2019/8/3.
Principle and application of optical information technology
WiFi is a powerful sensing medium
Respect cannot be demanded, it must be earned
Gaussian Process Ruohua Shi Meeting
《神经网络与深度学习》 第10章 模型独立的学习方式
When using opening and closing presentation slides, use the masterbrand logo at the correct size and in the right position. This slide meets both needs.
Presentation transcript:

Ensemble Learning (集成学习)

Outline What’s is Machine Learning? Background and Motivation The AdaBoost Algorithm How and why AdaBoost works? AdaBoost for Face Detection

What’s is Machine Learning 机器学习是人工智能的核心研究领域之一 经典定义:利用经验改善系统自身的性能 随着该领域的发展,主要做智能数据分析 典型任务:根据现有数据建立预测模型,然后 用来预测新数据。

机器学习的重要性 生物 信息学 计算 金融学 分子 生物学 行星 地质学 …… 工业过程控制 机器人 遥感信 息处理 信息安全 机 器 学 习 美国航空航天局JPL实验室的科学家在《Science》(2001年9月)上撰文指出:机器学习对科学研究的整个过程正起到越来越大的支持作用,……,该领域在今后的若干年内将取得稳定而快速的发展

奥卡姆剃刀定律(Occam's Razor, Ockham'sRazor)又称“奥康的剃刀”,它是由14世纪逻辑学家、圣方济各会修士奥卡姆的威廉(William of Occam,约1285年至1349年)提出。这个原理称为“如无必要,勿增实体”,即“简单有效原理”。正如他在《箴言书注》2卷15题说“切勿浪费较多东西去做,用较少的东西,同样可以做好的事情。”

奥卡姆剃刀定律(Occam's Razor, Ockham'sRazor)又称“奥康的剃刀”,它是由14世纪逻辑学家、圣方济各会修士奥卡姆的威廉(William of Occam,约1285年至1349年)提出。这个原理称为“如无必要,勿增实体”,即“简单有效原理”。正如他在《箴言书注》2卷15题说“切勿浪费较多东西去做,用较少的东西,同样可以做好的事情。”

Motivation 泛化能力是机器学习关注的一个根本问题 泛化能力(generalization ability)表征了学习系统对新事件的适用性 泛化能力越强越好 提高泛化能力是机器学习永远的追求

Motivation 在机器学习中,直接建立一个高性能的分类器是很困难的。 但是,如果能找到一系列性能较差的分类器,并把它们集成起来的话,也许就能得到更好的分类器。 日常生活中,所谓的民主决策,便是部分的利用了这种想法。 譬如选总统,每个人都以自己的考虑,投下自己的一票,但最后由多数人选出的总统,似乎应该好于由一个人指定的总统。

【集成学习:动机】 集成学习,就是一种把输入送入多个学习器,再通过某种办法把学习的结果集成起来的办法。 这每一个学习器,也就相应的被称为“弱学习器”。 集成学习最早也叫做“Committee Voting Method”,也就是因为它和投票的过程相似。 这里区分学习器和分类器的区别。学习器泛指任意的学习问题,分类器指分类问题(一般限制在两类)

【集成学习:图示】 Classifier ensemble Σαihi(x) hn(x) h2(x) h1(x) Input vector …… Classifier N Combine Classifiers Output x

集成学习 集成学习(Ensemble Learning)是一种机器学习范式,它使用多个学习器来解决同一个问题 … ... 由于集成学习可以有效地提高学习系统的泛化能力,因此它成为国际机器学习界的研究热点 “当前机器学习四大研究方向之首” [T.G. Dietterich, AIMag97]

Example: Weather Forecast Reality 1 2 3 4 5 Combine X X X X X

Intuitions Majority vote Suppose we have 5 completely independent classifiers… If accuracy is 70% for each 10 (.7^3)(.3^2)+5(.7^4)(.3)+(.7^5) 83.7% majority vote accuracy 101 such classifiers 99.9% majority vote accuracy Ensemble is a method like Majority vote Suppose we have 5 completely independent classifiers…If accuracy is 70% for each, we can get 83.7% accuracy by majority vote. If we have 101 such classifiers then we can get 99.9% majority vote accuracy 83.7%上边那个公式的解释:1、多数决策,至少3个赞成,2、所以有(3 2) (4 1) (5 0)三种情况。

[A. Krogh & J. Vedelsby, NIPS94] 【如何构建好的集成】 期望结果 期望结果 个体1 (精度33.3%) 个体1 (精度33.3%) 个体2 (精度33.3%) 集成(精度33.3%) 个体2 (精度33.3%) 集成 (精度0%) 个体3 (精度33.3%) 个体3 (精度33.3%) 投票 投票 个体必须有差异 个体精度不能太低 个体学习器越精确、差异越大,集成越好 [A. Krogh & J. Vedelsby, NIPS94]

选择性集成 既然多个学习器的集成比单个学习器更好,那么是不是学习器越多越好? 个体的增加将使得个体间的差异越来越难以获得 更多的个体意味着: 在预测时需要更大的计算开销,因为要计算更多的个体预测 更大的存储开销,因为有更多的个体需要保存 个体的增加将使得个体间的差异越来越难以获得 [A. Krogh & J. Vedelsby, NIPS94]

在有一组个体学习器可用时,从中选择一部分进行集成,可能比用所有个体学习器进行集成更好 选择性集成 提出了选择性集成(Selective Ensemble) 证明了 “Many Could be Better Than All” Theorem 在有一组个体学习器可用时,从中选择一部分进行集成,可能比用所有个体学习器进行集成更好 in classification in regression Z.-H. Zhou, J. Wu, and W. Tang. Ensembling neural networks: many could be better than all. Artificial Intelligence, 2002, 137(1-2): 239-263

选择性集成思想的一般性:利用多个个体,并对个体进行选择,可以获得更好的结果 问题 … ... 个体解 选择性集成的思想可以用到更多的领域中去 选择的基本原则:个体的效用高、差异大

【集成学习:如何构造?】 办法就是改变训练集。 通常的学习算法,根据训练集的不同,会给出不同的学习器。这时就可以通过改变训练集来构造不同的学习器。然后再把它们集成起来。

【带权的采样:讨论】 通过给训练数据赋以不同的权,实际上使得每个学习器关注训练集中的某一部分,这也符合我们最初民主投票的想法。 直观上,每个学习器关注训练集中的某一部分,很多个训练集应该可以覆盖训练集中的大部分,只要巧妙的选择加权平均的权,就可以得到更好的学习效果。

【用多个学习器覆盖样本空间】 如果来来去去都是覆盖同一块,就没有意思了。

【分类设计的重采样技术】 分类器设计的重采样技术也被称为“自适应的权值重置和组合(arcing, adaptive reweighting and combining); 这类方法的主要思想是利用同一个训练样本集合构造多个分类器,然后以某种方式将这些分类器组合成一个分类器; 主要方法包括:bagging算法和boosting算法

【Bagging算法】 从大小为n的原始数据集D中独立随机地抽取n’个数据(n’<=n),形成一个自助数据集; 基本思想:对训练集有放回地抽取训练样例,从而为每一个基本分类器都构造出一个跟训练集相当大小但各不相同的训练集,从而训练出不同的基本分类器;该算法是基于对训练集进行处理的集成方法中最简单、最直观的一种。 从大小为n的原始数据集D中独立随机地抽取n’个数据(n’<=n),形成一个自助数据集; 重复上述过程,产生出多个独立的自助数据集; 利用每个自助数据集训练出一个“分量分类器”; 最终的分类结果由这些“分量分类器”各自的判别结果投票决定。

【Boosting算法】 2类问题,3个分量分类器的训练算法: 在数量为n的原始样本集D中随机选取n1个样本构成D1,利用D1训练出一个分类器C1; 在样本集D-D1中选择被C1正确分类和错误分类的样本各一半组成样本集D2,用D2训练出一个分类器C2; 将样本集D-D1-D2中所有C1和C2分类结果不同的样本组成样本集D3,训练出一个分类器C3;

【Boosting的分类算法】 对新的样本x进行分类,如果C1和C2判别结果相同,则将x判别为此类别,否则以C3的结果作为x的类别; 原始样本集 分量分类器 组合分类器

【Boosting算法步骤】 Boosting算法:首先给每一个训练样例赋予相同的权重,然后训练第一个基本分类器并用它来对训练集进行测试,对于那些分类错误的测试样例提高其权重(实际算法中是降低分类正确的样例的权重),然后用调整后的带权训练集训练第二个基本分类器,然后重复这个过程直到最后得到一个足够好的学习器。

【Bagging算法和Boosting算法比较】

AdaBoost & Its Applications Overview

AdaBoost Introduction Boosting Adaptive A learning algorithm Building a strong classifier a lot of weaker ones

AdaBoost Concept . weak classifiers strong classifier slightly better than random

Weaker Classifiers . weak classifiers strong classifier Each weak classifier learns by considering one simple feature T most beneficial features for classification should be selected How to define features? select beneficial features? train weak classifiers? manage (weight) training samples? associate weight to each weak classifier? . weak classifiers strong classifier slightly better than random

The Strong Classifiers How good the strong one will be? . weak classifiers strong classifier slightly better than random

The AdaBoost Algorithm Given: Initialization: For : Find classifier which minimizes error wrt Dt ,i.e., Weight classifier: Update distribution:

The AdaBoost Algorithm Given: Initialization: For : Find classifier which minimizes error wrt Dt ,i.e., Weight classifier: Update distribution: Output final classifier:

Boosting illustration Weak Classifier 1

Boosting illustration Weights Increased

Boosting illustration Weak Classifier 2

Boosting illustration Weights Increased

Boosting illustration Weak Classifier 3

Boosting illustration Final classifier is a combination of weak classifiers

The AdaBoost Algorithm What goal the AdaBoost wants to reach? The AdaBoost Algorithm Given: Initialization: For : Find classifier which minimizes error wrt Dt ,i.e., Weight classifier: Update distribution: Output final classifier:

The AdaBoost Algorithm What goal the AdaBoost wants to reach? The AdaBoost Algorithm They are goal dependent. Given: Initialization: For : Find classifier which minimizes error wrt Dt ,i.e., Weight classifier: Update distribution: Output final classifier:

Minimize exponential loss Goal Final classifier: Minimize exponential loss

Goal Minimize exponential loss Maximize the margin yH(x) Final classifier: Minimize exponential loss Maximize the margin yH(x)

Minimize Goal Final classifier: Define with Then,

Minimize Final classifier: Define with Then, Set

Minimize Final classifier: Define with Then,

Minimize Final classifier: Define with Then,

Minimize Final classifier: Define with Then,

Minimize Final classifier: Define with Then,

Minimize Final classifier: Define with Then,

Minimize Final classifier: Define with Then, maximized when

Minimize Final classifier: Define with Then, At time t

Minimize Final classifier: At time t At time 1 At time t+1 Define with Then, At time t At time 1 At time t+1

The Task of Face Detection Many slides adapted from P. Viola

Basic Idea Slide a window across image and evaluate a face model at every location.

Image Features Rectangle filters

Image Features Rectangle filters

Size of Feature Space A+B C D Rectangle filters How many number of possible rectangle features for a 24x24 detection region? A+B C D Rectangle filters

Feature Selection A+B C D How many number of possible rectangle features for a 24x24 detection region? A+B C D What features are good for face detection?

Feature Selection A+B C D How many number of possible rectangle features for a 24x24 detection region? A+B C Can we create a good classifier using just a small subset of all possible features? How to select such a subset? D

Integral images The integral image computes a value at each pixel (x, y) that is the sum of the pixel values above and to the left of (x, y), inclusive. (x, y)

Computing the Integral Image The integral image computes a value at each pixel (x, y) that is the sum of the pixel values above and to the left of (x, y), inclusive. This can quickly be computed in one pass through the image. (x, y)

Computing Sum within a Rectangle D B Only 3 additions are required for any size of rectangle! C A

Scaling Integral image enables us to evaluate all rectangle sizes in constant time. Therefore, no image scaling is necessary. Scale the rectangular features instead! 1 2 3 4 5 6

Boosting Boosting is a classification scheme that works by combining weak learners into a more accurate ensemble classifier A weak learner need only do better than chance Training consists of multiple boosting rounds During each boosting round, we select a weak learner that does well on examples that were hard for the previous weak learners “Hardness” is captured by weights attached to training examples Y. Freund and R. Schapire, A short introduction to boosting, Journal of Japanese Society for Artificial Intelligence, 14(5):771-780, September, 1999.

The AdaBoost Algorithm Given: Initialization: For : Find classifier which minimizes error wrt Dt ,i.e., Weight classifier: Update distribution:

The AdaBoost Algorithm Given: Initialization: For : Find classifier which minimizes error wrt Dt ,i.e., Weight classifier: Update distribution: Output final classifier:

Weak Learners for Face Detection What base learner is proper for face detection? Given: Initialization: For : Find classifier which minimizes error wrt Dt ,i.e., Weight classifier: Update distribution: Output final classifier:

Weak Learners for Face Detection value of rectangle feature parity threshold window

Boosting Training set contains face and nonface examples Initially, with equal weight For each round of boosting: Evaluate each rectangle filter on each example Select best threshold for each filter Select best filter/threshold combination Reweight examples Computational complexity of learning: O(MNK) M rounds, N examples, K features

Features Selected by Boosting First two features selected by boosting: This feature combination can yield 100% detection rate and 50% false positive rate

ROC Curve for 200-Feature Classifier A 200-feature classifier can yield 95% detection rate and a false positive rate of 1 in 14084. ROC Curve for 200-Feature Classifier Not good enough! To be practical for real application, the false positive rate must be closer to 1 in 1,000,000.

Attentional Cascade We start with simple classifiers which reject many of the negative sub-windows while detecting almost all positive sub-windows Positive response from the first classifier triggers the evaluation of a second (more complex) classifier, and so on A negative outcome at any point leads to the immediate rejection of the sub-window Classifier 1 Classifier 2 Classifier 3 IMAGE SUB-WINDOW T T FACE T NON-FACE F F NON-FACE F NON-FACE

Attentional Cascade 100 50 % False Pos % Detection ROC Curve Chain classifiers that are progressively more complex and have lower false positive rates Classifier 1 Classifier 2 Classifier 3 IMAGE SUB-WINDOW T T FACE T NON-FACE F F NON-FACE F NON-FACE

Detection Rate and False Positive Rate for Chained Classifiers The detection rate and the false positive rate of the cascade are found by multiplying the respective rates of the individual stages A detection rate of 0.9 and a false positive rate on the order of 106 can be achieved by a 10-stage cascade if each stage has a detection rate of 0.99 (0.9910 ≈ 0.9) and a false positive rate of about 0.30 (0.310 ≈ 6106 ) Classifier 1 Classifier 2 Classifier 3 IMAGE SUB-WINDOW T T FACE T NON-FACE F F NON-FACE F NON-FACE

Training the Cascade Set target detection and false positive rates for each stage Keep adding features to the current stage until its target rates have been met Need to lower AdaBoost threshold to maximize detection (as opposed to minimizing total classification error) Test on a validation set If the overall false positive rate is not low enough, then add another stage Use false positives from current stage as the negative training examples for the next stage

Training the Cascade

ROC Curves Cascaded Classifier to Monlithic Classifier

ROC Curves Cascaded Classifier to Monlithic Classifier There is little difference between the two in terms of accuracy. There is a big difference in terms of speed. The cascaded classifier is nearly 10 times faster since its first stage throws out most non-faces so that they arenever evaluated by subsequent stages.

The Implemented System Training Data 5000 faces All frontal, rescaled to 24x24 pixels 300 million non-faces 9500 non-face images Faces are normalized Scale, translation Many variations Across individuals Illumination Pose

Structure of the Detector Cascade Combining successively more complex classifiers in cascade 38 stages included a total of 6060 features Reject Sub-Window 1 2 3 4 5 6 7 8 38 Face F T All Sub-Windows

Structure of the Detector Cascade 2 features, reject 50% non-faces, detect 100% faces 10 features, reject 80% non-faces, detect 100% faces 25 features 50 features All Sub-Windows by algorithm 1 T 2 T 3 T 4 T 5 T 6 T 7 T 8 T 38 T Face F F F F F F F F F Reject Sub-Window

Speed of the Final Detector On a 700 Mhz Pentium III processor, the face detector can process a 384 288 pixel image in about .067 seconds” 15 Hz 15 times faster than previous detector of comparable accuracy (Rowley et al., 1998) Average of 8 features evaluated per window on test set

Image Processing Training  all example sub-windows were variance normalized to minimize the effect of different lighting conditions Detection  variance normalized as well Can be computed using integral image Can be computed using integral image of squared image

Scanning the Detector Scaling is achieved by scaling the detector itself, rather than scaling the image Good detection results for scaling factor of 1.25 The detector is scanned across location Subsequent locations are obtained by shifting the window [s] pixels, where s is the current scale The result for  = 1.0 and  = 1.5 were reported

Merging Multiple Detections

ROC Curves for Face Detection

Output of Face Detector on Test Images

Facial Feature Localization Other Detection Tasks Facial Feature Localization Profile Detection Male vs. female

Facial Feature Localization Other Detection Tasks Facial Feature Localization Profile Detection

Other Detection Tasks Male vs. Female

Conclusions How adaboost works? Why adaboost works? Adaboost for face detection Rectangle features Integral images for fast computation Boosting for feature selection Attentional cascade for fast rejection of negative windows