Lecture 3 : Classification (1)

Slides:



Advertisements
Similar presentations
allow v. wrong adj. What’s wrong? midnight n. look through guess v. deal n. big deal work out 允许;准许 有毛病;错误的 哪儿不舒服? 午夜;子夜 快速查看;浏览 猜测;估计 协议;交易 重要的事.
Advertisements

高考短文改错专题 张柱平. 高考短文改错专题 一. 对短文改错的要求 高考短文改错的目的在于测试考生判断发现, 纠正语篇中 语言使用错误的能力, 以及考察考生在语篇中综合运用英 语知识的能力. 二. 高考短文改错的命题特点 高考短文改错题的形式有说明文. 短文故事. 书信等, 具有很 强的实用性.
桂林市 2011 年高三第二次调研考 试质量分析暨备考教学建议 桂林市教育科学研究所 李陆桂. 二调平均分与一调、 2010 广西高考英语平均分的比较 科目 类别 英语 文科文科 2010 年广西 一调 二调 与 10 年广西相差
高考英语短文改错 试题解析 内蒙古师范大学外国语学院 方芳 2011 年 3 月. 一、短文改错设疑方式 此 题要求改正所给短文中的错误。对标有 题号的每一行做出判断: 1) 如无错误,在该行右边横线上画一个 ( );如有错误(每行只有一个错误), 则按下列情况改正:
考研英语复试 口语准备 考研英语口语复试. 考研英语复试 口语准备 服装 谦虚、微笑、自信 态度积极 乐观沉稳.
英语中考复习探讨 如何写好书面表达 宁波滨海学校 李爱娣. 近三年中考试题分析 评分标准 试卷评分与练习 (2009 年书面表达为例 ) 影响给分的因素: 存在问题 书面表达高分技巧 建议.
2014 年上学期 湖南长郡卫星远程学校 制作 13 Getting news from the Internet.
-CHINESE TIME (中文时间): Free Response idea: 你周末做了什么?
专题八 书面表达.
A Career Planning Project
How can we become good leamers
-Artificial Neural Network- Hopfield Neural Network(HNN) 朝陽科技大學 資訊管理系 李麗華 教授.
Chapter 8 Liner Regression and Correlation 第八章 直线回归和相关
摘要的开头: The passage mainly tells us sth.
深層學習 暑期訓練 (2017).
Unit 4 I used to be afraid of the dark.
Module 5 Shopping 第2课时.
Some Effective Techniques for Naive Bayes Text Classification
Population proportion and sample proportion
模式识别 Pattern Recognition
文本分类综述 王 斌 中国科学院计算技术研究所 2002年12月.
Source: IEEE Access, vol. 5, pp , October 2017
HOW TO ACE -- THE IELTS SPEAKING TEST
顏色yán sè COLORS 紅色 藍色 綠色 黃色 紫色 白色 黑色 咖啡色 bái sè hēi sè hóng sè lǜ sè
Write a letter in a proper format
Guide to Freshman Life Prepared by Sam Wu.
创建型设计模式.
This Is English 3 双向视频文稿.
Interval Estimation區間估計
塑膠材料的種類 塑膠在模具內的流動模式 流動性質的影響 溫度性質的影響
Lesson 28 How Do I Learn English?
Lesson 44:Popular Sayings
Try to write He Mengling Daqu Middle School.
基于课程标准的校本课程教学研究 乐清中学 赵海霞.
解读设题意图,探究阅读策略 年高考试卷题型(阅读理解)分析及对策

Computer Vision Chapter 4
谈模式识别方法在林业管理问题中的应用 报告人:管理工程系 马宁 报告地点:学研B107
Version Control System Based DSNs
高性能计算与天文技术联合实验室 智能与计算学部 天津大学
Unit 8 Our Clothes Topic1 What a nice coat! Section D 赤峰市翁牛特旗梧桐花中学 赵亚平.
Guide to a successful PowerPoint design – simple is best
BORROWING SUBTRACTION WITHIN 20
虚 拟 仪 器 virtual instrument
Common Qs Regarding Earnings
中央社新聞— <LTTC:台灣學生英語聽說提升 讀寫相對下降>
-----Reading: ZhongGuanCun
線性規劃模式 Linear Programming Models
中考英语阅读理解 完成句子命题与备考 宝鸡市教育局教研室 任军利
高考应试作文写作训练 5. 正反观点对比.
李宏毅專題 Track A, B, C 的時間、地點開學前通知
Efficient Query Relaxation for Complex Relationship Search on Graph Data 李舒馨
常見的巨量資料分析與應用 楊立偉教授 台大工管系暨商研所 2017.
M; Well, let me check again with Jane
Introduction of this course
立足语境,放眼词块,螺旋上升 年温州一模试卷题型分析 及相应的二轮复习对策 by Sue March 14,2013.
More About Auto-encoder
Speaker : YI-CHENG HUNG
國立東華大學課程設計與潛能開發學系張德勝
怎樣把同一評估 給與在不同班級的學生 How to administer the Same assessment to students from Different classes and groups.
Further Development Translation 来自 创思英语 Grammar.
Class imbalance in Classification
簡單迴歸分析與相關分析 莊文忠 副教授 世新大學行政管理學系 計量分析一(莊文忠副教授) 2019/8/3.
My favorite subject science.
Principle and application of optical information technology
WiFi is a powerful sensing medium
Homework 2 : VSM and Summary
Train Track and Children
Gaussian Process Ruohua Shi Meeting
When using opening and closing presentation slides, use the masterbrand logo at the correct size and in the right position. This slide meets both needs.
Presentation transcript:

Lecture 3 : Classification (1) 楊立偉教授 wyang@ntu.edu.tw 本投影片修改自Introduction to Information Retrieval一書之投影片 Ch 13~14

近年來機器學習的進展 (1) 監督式學習 supervised learning 演算法及模型、計算複雜度及規模、巨量資料 Top 10 Data Mining Algorithm (IEEE ICDM 2006) 1 C4.5 and Beyond 決策樹及規則 2 k-Means 資料分群 3 Support Vector Machines (SVM) 自動分類 4 Apriori 關聯分析 5 Expectation Maximization (EM) 最大概似估計 6 PageRank 連結分析 7 AdaBoost 適應增強學習 8 k-Nearest Neighbor (kNN) 自動分類 9 Naive Bayes (NB) 自動分類 10 Classification and Regression Trees (CART) 決策樹

近年來機器學習的進展 (2) "What Artificial Intelligence Can and Can't Do Right Now" by Andrew Ng, Harvard Business Review, 2016 Nov. 由訓練到自動,「依樣畫葫蘆」 目標-嘗試-獎勵,相互對抗 臉部辨識 貸款核准 精準廣告 語音辨識 機器翻譯 自動駕駛

近年來機器學習的進展 (3) Neural Networks (NN) to Deep Learning (DL) Backpropagation and Gradient Descent algorithm

Recurrent Neural Networks (RNN) Ref: Recurrent neural network based language model. 2010

Long Short-Term Memory (LSTM) RNN Ref: http://193.6.4.39/~czap/letoltes/IS14/IS2014/PDF/AUTHOR/IS141304.PDF, 2014

Convolution Neural Networks (CNN) Ref: Imagenet classification with deep convolutional neural networks. 2010

Text Classification

A text classification task: Email spam filtering From: ‘‘’’ <takworlld@hotmail.com> Subject: real estate is the only way... gem oalvgkay Anyone can buy real estate with no money down Stop paying rent TODAY ! There is no need to spend hundreds or even thousands for similar courses I am 22 years old and I have already purchased 6 properties using the methods outlined in this truly INCREDIBLE ebook. Change your life NOW ! ================================================= Click Below to order: http://www.wholesaledaily.com/sales/nmd.htm How would you write a program that would automatically detect and delete this type of message? 9

Formal definition of TC: Training Given: A document space X Documents are represented in this space – typically some type of high-dimensional space. A fixed set of classes C = {c1, c2, . . . , cJ} The classes are human-defined for the needs of an application (e.g., relevant vs. nonrelevant). A training set D of labeled documents with each labeled document <d, c> ∈ X × C Using a learning method or learning algorithm, we then wish to learn a classifier ϒ that maps documents to classes: ϒ : X → C 10

Formal definition of TC: Application/Testing Given: a description d ∈ X of a document Determine: ϒ (d) ∈ C, that is, the class that is most appropriate for d 11

Topic classification 有些分類具階層性 有些可能屬多個分類 12

Example Test Data: (AI) (Programming) (HCI) Classes: ML Planning language proof intelligence" Test Data: (AI) (Programming) (HCI) Classes: ML Planning Semantics Garb.Coll. Multimedia GUI Training Data: learning intelligence algorithm reinforcement network... planning temporal reasoning plan language... programming semantics language proof... garbage collection memory optimization region... ... ... ML: machine learning HCI: human computer interface

More examples of TC Applications Labeling 貼標籤 (歸類) Assign labels to each document : Labels are most often topics such as Yahoo-categories 主題 e.g., "finance," "sports," "news>world>asia>business" Labels may be language or genres 語言或型式 e.g., “English" “Chinese" “French“ e.g., "editorials" "movie-reviews" "news“ Labels may be sentiments 情緒 e.g., “like”, “hate”, “neutral” Labels may be domain-specific binary 是否屬於某領域 e.g., "interesting-to-me" : "not-interesting-to-me” e.g., “spam” : “not-spam” e.g., “contains adult language” :“doesn’t”

More examples of TC Applications 新聞自動分類:將每日新聞自動歸類重新編排 專利自動分類:將申請的專利給予適當的類別以利分派審查 1999案件自動分類:將民眾抱怨快速分派給正確的單位做處理ˋ 用文件分類預測股價走勢:出些某些文件,之後七日內該股票之 之平均收盤價走高或走低(看漲文件及看跌文件)

Classification Methods (1) Manual classification 人工分類 Used by Yahoo, ODP, PubMed Very accurate when job is done by experts 靠領域與分類專家,所以很準 Consistent when the problem size and team is small 當資料量大,用人工判斷會有主觀不一致的問題 Difficult and expensive to scale need automatic methods for classification 註 : ODP - Open Directory Project 開放分類目錄計劃

Classification Methods (2) Rule-based systems 規則式分類 Google Alerts is an example of rule-based classification Assign category if document contains a given boolean combination of words 使用布林條件,例如 (文化創意 | 文創) → 歸於文化類 Accuracy is often very high if a rule has been carefully refined over time by a subject expert Building and maintaining these rules is cumbersome and expensive 例如 (文化創意 | 文創 | 電影 | 工藝 | 藝術….) → 歸於文化類 需要很多的列舉與排除

Classification Methods (3) Statistical/Probabilistic systems 統計機率式 Text classification as a learning problem (i) Supervised learning of a the classification function ϒ and (ii) its application to classifying new documents Examples Naive Bayes (simple, common method) k-Nearest Neighbors (simple, powerful) Support-vector machines (new, more powerful) No free lunch: requires hand-classified training data But data can be built up (and refined) by non-experts

Naïve Bayes

Overview The Naive Bayes classifier is a probabilistic classifier. 基於機率學的貝氏定理 Build a generative model that approximates how data is produced Uses prior probability (事前機率; 先天機率) of each category given no information about an item. Categorization produces a posterior probability (事後機率; 條件機率) distribution over the possible categories given a description of an item. 訣竅 : 將條件機率展開成可以計算的機率,然後計算之

Posterior probability 條件機率 已知學生為20歲中女性之機率 P(B|A)=21/35=0.6 或利用公式 P(B|A) = P(A∩B) / P(A) = 0.42/0.7 = 0.6 50 15 35 和 30 9 21 女 20 6 14 男 非20歲 20歲 性別 年齡

Bayes' Rule 基本貝式定理:將條件機率 (事後機率) 各自展開後搬移項所得 給定X文件 歸於C類的機率 其中P(X) : 給定X文件的機率是常數

Naive Bayes Classifiers (1) Task: Classify a new instance D based on a tuple of attribute values into one of the classes cj  C Naive Bayes classification is to find the "best" class (the most likely or maximum a posteriori (MAP) class Cmap ) as P(D) is constant

Naive Bayes Classifiers (2) P(cj) Can be estimated from the frequency of classes in the training examples. 可以由訓練資料中計算而得 P(x1,x2,…,xn|cj) Could be estimated if a very large number of training examples was available. applying Naïve Bayes Conditional Independence Assumption

Naïve Bayes Assumption Flu X1 X2 X5 X3 X4 fever sinus cough runnynose muscle-ache Conditional Independence Assumption: Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(xi|cj). sinus: 彎曲, 廔管

Exercise Test Data: (AI : 10篇) (Programming : 6篇) (HCI : 4篇) Classes: "planning language proof intelligence" Test Data: (AI : 10篇) (Programming : 6篇) (HCI : 4篇) Classes: ML Planning Semantics Garb.Coll. Multimedia GUI Training Data: learning 8次 intelligence 4次 algorithm 3次 reinforcement 2次 network 7次 planning 6次 temporal 2次 reasoning 3次 plan 7次 language 6次 programming 4次 semantics 5次 language 8次 proof 3次 garbage 2次 collection 4次 memory 8次 optimization 5次 region 3次 ... ... ML: machine learning HCI: human computer interface 共8+4+3+2+7+6+2+3+7+6=48 共4+5+8+3+2+4+8+5+3=42

Exercise (cont.) P(cj) P(x1,x2,…,xn|cj) P(AI)=10/20, P(Programming)=6/20 P(x1,x2,…,xn|cj) P(intelligence|AI)=4/48, P(planning|AI)=6/48, P(language|AI)=6/48, P(language|Programming)=8/42, P(proof|Programming)=3/42 若不做smoothing 並以相加(不取log)取代相乘的近似算法 AI: 10/20*(4/48+6/48+6/48)=0.167 ←歸類為AI類 Programming: 6/20*(8/42+3/42)=0.078

Naïve Bayes: Learning From training corpus, extract Vocabulary 先取出所有可能的詞 Calculate required P(cj) and P(xk | cj) 能算的先算 For each cj in C do docsj  subset of documents for which the target class is cj Concatenate all docsj into a single document Textj for each word xk in Vocabulary nk  number of occurrences of xk in Textj 調整項:Smoothing to avoid over-fitting (avoid zero)

Naïve Bayes: Classifying positions  all word positions in current document which contain tokens found in Vocabulary Return cNB, where 取機率最大的類別 有出現的詞, 其機率相乘

Smoothing to avoid over-fitting If a document contains a term x which never appears in the category c, the p(x|c) will always be zero, and the product will be zero, too. to add one smoothing to avoid zeros : Before After

Naive Bayes: Training 31

Naive Bayes: Testing 由相乘積轉成log相加 32

Naïve Bayes : discussion

Violation of NB Assumptions Conditional independence 是否可以假設兩詞的出現為獨立事件? 與 VSM 的問題類似:向量空間之兩兩詞間是否為正交? Conclusion Naive Bayes can work well even though conditional independence assumptions are badly violated Because classification is about predicting the correct class and not about accurately estimating probabilities.

NB with Feature Selection (1) Text collections have a large number of features 10,000 – 1,000,000 unique words … and more Feature (文件或類別的特徵) 若是選得好 Reduces training time Training time for some methods is quadratic or worse in the number of features Can improve generalization (performance) Eliminates noise features Avoids over-fitting

NB with Feature Selection (2) 2 ideas beyond TF-IDF 兩種評估好壞的指標 Hypothesis testing statistics: Are we confident that the value of one categorical variable is associated with the value of another Chi-square test 卡方檢定 Information theory: How much information does the value of one categorical variable give you about the value of another Mutual information They’re similar, but 2 measures confidence in association, (based on available statistics), while MI measures extent of association (assuming perfect knowledge of probabilities)

Naive Bayes is not so naive Naive Naive Bayes has won some bakeoffs (e.g., KDD-CUP 97) More robust to nonrelevant features than some more complex learning methods More robust to concept drift (changing of definition of class over time) than some more complex learning methods Better than methods like decision trees when we have many equally important features A good dependable baseline for text classification (but not the best) Optimal if independence assumptions hold (never true for text, but true for some domains) Very fast : Learning with one pass over the data; testing linear in the number of attributes, and document collection size Low storage requirements 37

Naïve Bayes : evaluation

Evaluation on Reuters 39

Example: The Reuters collection 40

A Reuters document 41

Evaluating classification Evaluation must be done on test data that are independent of the training data (usually a disjoint set of instances). It’s easy to get good performance on a test set that was available to the learner during training (e.g., just memorize the test set). Measures: Precision, recall, F1, classification accuracy 42

Precision P and recall R ↑also known as confusion matrix P = TP / ( TP + FP) R = TP / ( TP + FN) 43

A combined measure: F F1 allows us to trade off precision against recall. This is the harmonic mean of P and R: 44

Averaging: Micro vs. Macro We now have an evaluation measure (F1) for one class. But we also want a single number that measures the aggregate performance over all classes in the collection. Macroaveraging Compute F1 for each of the C classes Average these C numbers Microaveraging Compute TP, FP, FN for each of the C classes Sum these C numbers (e.g., all TP to get aggregate TP) Compute F1 for aggregate TP, FP, FN 45

Exercise Class A Class B # of records 800 200 accuracy % 80% 60% macro average accuracy % (80%+60%)/2=70% micro average accuracy % (800*80%+200*60%)/(800+200)=76% 若類別A及B的資料筆數相同,則macro avg.等於micro avg. micro avg.可視為依類別大小加權後的平均數

Naive Bayes vs. other methods 47

Alternative measurement Classification accuracy : c/n n : the total number of test instances c : the number of test instances correctly classified by the system. 當各類別大小差異大時,Classification accuracy 易受影響 Ex. Class A有800筆、Class B有200筆,則全部猜A的 accuracy至少有80% 此時仍應檢視各類別的Precision及Recall

Case study : WebKB Experiment Classify webpages from CS departments into: student, faculty, course,project Train on ~5,000 hand-labeled web pages Cornell, Washington, U.Texas, Wisconsin Crawl and classify a new site (CMU) for testing Results:

NB Model Comparison

Case study : Apache SpamAssassin Naïve Bayes classifier for spam filtering Widely used in spam filters Classic Naive Bayes superior when appropriately used 有很多衍生版本 Many email filters use NB classifiers But also many other things: black hole lists, etc. 同時混用很多其它技巧,如黑名單

Naïve Bayes on spam email

kNN : K Nearest Neighbors

Vector space classification As before, the training set is a set of documents, each labeled with its class. In vector space classification, this set corresponds to a labeled set of points or vectors in the vector space. Premise 1: Documents in the same class form a contiguous region. 同類別文件在空間中較相近 Premise 2: Documents from different classes don’t overlap. We define lines, surfaces, hypersurfaces to divide regions. 不同的類別間可找出一個分割線或(超)平面 55

Classes in a Vector Space Government Science Arts

Test Document = Government Similarity hypothesis true in general? Government Science Arts

k Nearest Neighbor Classification To classify document d into class c Define k-neighborhood N as k nearest neighbors of d Count number of documents i in N that belong to c Estimate P(c|d) as i/k Choose as class argmaxc P(c|d) [ = majority class] 訣竅 : 挑最近的 k 個鄰居出來統計 (投票),    看最多人屬哪一類,自己標成那一類

Example: k=5 (5NN) P(science| )? P(Government| )? Government Science Arts

Nearest-Neighbor Learning Algorithm Learning is just storing the representations of the training examples in D. Testing instance x: Compute similarity between x and all examples in D. 計算文件 x 與其它訓練文件的相似度 Assign x the category of the most similar example in D. 決定文件 x 的類別 Does not explicitly compute a generalization or category Also called: 此方法又稱為 Case-based learning Memory-based learning Lazy learning

kNN algorithm 61

Time complexity of kNN kNN test time proportional to the size of the training set The larger the training set, the longer it takes to classify a test document. kNN is inefficient for very large training sets. 62

kNN : discussion

kNN classification kNN classification is vector space classification method. It also is very simple and easy to implement. kNN is more accurate (in most cases) than Naive Bayes and others. If you need to get a pretty accurate classifier up and running in a short time . . . . . . and you don’t care about efficiency that much . . . . . . use kNN. 64

Discussion of kNN (1) No training necessary But linear preprocessing of documents is as expensive as training Naive Bayes. We always preprocess the training set, so in reality training time of kNN is linear. kNN is very accurate if training set is large. kNN Is Close to Optimal (ref. Cover and Hart, Nearest neighbor pattern classification, 1967) But kNN can be very inaccurate if training set is small. kNN scores is hard to convert to probabilities 65

Discussion of kNN (2) Using only the closest example to determine the categorization is subject to errors due to: k 若只取一個易受下列影響 A single atypical example. 特例 Noise (i.e. error) in the category label of a single training example. 同類別中的雜訊 More robust alternative is to find the k most-similar examples and return the majority category of these k examples. 選 k 個再用投票多數是較穩當的做法 Value of k is typically odd to avoid ties; 3 and 5 are most common. 通常 k 要取單數, 常見的是3或5個 66

Nearest Neighbor with Inverted Index Finding nearest neighbors requires a linear search through |D| documents in collection Determining k nearest neighbors is the same as determining the k best retrievals using the test document as a query to a database of training documents. 查詢結果前 k 名投票即可得 配合使用Inverted Index可提升實作效率

Support Vector Machine (SVM)

Support Vector Machine (1) A kind of large-margin classifier From Intensive machine-learning research in the last two decades to improve classifier effectiveness Vector space based machine-learning method aiming to find a decision boundary between two classes that is maximally far from any point in the training data possibly discounting some points as outliers or noise

Support Vector Machine (2) 2-class training data decision boundary → linear separator criterion: being maximally far away from any data point → determines classifier margin linear separator position defined by support vectors 70

Why maximize the margin? Points near decision surface → uncertain classification decisions A classifier with a large margin makes certainty classification decisions. - to reduce errors in measurement or doc. variation 71

Linear programming Find a,b,c, such that ax + by  c for red points 求出多項式的係數 即可決定該平面 Find a,b,c, such that ax + by  c for red points ax + by  c for green points.

Which Hyperplane? In general, lots of possible solutions for a,b,c. 可能有多組解 但是要挑哪個超平面 In general, lots of possible solutions for a,b,c.

Which Hyperplane? Lots of possible solutions for a,b,c. Some methods find a separating hyperplane, but not the optimal one, while the other methods find an optimal separating hyperplane Which points should influence optimality? All points 用全部的點去求最佳解 Linear regression Only “difficult points” close to decision boundary 只用邊界附近的困難點 Support vector machines (SVM)

Which Hyperplane? If you have to place a fat separator between classes, you have less choices, and so the capacity of the model has been decreased

Formalize an SVM with algebra Hyperplane : an n-dimensional generalization of a plane Decision hyperplane : given a normal vector w (weight vector) which is perpendicular to the hyperplane all points x on the hyperplane satisfy any point in two training set will individually satisfy wT x + b = 0 wT x + b = +1 wT x + b = -1

Geometric Margin ρ x r x' Distance from example to the separator is Examples closest to the hyperplane are support vectors. Margin ρ of the separator is the width of separation between support vectors of classes. ρ x r x' Looking for distance r. Dotted line x’-x is perpendicular to decision boundary so parallel to w. Unit vector is w/|w|, so this one is rw/|w|. x’ = x – rw/|w|. X’ satisfies wx+b = 0. So wT(x –rw/|w|) = 0 Recall that |w| = sqrt(wTw). So, solving for r gives: r = y(wTx + b)/|w|

Linear Support Vector Machine wTxa + b = 1 ρ Hyperplane wT x + b = 0 This implies: wT(xa–xb) = 2 ρ = ||xa–xb|| = 2/||w|| wTxb + b = -1 wT x + b = 0

Soft Margin Classification If the training set is not linearly separable, slack variables ξi can be added to allow misclassification of difficult or noisy examples. Make it allow some errors. ξi ξj

SVM Resources SVM Light, Cornell University http://svmlight.joachims.org/ libSVM, National Taiwan University http://www.csie.ntu.edu.tw/~cjlin/libsvm/ Reference http://www.cmlab.csie.ntu.edu.tw/~cyy/learning/tutorials/libsvm.pdf http://ntu.csie.org/~piaip/svm/svm_tutorial.html

Reference 支持向量機教學文件(中文版) http://www.cmlab.csie.ntu.edu.tw/~cyy/learning/tutorials/SVM1.pdf Support Vector Machines 簡介 http://www.cmlab.csie.ntu.edu.tw/~cyy/learning/tutorials/SVM2.pdf Support Vector Machine 簡介 http://www.cmlab.csie.ntu.edu.tw/~cyy/learning/tutorials/SVM3.pdf

Classification with SVMs Given a new point score its projection onto the hyperplane: compute score: wx + b set confidence threshold t. 計算離哪邊較近,並給予門檻值 Score > t: yes Score < -t: no Else: don’t know 7 5 3

SVM : discussion

Multiclass SVMs SVMs: inherently two-class classifiers. Most common technique in practice: build |C| one-versus- rest classifiers (commonly referred to as “one-versus-all” or OVA classification), and choose the class which classifies the test data with greatest margin Another strategy: build a set of one-versus-one classifiers, and choose the class that is selected by the most classifiers. this involves building |C|(|C| − 1)/2 classifiers use binary decision tree or majority votes. 84

Exercise 用二元分類器模擬多分類效果,假設有政治、娛樂、科技三類 建立政治-其他、娛樂-其他、科技-其他三個分類器,再 取離得最近的當作分類結果;若有n個類別,需要建立n 個分類器 任二類建立一個分類器,最後採總積分制決定分類結果。 例如 政治-娛樂 歸類為政治 政治-科技 歸類為科技 娛樂-科技 歸類為科技 若採積分制,最後結果為科技>政治>娛樂 若有n個類別,需要建立 C 2 𝑛 個分類器 85

Non-linear SVMs Datasets that are linearly separable (with some noise) work out great: But what are we going to do if the dataset is just too hard? 分不開怎麼辦 How about … mapping data to a higher-dimensional space: 想辦法映射到不同空間 x x x2 x

Non-linear SVMs: Feature spaces General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x)

SVM is good for Text Classification Documents are zero along almost all axes Most document pairs are very far apart (i.e., not strictly orthogonal, but only share very common words and a few scattered others) 其實文件很多軸上的值是 0 ; 在空間軸上離很開 Virtually all document sets are separable, for most any classification. This is why linear classifiers are quite successful in this domain 文件混在一起的情況不多, 所以用Linear分類器, 在文件分類上很適合

Issues in Text Classification

(1) What kind of classifier to use Document Classification is useful for many commercial applications What kind of classifier to use ? How much training data do you have? 需要準備多少訓練資料 None Very little Quite a lot A huge amount and its growing

If you have no labeled training data Try hand-written rules solution 人工規則 If (Baseball OR Basketball) then categorize as Sport In practice, rules get a lot bigger than this 通常規則很多 With careful crafting (human tuning on development data) performance is high: 但是效果很好 Amount of work required is huge 但仍大量人工檢查與維護

If you have fairly little data ? Naïve Bayes should do well in such circumstances (Ng and Jordan 2002 NIPS) 適合訓練用文件只有很少的時候 The practical answer is to get more labeled data as soon as you can How to let people be willing to label data for you ? 可思考如何運用網路上的資料 Ex. 信件分類、書籤、Social Tagging

If you have a huge amount of data ? Great in theory for doing accurate classification But expensive methods like SVMs (train time) or kNN (test time) are quite impractical 運算量大的演算法不合用 Try Naïve Bayes again.

(2) Large and difficult categories Easy for small number of well-separated categories Accurate classification over large sets of closely related classes is inherently difficult. Ex. Web directories (e.g. the Yahoo! Directory consists of over 200,000 categories or the Open Directory Project) Ex. library classification schemes (Library of Congress) Classifier combination is a useful technique Voting of multiple classifiers Use a hybrid automatic/manual solution

(3) Other techniques Try differentially weighting contributions from different document zones: Upweighting title words helps (Cohen & Singer 1996) 提高標題權重 Upweighting the first sentence of each paragraph helps (Murata, 1999) 提高每段的第一句權重 Upweighting sentences that contain title words helps (Ko et al, 2002) 提高包含標題之句子的權重 Summarization as feature selection for text categorization (Kolcz, Prabakarmurthi, and Kolita, CIKM 2001) 先做自動摘要

(4) Problem of Concept drift Categories change over time 類別是會隨時間變的 Example: “president of the united states” 1999: clinton is great feature 2002: clinton is bad feature 已經不是總統了 One measure of a text classification system is how well it protects against concept drift. 多久需要檢視訓練資料,並重新訓練?

Dumais et al. 1998 : Reuters - Accuracy 97

Results for Kernels (Joachims 1998) NB should be over 80!!

Yang & Liu: SVM vs. Other Methods

Text Classification : conclusion Choose a approach Do no classification 不分類 Do it all manually 人工分類 Do it all with an automatic classifier 全自動分類 Mistakes have a cost 要挑出錯也要成本 Do it with a combination of automatic classification and manual review of uncertain/difficult/“new” cases Commonly the last method is most cost efficient and is adopted

Discussions