Classification Rule - Decision Tree

Classification Rule - Decision Tree
鄒明城

起源樹狀結構以大量運用於computer science，例如資料結構、資料庫索引、complier中，但直到近10年來才開始運用於知識的發掘與表達 1984年Breiman等所發表的Classification and Regression Tree一書，使得decision tree開始於統計界獲得認同 1986年Quinlan於Machine Learning Journal所發表Induction of decision tree文章，並介紹了ID3演算法，開啟了日後在data mining 領域上的後續研究後續有C4.5, CART, CHAID等演算法提出

近來，常被用來與類神經網路作比較，它最大的特點是；可以很明確的表達出rule，解釋推論的前因後果，是一種white box model
可以以自然語言表達出來，，更可以很方便的轉成SQL語言，有利於從關連式資料庫中建立Data mining的機制屬於監督式分類的一種，它必須包含有一組訓練資料，並且事先為每一資料做分類，經過學習後產生正確的描述或模式，然後再通過一組測試資料來作驗證，可用於未來的分類與預測，

ID3 (Iterative Dichotomiser 3) 原理
以top-down 的方式，所有的training data由root node開始，以data structure中建tree的方式來建立它藉由可供區別的屬性，以遞迴的方式不斷切割訓練樣本成為具有相同性質的群組，至於區別屬性的選擇則以資訊理論中的Entropy(火商)或是資訊增益(Information Gain)方式來衡量，具有越大區別性的屬性優先被選出，整個過程不斷重複直到每筆記錄均已歸屬為某一分類

How to split a node (attribute)
建立評量函數(Goodness Function)，透過評量函數找出最適宜切割的屬性欄位 Goodness Fuction 種類 Information Gain Gini Index Inference Power

Information Gain 評估選擇具有最高information gain 的欄位做為tree node 切割的依據
P：所有資料樣本數 m 個類別在P中，P(P1,p2,p3,p4…pm)，每個有 pi的object Expected Info 為

Information Gain (cont’d)
The expected information required for the tree with A as root is obtained as the weighted average

Information gain by braching on A is Gain(A) = I(p1, p2…pm) - E(A) example I(p,n) = -9/14 log2 *9/14 - 5/14 log2 5/14 = 0.94 p1=2 n1=3, I(p1,n1) = 0.97 p2=4, n2=0, I(p2,n2)= 0 p3=3, n3=2, I(p3,n3)= 0.97 E(外觀) = 5/14*I(p1,n1) + 4/14*I(p2,n2) + 5/14*I(p3,n3) = 0.694 Gain(外觀) = = 0.246

同理可得 Gain(溫度) = 0.029 Gain(溼度) ﹦0.151 Gain(風力) = 0.048 故選擇外觀做為第一個切割的屬性其餘再以外觀的各個值做為root，藉由SQL選出選出對應者，再以同樣的方式找出各node之最大information gain 欄位，如此不斷遞迴建構下去，直到leaf node即結束

Gini Index If a data set T contains examples from n classes, gini index, gini(T) is defined as where pj is the relative frequency of class j in T If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index ginisplit(T)is defined as

Gini Index 具有最低ginisplit(T)值的屬性欄位，優先被選來做為node 的切割

Node split Split node 的方法並沒有絕對的好與壞，有學者研究使用adaptive 方式，動態的來找尋最適當的分割法來分割各個node 不重要的field甚至不會出現在tree 中

如何使用Decision Tree 直接間接根據資料的屬性直接導入decision tree中作測試
由root至leaf追蹤最後所到達leaf的分類標籤，即完成分類的預測間接將decision轉換成分類法則追蹤每一條由root至leaf的路徑即可構成一條rule 建立成易於人類觀察的IF THEN rule，並且可以建立成為專家系統的知識庫

優點可產生易於了解的法則可用於Rule-Oriented domain 分類計算簡潔，十分適合於大量分類的場合
易於轉換成自然語言以及SQL 可用於Rule-Oriented domain 分類計算簡潔，十分適合於大量分類的場合具有處理continuous and categorical 變數的能力，統計以及類神經對於categorical變數的處理處理能力較差可找出較重要性的欄位可與類神經連用，做為類神經訓練網路之用

弱點較不適用於連續值或時序列資料的預估當所欲分類的類別太多時容易出錯訓練時的計算成本很高

Classification Rule - Decision Tree

Similar presentations

Presentation on theme: "Classification Rule - Decision Tree"— Presentation transcript:

Similar presentations

About project

反馈

请登录

Auth with social network:

Classification Rule - Decision Tree

Similar presentations

Presentation on theme: "Classification Rule - Decision Tree"— Presentation transcript:

Similar presentations

About project

反馈