Download presentation
Presentation is loading. Please wait.
Published byBenedict Kelly Modified 6年之前
1
Some Effective Techniques for Naive Bayes Text Classification
Advisor : Dr. Hsu Presenter : Ai-Chen Liao Authors : Sang-Bum Kim, Kyoung-Soo Han, Hae-Chang Rim, and Sung Hyon Myaeng TKDE . Page(s) :
2
Outline Motivation Objective About Naïve Bayes Method
A per-document length normalization approach Weight-enhancing method Experimental Result Conclusion Personal Opinions
3
Motivation While naïve Bayes is quite effective in various data mining tasks, it shows a disappointing result in the automatic text classification problem. Based on the observation of naïve Bayes for the natural language text, we found a serious problem in the parameter estimation process, which cause poor results in text classification domain.
4
Objective We hope to propose some methods that can improve these problems.
5
About Naive Bayes Multivariate Bernoulli naïve Bayes
A document is considered as a binary feature vector representing whether each word is present or absent. It is not equipped to utilize term frequencies in documents. Multinomial model Two serious problems: (1) rough parameter estimation (2) handling rare categories Naive bayes是非常有效率,且實作上容易,也能夠與別的學習演算法去比較,但傳統上的naïve bayes沒有比其他的統計方法來的好,像是svm或boosting,最近鄰居分類器等等,所以希望能夠改善它。
6
About Naive Bayes
7
Method ─ Multivariate Poisson Model for Text Classification
λ表示某特定區間內某事件所發生的平均次數 假設一個document是由一個多變量的poisson model所產生的。
8
Method ─ A per-document length normalization approach
假設一個document是由一個多變量的poisson model所產生的。 根據每一篇文章的長度,對文章內的term作正規化。
9
Method ─ Feature Weighting Scheme
10
Experimental Results DS1: Reuters21578 (consists of 21,578 news articles) DS2: 20Newsgroups (consists of 19,997 Usenet articles collected from 20 different newsgroups)
11
Experimental Results high high high high
12
Experimental Results
13
Conclusion We propose a Poisson naive Bayes text classification model with weight-enhancing method. We suggest per-document term frequency normalization to estimate the Poisson parameter, while the traditional multinomial classifier estimates its parameters by considering all the training documents as a unique huge training document.
14
Personal Opinions Advantage Drawback Application …
Text classification…
Similar presentations