Some Effective Techniques for Naive Bayes Text Classification Advisor : Dr. Hsu Presenter : Ai-Chen Liao Authors : Sang-Bum Kim, Kyoung-Soo Han, Hae-Chang Rim, and Sung Hyon Myaeng 2006 . TKDE . Page(s) : 1457 - 1466
Outline Motivation Objective About Naïve Bayes Method A per-document length normalization approach Weight-enhancing method Experimental Result Conclusion Personal Opinions
Motivation While naïve Bayes is quite effective in various data mining tasks, it shows a disappointing result in the automatic text classification problem. Based on the observation of naïve Bayes for the natural language text, we found a serious problem in the parameter estimation process, which cause poor results in text classification domain.
Objective We hope to propose some methods that can improve these problems.
About Naive Bayes Multivariate Bernoulli naïve Bayes A document is considered as a binary feature vector representing whether each word is present or absent. It is not equipped to utilize term frequencies in documents. Multinomial model Two serious problems: (1) rough parameter estimation (2) handling rare categories Naive bayes是非常有效率,且實作上容易,也能夠與別的學習演算法去比較,但傳統上的naïve bayes沒有比其他的統計方法來的好,像是svm或boosting,最近鄰居分類器等等,所以希望能夠改善它。
About Naive Bayes
Method ─ Multivariate Poisson Model for Text Classification λ表示某特定區間內某事件所發生的平均次數 假設一個document是由一個多變量的poisson model所產生的。
Method ─ A per-document length normalization approach 假設一個document是由一個多變量的poisson model所產生的。 根據每一篇文章的長度,對文章內的term作正規化。
Method ─ Feature Weighting Scheme
Experimental Results DS1: Reuters21578 (consists of 21,578 news articles) DS2: 20Newsgroups (consists of 19,997 Usenet articles collected from 20 different newsgroups)
Experimental Results high high high high
Experimental Results
Conclusion We propose a Poisson naive Bayes text classification model with weight-enhancing method. We suggest per-document term frequency normalization to estimate the Poisson parameter, while the traditional multinomial classifier estimates its parameters by considering all the training documents as a unique huge training document.
Personal Opinions Advantage Drawback Application … Text classification…