Presentation is loading. Please wait.

Presentation is loading. Please wait.

楊立偉教授 台灣科大資管系 © Copyright 2015 by Willie Yang

Similar presentations


Presentation on theme: "楊立偉教授 台灣科大資管系 © Copyright 2015 by Willie Yang"— Presentation transcript:

1 楊立偉教授 台灣科大資管系 wyang@ntu.edu.tw © Copyright 2015 by Willie Yang
Homework 1 : TF-IDF 楊立偉教授 台灣科大資管系 © Copyright 2015 by Willie Yang

2 Chinese Keyword Extraction
Chinese keyword extraction is fundamental for many applications. There are two major approaches Need word segmentation 需先斷詞 No word segmentation 不需先斷詞

3 N-gram approach No word segmentation 不需先斷詞
The keywords are in the subset of n-grams How to select the proper n-grams for keywords ? tf-idf chi-square 卡方 mutual information information gain, maximum entropy, and others

4 N-gram approach with tf-idf
Enumerate n-grams, for example, 2 to 6 Compute tf and idf Sort by tf-idf descendingly Remove non-keywords 移除非關鍵詞者 Ex. 含數字或特殊字元者,不計 Remove sub-keywords 移除子關鍵詞 Ex. 移除林書、書豪,只保留林書豪

5 Demonstration Use news corpus Use different topics

6 Requirements (1) 實作七項主題 列出每一主題的前50名關鍵詞 影劇娛樂、運動、兩岸、財經、保健、政治、社會 列舉2~8字詞
需移除含數字或特殊字元者,並移除子關鍵詞 列出排名、關鍵詞、tf、df、tf-idf 依序存在一個 Excel 中

7 Requirements (2) 分組展示 每組1~4位同學 不限程式語言 二周後上台展示(現場跑 2 個主題 + code review)
繳交 Excel 與程式碼 打包壓縮,檔名為學號 ke2015_hw1_學號_學號…zip

8 Discussion "Basic algorithms and rich corpus can do a great job."
Use the keywords to tag every original document (as document feature to represent the document )


Download ppt "楊立偉教授 台灣科大資管系 © Copyright 2015 by Willie Yang"

Similar presentations


Ads by Google