Download presentation
Presentation is loading. Please wait.
1
Keyword extraction from Suffix Array
宋政隆
2
Outline Semi-infinite string (sistring) Suffix Tree
PATRICIA Tree (PAT-Tree) Suffix Array (PAT Array) 2019/9/2 COSCUP 2006, NTU
3
陳總統:限國民黨三個月改黨徽 【記者游文寶/桃園縣報導】 繼政變說之後,陳水扁總統昨晚又拋出新的選戰議題,他指過去黨國不分,他要求國民黨在三個月內修改黨徽,否則泛綠在立法院過半後,由他們來修改國徽法。 國民黨立即反駁,表示國民黨的黨徽有其歷史及意義,絕不可能更改,而且國徽與黨徽也有明顯不同。國民黨表示,陳總統只是在創造選戰議題,如果陳總統有膽,現在就可以改掉中華民國的國徽,另設台灣共和國國徽。 陳水扁昨晚到桃園縣平鎮市助選,指柔性政變的整個關鍵,就在於有人把中華民國等同於國民黨,把中華民國國軍等同於國民黨黨軍;經過政黨輪替後,中華民國主權屬於兩千三百萬台灣人民,國軍也是屬於台灣人民。 他表示,但國民黨、親民黨、新黨仍將政黨組織留在軍中,他奉勸正值一百一十歲生日的國民黨,早日放棄黨國一家的舊思維和封建思想,摒棄黨國主義;因為黨非國家,國家也非黨,不要再想以黨指揮國家,他要求泛藍趕快裁撤與軍隊相關的黨組織。 陳水扁說,國民黨過去執政五十年,霸占國家人民財產,應該還財於民。接著他出示國民黨的黨徽及國徽圖卡,讓現場民眾辨認,並拿出奧運會旗、警徽、各軍種總司令旗為例,說明國民黨黨徽、國徽分不清楚。 他說黨徽、國徽既然那麼容易搞錯,他要求國民黨在三個月內修改黨徽,避免魚目混珠,以作區隔;如果國民黨不改,沒關係,在十二月十一日讓民進黨成為國會穩定多數,「我們在明年二月一日新國會成立以後,再來修改國徽法。」 2019/9/2 COSCUP 2006, NTU
4
Frequent terms 黨,22 徽,16 民黨,12 國,21 國民黨,11 在,8 民,8 他,7 國徽,7 2019/9/2
COSCUP 2006, NTU
5
semi-infinite strings (sistring)
Example Text Once upon a time, in a far away land … sistring 1 Once upon a time … sistring 2 nce upon a time … sistring 8 on a time, in a … sistring 11 a time, in a far … sistring 22 a far away land … Compare sistrings 22 < 11 < 2 < 8 < 1 2019/9/2 COSCUP 2006, NTU
6
Suffix tree If txt=t1t2...ti...tn is a string, then Ti=titi+1...tn is the suffix of txt that starts at position i, e.g. 2019/9/2 COSCUP 2006, NTU
7
Suffix tree 2019/9/2 COSCUP 2006, NTU
8
PAT-tree PATRICIA tree PATRICIA + semi-infinite strings
a digital tree where the individual bits of the keys are used to decide on the branching each internal node indicates which bit of the query is used for branching absolute bit position a count of the number of bits to skip PATRICIA + semi-infinite strings a text T with n basic units u1 u2 … un u1 u2 … un …, u2 u3 … un …, u3 u4 … un …, … an end to the left but none to the right store the starting positions of semi-infinite strings in a text using PATRICIA 2019/9/2 COSCUP 2006, NTU
9
PAT-tree Example 1 2 2 Text 01100100010111 …
sistring … sistring … sistring … sistring … sistring … sistring … sistring … sistring 4 1 3 2 1 2 2 4 3 3 2 5 1 : external node sistring (integer displacement) total displacement of the bit to be inspected 1 1 1 1 1 1 2 1 2 1 3 2 : internal node skip counter & pointer 2019/9/2 COSCUP 2006, NTU
10
PAT/Suffix Tree Applications
Prefix Search Searching for a substring, pat[1..m], in txt[1..n], can be solved in O(m) time (after the suffix tree for txt has been built in O(n) time). Longest Repeated Substring the longest repeated substring of txt[1..n] is indicated by the deepest fork node in the suffix tree, where depth is measured by the number of characters traversed from the root. Longest Common Substring The longest common substring of two strings, txt1 and txt2, can be found by building a generalized suffix tree for txt1 and txt2: Each node is marked to indicate if it represents a suffix of txt1 or txt2 or both. The deepest node marked for both txt1 and txt2 represents the longest common substring. The longest common substring is indicated by the deepest fork node that has both `...$...' and `...#...' (no $) beneath it. 2019/9/2 COSCUP 2006, NTU
11
Prefix searching idea every subtree of the PAT tree has all the sistrings with a given prefix. Search: proportional to the query length exhaust the prefix or up to external node. Search for the prefix “10100” and its answer 2019/9/2 COSCUP 2006, NTU
12
Longest Repetition Searching
the match between two different positions of a text where this match is the longest in the entire text, e.g., the tallest internal node gives a pair of sistrings that match for the greatest number of characters Text sistring sistring sistring sistring sistring sistring sistring sistring 1 2 2 3 4 3 2 7 5 6 5 3 1 4 8 2019/9/2 COSCUP 2006, NTU
13
“Most Significant” or “Most Frequent” Matching
the most frequently occurring strings within the text database, e.g., the most frequent trigram find the most frequent trigram find the largest subtree at a distance 3 characters from root 1 the tallest internal node gives a pair of sistrings that match for the greatest number of characters 2 2 3 4 3 2 i.e., 1, 2, 3 are the same for sistrings and 7 5 6 5 3 1 4 8 2019/9/2 COSCUP 2006, NTU
14
Suffix Array for Chinese keyword extraction
Array of sistring Simple Construction Treat the whole text as a string (character array) Move char by char, and add to the suffix array Sort the array 2019/9/2 COSCUP 2006, NTU
15
Example 我要去上學校來上學 要去上學校來上學 去上學校來上學 上學校來上學 學校來上學 校來上學 來上學 上學 學 2019/9/2
COSCUP 2006, NTU
16
Example 我要去上學校來上學 1 -> 上, 2 2 -> 上學, 2 上學 上學校來上學 來上學 去上學校來上學 學
2019/9/2 COSCUP 2006, NTU
17
Example: banana Before sort After sort Common Prefix Word 2019/9/2
COSCUP 2006, NTU
18
Suffix Array Construction Difficulty
Quick Sort is slow Comparison between strings, not BITs It’s hard to record every phases/words that freq > 2 (common prefix word) in linear time It’s easy to record the most frequency word/ longest repetition substring 2019/9/2 COSCUP 2006, NTU
19
Reference Ted Pedersen, N-gram Statistics Package, L. F. Chien, "PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval", Proceedings of the ACM SIGIR International Conference on Information Retrieval, 1997. Bill Frakes, "New Indices for Text: PAT trees and PAT arrays", Information Retrieval Data Structures & Algorithms, 1992. 2019/9/2 COSCUP 2006, NTU
20
Experiment results and Online demo
Implementation Experiment results and Online demo
21
Suffix (Array) Testing Environment
CPU PIII-500 Memory 1GB Language C Encoding Utf8 UTF8 Chinese chars is 3 bytes/word where Unicode Chinese chars is 2 bytes/word in general 2019/9/2 COSCUP 2006, NTU
22
Program structure (initial version)
Struct node to store word_freq struct WORD_FREQ { char *word; unsigned int length; unsigned int freq; }; Functions Standard C Library built-in qsort(); int validUTF8ChineseWord (const char*); char * moveforward (const char *, const char *); int lookupWordIndex (WORD_FREQ *, const char*, const int); … 2019/9/2 COSCUP 2006, NTU
23
Suffix (Array) Previous results
Reversion # of words Process time(sec) Memory used r13 82,364 1,857 21,060K r15 99,072 73 6,832K r22 64,044 10 6,836K r32 48,679 4 6,736K Testing corpus 15,000 lines Utf8 size 913,034KB 2019/9/2 COSCUP 2006, NTU
24
Program structure (nowadays)
Struct node to store word_freq struct WORD_FREQ { char *word; char *wend; // point to the end of word unsigned int length; unsigned int freq; }; Functions …. 2019/9/2 COSCUP 2006, NTU
25
Comparison of suffix/comKTS
corpus-s (25000 lines, < 1mb) PAT-Tree CPU time 00:07:37 (457sec) Suffix CPU time 00:00:33 (33sec) corpus-m ( lines, 1.5mb) PAT-Tree CPU time ??? Suffix CPU time 03:26:57 (12417sec) 2019/9/2 COSCUP 2006, NTU
26
Program Flow Record every valid word (UTF8) address by pointer (char *) moveforward until EOF Quicksort by word prefix 我要去上學校來上學 要去上學校來上學 去上學校來上學 上學校來上學 學校來上學 校來上學 來上學 上學 學 2019/9/2 COSCUP 2006, NTU
27
Program Flow Record every valid word (UTF8) address by pointer (char *) moveforward until EOF Quicksort by word prefix Compare every adjacent word, if has common prefix Get WORD_FREQ index from lookupWordIndex() Insert to WORD_FREQ Quicksort by frequency Output 我要去上學校來上學 上學 上學校來上學 來上學 去上學校來上學 學 學校來上學 校來上學 要去上學校來上學 2019/9/2 COSCUP 2006, NTU
28
Features Prefix/Postfix filtering (may increase process time) 李遠哲院,4
李遠哲院長,4 遠哲院長,4 2019/9/2 COSCUP 2006, NTU
29
Suffix (Array) Results
corpus # of lines Utf8 size(kb) # of words Process time(sec) Qsort time (sec) Insert word time Memory used 1MB 15,000 913,034 48,679 4 0.859 0.843 6,736K (6M) 5MB 124,999 9,021,705 482,847 30 12.476 10.906 55,588K (55M) 10MB 249,999 17,309,906 912,978 62 26.148 23.453 103M 20MB 499,999 34,336,644 1,789,545 147 57.742 54.375 203M 40MB 967,382 66,135,317 3,402,792 342 391M 2019/9/2 COSCUP 2006, NTU
30
Future plan Applications Algorithm libtabe?
Any Apps that needs term extraction Algorithm Memory usage Faster execution time 2019/9/2 COSCUP 2006, NTU
31
Demo Time 2019/9/2 COSCUP 2006, NTU
32
Thank you
33
Questions?
Similar presentations