Keyword extraction from Suffix Array

Slides:



Advertisements
Similar presentations
Exercise 1 EECS, Peking University Exercise in Query Processing.
Advertisements

TEM-4 Reading Comprehension 黄山学院 程汕姗. TEM-4 试卷题型及分值、时间分配 1. 听写 15 分 15mins 2. 听力 30 题 15 分 15mins 3. 完形填空 20 题 10 分 15mins 4. 语法和词汇 30 题 15 分 15mins 5.
《互联网运营管理》系列课程 觉浅网 荣誉出品
基本概論 Basic concepts.
Time Objectives By the end of this chapter, you will be able to
CHAPTER 9 虛擬記憶體管理 9.2 分頁需求 9.3 寫入時複製 9.4 分頁替換 9.5 欄的配置法則 9.6 輾轉現象
2011计算机类教研活动 陈国久.
Classification of Web Query Intent Using Encyclopedia 基于百科知识的查询意图获取
Memory Pool ACM Yanqing Peng.
手持裝置應用系統之設計 與未來發展 黃有評 大同大學 資訊工程系.
資料庫設計 Database Design.
Welcome Welcome to my class Welcome to my class!.
Leftmost Longest Regular Expression Matching in Reconfigurable Logic
Operating System CPU Scheduing - 3 Monday, August 11, 2008.
A TIME-FREQUENCY ADAPTIVE SIGNAL MODEL-BASED APPROACH FOR PARAMETRIC ECG COMPRESSION 14th European Signal Processing Conference (EUSIPCO 2006), Florence,
CH1 Number Systems and Conversion
Tree(樹) 什麼是「樹」? 「樹」的範例 「樹」的定義 「樹」的表示法.
關聯式資料庫.
Manifold Learning Kai Yang
樹狀結構 陳怡芬 2018/11/16 北一女中資訊專題研究.
不断变迁的闪存行业形势 Memory has changed, especially serial - from a low cost, low pin count, slow memory to an advanced, high performance memory solution to save.
Chapter 6 Graph Chang Chi-Chung
Creating Animated Apps (I) 靜宜大學資管系 楊子青
第十章 基于立体视觉的深度估计.
EndNote X6 Advance your Research and Publish Instantly
圖論 (Graph Theory) B 電機四 大鳥 B 電機四 酋長 B 電機四 炫大
Data Structure(資料結構) 授課老師: 蕭志明 助理教授 Ext:6779
第4章 网络互联与广域网 4.1 网络互联概述 4.2 网络互联设备 4.3 广域网 4.4 ISDN 4.5 DDN
Time Objectives By the end of this chapter, you will be able to
中国科学技术大学计算机系 陈香兰(0512- ) Spring 2011
第三章 项目设定.
This Is English 3 双向视频文稿.
Time Objectives By the end of this chapter, you will be able to
重點 資料結構之選定會影響演算法 選擇對的資料結構讓您上天堂 程式.
Formal Pivot to both Language and Intelligence in Science
第4章(1) 空间数据库 —数据库理论基础 北京建筑工程学院 王文宇.
String Matching Michael Tsai 2012/06/04.
樹 2 Michael Tsai 2013/3/26.
職業 Random Slide Show Menu
Chapter 5 Recursion.
Version Control System Based DSNs
105-1 Data Structure Exam /12/27.
EndNote X6 進階 Advance your Research and Publish Instantly
Maintaining Frequent Itemsets over High-Speed Data Streams
Total Review of Data Structures
软件工程 第四章 软件设计 软件过程设计技术与工具.
3.5 Region Filling Region Filling is a process of “coloring in” a definite image area or region. 2019/4/19.
Speaker: Liu Yu-Jiun Date: 2009/4/29
软件设计任务 从工程管理的角度来看,软件设计分两步完成。 概要设计,将软件需求转化为数据结构和软件的系统结构。
計算機程式 授課教師:廖婉君教授 第六單元 Arrays
中国科学技术大学计算机系 陈香兰 2013Fall 第七讲 存储器管理 中国科学技术大学计算机系 陈香兰 2013Fall.
VRP工具or-tools调研 王羚宇
Review and Analysis of the Usage of Degree Adverbs
Hashing Michael Tsai 2013/06/04.
通信工程专业英语 Lesson 13 Phase-Locked Loops 第13课 锁相环
计算机问题求解 – 论题 算法方法 2016年11月28日.
Inter-band calibration for atmosphere
String Matching Michael Tsai 2013/05/28.
第10章 存储器接口 罗文坚 中国科大 计算机学院
BiCuts: A fast packet classification algorithm using bit-level cutting
计算机问题求解 – 论题1-5 - 数据与数据结构 2018年10月16日.
唐常杰 四川大学计算机学院 计算机科学技术系
赵才荣 同济大学,电子与信息工程学院,智信馆410室
Hashing Michael Tsai 2017/4/25.
國立東華大學課程設計與潛能開發學系張德勝
Introduction to Computer Security and Cryptography
Significant Figures 有效數字
Hybrid fractal zerotree wavelet image coding
When using opening and closing presentation slides, use the masterbrand logo at the correct size and in the right position. This slide meets both needs.
Presentation transcript:

Keyword extraction from Suffix Array 宋政隆 (clsung@)

Outline Semi-infinite string (sistring) Suffix Tree PATRICIA Tree (PAT-Tree) Suffix Array (PAT Array) 2019/9/2 COSCUP 2006, NTU

http://udn.com/NEWS/NATIONAL/NATS1/2361442.shtml 陳總統:限國民黨三個月改黨徽      【記者游文寶/桃園縣報導】 繼政變說之後,陳水扁總統昨晚又拋出新的選戰議題,他指過去黨國不分,他要求國民黨在三個月內修改黨徽,否則泛綠在立法院過半後,由他們來修改國徽法。 國民黨立即反駁,表示國民黨的黨徽有其歷史及意義,絕不可能更改,而且國徽與黨徽也有明顯不同。國民黨表示,陳總統只是在創造選戰議題,如果陳總統有膽,現在就可以改掉中華民國的國徽,另設台灣共和國國徽。 陳水扁昨晚到桃園縣平鎮市助選,指柔性政變的整個關鍵,就在於有人把中華民國等同於國民黨,把中華民國國軍等同於國民黨黨軍;經過政黨輪替後,中華民國主權屬於兩千三百萬台灣人民,國軍也是屬於台灣人民。 他表示,但國民黨、親民黨、新黨仍將政黨組織留在軍中,他奉勸正值一百一十歲生日的國民黨,早日放棄黨國一家的舊思維和封建思想,摒棄黨國主義;因為黨非國家,國家也非黨,不要再想以黨指揮國家,他要求泛藍趕快裁撤與軍隊相關的黨組織。 陳水扁說,國民黨過去執政五十年,霸占國家人民財產,應該還財於民。接著他出示國民黨的黨徽及國徽圖卡,讓現場民眾辨認,並拿出奧運會旗、警徽、各軍種總司令旗為例,說明國民黨黨徽、國徽分不清楚。 他說黨徽、國徽既然那麼容易搞錯,他要求國民黨在三個月內修改黨徽,避免魚目混珠,以作區隔;如果國民黨不改,沒關係,在十二月十一日讓民進黨成為國會穩定多數,「我們在明年二月一日新國會成立以後,再來修改國徽法。」 2019/9/2 COSCUP 2006, NTU

Frequent terms 黨,22 徽,16 民黨,12 國,21 國民黨,11 在,8 民,8 他,7 國徽,7 2019/9/2 COSCUP 2006, NTU

semi-infinite strings (sistring) Example Text Once upon a time, in a far away land … sistring 1 Once upon a time … sistring 2 nce upon a time … sistring 8 on a time, in a … sistring 11 a time, in a far … sistring 22 a far away land … Compare sistrings 22 < 11 < 2 < 8 < 1 2019/9/2 COSCUP 2006, NTU

Suffix tree If txt=t1t2...ti...tn is a string, then Ti=titi+1...tn is the suffix of txt that starts at position i, e.g. 2019/9/2 COSCUP 2006, NTU

Suffix tree 2019/9/2 COSCUP 2006, NTU

PAT-tree PATRICIA tree PATRICIA + semi-infinite strings a digital tree where the individual bits of the keys are used to decide on the branching each internal node indicates which bit of the query is used for branching absolute bit position a count of the number of bits to skip PATRICIA + semi-infinite strings a text T with n basic units u1 u2 … un u1 u2 … un …, u2 u3 … un …, u3 u4 … un …, … an end to the left but none to the right store the starting positions of semi-infinite strings in a text using PATRICIA 2019/9/2 COSCUP 2006, NTU

PAT-tree Example 1 2 2 Text 01100100010111 … sistring 1 01100100010111 … sistring 2 1100100010111 … sistring 3 100100010111 … sistring 4 00100010111 … sistring 5 0100010111 … sistring 6 100010111 … sistring 7 00010111 … sistring 8 0010111 ... 4 1 3 2 1 2 2 4 3 3 2 5 1 : external node sistring (integer displacement) total displacement of the bit to be inspected 1 1 1 1 1 1 2 1 2 1 3 2 : internal node skip counter & pointer 2019/9/2 COSCUP 2006, NTU

PAT/Suffix Tree Applications Prefix Search Searching for a substring, pat[1..m], in txt[1..n], can be solved in O(m) time (after the suffix tree for txt has been built in O(n) time). Longest Repeated Substring the longest repeated substring of txt[1..n] is indicated by the deepest fork node in the suffix tree, where depth is measured by the number of characters traversed from the root. Longest Common Substring The longest common substring of two strings, txt1 and txt2, can be found by building a generalized suffix tree for txt1 and txt2: Each node is marked to indicate if it represents a suffix of txt1 or txt2 or both. The deepest node marked for both txt1 and txt2 represents the longest common substring. The longest common substring is indicated by the deepest fork node that has both `...$...' and `...#...' (no $) beneath it. 2019/9/2 COSCUP 2006, NTU

Prefix searching idea every subtree of the PAT tree has all the sistrings with a given prefix. Search: proportional to the query length exhaust the prefix or up to external node. Search for the prefix “10100” and its answer 2019/9/2 COSCUP 2006, NTU

Longest Repetition Searching the match between two different positions of a text where this match is the longest in the entire text, e.g., 0 1 1 0 0 1 0 0 0 1 0 1 1 1 the tallest internal node gives a pair of sistrings that match for the greatest number of characters Text 01100100010111 sistring 1 01100100010111 sistring 2 1100100010111 sistring 3 100100010111 sistring 4 00100010111 sistring 5 0100010111 sistring 6 100010111 sistring 7 00010111 sistring 8 0010111 1 2 2 3 4 3 2 7 5 6 5 3 1 4 8 2019/9/2 COSCUP 2006, NTU

“Most Significant” or “Most Frequent” Matching the most frequently occurring strings within the text database, e.g., the most frequent trigram find the most frequent trigram find the largest subtree at a distance 3 characters from root 1 the tallest internal node gives a pair of sistrings that match for the greatest number of characters 2 2 3 4 3 2 i.e., 1, 2, 3 are the same for sistrings 100100010111 and 100010111 7 5 6 5 3 1 4 8 2019/9/2 COSCUP 2006, NTU

Suffix Array for Chinese keyword extraction Array of sistring Simple Construction Treat the whole text as a string (character array) Move char by char, and add to the suffix array Sort the array 2019/9/2 COSCUP 2006, NTU

Example 我要去上學校來上學 要去上學校來上學 去上學校來上學 上學校來上學 學校來上學 校來上學 來上學 上學 學 2019/9/2 COSCUP 2006, NTU

Example 我要去上學校來上學 1 -> 上, 2 2 -> 上學, 2 上學 上學校來上學 來上學 去上學校來上學 學 2019/9/2 COSCUP 2006, NTU

Example: banana Before sort After sort Common Prefix Word 2019/9/2 COSCUP 2006, NTU

Suffix Array Construction Difficulty Quick Sort is slow Comparison between strings, not BITs It’s hard to record every phases/words that freq > 2 (common prefix word) in linear time It’s easy to record the most frequency word/ longest repetition substring 2019/9/2 COSCUP 2006, NTU

Reference Ted Pedersen, N-gram Statistics Package, 2003. http://www.d.umn.edu/~tpederse/nsp.html L. F. Chien, "PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval", Proceedings of the ACM SIGIR International Conference on Information Retrieval, 1997. Bill Frakes, "New Indices for Text: PAT trees and PAT arrays", Information Retrieval Data Structures & Algorithms, 1992. 2019/9/2 COSCUP 2006, NTU

Experiment results and Online demo Implementation Experiment results and Online demo

Suffix (Array) Testing Environment CPU PIII-500 Memory 1GB Language C Encoding Utf8 UTF8 Chinese chars is 3 bytes/word where Unicode Chinese chars is 2 bytes/word in general 2019/9/2 COSCUP 2006, NTU

Program structure (initial version) Struct node to store word_freq struct WORD_FREQ { char *word; unsigned int length; unsigned int freq; }; Functions Standard C Library built-in qsort(); int validUTF8ChineseWord (const char*); char * moveforward (const char *, const char *); int lookupWordIndex (WORD_FREQ *, const char*, const int); … 2019/9/2 COSCUP 2006, NTU

Suffix (Array) Previous results Reversion # of words Process time(sec) Memory used r13 82,364 1,857 21,060K r15 99,072 73 6,832K r22 64,044 10 6,836K r32 48,679 4 6,736K Testing corpus 15,000 lines Utf8 size 913,034KB 2019/9/2 COSCUP 2006, NTU

Program structure (nowadays) Struct node to store word_freq struct WORD_FREQ { char *word; char *wend; // point to the end of word unsigned int length; unsigned int freq; }; Functions …. 2019/9/2 COSCUP 2006, NTU

Comparison of suffix/comKTS corpus-s (25000 lines, < 1mb) PAT-Tree CPU time 00:07:37 (457sec) Suffix CPU time 00:00:33 (33sec) corpus-m (250000 lines, 1.5mb) PAT-Tree CPU time ??? Suffix CPU time 03:26:57 (12417sec) 2019/9/2 COSCUP 2006, NTU

Program Flow Record every valid word (UTF8) address by pointer (char *) moveforward until EOF Quicksort by word prefix 我要去上學校來上學 要去上學校來上學 去上學校來上學 上學校來上學 學校來上學 校來上學 來上學 上學 學 2019/9/2 COSCUP 2006, NTU

Program Flow Record every valid word (UTF8) address by pointer (char *) moveforward until EOF Quicksort by word prefix Compare every adjacent word, if has common prefix Get WORD_FREQ index from lookupWordIndex() Insert to WORD_FREQ Quicksort by frequency Output 我要去上學校來上學 上學 上學校來上學 來上學 去上學校來上學 學 學校來上學 校來上學 要去上學校來上學 2019/9/2 COSCUP 2006, NTU

Features Prefix/Postfix filtering (may increase process time) 李遠哲院,4 李遠哲院長,4 遠哲院長,4 2019/9/2 COSCUP 2006, NTU

Suffix (Array) Results corpus # of lines Utf8 size(kb) # of words Process time(sec) Qsort time (sec) Insert word time Memory used 1MB 15,000 913,034 48,679 4 0.859 0.843 6,736K (6M) 5MB 124,999 9,021,705 482,847 30 12.476 10.906 55,588K (55M) 10MB 249,999 17,309,906 912,978 62 26.148 23.453 103M 20MB 499,999 34,336,644 1,789,545 147 57.742 54.375 203M 40MB 967,382 66,135,317 3,402,792 342 121.164 126.914 391M 2019/9/2 COSCUP 2006, NTU

Future plan Applications Algorithm libtabe? Any Apps that needs term extraction Algorithm Memory usage Faster execution time 2019/9/2 COSCUP 2006, NTU

Demo Time http://yasa.newzilla.org/ 2019/9/2 COSCUP 2006, NTU

Thank you

Questions?