National Taiwan Normal University Text Database Yuen-Hsien Tseng / 曾元顯 Digital Media Center National Taiwan Normal University Text Fuzzy Match Keyword/Key Phrase Extraction Fuzzy Match Applications: Music Retrieval Term Association Analysis Text Categorization Text Summarization, Clustering, and Its Applications
Values of This Course Information access service is highly profitable Text database techniques are its core technologies Share By jiyanjiang on 2006-01-03
Text Fuzzy Match Applications Levels of Text Fuzzy Match Text search, text highlight duplicate detection, record linkage Levels of Text Fuzzy Match Phonic: homonyms, transliteration match Symbolic: most common type of text match Semantic: knowledge bases are often needed
Text Fuzzy Match – Phonic (En) For English, use Soundex code a hashing system for English words used by the United States Census Bureau by Robert C. Russell (U.S. patent 1,261,167 on April 2, 1918) Soundex Algorithm begin with the first letter of a word followed by a three-digit (1-6) code that represents the (first three) remaining consonants If the word is too short to generate 3 numbers, 0 is added as needed. If the generated code is longer than 3 numbers, the extra are thrown away.
Text Fuzzy Match – Phonic (En) Examples: perl -e "use Text::Soundex; print soundex('anything')" Ans: A535 References: http://www.highprogrammer.com/alan/numbers/soundex.html http://searches.rootsweb.com/cgi-bin/Genea/soundex.sh Applications: Yuen-Hsien Tseng, "Automatic Cataloguing and Searching for Retrospective Data by Use of OCR Text", Journal of the American Society for Information Science and Technology, Vol. 52, No. 5, April 2001, pp. 378-390. (SSCI, SCI, EI)
Text Fuzzy Match – Phonic (Ch) For Chinese, convert each character into its phonograms or pinyins match their phonograms or pinyins instead of their characters Applications Transliteration match:克麗斯汀 vs 克莉思汀 Traditional/simplified character match Example: D:\demo\ir\sam.lib>perl -s Fuzzyhighlight.pm "國內大型電腦集團赴大陸投資升溫" "國內電腦廠商赴大陸頭姿" 1 Ans: <font color=red>國內大型電腦集團赴大陸投資</font>升溫 D:\demo\ir\sam.lib>perl -s Fuzzyhighlight.pm "國內大型電腦集團赴大陸投資升溫" "國內電腦廠商,赴大陸投資,頭姿" 1 Ans: 國內大型電腦集團<font color=red>赴大陸<font color=red>投資</font></font>升溫
Text Match (Naïve) Simple brute force (naive) algorithm: O(nm) in its worse case Text: T[n], where n is the text length Pattern: P[m], where m is the pattern length Naïve Algorithm: for (i=0; T[i] != '\0'; i++) { for (j=0; T[i+j] != '\0' && P[j] != '\0' && T[i+j]==P[j]; j++) ; if (P[j] == '\0') found a match } Example: looking for pattern "nano" in text "banananobano" Each row represents an iteration of the outer loop, with each character in the row representing the result of a comparison (X if the comparison was unequal) 0 1 2 3 4 5 6 7 8 9 10 11 T: b a n a n a n o b a n o i=0: X i=1: X i=2: n a n X i=3: X i=4: n a n o i=5: X i=6: n X i=7: X i=8: X i=9: n X i=10: X
Text Match (Knuth-Morris-Pratt) Knuth-Morris-Pratt Algorithm Skipping outer iterations in the above naïve algorithm If you've found a partial match of j characters starting at position i, you know what's in positions T[i]...T[i+j-1]. i=2: n a n i=3: n a n o <= no need to do this comparison i=4: n a n o <= just skip to this step Skipping inner iterations i=4: n a n o <= n has been compared at i=2, skip this n further Complexity: O(m+n) guaranteed Reference: http://www.ics.uci.edu/~eppstein/161/960227.html
Fuzzy Text Match Become popular after Prof. Sun Wu (吳昇) published his papers since 1992 complexity O(kn), k: number of allowed errors But his algorithm still is not good enough for music retrieval (as noted by Ian Witten in their papers about music retrieval) Reference: Sun Wu and Udi Manber, “Agrep—A Fast Approximate Pattern-Matching Tool” Usenix Winter 1992 - 被引用 140 次 Sun Wu and Udi Manber, ‘‘Fast text searching: allowing errors,’’ Communications of the ACM, Vol. 35 , Issue 10 (Oct. 1992) pp: 83 - 91 – cited 71 times in ACM portal Baeza-Yates, R.A. and Gonnet, G.H. A new approach to text searching. Commun. ACM 35, 10 (1992). Preliminary version appeared in Proceedings of the 12th Annual ACM-SIGIR Conference on Information Retrieval, Cambridge, Mass. (June 1989), pp. 168-175.
Dynamic Programming for Fuzzy Match DP algorithm: 比較兩字串,允許字元的插入、刪除、代換 將兩字串在二維矩陣中展開(如下頁範例),求累計距離 d[i, j] = min( d[i-1, j] + w(A[i], 0), d[i-1, j-1] + w(A[i], B[j]), d[i, j-1] + w(0, B[j]) ) d[i, j] : accumulated distance between sequence A and B w(A[i], B[j]) : cost of substituting A[i]with B[j] w(A[i], 0) : cost of inserting A[i] w(0, B[j]) : cost of deleting B[j] d[0, 0] = 0 d[i, 0] = d[i-1, 0] + w(A[i], 0) , 1<= i <= m d[0, j] = d[0, j-1] + w(0, B[j]) , 1<= j <= n Complexity: O(n * m), m : length of A, n : length of B (i-1,j-1) (i-1, j) (i, j-1) ( i, j )
Basic Dynamic Programming: Examples 假設文件B=adecdecf,查詢詞A=adec,則 d[4,4]為0,最小,從d[4,4]倒推回d[3,3],再倒推回d[2,2],再倒推回d[1,1],一路都有最小值0,可知match 的位置在最開頭 adec 但假若match的字串adec在B的中間,就沒有預期的最小值0 很多人都不知道如何改進,因此認為DP只能從頭開始比起,尤其是一些音樂檢索系統,只能允許旋律從頭比起,限制很大 改進辦法,在第一列插入全部都是0的列,如下頁
Improved Dynamic Programming 在第一列插入全部都是0的列,其效果猶如允許從任何位置開始比較,亦即,初始值 從 d[0, j] = d[0, j-1] + w(0, B[j]) , 1<= j <= n 改成 d[0, j] = 0 , 1<= j <= n 最後一列的值,範圍為0到m,定義一函數: 其中 min 是最後一列的最小值 函數d(m)的範圍在0到1之間,其與m的關係如下圖,因此,取門檻值如下: if ($min == $lenq) { $dm = 0; } else { $dm = 1/exp($min/($lenq-$min)); } if (($lenq < 5 && $dm < 0.7) or ($dm < 0.54)) { … } #The No Match threshold Reference: Daniel Lopresti and Jiangying Zhou, “Retrieval Strategies for Noisy Text,” Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval, April 15-17, 1996, pp. 255-269.
應用範例(1/3) 目前我實做的 fuzzy highlight 規格: 底下的 Perl 範例程式,第一個參數是待標示的文件(字串),第二個參數是查詢詞(可多個,用「,」或「,」分開即可),第三個參數如果不為 0 ,則表示要做同音比對。 D:\>perl -s fuzzymatch5.pl "國內大型電腦集團赴大陸投資升溫" "國內電腦廠商赴大陸投資" Result:<font color=red>國內大型電腦集團赴大陸投資</font>升溫 D:\>perl -s fuzzymatch5.pl “國際商業股份公司” “傷葉古份” 1 Result:國際<font color=red>商業股份</font>公司 D:\>perl -s fuzzymatch5.pl “這是CD-R光碟片” “cdr 胱跌騙” 1 Result:這是<font color=red>CD-R光碟片</font> D:\>perl -s fuzzymatch5.pl "My name is 'sam Tseng'" "sam Tseng" Result:My name is '<font color=red>sam Tseng</font>'
應用範例(2/3) 中文同音、多詞 highlight: D:\>perl -s fuzzymatch5.pl "國內大型電腦集團赴大陸投資升溫" "國內電腦廠商,赴大陸投資,生瘟" 1 Result:國內大型電腦集團<font color=red>赴大陸投資</font><font color=red>升溫</font> D:\>perl -s fuzzymatch5.pl "國內大型電腦集團赴大陸投資升溫" "國內電腦廠商,赴大陸投資,頭姿" 1 Result:國內大型電腦集團<font color=red>赴大陸<font color=red>投資</font></font>升溫 中英文模糊 highlight: D:\>perl -s fuzzymatch5.pl "國內computer集團赴大陸投資升溫" "國內camputorization集團" Result:<font color=red>國內computer集團</font>赴大陸投資升溫
應用範例(3/3) 英文的同音 high light: D:\>perl -s fuzzymatch5.pl "國內computer集團赴大陸投資升溫" "國內camputorization" Result:國內computer集團赴大陸投資升溫 錯太多,無法 high light,要改用同音比對 D:\>perl -s fuzzymatch5.pl "國內computer集團赴大陸投資升溫" "國內camputorization" 1 Result:<font color=red>國內computer</font>集團赴大陸投資升溫
Conclusions (1) We have shown a fuzzy match tool that is capable of matching strings with some edit distance regardless of spaces and capitalization insensitive to plurals and tense allowing homonym match converting the edit distance into a similarity score highlighting similar query string in the text Google segments the query string into terms and highlight terms in the text Applications: Music Retrieval Yuen-Hsien Tseng, "Content-Based Retrieval for Music Collections," ACM SIGIR '99, Aug. 15-19, Berkeley, U.S.A., 1999, pp.176-182. Duplicate detection Yuen-Hsien Tseng and William John Teahan, "Verifying a Chinese Collection for Text Categorization," ACM SIGIR '04, July 25 - 29, U.K., 2004, pp.556-557. 曾元顯, "分類不一致對文件自動分類效果的影響", 「大學圖書館」, 第 9 卷第 1 期, 2005 年 3 月, 頁2-19.
Conclusions (2) However, its complexity is O(mn) in time and O(n) in space limited to applications of only small text sizes Do we have better solutions? Yes, index the texts to be searched first and then match them with the query string based on some retrieval models In this way, we also are able to deal with semantic text match to some degree In short, we need information retrieval (IR) techniques