National Taiwan Normal University

Slides:

Advertisements

Similar presentations

allow v. wrong adj. What’s wrong? midnight n. look through guess v. deal n. big deal work out 允许；准许有毛病；错误的哪儿不舒服？午夜；子夜快速查看；浏览猜测；估计协议；交易重要的事.

Advertisements

第五章動詞動詞用來表示一種動作動詞有及物與不及物之分，及物動詞之後需要受詞，有的動詞甚至需要兩個受詞：一個直接受詞，一個間接受詞

Time Objectives By the end of this chapter, you will be able to

激励发现推动创新利用Web of Science数据库发现人文社会科学领域研究的新视角汤森路透知识产权与科技集团.

台灣民間部門在長期照護中角色的趨勢 4th International Symposium on Chinese Elderly

信息检索中效率问题的研究报告人：赵江华北京大学计算机科学与技术系网络与分布式系统实验室 2002年4月21日.

当代中国流行文化与对外汉语教学 Contemporary Chinese Popular Culture and Teaching Chinese as a Foreign Language Yuhong Sun 孙玉红.

個人簡介施再繁台大電機所計算機組博士.

Teaching the Chinese Copula 是 for CSL Purposes

How can we become good leamers

手持裝置應用系統之設計與未來發展黃有評大同大學資訊工程系.

AIS Project hanyu Stage 6 Writing Skills

Operating System CPU Scheduing - 3 Monday, August 11, 2008.

Text Segmentation for Chinese Spell Checking

Been During the Vacation？

生物資訊 bioinformatics 林育慶.

Manifold Learning Kai Yang

優質教育基金研究計劃研討會: 經驗分享 - 透過Web 2.0推動高小程度探究式專題研習的協作教學模式

出死入生 Cross Over from Death to Life

Zn 模式匹配与KMP算法 Zn

計算方法設計與分析 Design and Analysis of Algorithms 唐傳義

Department of Computer Science & Information Engineering

词汇语义资源在中文关系抽取中的应用报告人：钱龙华刘丹丹胡亚楠钱龙华周国栋

Chinese IIAB (IIA +IIB)

Chinese IIAB (IIA +IIB)

Guide to Freshman Life Prepared by Sam Wu.

課務組 Curriculum Section

Time Objectives By the end of this chapter, you will be able to

Chinese 101 University of Puget Sound

Word-Entity Duet Representations for Document Ranking

现代信息检索 Modern Information Retrieval

智泉國際事業有限公司(iGroup Taiwan)

“掌握科技文献助力科学研究” ——大学生信息素质培训与交流龚芙蓉武汉大学图书馆龚芙蓉武汉大学图书馆.

國立政治大學資訊科學研究所知識系統實驗室研究生：鄭雍瑋指導教授：劉吉軒博士中華民國九十五年六月三十日

子博弈完美Nash均衡我们知道，一个博弈可以有多于一个的Nash均衡。在某些情况下，我们可以按照“子博弈完美”的要求，把不符合这个要求的均衡去掉。扩展型博弈G的一部分g叫做一个子博弈，如果g包含某个节点和它所有的后继点，并且一个G的信息集或者和g不相交，或者整个含于g。一个Nash均衡称为子博弈完美的，如果它在每.

Time Objectives By the end of this chapter, you will be able to

重點資料結構之選定會影響演算法選擇對的資料結構讓您上天堂程式.

Lesson 44:Popular Sayings

String Matching Michael Tsai 2012/06/04.

一般論文的格式註：這裡指的是一般 journal papers 和 conference papers 的格式。

BEd(Special Education)

第二讲计算机信息检索概述主要内容：一信息检索的基本概念二电子资源的概念与类型三计算机信息检索系统四计算机检索技术.

Towards Emotional Awareness in Software Development Teams

服務於中國研究的網絡基礎設施 A Cyberinfrastructure for Historical China Studies

Chinese IAB (IA +IB) 11 Weather and Internet Module (L21-L22)

介面使用說明飛資得知識服務.

Sensor Networks: Applications and Services

Guide to a successful PowerPoint design – simple is best

Common Qs Regarding Earnings

Review and Analysis of the Usage of Degree Adverbs

中美图书馆之间合作的过去、现在和未来 Sino-U. S

OvidSP Introduction Flexible. Innovative. Precise.

OCLC Firstsearch 全國版推廣教育訓練課程

中考英语阅读理解完成句子命题与备考宝鸡市教育局教研室任军利

String Matching Michael Tsai 2013/05/28.

Course 10 削減與搜尋 Prune and Search

Efficient Query Relaxation for Complex Relationship Search on Graph Data 李舒馨

ACM Digital Library 進階利用與實作郭珮琪主講

More About Auto-encoder

My Country 我的国家.

钱炘祺一种面向实体浏览中属性融合的人机交互的设计与实现 Designing Human-Computer Interaction of Property Consolidation for Entity Browsing 钱炘祺

Selecting Reading Materials

參考資料：林秋燕曾元顯卜小蝶，Chap. 1、3 Chowdhury，Chap.9

自主练悟 ①(2017·桂林市联考)To them, life is a competition — they have to do _______ (good) than their peers to be happy. ②(2017·菏泽市模拟)People who forgive.

Introduction to Computer Security and Cryptography

以碎形正交基底和時間情境圖為基礎進行之視訊檢索 Video retrieval based on fractal orthogonal bases and temporal graph 阿凡達研究生：張敏倫指導教授：蔣依吾博士國立中山大學資訊工程學系.

高考英语作文指导福建省教研室姚瑞兰.

Section 1 Basic concepts of web page

Presentation transcript:

National Taiwan Normal University Text Database Yuen-Hsien Tseng / 曾元顯 Digital Media Center National Taiwan Normal University Text Fuzzy Match Keyword/Key Phrase Extraction Fuzzy Match Applications: Music Retrieval Term Association Analysis Text Categorization Text Summarization, Clustering, and Its Applications

Values of This Course Information access service is highly profitable Text database techniques are its core technologies Share By jiyanjiang on 2006-01-03

Text Fuzzy Match Applications Levels of Text Fuzzy Match Text search, text highlight duplicate detection, record linkage Levels of Text Fuzzy Match Phonic: homonyms, transliteration match Symbolic: most common type of text match Semantic: knowledge bases are often needed

Text Fuzzy Match – Phonic (En) For English, use Soundex code a hashing system for English words used by the United States Census Bureau by Robert C. Russell (U.S. patent 1,261,167 on April 2, 1918) Soundex Algorithm begin with the first letter of a word followed by a three-digit (1-6) code that represents the (first three) remaining consonants If the word is too short to generate 3 numbers, 0 is added as needed. If the generated code is longer than 3 numbers, the extra are thrown away.

Text Fuzzy Match – Phonic (En) Examples: perl -e "use Text::Soundex; print soundex('anything')" Ans: A535 References: http://www.highprogrammer.com/alan/numbers/soundex.html http://searches.rootsweb.com/cgi-bin/Genea/soundex.sh Applications: Yuen-Hsien Tseng, "Automatic Cataloguing and Searching for Retrospective Data by Use of OCR Text", Journal of the American Society for Information Science and Technology, Vol. 52, No. 5, April 2001, pp. 378-390. (SSCI, SCI, EI)

Text Fuzzy Match – Phonic (Ch) For Chinese, convert each character into its phonograms or pinyins match their phonograms or pinyins instead of their characters Applications Transliteration match:克麗斯汀 vs 克莉思汀 Traditional/simplified character match Example: D:\demo\ir\sam.lib>perl -s Fuzzyhighlight.pm "國內大型電腦集團赴大陸投資升溫" "國內電腦廠商赴大陸頭姿" 1 Ans: <font color=red>國內大型電腦集團赴大陸投資</font>升溫 D:\demo\ir\sam.lib>perl -s Fuzzyhighlight.pm "國內大型電腦集團赴大陸投資升溫" "國內電腦廠商,赴大陸投資,頭姿" 1 Ans: 國內大型電腦集團<font color=red>赴大陸<font color=red>投資</font></font>升溫

Text Match (Naïve) Simple brute force (naive) algorithm: O(nm) in its worse case Text: T[n], where n is the text length Pattern: P[m], where m is the pattern length Naïve Algorithm: for (i=0; T[i] != '\0'; i++) { for (j=0; T[i+j] != '\0' && P[j] != '\0' && T[i+j]==P[j]; j++) ; if (P[j] == '\0') found a match } Example: looking for pattern "nano" in text "banananobano" Each row represents an iteration of the outer loop, with each character in the row representing the result of a comparison (X if the comparison was unequal) 0 1 2 3 4 5 6 7 8 9 10 11 T: b a n a n a n o b a n o i=0: X i=1: X i=2: n a n X i=3: X i=4: n a n o i=5: X i=6: n X i=7: X i=8: X i=9: n X i=10: X

Text Match (Knuth-Morris-Pratt) Knuth-Morris-Pratt Algorithm Skipping outer iterations in the above naïve algorithm If you've found a partial match of j characters starting at position i, you know what's in positions T[i]...T[i+j-1]. i=2: n a n i=3: n a n o <= no need to do this comparison i=4: n a n o <= just skip to this step Skipping inner iterations i=4: n a n o <= n has been compared at i=2, skip this n further Complexity: O(m+n) guaranteed Reference: http://www.ics.uci.edu/~eppstein/161/960227.html

Fuzzy Text Match Become popular after Prof. Sun Wu (吳昇) published his papers since 1992 complexity O(kn), k: number of allowed errors But his algorithm still is not good enough for music retrieval (as noted by Ian Witten in their papers about music retrieval) Reference: Sun Wu and Udi Manber, “Agrep—A Fast Approximate Pattern-Matching Tool” Usenix Winter 1992 - 被引用 140 次 Sun Wu and Udi Manber, ‘‘Fast text searching: allowing errors,’’ Communications of the ACM, Vol. 35 , Issue 10 (Oct. 1992) pp: 83 - 91 – cited 71 times in ACM portal Baeza-Yates, R.A. and Gonnet, G.H. A new approach to text searching. Commun. ACM 35, 10 (1992). Preliminary version appeared in Proceedings of the 12th Annual ACM-SIGIR Conference on Information Retrieval, Cambridge, Mass. (June 1989), pp. 168-175.

Dynamic Programming for Fuzzy Match DP algorithm: 比較兩字串，允許字元的插入、刪除、代換將兩字串在二維矩陣中展開（如下頁範例），求累計距離 d[i, j] = min( d[i-1, j] + w(A[i], 0), d[i-1, j-1] + w(A[i], B[j]), d[i, j-1] + w(0, B[j]) ) d[i, j] : accumulated distance between sequence A and B w(A[i], B[j]) : cost of substituting A[i]with B[j] w(A[i], 0) : cost of inserting A[i] w(0, B[j]) : cost of deleting B[j] d[0, 0] = 0 d[i, 0] = d[i-1, 0] + w(A[i], 0) , 1<= i <= m d[0, j] = d[0, j-1] + w(0, B[j]) , 1<= j <= n Complexity: O(n * m), m : length of A, n : length of B (i-1,j-1) (i-1, j) (i, j-1) ( i, j )

Basic Dynamic Programming: Examples 假設文件B=adecdecf，查詢詞A=adec，則 d[4,4]為0，最小，從d[4,4]倒推回d[3,3]，再倒推回d[2,2]，再倒推回d[1,1],一路都有最小值0,可知match 的位置在最開頭 adec 但假若match的字串adec在B的中間，就沒有預期的最小值0 很多人都不知道如何改進，因此認為DP只能從頭開始比起，尤其是一些音樂檢索系統，只能允許旋律從頭比起，限制很大改進辦法，在第一列插入全部都是0的列，如下頁

Improved Dynamic Programming 在第一列插入全部都是0的列，其效果猶如允許從任何位置開始比較，亦即，初始值從 d[0, j] = d[0, j-1] + w(0, B[j]) , 1<= j <= n 改成 d[0, j] = 0 , 1<= j <= n 最後一列的值，範圍為0到m，定義一函數：其中 min 是最後一列的最小值函數d(m)的範圍在0到1之間，其與m的關係如下圖，因此，取門檻值如下： if ($min == $lenq) { $dm = 0; } else { $dm = 1/exp($min/($lenq-$min)); } if (($lenq < 5 && $dm < 0.7) or ($dm < 0.54)) { … } #The No Match threshold Reference: Daniel Lopresti and Jiangying Zhou, “Retrieval Strategies for Noisy Text,” Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval, April 15-17, 1996, pp. 255-269.

應用範例（1/3）目前我實做的 fuzzy highlight 規格：底下的 Perl 範例程式，第一個參數是待標示的文件（字串），第二個參數是查詢詞（可多個，用「，」或「,」分開即可），第三個參數如果不為 0 ,則表示要做同音比對。 D:\>perl -s fuzzymatch5.pl "國內大型電腦集團赴大陸投資升溫" "國內電腦廠商赴大陸投資" Result:<font color=red>國內大型電腦集團赴大陸投資</font>升溫 D:\>perl -s fuzzymatch5.pl “國際商業股份公司” “傷葉古份” 1 Result:國際<font color=red>商業股份</font>公司 D:\>perl -s fuzzymatch5.pl “這是CD-R光碟片” “cdr 胱跌騙” 1 Result:這是<font color=red>CD-R光碟片</font> D:\>perl -s fuzzymatch5.pl "My name is 'sam Tseng'" "sam Tseng" Result:My name is '<font color=red>sam Tseng</font>'

應用範例（2/3）中文同音、多詞 highlight： D:\>perl -s fuzzymatch5.pl "國內大型電腦集團赴大陸投資升溫" "國內電腦廠商,赴大陸投資,生瘟" 1 Result:國內大型電腦集團<font color=red>赴大陸投資</font><font color=red>升溫</font> D:\>perl -s fuzzymatch5.pl "國內大型電腦集團赴大陸投資升溫" "國內電腦廠商,赴大陸投資,頭姿" 1 Result:國內大型電腦集團<font color=red>赴大陸<font color=red>投資</font></font>升溫中英文模糊 highlight： D:\>perl -s fuzzymatch5.pl "國內computer集團赴大陸投資升溫" "國內camputorization集團" Result:<font color=red>國內computer集團</font>赴大陸投資升溫

應用範例（3/3）英文的同音 high light： D:\>perl -s fuzzymatch5.pl "國內computer集團赴大陸投資升溫" "國內camputorization" Result:國內computer集團赴大陸投資升溫錯太多，無法 high light，要改用同音比對 D:\>perl -s fuzzymatch5.pl "國內computer集團赴大陸投資升溫" "國內camputorization" 1 Result:<font color=red>國內computer</font>集團赴大陸投資升溫

Conclusions (1) We have shown a fuzzy match tool that is capable of matching strings with some edit distance regardless of spaces and capitalization insensitive to plurals and tense allowing homonym match converting the edit distance into a similarity score highlighting similar query string in the text Google segments the query string into terms and highlight terms in the text Applications: Music Retrieval Yuen-Hsien Tseng, "Content-Based Retrieval for Music Collections," ACM SIGIR '99, Aug. 15-19, Berkeley, U.S.A., 1999, pp.176-182. Duplicate detection Yuen-Hsien Tseng and William John Teahan, "Verifying a Chinese Collection for Text Categorization," ACM SIGIR '04, July 25 - 29, U.K., 2004, pp.556-557. 曾元顯, "分類不一致對文件自動分類效果的影響", 「大學圖書館」, 第 9 卷第 1 期, 2005 年 3 月, 頁2-19.

Conclusions (2) However, its complexity is O(mn) in time and O(n) in space limited to applications of only small text sizes Do we have better solutions? Yes, index the texts to be searched first and then match them with the query string based on some retrieval models In this way, we also are able to deal with semantic text match to some degree In short, we need information retrieval (IR) techniques