National Taiwan Normal University

Slides:



Advertisements
Similar presentations
allow v. wrong adj. What’s wrong? midnight n. look through guess v. deal n. big deal work out 允许;准许 有毛病;错误的 哪儿不舒服? 午夜;子夜 快速查看;浏览 猜测;估计 协议;交易 重要的事.
Advertisements

第五章 動詞 動詞用來表示一種動作 動詞有及物與不及物之分,及物動詞之後需要受詞,有的動詞甚至需要兩個受詞:一個直接受詞,一個間接受詞
Time Objectives By the end of this chapter, you will be able to
激励发现 推动创新 利用Web of Science数据库发现人文社会科学领域研究的新视角 汤森路透知识产权与科技集团.
台灣民間部門在長期照護中角色的趨勢 4th International Symposium on Chinese Elderly
信息检索中效率问题的研究 报告人:赵江华 北京大学计算机科学与技术系 网络与分布式系统实验室 2002年4月21日.
当代中国流行文化与对外汉语教学 Contemporary Chinese Popular Culture and Teaching Chinese as a Foreign Language Yuhong Sun 孙玉红.
個人簡介 施再繁 台大電機所計算機組博士.
Teaching the Chinese Copula 是 for CSL Purposes
How can we become good leamers
手持裝置應用系統之設計 與未來發展 黃有評 大同大學 資訊工程系.
AIS Project hanyu Stage 6 Writing Skills
Operating System CPU Scheduing - 3 Monday, August 11, 2008.
Text Segmentation for Chinese Spell Checking
Been During the Vacation?
生物資訊 bioinformatics 林育慶.
Title Layout.
Manifold Learning Kai Yang
優質教育基金研究計劃研討會: 經驗分享 - 透過Web 2.0推動高小程度 探究式專題研習的協作教學模式
出死入生 Cross Over from Death to Life
Zn 模式匹配与KMP算法 Zn
計算方法設計與分析 Design and Analysis of Algorithms 唐傳義
Department of Computer Science & Information Engineering
词汇语义资源在中文关系抽取中的应用 报告人:钱龙华 刘丹丹 胡亚楠 钱龙华 周国栋
Chinese IIAB (IIA +IIB)
Chinese IIAB (IIA +IIB)
Guide to Freshman Life Prepared by Sam Wu.
課務組 Curriculum Section
Time Objectives By the end of this chapter, you will be able to
Chinese 101 University of Puget Sound
Word-Entity Duet Representations for Document Ranking
现代信息检索 Modern Information Retrieval
智泉國際事業有限公司(iGroup Taiwan)
“掌握科技文献 助力科学研究” ——大学生信息素质培训与交流 龚芙蓉 武汉大学图书馆 龚芙蓉 武汉大学图书馆.
國立政治大學 資訊科學研究所 知識系統實驗室 研究生: 鄭雍瑋 指導教授: 劉吉軒 博士 中華民國九十五年六月三十日
子博弈完美Nash均衡 我们知道,一个博弈可以有多于一个的Nash均衡。在某些情况下,我们可以按照“子博弈完美”的要求,把不符合这个要求的均衡去掉。 扩展型博弈G的一部分g叫做一个子博弈,如果g包含某个节点和它所有的后继点,并且一个G的信息集或者和g不相交,或者整个含于g。 一个Nash均衡称为子博弈完美的,如果它在每.
Time Objectives By the end of this chapter, you will be able to
重點 資料結構之選定會影響演算法 選擇對的資料結構讓您上天堂 程式.
Lesson 44:Popular Sayings
String Matching Michael Tsai 2012/06/04.
一般論文的格式 註:這裡指的是一般 journal papers 和 conference papers 的格式。
BEd(Special Education)
第二讲 计算机信息检索概述 主要内容: 一 信息检索的基本概念 二 电子资源的概念与类型 三 计算机信息检索系统 四 计算机检索技术.
Towards Emotional Awareness in Software Development Teams
服務於中國研究的網絡基礎設施 A Cyberinfrastructure for Historical China Studies
Chinese IAB (IA +IB) 11 Weather and Internet Module (L21-L22)
介面使用說明 飛資得知識服務.
Sensor Networks: Applications and Services
Guide to a successful PowerPoint design – simple is best
Common Qs Regarding Earnings
Review and Analysis of the Usage of Degree Adverbs
中美图书馆之间合作的过去、现在和未来 Sino-U. S
OvidSP Introduction Flexible. Innovative. Precise.
OCLC Firstsearch 全國版推廣教育訓練課程
中考英语阅读理解 完成句子命题与备考 宝鸡市教育局教研室 任军利
String Matching Michael Tsai 2013/05/28.
Course 10 削減與搜尋 Prune and Search
Efficient Query Relaxation for Complex Relationship Search on Graph Data 李舒馨
ACM Digital Library 進階利用與實作 郭珮琪主講
More About Auto-encoder
My Country 我 的 国 家.
钱炘祺 一种面向实体浏览中属性融合的人机交互的设计与实现 Designing Human-Computer Interaction of Property Consolidation for Entity Browsing 钱炘祺
Selecting Reading Materials
參考資料: 林秋燕 曾元顯 卜小蝶,Chap. 1、3 Chowdhury,Chap.9
自主练悟 ①(2017·桂林市联考)To them, life is a competition — they have to do _______ (good) than their peers to be happy. ②(2017·菏泽市模拟)People who forgive.
Introduction to Computer Security and Cryptography
以碎形正交基底和時間情境圖為基礎進行之視訊檢索 Video retrieval based on fractal orthogonal bases and temporal graph 阿凡達 研究生:張敏倫 指導教授:蔣依吾博士 國立中山大學資訊工程學系.
高考英语作文指导 福建省教研室 姚瑞兰.
Section 1 Basic concepts of web page
Presentation transcript:

National Taiwan Normal University Text Database Yuen-Hsien Tseng / 曾元顯 Digital Media Center National Taiwan Normal University Text Fuzzy Match Keyword/Key Phrase Extraction Fuzzy Match Applications: Music Retrieval Term Association Analysis Text Categorization Text Summarization, Clustering, and Its Applications

Values of This Course Information access service is highly profitable Text database techniques are its core technologies Share By jiyanjiang on 2006-01-03

Text Fuzzy Match Applications Levels of Text Fuzzy Match Text search, text highlight duplicate detection, record linkage Levels of Text Fuzzy Match Phonic: homonyms, transliteration match Symbolic: most common type of text match Semantic: knowledge bases are often needed

Text Fuzzy Match – Phonic (En) For English, use Soundex code a hashing system for English words used by the United States Census Bureau by Robert C. Russell (U.S. patent 1,261,167 on April 2, 1918) Soundex Algorithm begin with the first letter of a word followed by a three-digit (1-6) code that represents the (first three) remaining consonants If the word is too short to generate 3 numbers, 0 is added as needed. If the generated code is longer than 3 numbers, the extra are thrown away.

Text Fuzzy Match – Phonic (En) Examples: perl -e "use Text::Soundex; print soundex('anything')" Ans: A535 References: http://www.highprogrammer.com/alan/numbers/soundex.html http://searches.rootsweb.com/cgi-bin/Genea/soundex.sh Applications: Yuen-Hsien Tseng, "Automatic Cataloguing and Searching for Retrospective Data by Use of OCR Text", Journal of the American Society for Information Science and Technology, Vol. 52, No. 5, April 2001, pp. 378-390. (SSCI, SCI, EI)

Text Fuzzy Match – Phonic (Ch) For Chinese, convert each character into its phonograms or pinyins match their phonograms or pinyins instead of their characters Applications Transliteration match:克麗斯汀 vs 克莉思汀 Traditional/simplified character match Example: D:\demo\ir\sam.lib>perl -s Fuzzyhighlight.pm "國內大型電腦集團赴大陸投資升溫" "國內電腦廠商赴大陸頭姿" 1 Ans: <font color=red>國內大型電腦集團赴大陸投資</font>升溫 D:\demo\ir\sam.lib>perl -s Fuzzyhighlight.pm "國內大型電腦集團赴大陸投資升溫" "國內電腦廠商,赴大陸投資,頭姿" 1 Ans: 國內大型電腦集團<font color=red>赴大陸<font color=red>投資</font></font>升溫

Text Match (Naïve) Simple brute force (naive) algorithm: O(nm) in its worse case Text: T[n], where n is the text length Pattern: P[m], where m is the pattern length Naïve Algorithm: for (i=0; T[i] != '\0'; i++) { for (j=0; T[i+j] != '\0' && P[j] != '\0' && T[i+j]==P[j]; j++) ; if (P[j] == '\0') found a match } Example: looking for pattern "nano" in text "banananobano" Each row represents an iteration of the outer loop, with each character in the row representing the result of a comparison (X if the comparison was unequal) 0 1 2 3 4 5 6 7 8 9 10 11 T: b a n a n a n o b a n o i=0: X i=1: X i=2: n a n X i=3: X i=4: n a n o i=5: X i=6: n X i=7: X i=8: X i=9: n X i=10: X

Text Match (Knuth-Morris-Pratt) Knuth-Morris-Pratt Algorithm Skipping outer iterations in the above naïve algorithm If you've found a partial match of j characters starting at position i, you know what's in positions T[i]...T[i+j-1]. i=2: n a n i=3: n a n o <= no need to do this comparison i=4: n a n o <= just skip to this step Skipping inner iterations i=4: n a n o <= n has been compared at i=2, skip this n further Complexity: O(m+n) guaranteed Reference: http://www.ics.uci.edu/~eppstein/161/960227.html

Fuzzy Text Match Become popular after Prof. Sun Wu (吳昇) published his papers since 1992 complexity O(kn), k: number of allowed errors But his algorithm still is not good enough for music retrieval (as noted by Ian Witten in their papers about music retrieval) Reference: Sun Wu and Udi Manber, “Agrep—A Fast Approximate Pattern-Matching Tool” Usenix Winter 1992 - 被引用 140 次 Sun Wu and Udi Manber, ‘‘Fast text searching: allowing errors,’’ Communications of the ACM, Vol. 35 , Issue 10 (Oct. 1992) pp: 83 - 91 – cited 71 times in ACM portal Baeza-Yates, R.A. and Gonnet, G.H. A new approach to text searching. Commun. ACM 35, 10 (1992). Preliminary version appeared in Proceedings of the 12th Annual ACM-SIGIR Conference on Information Retrieval, Cambridge, Mass. (June 1989), pp. 168-175.

Dynamic Programming for Fuzzy Match DP algorithm: 比較兩字串,允許字元的插入、刪除、代換 將兩字串在二維矩陣中展開(如下頁範例),求累計距離 d[i, j] = min( d[i-1, j] + w(A[i], 0), d[i-1, j-1] + w(A[i], B[j]), d[i, j-1] + w(0, B[j]) ) d[i, j] : accumulated distance between sequence A and B w(A[i], B[j]) : cost of substituting A[i]with B[j] w(A[i], 0) : cost of inserting A[i] w(0, B[j]) : cost of deleting B[j] d[0, 0] = 0 d[i, 0] = d[i-1, 0] + w(A[i], 0) , 1<= i <= m d[0, j] = d[0, j-1] + w(0, B[j]) , 1<= j <= n Complexity: O(n * m), m : length of A, n : length of B (i-1,j-1) (i-1, j) (i, j-1) ( i, j )

Basic Dynamic Programming: Examples 假設文件B=adecdecf,查詢詞A=adec,則 d[4,4]為0,最小,從d[4,4]倒推回d[3,3],再倒推回d[2,2],再倒推回d[1,1],一路都有最小值0,可知match 的位置在最開頭 adec 但假若match的字串adec在B的中間,就沒有預期的最小值0 很多人都不知道如何改進,因此認為DP只能從頭開始比起,尤其是一些音樂檢索系統,只能允許旋律從頭比起,限制很大 改進辦法,在第一列插入全部都是0的列,如下頁

Improved Dynamic Programming 在第一列插入全部都是0的列,其效果猶如允許從任何位置開始比較,亦即,初始值 從 d[0, j] = d[0, j-1] + w(0, B[j]) , 1<= j <= n 改成 d[0, j] = 0 , 1<= j <= n 最後一列的值,範圍為0到m,定義一函數: 其中 min 是最後一列的最小值 函數d(m)的範圍在0到1之間,其與m的關係如下圖,因此,取門檻值如下: if ($min == $lenq) { $dm = 0; } else { $dm = 1/exp($min/($lenq-$min)); } if (($lenq < 5 && $dm < 0.7) or ($dm < 0.54)) { … } #The No Match threshold Reference: Daniel Lopresti and Jiangying Zhou, “Retrieval Strategies for Noisy Text,” Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval, April 15-17, 1996, pp. 255-269.

應用範例(1/3) 目前我實做的 fuzzy highlight 規格: 底下的 Perl 範例程式,第一個參數是待標示的文件(字串),第二個參數是查詢詞(可多個,用「,」或「,」分開即可),第三個參數如果不為 0 ,則表示要做同音比對。 D:\>perl -s fuzzymatch5.pl "國內大型電腦集團赴大陸投資升溫" "國內電腦廠商赴大陸投資" Result:<font color=red>國內大型電腦集團赴大陸投資</font>升溫 D:\>perl -s fuzzymatch5.pl “國際商業股份公司” “傷葉古份” 1 Result:國際<font color=red>商業股份</font>公司 D:\>perl -s fuzzymatch5.pl “這是CD-R光碟片” “cdr 胱跌騙” 1 Result:這是<font color=red>CD-R光碟片</font> D:\>perl -s fuzzymatch5.pl "My name is 'sam Tseng'" "sam Tseng" Result:My name is '<font color=red>sam Tseng</font>'

應用範例(2/3) 中文同音、多詞 highlight: D:\>perl -s fuzzymatch5.pl "國內大型電腦集團赴大陸投資升溫" "國內電腦廠商,赴大陸投資,生瘟" 1 Result:國內大型電腦集團<font color=red>赴大陸投資</font><font color=red>升溫</font> D:\>perl -s fuzzymatch5.pl "國內大型電腦集團赴大陸投資升溫" "國內電腦廠商,赴大陸投資,頭姿" 1 Result:國內大型電腦集團<font color=red>赴大陸<font color=red>投資</font></font>升溫 中英文模糊 highlight: D:\>perl -s fuzzymatch5.pl "國內computer集團赴大陸投資升溫" "國內camputorization集團" Result:<font color=red>國內computer集團</font>赴大陸投資升溫

應用範例(3/3) 英文的同音 high light: D:\>perl -s fuzzymatch5.pl "國內computer集團赴大陸投資升溫" "國內camputorization" Result:國內computer集團赴大陸投資升溫 錯太多,無法 high light,要改用同音比對 D:\>perl -s fuzzymatch5.pl "國內computer集團赴大陸投資升溫" "國內camputorization" 1 Result:<font color=red>國內computer</font>集團赴大陸投資升溫

Conclusions (1) We have shown a fuzzy match tool that is capable of matching strings with some edit distance regardless of spaces and capitalization insensitive to plurals and tense allowing homonym match converting the edit distance into a similarity score highlighting similar query string in the text Google segments the query string into terms and highlight terms in the text Applications: Music Retrieval Yuen-Hsien Tseng, "Content-Based Retrieval for Music Collections," ACM SIGIR '99, Aug. 15-19, Berkeley, U.S.A., 1999, pp.176-182. Duplicate detection Yuen-Hsien Tseng and William John Teahan, "Verifying a Chinese Collection for Text Categorization," ACM SIGIR '04, July 25 - 29, U.K., 2004, pp.556-557. 曾元顯, "分類不一致對文件自動分類效果的影響", 「大學圖書館」, 第 9 卷第 1 期, 2005 年 3 月, 頁2-19.

Conclusions (2) However, its complexity is O(mn) in time and O(n) in space limited to applications of only small text sizes Do we have better solutions? Yes, index the texts to be searched first and then match them with the query string based on some retrieval models In this way, we also are able to deal with semantic text match to some degree In short, we need information retrieval (IR) techniques