Text Segmentation for Chinese Spell Checking

Slides:



Advertisements
Similar presentations
Chapter 2 Combinatorial Analysis 主講人 : 虞台文. Content Basic Procedure for Probability Calculation Counting – Ordered Samples with Replacement – Ordered.
Advertisements

期末考试作文讲解 % 的同学赞成住校 30% 的学生反对住校 1. 有利于培养我们良好的学 习和生活习惯; 1. 学生住校不利于了解外 界信息; 2 可与老师及同学充分交流有 利于共同进步。 2. 和家人交流少。 在寄宿制高中,大部分学生住校,但仍有一部分学生选 择走读。你校就就此开展了一次问卷调查,主题为.
考研英语复试 口语准备 考研英语口语复试. 考研英语复试 口语准备 服装 谦虚、微笑、自信 态度积极 乐观沉稳.
《互联网运营管理》系列课程 觉浅网 荣誉出品
黄国文 中山大学 通用型英语人才培养中的 语言学教学 黄国文 中山大学
Time Objectives By the end of this chapter, you will be able to
中四 升學講座 中五 2007年12月8日.
第十九课 旅行.
CHIN 3010: reading & writing
即兴中文讲演比赛 On-Site Speech 新型比赛项目
雅思大作文的结构 Presented by: 总统秘书王富贵.
BRIEF GUIDELINE FOR AUTHOR PREPARING PAPER FOR PUBLICATION
B型肝炎帶原之肝細胞癌患者接受肝動脈栓塞治療後血液中DNA之定量分析
視聽資料之定義 視聽資料 非書資料 多媒體資料.
Chapter 8 Liner Regression and Correlation 第八章 直线回归和相关
Chaoping Li, Zhejiang University
AIS Project hanyu Stage 6 Writing Skills
Welcome Welcome to my class Welcome to my class!.
Leftmost Longest Regular Expression Matching in Reconfigurable Logic
Euler’s method of construction of the Exponential function
Applications of Digital Signal Processing
指導教授:許子衡 教授 報告學生:翁偉傑 Qiangyuan Yu , Geert Heijenk
Population proportion and sample proportion
考试与考生 --不对等与对等 邹申 上海外国语大学
Differential Equations (DE)
Hui-Ju Chuang University of Hawaii-Manoa
Unit title: 买东西 - Shopping
圖形溝通大師 Microsoft Visio 2003
Department of Computer Science & Information Engineering
创建型设计模式.
Unit 2 Key points summary.
Time Objectives By the end of this chapter, you will be able to
Area of interaction focus
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi
第14章 竞争市场上的企业 上海杉达学院 国贸系.
實驗1 Streaking isolation of bacteria 細菌劃線分離
Outrigger Optimization for Super Tall Structures Under Multiple Constraints 多约束条件下超高结构伸臂系统优化.
Time Objectives By the end of this chapter, you will be able to
作者: DALE GOODHUE 來源: JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION
Formal Pivot to both Language and Intelligence in Science
Lesson 44:Popular Sayings
Chapter 3 Nationality Objectives:
第十五课:在医院看病.
職業 Random Slide Show Menu
How to be lucky 初中基础 (2356期 ) 6版.
Interesting or inspiring sequences
Version Control System Based DSNs
研究技巧與論文撰寫方法 中央大學資管系 陳彥良.
高性能计算与天文技术联合实验室 智能与计算学部 天津大学
Mechanics Exercise Class Ⅰ
Maintaining Frequent Itemsets over High-Speed Data Streams
Guide to a successful PowerPoint design – simple is best
Interesting or inspiring sequences
Safety science and engineering department
中国科学技术大学计算机系 陈香兰 2013Fall 第七讲 存储器管理 中国科学技术大学计算机系 陈香兰 2013Fall.
虚 拟 仪 器 virtual instrument
关联词 Writing.
Review and Analysis of the Usage of Degree Adverbs
從 ER 到 Logical Schema ──兼談Schema Integration
Simple Regression (簡單迴歸分析)
中考英语阅读理解 完成句子命题与备考 宝鸡市教育局教研室 任军利
高考应试作文写作训练 5. 正反观点对比.
Outline Overview of this paper Motivation and Initialization
Efficient Query Relaxation for Complex Relationship Search on Graph Data 李舒馨
名词从句(2).
动词不定式(6).
2012 程式設計比賽 Openfind 天使帝國 v2.0 (蓋亞的紋章).
2 Number Systems, Operations, and Codes
MGT 213 System Management Server的昨天,今天和明天
Hybrid fractal zerotree wavelet image coding
Presentation transcript:

Text Segmentation for Chinese Spell Checking JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE’ 1999 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. Kin Hong Lee Mau Kit Michael Ng Department of Computer Science and Engineering, The Chinese University of Hong Kong

Abstract Introduction Overview Chinese text has no natural delimiters such as spaces between words, which are meaningful sequences of characters. Every Chinese character input must be a valid ideograph, but the sequence of Chinese characters may not make sense. (EX : “時間“ is mistyped as ”時閒” , 時and閒are correct characters , although the character sequence is not a correct word.) EXPERIMENTS 中文不像英文有空白分割 且每個中文字都有他的意義,但如果放在句子中意思又可能不太一樣 時間打錯為時閒,但時、閒兩個字都有他的意思,但如果把兩個字放在一起那他就是錯字 1 / 17

There are four main kinds of errors. Introduction Overview In Chinese spell checking, it cannot be assumed that texts are free of errors. There are four main kinds of errors. Misuses of characters due to same or similar sounds (按步就班  按部就班) Misuses of characters due to similar shapes (茶,荼) Misuses of characters due to similar meanings (名『符』其實,名『副』其實) Typing errors related to Chinese input methods. EXPERIMENTS 在中國拼寫檢查,它不能被假定文本有沒有錯誤。 細部切割可以分為4種錯誤 1.相同拼音 2.相似字形 3.因為相似的辭意而字符的誤用 4.輸入法所造成的錯誤 2 / 17

The Segmentation Process and System Interaction Model Introduction Overview When a piece of text is to be handled, it is first divided into sentences. Punctuation marks are used as delimiters to separate sentences. Some of the sentences may contain symbols, alphabetic symbols, and numerals. These types of characters are skipped without checking, and are used as unnatural delimiters to further divide sentences into phrases. To reduce false alarms, occurrences of the first 200 most frequently used single-character words such as 的(of), 一(one), 是(is), 不(not), 有(have), 在(in), 個(unit, quantity) should not be considered as suspected errors. EXPERIMENTS 分割過程和系統的互動模型 第一步是先將句字做切割 標點符號做分段 有些句子包含符號,字母符號和數字。 這些類型的字符被跳過而不檢查,並作為異常的符號,可以進一步劃分成句短語。 為了減少誤報,他們將最多使用的200個常用字剔除 3 / 17

Introduction Overview Unlike text analysis for translation or semantic analysis, sometimes it is not necessary for a spelling checker to find a unique segmentation solution. EXPERIMENTS 錯字校正不像文字翻譯或語意,有時候不需要為了拼字檢查而做獨特的細分解決方案 4 / 17

Introduction Overview In this article, a Block-of-Combinations (BOC) segmentation method based on frequency of word usage is proposed. To make the method more suitable for spell checking, user interaction is also introduced into the system.Based on the user’s response, the segmentation can be refined to fit the user’s interpretation, and unknown words can also be learned by the system during the spell checking process. EXPERIMENTS 在這篇文章中,提出了一種基於詞的使用頻率( BOC)的分割方法。 為了使該方法更適合於拼寫檢查,與用戶互動也被引入到系統中。 根據用戶的回應,分割可以細化,以適應用戶的解釋;未知的句子,可由系統中的拼寫檢查過程而得知。 5 / 17

Block-of-Combinations (BOC) Segmentation Method Introduction Overview “誰都不知道他的確實用途” EXPERIMENTS This may be deduced from so-called word formation power, in which the word formation power of the character is higher than that of the character 的, and so it is more likely that the character sequence 確實 is a word. 根據字典做斷詞法 字符構成的詞有比較高的權重,所以分割出來會是 的"確實"而不是"的確"實 6 / 17

“誰都不知道他的確實用嗎” the last character is changed from “途” to “嗎” 7 / 17 Introduction Overview “誰都不知道他的確實用嗎” the last character is changed from “途” to “嗎” (no one knows its real use) (Does no one know that it is really practical?) EXPERIMENTS 但當修改一個字"途">"嗎" 段的詞就不一樣了 7 / 17

Introduction Overview Recall that a semiword is a one-character word that is seldom used as a word. In the BOC segmentation method proposed, single- character-word function U is defined as follows: EXPERIMENTS f : occurrence frequency of the character as a single-character word 𝑓 𝐶𝑈𝑇 : threshold frequency below the range in which the characters are considered as semiwords 𝑓 𝑆𝐴𝑇 : threshold frequency above which the characters often appear a single-character words semiword是指很少被用到的一個字詞,例如 “家庭”  被斷開為 “家” “庭” ,而這個 “庭”字就是 在所提出的 BOC 分割方法,單字符單詞函數U被定義如下: f:作為單字詞的出現頻率 𝑓_𝐶𝑈𝑇 :低於閥值頻率範圍,被認定為semiwords (切錯時出現) 𝑓_𝑆𝐴𝑇 : 高於閥值頻率範圍,被認定為它確實是一個單一字詞 (的、是 ) 8 / 17

Introduction Overview The score of a segmentation is defined as : The best segmentation is the one with the smallest Score-S. EXPERIMENTS Score-S = ∑( 1 – U( 𝑓 𝑗 ) ) j : single character appearing in the segmentation. 這邊有一個計算分數的公式: 𝑓 𝑗 是那個單一字詞出現的頻率 好的切割分數會是越低越好 9 / 17

Heuristic for Finding the Best Segmentation Introduction Overview Theoretically, any sequence of Chinese characters can form a word, if unknown words are also considered. For a phrase of length 1, the maximum number of different segmentation is 1. For a phrase of L characters, the maximum number of segmentations is: EXPERIMENTS 啟發式的尋找最佳的分割 從理論上說,中國字符的任何序列都能形成一個單詞,如果不知道的話,也算。 因此, ( 1 )如果一個短句長度為 L = 1 (即一個字), 最大可分的數量為1 。 ( 2 )當一個短句有L個字符時,那最大數量為 10 / 17

Introduction Overview Thus, the maximum number of segmentations for a phrase of L characters is: EXPERIMENTS 因此,如果長度為L的短語直接下去切那需要做的迭帶次數會非常高 分割的數量呈指數增長。 根據圖表示,分割的最大數量,是根據短語的長度以指數增加,並且有組合的數量可能會爆掉。 可以觀察到,也有中文依賴性現象,可以通過考慮幾個相鄰字符解決。 11 / 17

Introduction Overview It is observed that although there are long-distance dependency phenomena in Chinese, most of the ambiguities can be solved by considering a few adjacent characters. Instead of considering all the combinations of a long phrase at one time, the segmentation process considers text under a sliding window. EXPERIMENTS 透過觀察發現到,也有中文相依現象,可以通過考慮幾個相鄰字符解決。 不是考慮所有長語的組合在同一時間,而是根據移動窗口的地方來做考慮 12 / 17

Introduction Overview In each iteration, the process looks ahead several characters and generates combinations to choose the best solution. A Terminator is the starting position of the words that follow the words considered in the current iteration. EXPERIMENTS 在每次迭代中,用前面幾個字符,並產生組合選擇最佳的解決方案。 因為可能有幾個不同的解釋,如果不考慮長的連續的詞,那可能無法找到一個共同結束的位置 迭代停止的地方就是詞的起始位置,可以把它切開在下次做迭代 假設m=5所以會取到 “構”,但 “構成”是一個詞,所以下次開始就是從 “漢”開始 13 / 17

Maximum Number of Combinations in Each Iteration Introduction Overview The length of all the words are 𝑀𝑎𝑥 𝑊 or less, and P is the first character considered in the current iteration. EX : “發展中國家庭電器換取外匯” The corresponding words in the dictionary are “發展中國家”、“發展”、 “中國”、 “國家”、 “家庭電器” 、 “家庭” 、 “電器” 、 “換取”、 “外匯” 𝑀𝑎𝑥 𝑊 = 5 That is, if a word of six or more characters is encountered, the word will be chosen as the result of that iteration. EXPERIMENTS 最大跌代組合 所有的話的長度〖𝑀𝑎𝑥〗 _𝑊 or 少,而P是在當前迭代認為網絡第一個字符。 也就是說,如果遇到的六個或更多個字符的單詞,該單詞將被選擇作為該次迭代的結果。 14 / 17

Introduction Overview The larger the maximum length, the more combinations have to be considered, and hence, the more computation time is needed. As the algorithm uses adjacent multicharacter words for solving ambiguities, at least two more characters have to be considered. EXPERIMENTS 句子長度用長,需要的的耗費時間越久 所以必須要竟可能得讓Max(W)小 在這個演算法,利用鄰近的word來解決字義的歧異,因此必須考慮到兩個字符以上 比較好的Max(w)至少要到5,因為中文字幾乎沒要到6個字符以上組成一個詞的 然後用PDF範例做講解 the preferable value of 𝑀𝑎𝑥 𝑊 should be at least 5 15 / 17

Conclusions Introduction Overview In this paper used a total of 100 ambiguities. Among the 100 ambiguities, 68 of them can be solved correctly by both methods, and 5 of them cannot be solved correctly by both methods. Among the remaining 27 ambiguities, 19 of them can only be solved by BOC, while 8 of them can only be solved by Forward Maximum Match. EXPERIMENTS 結果 精確性 它們使用100個有歧異的句子 68個用兩種方法可以正確做出矯正 5種兩個都不行 在剩餘的27組歧義句子,其中19組可以由BOC解決,而8組FMM可以解決。 他們提出方法可以解決87 FMM只有76 16 / 17

Speed 17 / 17 Introduction Overview EXPERIMENTS 速度兩個方法其實是差不多快 但是他們的方法,準確度比較高 17 / 17