Text Segmentation for Chinese Spell Checking

Slides:

Advertisements

Similar presentations

Chapter 2 Combinatorial Analysis 主講人 : 虞台文. Content Basic Procedure for Probability Calculation Counting – Ordered Samples with Replacement – Ordered.

Advertisements

期末考试作文讲解 % 的同学赞成住校 30% 的学生反对住校 1. 有利于培养我们良好的学习和生活习惯； 1. 学生住校不利于了解外界信息； 2 可与老师及同学充分交流有利于共同进步。 2. 和家人交流少。在寄宿制高中，大部分学生住校，但仍有一部分学生选择走读。你校就就此开展了一次问卷调查，主题为.

考研英语复试口语准备考研英语口语复试. 考研英语复试口语准备服装谦虚、微笑、自信态度积极乐观沉稳.

《互联网运营管理》系列课程觉浅网荣誉出品

黄国文中山大学通用型英语人才培养中的语言学教学黄国文中山大学

Time Objectives By the end of this chapter, you will be able to

中四升學講座中五 2007年12月8日.

第十九课旅行.

CHIN 3010: reading & writing

即兴中文讲演比赛 On-Site Speech 新型比赛项目

雅思大作文的结构 Presented by: 总统秘书王富贵.

BRIEF GUIDELINE FOR AUTHOR PREPARING PAPER FOR PUBLICATION

B型肝炎帶原之肝細胞癌患者接受肝動脈栓塞治療後血液中DNA之定量分析

視聽資料之定義視聽資料非書資料多媒體資料.

Chapter 8 Liner Regression and Correlation 第八章直线回归和相关

Chaoping Li, Zhejiang University

AIS Project hanyu Stage 6 Writing Skills

Welcome Welcome to my class Welcome to my class!.

Leftmost Longest Regular Expression Matching in Reconfigurable Logic

Euler’s method of construction of the Exponential function

Applications of Digital Signal Processing

指導教授：許子衡教授報告學生：翁偉傑 Qiangyuan Yu , Geert Heijenk

Population proportion and sample proportion

考试与考生 --不对等与对等邹申上海外国语大学

Differential Equations (DE)

Hui-Ju Chuang University of Hawaii-Manoa

Unit title: 买东西 - Shopping

圖形溝通大師 Microsoft Visio 2003

Department of Computer Science & Information Engineering

创建型设计模式.

Unit 2 Key points summary.

Time Objectives By the end of this chapter, you will be able to

Area of interaction focus

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi

第14章竞争市场上的企业上海杉达学院国贸系.

實驗1 Streaking isolation of bacteria 細菌劃線分離

Outrigger Optimization for Super Tall Structures Under Multiple Constraints 多约束条件下超高结构伸臂系统优化.

Time Objectives By the end of this chapter, you will be able to

作者: DALE GOODHUE 來源: JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION

Formal Pivot to both Language and Intelligence in Science

Lesson 44:Popular Sayings

Chapter 3 Nationality Objectives:

第十五课：在医院看病.

職業 Random Slide Show Menu

How to be lucky 初中基础 (2356期 ) 6版.

Interesting or inspiring sequences

Version Control System Based DSNs

研究技巧與論文撰寫方法中央大學資管系陳彥良.

高性能计算与天文技术联合实验室智能与计算学部天津大学

Mechanics Exercise Class Ⅰ

Maintaining Frequent Itemsets over High-Speed Data Streams

Guide to a successful PowerPoint design – simple is best

Interesting or inspiring sequences

Safety science and engineering department

中国科学技术大学计算机系陈香兰 2013Fall 第七讲存储器管理中国科学技术大学计算机系陈香兰 2013Fall.

虚拟仪器 virtual instrument

关联词 Writing.

Review and Analysis of the Usage of Degree Adverbs

從 ER 到 Logical Schema ──兼談Schema Integration

Simple Regression (簡單迴歸分析)

中考英语阅读理解完成句子命题与备考宝鸡市教育局教研室任军利

高考应试作文写作训练 5. 正反观点对比.

Outline Overview of this paper Motivation and Initialization

Efficient Query Relaxation for Complex Relationship Search on Graph Data 李舒馨

名词从句(2).

动词不定式（6）.

2012 程式設計比賽 Openfind 天使帝國 v2.0 (蓋亞的紋章).

2 Number Systems, Operations, and Codes

MGT 213 System Management Server的昨天，今天和明天

Hybrid fractal zerotree wavelet image coding

Presentation transcript:

Text Segmentation for Chinese Spell Checking JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE’ 1999 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. Kin Hong Lee Mau Kit Michael Ng Department of Computer Science and Engineering, The Chinese University of Hong Kong

Abstract Introduction Overview Chinese text has no natural delimiters such as spaces between words, which are meaningful sequences of characters. Every Chinese character input must be a valid ideograph, but the sequence of Chinese characters may not make sense. (EX : “時間“ is mistyped as ”時閒” , 時and閒are correct characters , although the character sequence is not a correct word.) EXPERIMENTS 中文不像英文有空白分割且每個中文字都有他的意義，但如果放在句子中意思又可能不太一樣時間打錯為時閒，但時、閒兩個字都有他的意思，但如果把兩個字放在一起那他就是錯字 1 / 17

There are four main kinds of errors. Introduction Overview In Chinese spell checking, it cannot be assumed that texts are free of errors. There are four main kinds of errors. Misuses of characters due to same or similar sounds (按步就班  按部就班) Misuses of characters due to similar shapes (茶,荼) Misuses of characters due to similar meanings (名『符』其實，名『副』其實) Typing errors related to Chinese input methods. EXPERIMENTS 在中國拼寫檢查，它不能被假定文本有沒有錯誤。細部切割可以分為4種錯誤 1.相同拼音 2.相似字形 3.因為相似的辭意而字符的誤用 4.輸入法所造成的錯誤 2 / 17

The Segmentation Process and System Interaction Model Introduction Overview When a piece of text is to be handled, it is ﬁrst divided into sentences. Punctuation marks are used as delimiters to separate sentences. Some of the sentences may contain symbols, alphabetic symbols, and numerals. These types of characters are skipped without checking, and are used as unnatural delimiters to further divide sentences into phrases. To reduce false alarms, occurrences of the ﬁrst 200 most frequently used single-character words such as 的(of), 一(one), 是(is), 不(not), 有(have), 在(in), 個(unit, quantity) should not be considered as suspected errors. EXPERIMENTS 分割過程和系統的互動模型第一步是先將句字做切割標點符號做分段有些句子包含符號，字母符號和數字。這些類型的字符被跳過而不檢查，並作為異常的符號，可以進一步劃分成句短語。為了減少誤報，他們將最多使用的200個常用字剔除 3 / 17

Introduction Overview Unlike text analysis for translation or semantic analysis, sometimes it is not necessary for a spelling checker to ﬁnd a unique segmentation solution. EXPERIMENTS 錯字校正不像文字翻譯或語意，有時候不需要為了拼字檢查而做獨特的細分解決方案 4 / 17

Introduction Overview In this article, a Block-of-Combinations (BOC) segmentation method based on frequency of word usage is proposed. To make the method more suitable for spell checking, user interaction is also introduced into the system.Based on the user’s response, the segmentation can be refined to fit the user’s interpretation, and unknown words can also be learned by the system during the spell checking process. EXPERIMENTS 在這篇文章中，提出了一種基於詞的使用頻率（ BOC）的分割方法。為了使該方法更適合於拼寫檢查，與用戶互動也被引入到系統中。根據用戶的回應，分割可以細化，以適應用戶的解釋；未知的句子，可由系統中的拼寫檢查過程而得知。 5 / 17

Block-of-Combinations (BOC) Segmentation Method Introduction Overview “誰都不知道他的確實用途” EXPERIMENTS This may be deduced from so-called word formation power, in which the word formation power of the character is higher than that of the character 的, and so it is more likely that the character sequence 確實 is a word. 根據字典做斷詞法字符構成的詞有比較高的權重，所以分割出來會是的"確實"而不是"的確"實 6 / 17

“誰都不知道他的確實用嗎” the last character is changed from “途” to “嗎” 7 / 17 Introduction Overview “誰都不知道他的確實用嗎” the last character is changed from “途” to “嗎” (no one knows its real use) (Does no one know that it is really practical?) EXPERIMENTS 但當修改一個字"途">"嗎" 段的詞就不一樣了 7 / 17

Introduction Overview Recall that a semiword is a one-character word that is seldom used as a word. In the BOC segmentation method proposed, single- character-word function U is defined as follows: EXPERIMENTS f : occurrence frequency of the character as a single-character word 𝑓 𝐶𝑈𝑇 : threshold frequency below the range in which the characters are considered as semiwords 𝑓 𝑆𝐴𝑇 : threshold frequency above which the characters often appear a single-character words semiword是指很少被用到的一個字詞，例如 “家庭”  被斷開為 “家” “庭” ，而這個 “庭”字就是在所提出的 BOC 分割方法，單字符單詞函數U被定義如下： f：作為單字詞的出現頻率 𝑓_𝐶𝑈𝑇 ：低於閥值頻率範圍，被認定為semiwords (切錯時出現) 𝑓_𝑆𝐴𝑇 ：高於閥值頻率範圍，被認定為它確實是一個單一字詞 (的、是 ) 8 / 17

Introduction Overview The score of a segmentation is defined as : The best segmentation is the one with the smallest Score-S. EXPERIMENTS Score-S = ∑( 1 – U( 𝑓 𝑗 ) ) j : single character appearing in the segmentation. 這邊有一個計算分數的公式: 𝑓 𝑗 是那個單一字詞出現的頻率好的切割分數會是越低越好 9 / 17

Heuristic for Finding the Best Segmentation Introduction Overview Theoretically, any sequence of Chinese characters can form a word, if unknown words are also considered. For a phrase of length 1, the maximum number of different segmentation is 1. For a phrase of L characters, the maximum number of segmentations is: EXPERIMENTS 啟發式的尋找最佳的分割從理論上說，中國字符的任何序列都能形成一個單詞，如果不知道的話，也算。因此， ( 1 )如果一個短句長度為 L = 1 (即一個字), 最大可分的數量為1 。 ( 2 )當一個短句有L個字符時，那最大數量為 10 / 17

Introduction Overview Thus, the maximum number of segmentations for a phrase of L characters is: EXPERIMENTS 因此，如果長度為L的短語直接下去切那需要做的迭帶次數會非常高分割的數量呈指數增長。根據圖表示，分割的最大數量，是根據短語的長度以指數增加，並且有組合的數量可能會爆掉。可以觀察到，也有中文依賴性現象,可以通過考慮幾個相鄰字符解決。 11 / 17

Introduction Overview It is observed that although there are long-distance dependency phenomena in Chinese, most of the ambiguities can be solved by considering a few adjacent characters. Instead of considering all the combinations of a long phrase at one time, the segmentation process considers text under a sliding window. EXPERIMENTS 透過觀察發現到，也有中文相依現象，可以通過考慮幾個相鄰字符解決。不是考慮所有長語的組合在同一時間，而是根據移動窗口的地方來做考慮 12 / 17

Introduction Overview In each iteration, the process looks ahead several characters and generates combinations to choose the best solution. A Terminator is the starting position of the words that follow the words considered in the current iteration. EXPERIMENTS 在每次迭代中，用前面幾個字符，並產生組合選擇最佳的解決方案。因為可能有幾個不同的解釋，如果不考慮長的連續的詞，那可能無法找到一個共同結束的位置迭代停止的地方就是詞的起始位置，可以把它切開在下次做迭代假設m=5所以會取到 “構”，但 “構成”是一個詞，所以下次開始就是從 “漢”開始 13 / 17

Maximum Number of Combinations in Each Iteration Introduction Overview The length of all the words are 𝑀𝑎𝑥 𝑊 or less, and P is the ﬁrst character considered in the current iteration. EX : “發展中國家庭電器換取外匯” The corresponding words in the dictionary are “發展中國家”、“發展”、 “中國”、 “國家”、 “家庭電器” 、 “家庭” 、 “電器” 、 “換取”、 “外匯” 𝑀𝑎𝑥 𝑊 = 5 That is, if a word of six or more characters is encountered, the word will be chosen as the result of that iteration. EXPERIMENTS 最大跌代組合所有的話的長度〖𝑀𝑎𝑥〗 _𝑊 or 少，而P是在當前迭代認為網絡第一個字符。也就是說，如果遇到的六個或更多個字符的單詞，該單詞將被選擇作為該次迭代的結果。 14 / 17

Introduction Overview The larger the maximum length, the more combinations have to be considered, and hence, the more computation time is needed. As the algorithm uses adjacent multicharacter words for solving ambiguities, at least two more characters have to be considered. EXPERIMENTS 句子長度用長，需要的的耗費時間越久所以必須要竟可能得讓Max(W)小在這個演算法，利用鄰近的word來解決字義的歧異，因此必須考慮到兩個字符以上比較好的Max(w)至少要到5，因為中文字幾乎沒要到6個字符以上組成一個詞的然後用PDF範例做講解 the preferable value of 𝑀𝑎𝑥 𝑊 should be at least 5 15 / 17

Conclusions Introduction Overview In this paper used a total of 100 ambiguities. Among the 100 ambiguities, 68 of them can be solved correctly by both methods, and 5 of them cannot be solved correctly by both methods. Among the remaining 27 ambiguities, 19 of them can only be solved by BOC, while 8 of them can only be solved by Forward Maximum Match. EXPERIMENTS 結果精確性它們使用100個有歧異的句子 68個用兩種方法可以正確做出矯正 5種兩個都不行在剩餘的27組歧義句子，其中19組可以由BOC解決，而8組FMM可以解決。他們提出方法可以解決87 FMM只有76 16 / 17

Speed 17 / 17 Introduction Overview EXPERIMENTS 速度兩個方法其實是差不多快但是他們的方法，準確度比較高 17 / 17