Text Segmentation for Chinese Spell Checking JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE’ 1999 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. Kin Hong Lee Mau Kit Michael Ng Department of Computer Science and Engineering, The Chinese University of Hong Kong
Abstract Introduction Overview Chinese text has no natural delimiters such as spaces between words, which are meaningful sequences of characters. Every Chinese character input must be a valid ideograph, but the sequence of Chinese characters may not make sense. (EX : “時間“ is mistyped as ”時閒” , 時and閒are correct characters , although the character sequence is not a correct word.) EXPERIMENTS 中文不像英文有空白分割 且每個中文字都有他的意義,但如果放在句子中意思又可能不太一樣 時間打錯為時閒,但時、閒兩個字都有他的意思,但如果把兩個字放在一起那他就是錯字 1 / 17
There are four main kinds of errors. Introduction Overview In Chinese spell checking, it cannot be assumed that texts are free of errors. There are four main kinds of errors. Misuses of characters due to same or similar sounds (按步就班 按部就班) Misuses of characters due to similar shapes (茶,荼) Misuses of characters due to similar meanings (名『符』其實,名『副』其實) Typing errors related to Chinese input methods. EXPERIMENTS 在中國拼寫檢查,它不能被假定文本有沒有錯誤。 細部切割可以分為4種錯誤 1.相同拼音 2.相似字形 3.因為相似的辭意而字符的誤用 4.輸入法所造成的錯誤 2 / 17
The Segmentation Process and System Interaction Model Introduction Overview When a piece of text is to be handled, it is first divided into sentences. Punctuation marks are used as delimiters to separate sentences. Some of the sentences may contain symbols, alphabetic symbols, and numerals. These types of characters are skipped without checking, and are used as unnatural delimiters to further divide sentences into phrases. To reduce false alarms, occurrences of the first 200 most frequently used single-character words such as 的(of), 一(one), 是(is), 不(not), 有(have), 在(in), 個(unit, quantity) should not be considered as suspected errors. EXPERIMENTS 分割過程和系統的互動模型 第一步是先將句字做切割 標點符號做分段 有些句子包含符號,字母符號和數字。 這些類型的字符被跳過而不檢查,並作為異常的符號,可以進一步劃分成句短語。 為了減少誤報,他們將最多使用的200個常用字剔除 3 / 17
Introduction Overview Unlike text analysis for translation or semantic analysis, sometimes it is not necessary for a spelling checker to find a unique segmentation solution. EXPERIMENTS 錯字校正不像文字翻譯或語意,有時候不需要為了拼字檢查而做獨特的細分解決方案 4 / 17
Introduction Overview In this article, a Block-of-Combinations (BOC) segmentation method based on frequency of word usage is proposed. To make the method more suitable for spell checking, user interaction is also introduced into the system.Based on the user’s response, the segmentation can be refined to fit the user’s interpretation, and unknown words can also be learned by the system during the spell checking process. EXPERIMENTS 在這篇文章中,提出了一種基於詞的使用頻率( BOC)的分割方法。 為了使該方法更適合於拼寫檢查,與用戶互動也被引入到系統中。 根據用戶的回應,分割可以細化,以適應用戶的解釋;未知的句子,可由系統中的拼寫檢查過程而得知。 5 / 17
Block-of-Combinations (BOC) Segmentation Method Introduction Overview “誰都不知道他的確實用途” EXPERIMENTS This may be deduced from so-called word formation power, in which the word formation power of the character is higher than that of the character 的, and so it is more likely that the character sequence 確實 is a word. 根據字典做斷詞法 字符構成的詞有比較高的權重,所以分割出來會是 的"確實"而不是"的確"實 6 / 17
“誰都不知道他的確實用嗎” the last character is changed from “途” to “嗎” 7 / 17 Introduction Overview “誰都不知道他的確實用嗎” the last character is changed from “途” to “嗎” (no one knows its real use) (Does no one know that it is really practical?) EXPERIMENTS 但當修改一個字"途">"嗎" 段的詞就不一樣了 7 / 17
Introduction Overview Recall that a semiword is a one-character word that is seldom used as a word. In the BOC segmentation method proposed, single- character-word function U is defined as follows: EXPERIMENTS f : occurrence frequency of the character as a single-character word 𝑓 𝐶𝑈𝑇 : threshold frequency below the range in which the characters are considered as semiwords 𝑓 𝑆𝐴𝑇 : threshold frequency above which the characters often appear a single-character words semiword是指很少被用到的一個字詞,例如 “家庭” 被斷開為 “家” “庭” ,而這個 “庭”字就是 在所提出的 BOC 分割方法,單字符單詞函數U被定義如下: f:作為單字詞的出現頻率 𝑓_𝐶𝑈𝑇 :低於閥值頻率範圍,被認定為semiwords (切錯時出現) 𝑓_𝑆𝐴𝑇 : 高於閥值頻率範圍,被認定為它確實是一個單一字詞 (的、是 ) 8 / 17
Introduction Overview The score of a segmentation is defined as : The best segmentation is the one with the smallest Score-S. EXPERIMENTS Score-S = ∑( 1 – U( 𝑓 𝑗 ) ) j : single character appearing in the segmentation. 這邊有一個計算分數的公式: 𝑓 𝑗 是那個單一字詞出現的頻率 好的切割分數會是越低越好 9 / 17
Heuristic for Finding the Best Segmentation Introduction Overview Theoretically, any sequence of Chinese characters can form a word, if unknown words are also considered. For a phrase of length 1, the maximum number of different segmentation is 1. For a phrase of L characters, the maximum number of segmentations is: EXPERIMENTS 啟發式的尋找最佳的分割 從理論上說,中國字符的任何序列都能形成一個單詞,如果不知道的話,也算。 因此, ( 1 )如果一個短句長度為 L = 1 (即一個字), 最大可分的數量為1 。 ( 2 )當一個短句有L個字符時,那最大數量為 10 / 17
Introduction Overview Thus, the maximum number of segmentations for a phrase of L characters is: EXPERIMENTS 因此,如果長度為L的短語直接下去切那需要做的迭帶次數會非常高 分割的數量呈指數增長。 根據圖表示,分割的最大數量,是根據短語的長度以指數增加,並且有組合的數量可能會爆掉。 可以觀察到,也有中文依賴性現象,可以通過考慮幾個相鄰字符解決。 11 / 17
Introduction Overview It is observed that although there are long-distance dependency phenomena in Chinese, most of the ambiguities can be solved by considering a few adjacent characters. Instead of considering all the combinations of a long phrase at one time, the segmentation process considers text under a sliding window. EXPERIMENTS 透過觀察發現到,也有中文相依現象,可以通過考慮幾個相鄰字符解決。 不是考慮所有長語的組合在同一時間,而是根據移動窗口的地方來做考慮 12 / 17
Introduction Overview In each iteration, the process looks ahead several characters and generates combinations to choose the best solution. A Terminator is the starting position of the words that follow the words considered in the current iteration. EXPERIMENTS 在每次迭代中,用前面幾個字符,並產生組合選擇最佳的解決方案。 因為可能有幾個不同的解釋,如果不考慮長的連續的詞,那可能無法找到一個共同結束的位置 迭代停止的地方就是詞的起始位置,可以把它切開在下次做迭代 假設m=5所以會取到 “構”,但 “構成”是一個詞,所以下次開始就是從 “漢”開始 13 / 17
Maximum Number of Combinations in Each Iteration Introduction Overview The length of all the words are 𝑀𝑎𝑥 𝑊 or less, and P is the first character considered in the current iteration. EX : “發展中國家庭電器換取外匯” The corresponding words in the dictionary are “發展中國家”、“發展”、 “中國”、 “國家”、 “家庭電器” 、 “家庭” 、 “電器” 、 “換取”、 “外匯” 𝑀𝑎𝑥 𝑊 = 5 That is, if a word of six or more characters is encountered, the word will be chosen as the result of that iteration. EXPERIMENTS 最大跌代組合 所有的話的長度〖𝑀𝑎𝑥〗 _𝑊 or 少,而P是在當前迭代認為網絡第一個字符。 也就是說,如果遇到的六個或更多個字符的單詞,該單詞將被選擇作為該次迭代的結果。 14 / 17
Introduction Overview The larger the maximum length, the more combinations have to be considered, and hence, the more computation time is needed. As the algorithm uses adjacent multicharacter words for solving ambiguities, at least two more characters have to be considered. EXPERIMENTS 句子長度用長,需要的的耗費時間越久 所以必須要竟可能得讓Max(W)小 在這個演算法,利用鄰近的word來解決字義的歧異,因此必須考慮到兩個字符以上 比較好的Max(w)至少要到5,因為中文字幾乎沒要到6個字符以上組成一個詞的 然後用PDF範例做講解 the preferable value of 𝑀𝑎𝑥 𝑊 should be at least 5 15 / 17
Conclusions Introduction Overview In this paper used a total of 100 ambiguities. Among the 100 ambiguities, 68 of them can be solved correctly by both methods, and 5 of them cannot be solved correctly by both methods. Among the remaining 27 ambiguities, 19 of them can only be solved by BOC, while 8 of them can only be solved by Forward Maximum Match. EXPERIMENTS 結果 精確性 它們使用100個有歧異的句子 68個用兩種方法可以正確做出矯正 5種兩個都不行 在剩餘的27組歧義句子,其中19組可以由BOC解決,而8組FMM可以解決。 他們提出方法可以解決87 FMM只有76 16 / 17
Speed 17 / 17 Introduction Overview EXPERIMENTS 速度兩個方法其實是差不多快 但是他們的方法,準確度比較高 17 / 17