1 Text Segmentation for Chinese Spell Checking
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE’ 1999 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. Kin Hong Lee Mau Kit Michael Ng Department of Computer Science and Engineering, The Chinese University of Hong Kong

Abstract Introduction Overview Chinese text has no natural delimiters such as spaces between words, which are meaningful sequences of characters. Every Chinese character input must be a valid ideograph, but the sequence of Chinese characters may not make sense. (EX : "時間" is mistyped as "時閒" , 時and閒are correct characters , although the character sequence is not a correct word.)

3 There are four main kinds of errors.
Introduction Overview In Chinese spell checking, it cannot be assumed that texts are free of errors. There are four main kinds of errors. Misuses of characters due to same or similar sounds (按步就班  按部就班) Misuses of characters due to similar shapes (茶,荼) Misuses of characters due to similar meanings (名『符』其實,名『副』其實) Typing errors related to Chinese input methods.

4 The Segmentation Process and System Interaction Model
Introduction Overview When a piece of text is to be handled, it is first divided into sentences. Punctuation marks are used as delimiters to separate sentences. Some of the sentences may contain symbols, alphabetic symbols, and numerals. These types of characters are skipped without checking, and are used as unnatural delimiters to further divide sentences into phrases. To reduce false alarms, occurrences of the first 200 most frequently used single-character words such as 的(of), 一(one), 是(is), 不(not), 有(have), 在(in), 個(unit, quantity) should not be considered as suspected errors.

Introduction Overview Unlike text analysis for translation or semantic analysis, sometimes it is not necessary for a spelling checker to find a unique segmentation solution.

Introduction Overview In this article, a Block-of-Combinations (BOC) segmentation method based on frequency of word usage is proposed. To make the method more suitable for spell checking, user interaction is also introduced into the system.Based on the user's response, the segmentation can be refined to fit the user's interpretation, and unknown words can also be learned by the system during the spell checking process.

7 Block-of-Combinations (BOC) Segmentation Method
Introduction Overview "誰都不知道他的確實用途"
This may be deduced from so-called word formation power, in which the word formation power of the character is higher than that of the character 的, and so it is more likely that the character sequence 確實 is a word.

8 “誰都不知道他的確實用嗎” the last character is changed from “途” to “嗎” 7 / 17
Introduction Overview "誰都不知道他的確實用嗎" the last character is changed from "途" to "嗎" (no one knows its real use) (Does no one know that it is really practical?)

Introduction Overview Recall that a semiword is a one-character word that is seldom used as a word. In the BOC segmentation method proposed, single- character-word function U is defined as follows:
f : occurrence frequency of the character as a single-character word 𝑓 𝐶𝑈𝑇 : threshold frequency below the range in which the characters are considered as semiwords 𝑓 𝑆𝐴𝑇 : threshold frequency above which the characters often appear a single-character words

Introduction Overview The score of a segmentation is defined as : The best segmentation is the one with the smallest Score-S.
Score-S = ∑( 1 – U( 𝑓 𝑗 ) ) j : single character appearing in the segmentation.

11 Heuristic for Finding the Best Segmentation
Introduction Overview Theoretically, any sequence of Chinese characters can form a word, if unknown words are also considered. For a phrase of length 1, the maximum number of different segmentation is 1. For a phrase of L characters, the maximum number of segmentations is:

Introduction Overview Thus, the maximum number of segmentations for a phrase of L characters is:

Introduction Overview It is observed that although there are long-distance dependency phenomena in Chinese, most of the ambiguities can be solved by considering a few adjacent characters. Instead of considering all the combinations of a long phrase at one time, the segmentation process considers text under a sliding window.

Introduction Overview In each iteration, the process looks ahead several characters and generates combinations to choose the best solution. A Terminator is the starting position of the words that follow the words considered in the current iteration.

15 Maximum Number of Combinations in Each Iteration
Introduction Overview The length of all the words are 𝑀𝑎𝑥 𝑊 or less, and P is the first character considered in the current iteration. EX : "發展中國家庭電器換取外匯" The corresponding words in the dictionary are "發展中國家"、"發展"、 "中國"、 "國家"、 "家庭電器" 、 "家庭" 、 "電器" 、 "換取"、 "外匯" 𝑀𝑎𝑥 𝑊 = 5 That is, if a word of six or more characters is encountered, the word will be chosen as the result of that iteration.

Introduction Overview The larger the maximum length, the more combinations have to be considered, and hence, the more computation time is needed. As the algorithm uses adjacent multicharacter words for solving ambiguities, at least two more characters have to be considered.
the preferable value of 𝑀𝑎𝑥 𝑊 should be at least 5

Conclusions Introduction Overview In this paper used a total of 100 ambiguities. Among the 100 ambiguities, 68 of them can be solved correctly by both methods, and 5 of them cannot be solved correctly by both methods. Among the remaining 27 ambiguities, 19 of them can only be solved by BOC, while 8 of them can only be solved by Forward Maximum Match.

Speed
Introduction Overview
但是他們的方法,準確度比較高 17 / 17

