Lecture 5 : Chinese Processing

Slides:

Advertisements

Similar presentations

Chapter 2 Combinatorial Analysis 主講人 : 虞台文. Content Basic Procedure for Probability Calculation Counting – Ordered Samples with Replacement – Ordered.

Advertisements

Which TV program is the video? 中国达人秀 China’s Got Talent 选秀节目 talent show talent n. 天资；天赋.

信息检索与 Web 搜索第 3 讲词项词典及倒排记录表 The term vocabulary and postings lists 授课人：高曙明 * 改编自 “ 现代信息检索 ” 网上公开课件（）

全国卷书面表达备考建议广州市第六中学王慧珊 Aug. 24th, 2015.

-CHINESE TIME (中文时间): Free Response idea: 你周末做了什么？

专题八书面表达.

Teaching the Chinese Copula 是 for CSL Purposes

Homework 2 : VSM and Summary

2012高考英语书面表达精品课件：话题作文6 计划与愿望.

摘要的开头： The passage mainly tells us sth.

Unit 5 Dialogues Detailed Study of Dialogues (对话) Exercises（练习）

Welcome Welcome to my class Welcome to my class!.

Text Segmentation for Chinese Spell Checking

Module 5 Shopping 第2课时.

Some Effective Techniques for Naive Bayes Text Classification

I always like birthday parties.

Here Are Some Tips That May Bring You A Beautiful Life!

模式识别 Pattern Recognition

文本分类综述王斌中国科学院计算技术研究所 2002年12月.

Lecture 5 : Sequence Tagging and Language Processing

Write a letter in a proper format

1 Introduction Prof. Lin-Shan Lee.

Unit 2 Key points summary.

机器学习与数据挖掘样本准备（2）.

药物和疾病啥关系？李智恒.

Lesson 17 Dialogue 1 Grammar UM Flint Zhong, Yan.

Chapter 3 Nationality Objectives:

Unit 1 鸳大九义校杨付春.

Try to write He Mengling Daqu Middle School.

Towards Emotional Awareness in Software Development Teams

基于课程标准的校本课程教学研究乐清中学赵海霞.

Hobbies II Objectives A. Greet a long time no see friend: Respond to the greeting: B. Ask the friend if he/she likes to on this weekend? She/he doesn’t.

Review Final Chinese 2-Chapter 6~10-1

Section A Period 2.

Chp.4 The Discount Factor

第四课会话II Would you like to play ball? 你想打球吗？ CHI 101

Here Are Some Tips That May Bring You A Beautiful Life!

Version Control System Based DSNs

Here Are Some Tips That May Bring You A Beautiful Life!

Have you read Treasure Island yet?

高性能计算与天文技术联合实验室智能与计算学部天津大学

Here Are Some Tips That May Bring You A Beautiful Life!

Unit 8 Our Clothes Topic1 What a nice coat! Section D 赤峰市翁牛特旗梧桐花中学赵亚平.

Research 裴澍炜 Shuwei Pei Tel:

Unit 5 Reading A Couch Potato.

Guide to a successful PowerPoint design – simple is best

Mendengar Percakapan (1) Pertemuan 5

虚拟仪器 virtual instrument

中央社新聞— ＜LTTC：台灣學生英語聽說提升讀寫相對下降＞

True friendship is like sound health;

Learn Question Focus and Dependency Relations from Web Search Results for Question Classification 各位老師大家好,這是我今天要報告的論文題目,…… 那在題目上的括號是因為,前陣子我們有投airs的paper,那有reviewer對model的名稱產生意見.

從 ER 到 Logical Schema ──兼談Schema Integration

西南大学计算机系郭云龙徐潇向宇曾维刚李莉

Chp.4 The Discount Factor

高考应试作文写作训练 5. 正反观点对比.

Good Karma 善因緣 This is a nice reading, but short. Enjoy! This is what The Dalai Lama has to say for All it takes is a few seconds to read and think.

More About Auto-encoder

英文文本分析入门.

钱炘祺一种面向实体浏览中属性融合的人机交互的设计与实现 Designing Human-Computer Interaction of Property Consolidation for Entity Browsing 钱炘祺

國立東華大學課程設計與潛能開發學系張德勝

怎樣把同一評估給與在不同班級的學生 How to administer the Same assessment to students from Different classes and groups.

Class imbalance in Classification

Introduction to Computer Security and Cryptography

楊立偉教授台灣科大資管系 © Copyright 2016 by Willie Yang

楊立偉教授台灣科大資管系 © Copyright 2015 by Willie Yang

英语口译 4 Education and Campus 大学英语教学部向丁丁.

Homework 2 : VSM and Summary

Some discussions on Entity Identification

Presentation transcript:

Lecture 5 : Chinese Processing 楊立偉教授 wyang@ntu.edu.tw

So Far What We Have Document Similarity Co-occurrence Classification "a Bag of Words" Model + Term Weighting (TF-IDF) Vector Space Model (VSM) Co-occurrence Association Link analysis : Co-citation & Coupling Classification Naïve Bayes k Nearest Neighbors Support Vector Machine

Problems in Chinese Processing "小明日記：今日王叔叔來我家玩媽媽，說我做完作業後，可以吃點心。然後，王叔叔誇我作業做的好，於是抱起了我媽，媽叫叔叔小心一點，之後叔叔又親了我媽媽，也親了我。" "老師批復：拿回家讓你爸看看，是標點符號有問題，還是你王叔叔和你媽媽有問題。"

Problems in Chinese Processing 新詞 (out-of-vocabulary, OOV) 九把刀拍了部新電影叫等一個人咖啡斷詞 (term segmentation) 消除歧義我國代表現在正面臨很大的壓力全台大停電不可以營利為目的他才能非凡；他才能勝任

What’s the Difference in Chinese ? Algorithms in English are based in “Term” 前述主要演算法均基於詞做運算 Document → Paragraph → Sentence → Term Some expand to Phrase 有些擴充至片語 Some change to n-gram 有些改用n-gram The major difference in Chinese Character range space is much larger 中文字元個數遠多過於其它語言 No obvious boundary between characters / terms. 中文字或詞之間無明顯分隔符號

What’s the Difference in Chinese ? 中文英文單位字(元) Character 詞片語句子段落文件字母 Letter 字 Word 片語 Phrase 句子 Sentence 段落 Paragraph 文件 Document 統計資料 BIG5: 常用字約5000個, 次常用字約8000個 Unicode: 約4萬個漢字注音: 共376個音(不含四聲變化) CKIP: 二字以上約13萬詞 Webster Dictionary: 470,000

Problem in Chinese Processing (1) Term Segmentation 斷詞 (i.e. 搶詞問題) Example 我國代表現在正面臨很大的壓力我到達文西博物館 Solution 1. 字典法：例如長詞優先法 2. 法則式：例如文法式、構詞法則、歧義解決法則 3. 訓練統計式：例如詞頻法 (最大詞頻組合) 等 4. 自動分類式：將斷詞轉為分類問題 Result 現今主要第 3 , 4 類方法, 正確率可達 9 成以上

Problem in Chinese Processing (2) Part-of-Speech Tagging 詞性標定 Example 我國代表現在正面臨很大的壓力 Nc Na Nd Neqb Nv Dfa Na Na Na VC Na VK Nv T VK Nv VA De VC VH Di VH VJ A A D Da N… 名詞 V….動詞 D… 副詞 A 形容詞 T 語助詞

Problem in Chinese Processing (2) Part-of-Speech Tagging 詞性標定 Solution 1. 訓練統計式：例如馬可夫機率模型 2. 自動分類式：將詞性標定轉為分類問題 Result 正確率可達 9 成以上可衍生出許多應用表：中研院平衡語料庫詞類標記

Problem in Chinese Processing (3) Unknown Term 未知詞 (或稱Out-of-Vocabulary) Example 新鮮人倪安東見面簽唱會歌迷熱情喊凍蒜國際運動仲裁庭祕書長瑞伯表示世跆盟可拒仲裁 Solution 1. 先經過斷詞，再處理未知部份未知部份以構詞法則處理，或n-gram統計學習 2. 不經過斷詞，直接以訓練統計式處理 Result 正確率可達 7~8 成 (含詞性標定)

Tool for Chinese Processing (1) Yahoo 斷章取義 API http://tw.developer.yahoo.com/cas/ 取得應用程式帳號使用API (目前停用) 斷詞與詞性標註文章關鍵字擷取

Tool for Chinese Processing (2) eLand ETool 開放完整API 主要功能自動關鍵詞自動摘要斷詞與詞性標定情緒判定試用展示

自動關鍵字以n-gram，找出最長且最常結伴出現的字元串需指定所謂 “最常出現” 的次數門檻值以BACDBCDABACD為例設定 threshold T＝1 FinalList會得到 CD：3 BACD：2 代表擷取出兩個關鍵字

自動摘要重新組合重要的句子 “句子” 作為單位以關鍵詞計算每個句子的得分由句子得分篩選固定比例的句子作為文章摘要

HMM 斷詞 Hidden Markov Model 統計式的模型序列資料的描述 S0 S1 S2 P00 P01 P02 P10 P11 Gaussian distribution Ot Ot-1 Ot+1 Ot+2 (State) (Observation Value) Transition Prob. Observation Prob. S0 S1 S2 P00 P01 P02 P10 P11 P12 P20 P21 P22

HMM 斷詞中文斷詞的應用 Ex. V N SHI N 缺乏耐性是一缺乏(V)耐性(N)是(SHI)一(N)項(N)莫大(A)的(D)致命傷(N) V N SHI N S0 S1 S2 Ot Ot-1 Ot+1 Ot+2 State : 詞性 Observation Value：詞缺乏耐性是一

HMM 斷詞中文斷詞的應用取得機率最高的路徑 V N A .. … 缺乏耐性

情緒判別 A bag-of-words Okapi BM25 引人入勝上好方便主流太過白目 document term set 一流公道引人入勝方便主流叫好卓越 … 引誘太過出錯失常白目丟臉劣質 Positive Terms Negative Terms document term set document length avg. document length

情緒判別 Associate Attitude 建立關聯態度詞庫 …這家代理商的服務一點也不周全… …這家代理商的服務一點也不周全… 親切(1.0) …這家代理商的服務一點也不周全… 服務周全(1.0) 敷衍(-1.0) 態度反轉傲慢(-1.0) …這家代理商的服務一點也不周全…

Discussions

Chinese Text Retrieval without Using a Dictionary (Chen et al, SIGIR97) Segmentation Break a string of characters into words Chinese characters and words Most Chinese words consist of two characters (趙元任) 26.7% unigrams, 69.8% bigrams, 2.7% trigrams (北京，現代漢語頻率辭典) 5% unigrams, 75% bigrams, 14% trigrams, 6% others (Liu) Word segmentation statistical methods, e.g., mutual information statistics rule-based methods, e.g., morphological rules, longest-match rules, ... hybrid methods Hsin-Hsi Chen

Indexing Techniques Unigram Indexing Bigram Indexing Trigram Indexing Break a sequence of Chinese characters into individual ones. Regard each individual character as an indexing unit. GB2312-80: 6763 characters Bigram Indexing Regard all adjacent pairs of hanzi characters in text as indexing terms. Trigram Indexing Regard all the consecutive sequence of three hanzi characters as indexing terms. Hsin-Hsi Chen

Examples Hsin-Hsi Chen

Indexing Techniques (Continued) Statistical Indexing Collect occurrence frequency in the collection for all Chinese characters occurring at least once in the collection. Collect occurrence frequency in the collection for all Chinese bigrams occurring at lease once in the collection. Compute the mutual information for all Chinese bigrams. I(x,y)=log2(p(x,y)/(p(x)*p(y))) = log2((f(x,y)/N) /((f(x)/N)*(f(y)/N))) = log2((f(x,y)*N) /(f(x)*f(y))) Strongly related: much larger value Not related: close to 0 Negatively related: negative I(x,y)=log2(p(x,y)/ (p(x)*p(y))) =log2 (p(x)/ (p(x)*p(y)) =log2 (1/ *p(y)) I(x,y)=log2(p(x,y)/(p(x)*p(y))) =log2(p(x|y)/p(x)) =log2(p(x|y)/p(x))=0 Hsin-Hsi Chen

f(c1): the occurrence frequency value of the first Chinese character of a bigram f(c2): the occurrence frequency value of the second Chinese character f(c1c2): the occurrence frequency value of a bigram I(c1,c2): mutual information I(c1,c2) >> 0, c1 and c2 have strong relationship I(c1,c2) ~ 0, c1 and c2 have no relationship I(c1,c2) << 0, c1 and c2 have complementrary relationship > 0 < 0 Hsin-Hsi Chen

3 5 2 9 7 4 6 1 Hsin-Hsi Chen

Segmentation as Classification 我國代表現在正面臨很大的壓力 B E B E B E S B E B E S B E 九把刀不同意 B I E S B E

Training data : input features and the target C-2 T-2 C-1 T-1 C1 T1 C2 T2 C0 T0 目標欄位我 B 國 E 表現代在 …

from "Chinese Word Segmentation by Classification of Characters", 2005

Appendix : Other Language Issues 楊立偉教授 wyang@ntu.edu.tw

Tokenization Input: “Friends, Romans and Countrymen” Output: Tokens Each such token is now a candidate for further processing 正規化及語言處理在此一階段就直接丟棄 (保留) 哪些資訊？索引與查詢(分析)時的處理要一致

Tokenization Issues in tokenization: Finland’s capital  Finland? Finlands? Finland’s? Hewlett-Packard  Hewlett and Packard as two tokens? San Francisco: one token or two? How do you decide it is one token? maneuver: 策略

Numbers 3/12/91 Mar. 12, 1991 55 B.C. B-52 My PGP key is 324a3df234cb23e 100.2.86.144 Often, don’t index as text. But often very useful mixed with text: ex. 產品型號 Nikon D700 (One answer is using n-grams)

Tokenization: Language issues L'ensemble  one token or two? L ? L’ ? Le ? Want l’ensemble to match with un ensemble German noun compounds are not segmented Lebensversicherungsgesellschaftsangestellter ‘life insurance company employee’

Tokenization: language issues Chinese and Japanese have no spaces between words: 莎拉波娃现在居住在美国东南部的佛罗里达。 Not always guaranteed a unique tokenization Further complicated in Japanese, with multiple alphabets intermingled 混合使用 Dates/amounts in multiple formats 斷詞問題フォーチュン500社は情報不足のため時間あた$500K(約6,000万円) Katakana 片假名 Hiragana 平假名 Kanji 漢字 Romaji 羅馬拼音

Normalization Need to “normalize” terms in indexed text as well as query terms into the same form We want to match U.S.A. and USA 索引與查詢(分析)時的處理要一致 Alternative is to have multiple tokenization mixed language processing and n-gram approach

Normalization: other languages Accents: résumé vs. resume. Most important criterion: Even in languages that standardly have accents, users often may not type them How would you like to present in the final result ? German: Tuebingen vs. Tübingen Should be equivalent 7月30日 vs. 7/30

Case folding Reduce all letters to lower case exception: upper case (in mid-sentence?) e.g., General Motors Fed vs. fed SAIL vs. sail One approach is to lower case everything in analysis, meanwhile to represent in the original form

Stop words With a stop list, you exclude from dictionary entirely the commonest words. Intuition: They have little semantic content: the, a, and, to, be They take a lot of space: ~30% of postings for top 30 But the trend is away from doing this: You need them for: Phrase queries: “King of Denmark” Various song titles, etc.: “Let it be”, “To be or not to be” “Relational” queries: “flights to London”

Thesauri and soundex Handle synonyms 同義字 and homonyms 同音字 Hand-constructed equivalence classes e.g., car = automobile color = colour Rewrite to form equivalence classes 原則：兩種方式，在索引時處理？或在查詢時處理？ (1) Index such equivalences Ex. When the document contains automobile, index it under car as well (usually, also vice-versa) (2) expand query Ex. When the query contains automobile, look under car as well

Soundex Traditional class of heuristics to expand a query into phonetic equivalents Language specific – mainly for names E.g., chebyshev  tchebycheff

Lemmatization Reduce inflectional/variant forms to base form E.g., am, are, is  be car, cars, car's, cars'  car the boy's cars are different colors  the boy car be different color Lemmatization implies doing “proper” reduction to dictionary headword form

Stemming Reduce terms to their “roots” before indexing “Stemming” suggest crude affix chopping 很粗略地將字首字尾去除 language dependent e.g., automate(s), automatic, automation all reduced to automat. for exampl compress and compress ar both accept as equival to compress for example compressed and compression are both accepted as equivalent to compress.

Porter’s algorithm Commonest algorithm for stemming English Results suggest at least as good as other stemming options Conventions + 5 phases of reductions phases applied sequentially each phase consists of a set of commands sample convention: Of the rules in a compound command, select the one that applies to the longest suffix. 由一群合併或判斷規則所組成 → 挑選適用最長字尾的規則

Typical rules in Porter sses  ss ies  i ational  ate tional  tion Weight of word sensitive rules (m>1) EMENT → replacement → replac cement → cement

Other stemmers Other stemmers exist, e.g., Lovins stemmer http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm Single-pass, longest suffix removal (about 250 rules) 計算代價高 Do stemming and other normalizations help? help recall for some queries 有利找到資料 (檢出率上升) Ex. 找意大利, 同時找出義大利 but harm precision on others 但可能會找到錯的資料 (精確率下降) Ex. 找 Steve Jobs 卻找到 steve’s job