Download presentation
Presentation is loading. Please wait.
Published byMervin Harper Modified 6年之前
1
The Construction of a Chinese Named Entity Tagged Corpus: CNEC1.0
Cheng-Wei Shih, Tzong-Han Tsai, Shih-Hung Wu, Chiu-Chen Hsieh, and Wen-Lian Hsu Institute of Information Science, Academia Sinica ROCLING XVI September
2
Agenda Motivation Previous work Named entity categories
Annotating Process Conclusion and Future work ROCLING XVI September
3
Motivation Named entity recognition (NER) can support unknown proper nouns problem and text mining Corpus is useful on building NER systems and can be used for evaluation ROCLING XVI September
4
Related work Multilingual Entity Tracking (MET-2)
Japanese, Simplified Chinese and Spanish Include person, organization, location The Automatic Content Extraction program (ACE) English, Simplified Chinese and Arabic Include person, organization, location, GPE, facility and etc Retrieval and Extraction Exercise (IREX) Japanese Include person, organization, location, artifact and etc. Share task of CoNLL 2002 and 2003 English, German, Dutch and Spanish ROCLING XVI September
5
Named entity categories
Person name - PER Location name - LOC Organization name - ORG ROCLING XVI September
6
Named entity categories
A person name represents a real person. 行政院發言人 [陳其邁<PER>] 上午指出 A location name should be a place’s geographical position [台北縣<LOC>] 三百多萬人口 An organization is a subject which have the ability to execute plans and projects. 因此會爭取與 [經濟部<ORG>] 63個駐外機構合作 ROCLING XVI September
7
What is “named entity (NE)”?
Our definition : Named entities should be Proper Nouns. A word without uniqueness and existence should not be annotated. Ex : 任憑 被告 和 辯護律師 巧言飾辯 (X) 原來是幾個外勞坐在 公園 裡喝酒談天 (X) 國內新聞部 立刻進行處理 (X) ROCLING XVI September
8
How to determined boundary and category of a named entities?
Two main ways to determine NEs in a sentence By inner feature - Its literal meaning 吳舜文 從年少開始 By outer feature - The context 射手座的 景心潔 笑說 Named entities which can only be find by understanding the whole sentence/paragraph are ignored. 思科 也積極參與 ROCLING XVI September
9
Confusion between Location and Organization
The confusion (borrowing) between location and organization name always exist. 總統府 上午仍然低調回應 台北市政府同意 總統府 前的集會遊行 Location as Organization (LAO) and Organization as Location (OAL) are proposed to solve this problem ROCLING XVI September
10
Location as Organization (LAO) and Organization as Location (OAL)
LAO and OAL are used to mark the borrowing between LOC and ORG [總統府<ORG>] 上午仍然低調回應 台北市政府同意 [總統府<OAL>] 前的集會遊行 ROCLING XVI September
11
Should adjacent named entities be combined?
Combining adjacent NEs or not depends on if it generate any specific meaning after these NEs merged. 民進黨高雄市議會黨團 於廿一日假性投票 Two policies, maximum and minimum semantic unit matching, may be applied in NEs tagging. ROCLING XVI September
12
Maximum semantic unit matching
If it exist any specific meaning after these NEs merged, these NEs should use Maximum semantic unit matching. [行政院原住民族委員會<ORG>] 副主委浦忠成昨天指出 [瑞芳區漁會鼻頭辦事處<ORG>] 主任戴清松說 ROCLING XVI September
13
Minimum semantic unit matching
On the contrary, NEs use miximum semantic unit matching policy if there is no specific meaning after they combined. 咖啡園就位於 [雲林<LOC>] [古坑<LOC>] 的 [荷苞山<LOC>] 吳舜文則是 [江蘇<LOC>] [常州<LOC>] 紡織世家吳鏡淵先生之女 ROCLING XVI September
14
Compound word problem Some named entities, like location names, may included in another compound word 西班牙海鮮飯 蘇澳冷泉 We can insert ”的” between a possible location name and the other words to test such cases. 西班牙 的 海鮮飯 (X) 蘇澳 的 冷泉 (O) ROCLING XVI September
15
Annotating Process We collect over a million sentences from UDN and China Times for the period Dec 2002 to Dec 2003 as raw data. All the data is recorded in XML ROCLING XVI September
16
Fig. 1. Original corpus ROCLING XVI September
17
Annotating Process (cont.)
Some high school students are selected as annotators and received basic training before they performed the annotations. The training process include: Introduction of NER and related course. A qualifying test to select participants for the tagging task. Training about the annotating criteria and the operation of the tagging tool program. The participants were then divided into three groups. Each group get 21,000 sentences for tagging. ROCLING XVI September
18
Fig. 2. Tagging tool ROCLING XVI September
19
Fig. 3. Tagged corpus. ROCLING XVI September
20
Difficult problem - DIFF
DIFF is designed to identify problems such as abbreviations, cross-language loanwords, nested, ambiguous, or poorly defined named entities. DIFF is a essential function tag for identifying ambiguous cases. We can use DIFF to ensure the quality of the corpus ROCLING XVI September
21
Some examples of DIFF Nicknames Incomplete Chinese person names
[小炳<DIFF>] 疼愛的女兒 [央央<DIFF>] Incomplete Chinese person names [蒨蓉<DIFF>] 從小就是個恰北北 [陳總統<DIFF>] 上午前往日本北海道遊玩 Foreigners’ names [哈利波特<DIFF>] 養了一隻貓頭鷹嘿美 World place names [Missouri<DIFF>] 州的 [St. Louis City<DIFF>] Location and organization abbreviations 就不止是 [台中縣市<DIFF>] 倒是前天宣稱民進黨與黑金掛勾的 [國親<DIFF>] 兩黨 ROCLING XVI September
22
Conclusion and Future work
We define the criteria of Chinese NE tagging, and design a standard tagging procedure for CNEC annotation Some issues in NE tagging are considered. A functional tag “DIFF” is proposed to support ambiguity and ensure the quality and consistency of the annotations. Ambiguous entities, labeled as DIFF, may become the future work of next version of CNEC ROCLING XVI September
23
Thank you ROCLING XVI September
Similar presentations