The Construction of a Chinese Named Entity Tagged Corpus: CNEC1.0

Slides:



Advertisements
Similar presentations
Chapter 2 Combinatorial Analysis 主講人 : 虞台文. Content Basic Procedure for Probability Calculation Counting – Ordered Samples with Replacement – Ordered.
Advertisements

高考短文改错专题 张柱平. 高考短文改错专题 一. 对短文改错的要求 高考短文改错的目的在于测试考生判断发现, 纠正语篇中 语言使用错误的能力, 以及考察考生在语篇中综合运用英 语知识的能力. 二. 高考短文改错的命题特点 高考短文改错题的形式有说明文. 短文故事. 书信等, 具有很 强的实用性.
期末考试作文讲解 % 的同学赞成住校 30% 的学生反对住校 1. 有利于培养我们良好的学 习和生活习惯; 1. 学生住校不利于了解外 界信息; 2 可与老师及同学充分交流有 利于共同进步。 2. 和家人交流少。 在寄宿制高中,大部分学生住校,但仍有一部分学生选 择走读。你校就就此开展了一次问卷调查,主题为.
-CHINESE TIME (中文时间): Free Response idea: 你周末做了什么?
2 美國與全球經濟概況 CHAPTER. 2 美國與全球經濟概況 CHAPTER C H A P T E R C H E C K L I S T 學習本章後,您將能: 描述美國與全球在生產什麼、如何生產,以及為誰生產貨 品與服務 1 透過循環流量模型,瞭解家計單位、廠商與政府之間的 互動 2.
Teaching the Chinese Copula 是 for CSL Purposes
中职英语课程改革中 如何实践“以就业为导向,服务为宗旨”的办学理念
2012 年下学期 湖南长郡卫星远程学校 制作 13 Unit 4 The next step 年下学期 湖南长郡卫星远程学校 制作 13 Discussion Which university do you want to study at? Have you thought carefully.
A Career Planning Project
2014年上海市中职校学业水平考试 英语学科总结报告
真题重现:广东高考中的不定式。 1 (2008年高考题)For example, the proverb,“ plucking up a crop _________(help) it grow ,” is based on the following story… 2 (2007年高考题)While.
初中英语 教学设计与案例分析(上) 北京教育学院 袁昌寰.
中国的机构部门分类 The classification of institutional sectors in China
Academic Year TFC EFL Data Collection Outline 学年美丽中国英语测试数据收集概述
Welcome Welcome to my class Welcome to my class!.
深層學習 暑期訓練 (2017).
Homework 4 an innovative design process model TEAM 7
Unit 4 I used to be afraid of the dark.
Some Effective Techniques for Naive Bayes Text Classification
Platypus — Indoor Localization and Identification through Sensing Electric Potential Changes in Human Bodies.
Thinking of Instrumentation Survivability Under Severe Accident
關聯式資料庫.
教師的成長 與 教師專業能力理念架構 教育局 專業發展及培訓分部 TCF, how much you know about it?
Unit title: 嗨!Hi! Introducing yourself in Chinese
Notes appear on slides 4, 5, 6, and 62
Journal Citation Reports® 期刊引文分析報告的使用和檢索
Unit 8 Have you read Treasure Island yet?
Write a letter in a proper format
肢體殘障人士 Physically handicapped
Guide to Freshman Life Prepared by Sam Wu.
課務組 Curriculum Section
Area of interaction focus
Unit 4 My day Reading (2) It’s time for class.
G10 PARENT MEETING COURSE SELECTION 高一选课家长会 PRESENTED BY B
張新仁 教授兼學生事務長 國立高雄師範大學教育系
印度武术 ——卡拉里帕亚特之秘.
Area of interaction focus
This Is English 3 双向视频文稿.
Unit title: 日常生活和衣服 Daily life and clothes Area of interaction focus
Chapter 3 Nationality Objectives:
外事英语 主讲:陈蔼琦 2019/2/17.
基于课程标准的校本课程教学研究 乐清中学 赵海霞.
21st Century Teaching & Learning
大学思辨英语教程 精读1:语言与文化 (说课)
解读设题意图,探究阅读策略 年高考试卷题型(阅读理解)分析及对策
Area of interaction focus
Unit title: 买东西 - Shopping
2 GROUP 身體殘障人士 傷健共融—如何改善肢體殘障人士的生活素質
Guide to a successful PowerPoint design – simple is best
Ericsson Innovation Award 2018 爱立信创新大赛 2018
Build an app to measure ECG-base HRV via a Smart wristband
中央社新聞— <LTTC:台灣學生英語聽說提升 讀寫相對下降>
-----Reading: ZhongGuanCun
Review and Analysis of the Usage of Degree Adverbs
Learn Question Focus and Dependency Relations from Web Search Results for Question Classification 各位老師大家好,這是我今天要報告的論文題目,…… 那在題目上的括號是因為,前陣子我們有投airs的paper,那有reviewer對model的名稱產生意見.
2008 TIME USE SURVEY IN CHINA
Area of interaction focus
Unit 4 Body Language.
高考应试作文写作训练 5. 正反观点对比.
第二单元 语言差异、汉英对比 曾昭涛 2010年.
李宏毅專題 Track A, B, C 的時間、地點開學前通知
More About Auto-encoder
Selecting Reading Materials
Resources Planning for Applied Research
怎樣把同一評估 給與在不同班級的學生 How to administer the Same assessment to students from Different classes and groups.
My favorite subject science.
地点识别调研 施林锋.
Unit 1 Book 8 A land of diversity
CAI-Asia China, CATNet-Asia
Some discussions on Entity Identification
Presentation transcript:

The Construction of a Chinese Named Entity Tagged Corpus: CNEC1.0 Cheng-Wei Shih, Tzong-Han Tsai, Shih-Hung Wu, Chiu-Chen Hsieh, and Wen-Lian Hsu Institute of Information Science, Academia Sinica ROCLING XVI September 3. 2004

Agenda Motivation Previous work Named entity categories Annotating Process Conclusion and Future work ROCLING XVI September 3. 2004

Motivation Named entity recognition (NER) can support unknown proper nouns problem and text mining Corpus is useful on building NER systems and can be used for evaluation ROCLING XVI September 3. 2004

Related work Multilingual Entity Tracking (MET-2) Japanese, Simplified Chinese and Spanish Include person, organization, location The Automatic Content Extraction program (ACE) English, Simplified Chinese and Arabic Include person, organization, location, GPE, facility and etc Retrieval and Extraction Exercise (IREX) Japanese Include person, organization, location, artifact and etc. Share task of CoNLL 2002 and 2003 English, German, Dutch and Spanish ROCLING XVI September 3. 2004

Named entity categories Person name - PER Location name - LOC Organization name - ORG ROCLING XVI September 3. 2004

Named entity categories A person name represents a real person. 行政院發言人 [陳其邁<PER>] 上午指出 A location name should be a place’s geographical position [台北縣<LOC>] 三百多萬人口 An organization is a subject which have the ability to execute plans and projects. 因此會爭取與 [經濟部<ORG>] 63個駐外機構合作 ROCLING XVI September 3. 2004

What is “named entity (NE)”? Our definition : Named entities should be Proper Nouns. A word without uniqueness and existence should not be annotated. Ex : 任憑 被告 和 辯護律師 巧言飾辯 (X) 原來是幾個外勞坐在 公園 裡喝酒談天 (X) 國內新聞部 立刻進行處理 (X) ROCLING XVI September 3. 2004

How to determined boundary and category of a named entities? Two main ways to determine NEs in a sentence By inner feature - Its literal meaning 吳舜文 從年少開始 By outer feature - The context 射手座的 景心潔 笑說 Named entities which can only be find by understanding the whole sentence/paragraph are ignored. 思科 也積極參與 ROCLING XVI September 3. 2004

Confusion between Location and Organization The confusion (borrowing) between location and organization name always exist. 總統府 上午仍然低調回應 台北市政府同意 總統府 前的集會遊行 Location as Organization (LAO) and Organization as Location (OAL) are proposed to solve this problem ROCLING XVI September 3. 2004

Location as Organization (LAO) and Organization as Location (OAL) LAO and OAL are used to mark the borrowing between LOC and ORG [總統府<ORG>] 上午仍然低調回應 台北市政府同意 [總統府<OAL>] 前的集會遊行 ROCLING XVI September 3. 2004

Should adjacent named entities be combined? Combining adjacent NEs or not depends on if it generate any specific meaning after these NEs merged. 民進黨高雄市議會黨團 於廿一日假性投票 Two policies, maximum and minimum semantic unit matching, may be applied in NEs tagging. ROCLING XVI September 3. 2004

Maximum semantic unit matching If it exist any specific meaning after these NEs merged, these NEs should use Maximum semantic unit matching. [行政院原住民族委員會<ORG>] 副主委浦忠成昨天指出 [瑞芳區漁會鼻頭辦事處<ORG>] 主任戴清松說 ROCLING XVI September 3. 2004

Minimum semantic unit matching On the contrary, NEs use miximum semantic unit matching policy if there is no specific meaning after they combined. 咖啡園就位於 [雲林<LOC>] [古坑<LOC>] 的 [荷苞山<LOC>] 吳舜文則是 [江蘇<LOC>] [常州<LOC>] 紡織世家吳鏡淵先生之女 ROCLING XVI September 3. 2004

Compound word problem Some named entities, like location names, may included in another compound word 西班牙海鮮飯 蘇澳冷泉 We can insert ”的” between a possible location name and the other words to test such cases. 西班牙 的 海鮮飯 (X) 蘇澳 的 冷泉 (O) ROCLING XVI September 3. 2004

Annotating Process We collect over a million sentences from UDN and China Times for the period Dec 2002 to Dec 2003 as raw data. All the data is recorded in XML ROCLING XVI September 3. 2004

Fig. 1. Original corpus ROCLING XVI September 3. 2004

Annotating Process (cont.) Some high school students are selected as annotators and received basic training before they performed the annotations. The training process include: Introduction of NER and related course. A qualifying test to select participants for the tagging task. Training about the annotating criteria and the operation of the tagging tool program. The participants were then divided into three groups. Each group get 21,000 sentences for tagging. ROCLING XVI September 3. 2004

Fig. 2. Tagging tool ROCLING XVI September 3. 2004

Fig. 3. Tagged corpus. ROCLING XVI September 3. 2004

Difficult problem - DIFF DIFF is designed to identify problems such as abbreviations, cross-language loanwords, nested, ambiguous, or poorly defined named entities. DIFF is a essential function tag for identifying ambiguous cases. We can use DIFF to ensure the quality of the corpus ROCLING XVI September 3. 2004

Some examples of DIFF Nicknames Incomplete Chinese person names [小炳<DIFF>] 疼愛的女兒 [央央<DIFF>] Incomplete Chinese person names [蒨蓉<DIFF>] 從小就是個恰北北 [陳總統<DIFF>] 上午前往日本北海道遊玩 Foreigners’ names [哈利波特<DIFF>] 養了一隻貓頭鷹嘿美 World place names [Missouri<DIFF>] 州的 [St. Louis City<DIFF>] Location and organization abbreviations 就不止是 [台中縣市<DIFF>] 倒是前天宣稱民進黨與黑金掛勾的 [國親<DIFF>] 兩黨 ROCLING XVI September 3. 2004

Conclusion and Future work We define the criteria of Chinese NE tagging, and design a standard tagging procedure for CNEC annotation Some issues in NE tagging are considered. A functional tag “DIFF” is proposed to support ambiguity and ensure the quality and consistency of the annotations. Ambiguous entities, labeled as DIFF, may become the future work of next version of CNEC ROCLING XVI September 3. 2004

Thank you ROCLING XVI September 3. 2004