Presentation is loading. Please wait.

Presentation is loading. Please wait.

Some discussions on Entity Identification

Similar presentations


Presentation on theme: "Some discussions on Entity Identification"— Presentation transcript:

1 Some discussions on Entity Identification
丁文韬 2019/12/19

2 Outlines What is entity identification? How to identity them?
Identify what? When to identify those entities? How to identity them? Learning with supervision Direct matching with knowledge Combined approaches Discussion 2019/12/19

3 What is entity identification?
Entity identification/recognition/discovery/extraction/… 常见误读: 给定一个知识库和语料,找出知识库上所有实体在语料上的全部出现? 对通用知识库往往不成立 什么算“文本中出现的实体” Identify what? When to identify those entities? Jim bought 300 shares of Acme Corp. in 2006. [Jim]Person bought 300 shares of [Acme Corp.]Organization in [2006]Time. 2019/12/19

4 What is entity identification?
Identify what? Defined by annotations 根据Supervision学习什么样的实体应该被识别 Open world assumption成立吗? Defined by knowledge Recognizing entities of restricted types 缺乏充足Supervision时必然的选择 Defined by knowledge & annotations e.g. OKE Task 2 2019/12/19

5 What is entity identification?
When to identify those entities? Defined by linguistic restriction(?) “Named” entity recognition: the word “Named” aims to restrict the task to only those entities for which one or many rigid designators “the automotive company created by Henry Ford in 1903” -> “Henry Ford” “the automotive company created by Henry Ford in 1903” -> “Ford Motor Company” Defined by annotations 根据Supervision学习什么样的实体应该被识别 需要标注者能在相当程度上保持粒度的一致性 Just try to identify more 当identification被视为一个模块时经常采用的真实方案 2019/12/19

6 How to identity them? Direct matching with knowledge
matching with a dictionary exact/inexact matching learning based dictionary augmentation Learning with supervision Sequential tagging model (CRF) Combined approaches 2019/12/19

7 Dictionary based identification
Exact/inexact matching prefix/suffix: “China” -> “People’s republic of China” acronym: “NJU” -> “Nanjing University” edit distance: “Michael Jordan” -> “Michael I. Jordan” Learning based dictionary augmentation The overview of AutoPhrase 2019/12/19

8 Identification as sequential tagging
CRF model Y (tagging scheme) BIO/SBIEO (Single, Begin, Intermediate, End, Other) Type (Class) X (feature) Discrete feature Fixed word embedding Language model 2019/12/19

9 Combined approaches Dictionary Sequential tagging
high precision, completeness incomplete, context-independent Sequential tagging robust, context-dependent hard to model long dependencies Enhancing sequential tagging models by dictionary Chinese NER Using Lattice LSTM (ACL 18) Learning Named Entity Tagger using Domain-Specific Dictionary (EMNLP 18) 2019/12/19

10 Chinese NER Using Lattice LSTM
Segmentation和NER之间存在相互关系 但Segmentation -> NER流水线可能造成错误传播 2019/12/19

11 Learning Named Entity Tagger using Domain-Specific Dictionary
Domain-Specific: without large amounts of manually-labeled training data Distant supervision (matching with a dictionary) Fuzzy-LSTM-CRF for distant supervision Unknown type 2019/12/19

12 Learning Named Entity Tagger using Domain-Specific Dictionary
“Tie-or-Break” tagging scheme The connection between two adjacent tokens is labeled as: Tie, when the two tokens are matched to the same entity; Unknown, if at least one of the tokens belongs to an unknown-typed high- quality phrase; Break, otherwise. 2019/12/19

13 Learning Named Entity Tagger using Domain-Specific Dictionary
Results 2019/12/19

14 Discussion Entity Identification= Triggering + Boundary fining + Classification? SynTime: Pipelined triggering & boundary fining PTime: 使用词典应考虑置信度 2019/12/19

15 Discussion Entity Identification= Triggering + Boundary fining + Classification? 纯人工标注提供的Supervision是准确的,但可能不完备 (扩充的)词典提供的Distant Supervision的准确性存在问题 是否能(应该)带置信度的提供标注信息? 在dev上分别计算多个条件组合的质量 对token计算置信度 2019/12/19

16 Discussion 显式的Syntax应该提供一定帮助 (Segmentation, Syntax Tree)
Entity Identification= Triggering + Boundary fining + Classification? 显式的Syntax应该提供一定帮助 (Segmentation, Syntax Tree) 2019/12/19

17 Discussion Classification
Entity Identification= Triggering + Boundary fining + Classification? Classification Classification到底如何帮助前两个步骤? 或者说,同时做多个类型的NER有什么好处? Classification真的有帮助吗? 猜想:仅对一个短语内部的处理有好处 “the automotive company created by Henry Ford in 1903” -> “Henry Ford” “the automotive company created by Henry Ford in 1903” -> “Ford Motor Company” 2019/12/19

18 Discussion 对于监督信息不充足的Entity Identification任务
寻找标准词典和可迁移标注信息(e.g. anchor text) 带置信度的扩充词典 对非标准信息分级估算置信度 构建多任务(?)的Fuzzy-CRF模型 同时学习Segmentation和Entity Identification Segmentation以token之间的标记体现 考虑分类别进行Entity Identification 将多个类的识别结果合并视为一个单独的问题 2019/12/19

19 References Khaled Shaalan: A Survey of Arabic Named Entity Recognition and Classification. Computational Linguistics 40(2):  (2014) Wei Shen, Jianyong Wang, Jiawei Han: Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions. IEEE Trans. Knowl. Data Eng. 27(2):   (2015) Xiaoshi Zhong, Aixin Sun, Erik Cambria: Time Expression Analysis and Recognition Using Syntactic Token Types and Ge neral Heuristic Rules. ACL (1) 2017:  Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R. Voss, Jiawei Han: Automated Phrase Mining from Massive Text Corpora. IEEE Trans. Knowl. Data Eng. 30(10):  (2018) Jingbo Shang, Liyuan Liu, Xiaotao Gu, Xiang Ren, Teng Ren, Jiawei Han: Learning Named Entity Tagger using Domain-Specific Dictionary. EMNLP 2018:  Yue Zhang, Jie Yang: Chinese NER Using Lattice LSTM. ACL (1) 2018:  2019/12/19

20 Thanks for listening Q & A 2019/12/19


Download ppt "Some discussions on Entity Identification"

Similar presentations


Ads by Google