Adjective + Noun 到知识库中特定Types 王远 2018/11/22

Slides:



Advertisements
Similar presentations
allow v. wrong adj. What’s wrong? midnight n. look through guess v. deal n. big deal work out 允许;准许 有毛病;错误的 哪儿不舒服? 午夜;子夜 快速查看;浏览 猜测;估计 协议;交易 重要的事.
Advertisements

TOEFL Speaking ----Q1&Q2 坚果托福 秀文. 评分标准评分标准 Volume Grammar Fluency Logic / Organization Lexical ability Pronunciation.
1 )正确 2 )多词 3 )缺词 4 )错词 删除 补漏 更正 “1126” 原则 “1225” 原则 “1117” 原则.
A self-reflection of my teaching design Unit 1 New Friends New Faces 戴弘梧.
SanazM Compiled By: SanazM Here Are Some Tips That May Bring You A Beautiful Life! Music: 美麗人生 Angel ( 主題曲 ) Revised By: Henry 以下是一些能帶給你一個美麗人生的秘訣 中文註解:
全国卷书面表达备考建议 广州市第六中学 王慧珊 Aug. 24th, 2015.
The importance of being female (life style)
专题八 书面表达.
第一部分 语法专题研究 专题三 冠词.
2012高考英语书面表达精品课件:话题作文6 计划与愿望.
3. 一般問題 部份資料來源: YAHOO網 及本校08年升中學生提供
Figure Interpreting. Introduction In recording an English figure, its three digits make one subsection, while in Chinese, its four digits make one subsection.
Unit2 School life Reading 2.
Welcome Welcome to my class Welcome to my class!.
Unit 4 I used to be afraid of the dark.
Unit 2 Lessons 7-12 It’s Show Time! 甘肃省陇西县崇文中学 陈文通.
Life relies on sports 生命在于运动.
I’m going to be a basketball player.
语义网若干基本问题的讨论 申思 2003年5月.
Here Are Some Tips That May Bring You A Beautiful Life!
優質教育基金研究計劃研討會: 經驗分享 - 透過Web 2.0推動高小程度 探究式專題研習的協作教學模式
Fun with English 7A Unit 2 Main task.
Journal Citation Reports® 期刊引文分析報告的使用和檢索
Guide to Freshman Life Prepared by Sam Wu.
課務組 Curriculum Section
创建型设计模式.
高考常考单选、写作句型默写.
Unit 4 My day Reading (2) It’s time for class.
Unit title: 假期 – Holiday
Princeton WordNet Ontology
LCCC 2018 Spring Festival April 28, 2018.
Chapter 3 Nationality Objectives:
从百科类网站抽取infobox 报告人:徐波.
PubMed整合显示图书馆电子资源 医科院图书馆电子资源培训讲座.
Single’s Day.
GRANT UNION HIGH SCHOOL
Social Process & Relationship
Princeton WordNet Ontology
Unit 1 This is me ! Task.
Here Are Some Tips That May Bring You A Beautiful Life!
一个RDF数据自然语言生成器的设计与实现
Here Are Some Tips That May Bring You A Beautiful Life!
Have you read Treasure Island yet?
10 ROSES FOR YOU 2009 送你10朵玫瑰花.2009的祝愿! 配乐:神秘园.
Here Are Some Tips That May Bring You A Beautiful Life!
Unit 8 Our Clothes Topic1 What a nice coat! Section D 赤峰市翁牛特旗梧桐花中学 赵亚平.
UNIT 3.
汉英翻译对比练习.
Unit title: 学校 School Area of interaction focus Significant concepts
Changhua University of Education
Adj + Noun到知识库中的Special Classes
中央社新聞— <LTTC:台灣學生英語聽說提升 讀寫相對下降>
关联词 Writing.
Area of interaction focus
Unit 4 Body Language.
实体描述呈现方法的研究 实验评估 2019/5/1.
高考应试作文写作训练 5. 正反观点对比.
Good Karma 善因緣 This is a nice reading, but short. Enjoy! This is what The Dalai Lama has to say for All it takes is a few seconds to read and think.
Social Process & Relationship
Unit title: 学校 School Area of interaction focus Significant concepts
九月十七日 Do now-写中文 Who Name also is/are/am friend.
Good Karma 善因緣 This is a nice reading, but short. Enjoy! This is what The Dalai Lama has to say for All it takes is a few seconds to read and think.
My Country 我 的 国 家.
钱炘祺 一种面向实体浏览中属性融合的人机交互的设计与实现 Designing Human-Computer Interaction of Property Consolidation for Entity Browsing 钱炘祺
怎樣把同一評估 給與在不同班級的學生 How to administer the Same assessment to students from Different classes and groups.
Adj + Noun映射到知识库中的classes
I Love to Tell the Story S465 我愛傳講主福音 1/4
簡單迴歸分析與相關分析 莊文忠 副教授 世新大學行政管理學系 計量分析一(莊文忠副教授) 2019/8/3.
高考英语作文指导 福建省教研室 姚瑞兰.
When using opening and closing presentation slides, use the masterbrand logo at the correct size and in the right position. This slide meets both needs.
Presentation transcript:

Adjective + Noun 到知识库中特定Types 王远 2018/11/22

DBpedia Ontology YAGO Ontology Wikipedia Pages Adj + Noun DBpedia Entity dbo:wikiPageID DBpedia Ontology YAGO Ontology SDType Adj + Noun : 1,836,620 Adj : 75,867 Noun : 70,140 生成候选Type 测试数据集: 1.QALD1-7抽取了49个<adj, noun>对 2.根据QALD提供的标准SPARQL对每个<adj, noun>标注了一个标准的Type 测试结果: Top-100 Top-50 Top-20 Top-10 Average 40 35 21 4 28 结果分析: 1. 3个<adj, noun>中的noun是组合型名词,且没有包含在WordNet中。 例:Grunge#;#record label、 Australian#;#metalcore band 2. 2个<adj, noun>没有在资源库中(可能被过滤掉了;也有可能Wikipeidia不存在这个adj + noun) 例:Swedish#;#holiday 3. 4个<adj, noun>的标准Type不在Top-100; 例: anti-apartheid#;#activist( ?uri rdf:type text:"anti-apartheid activist" . ) Military conflicts Given Name

Adj + Noun 的上下文特征 Adj + Noun 与 Type 的共现特征 相似度特征 Type 的上下文特征 字面相似度特征 #5 Adj + Noun 与候选Type的LocalName字面相似度 语义相似度特征 #6 Adj + Noun 与候选Type的LocalName语义相似度 Type 的上下文特征 #7 候选 Type 所在层级 #8 候选 Type 在知识库中对应的实体个数 #9 候选 Type 与其它候选 Type 的PMI信息 #10 候选Type是属于DBpedia Ontology还是属于Yago Ontology #11 候选Type的LocalName中是否包含有 Adj + Noun

Type 的 LocalName 中包含有 Adj + Noun DBpedia : 60/414 Yago : 135,219/369,144 musical + group 1292 american 1753 people 4494 political + party 880 musical 1388 descent 3011 military + unit 627 military 1230 group 1578 gaelic + footballer 482 political 1198 school 1176 religious + building 459 british 1052 party 1131 archaeological + site 431 defunct 1039 football 793 american + people 340 canadian 942 unit 704 american + football 301 french 893 building 639 human + rights 266 german 881 footballer 638 …… 后续工作 对每个候选Type生成特征数据 人工标注100~200组数据 训练一个二分类模型来进行过滤排序

thks

Heiko Paulheim, Christian Bizer. Type Inference on Noisy RDF Data Heiko Paulheim, Christian Bizer. Type Inference on Noisy RDF Data. ISWC 2013. Motivation In DBpedia, common reasons for missing type statements are -- Missing infoboxes. an article without an infobox is not assigned any type. -- Too general infoboxes . if an article about an actor uses a person infobox instead of the more specic actor infobox, the instance is assigned the type dbpedia-owl:Person, but not dbpedia-owl:Actor. -- Wrong infobox mappings. the videogame infobox is mapped to dbpedia- owl:VideoGame, not dbpedia-owl:Game, and dbpedia-owl:VideoGame is not a subclass of dbpedia-owl:Game in the DBpedia ontology. -- Unclear semantics. dbpedia-owl:College. College in British and US English, can denote private secondary schools, universities, or institutions within universities Standard RDFS reasoning via entailment rules -- ?x a ?t1. ?t1 rdfs:subClassOf ?t2 entails ?x a ?t2 -- ?x ?r ?y . ?r rdfs:domain ?t entails ?x a ?t -- ?y ?r ?x . ?r rdfs:range ?t entails ?x a ?t Reasoning seems the straight forward approach to tackle the problem of completing missing types. The DBpedia dataset contains all types from the infobox types dataset (i.e., Dbpedia ontology, schema.org, and UMBEL) some DBpedia ontology classes do not have clear semantics

and military conflict…… dbr:Germany Types : country, award, city, sports team, mountain, stadium, record label, person, and military conflict…… dbpedia:Mze dbpedia-owl:sourceMountain dbpedia:Germany. dbpedia:XII Corps (United Kingdom) dbpedia-owl:battle dbpedia:Germany. SDType : An approach for inducing types which is tolerant with respect to erroneous and noisy data.  姆熱河

Evaluation Random samples of 10,000 instances from Dbpedia and OpenCyc. Using only ingoing properties. In DBpedia, outgoing properties and types are generated in the same step, so the correct type can be trivially predicted from outgoing properties. The reason for that is that DBpedia, with its stronger focus on coverage than on correctness, contains more faulty statements. When more links are present, the influence of each individual statement is reduced, which allows for correcting errors.

Evaluation From all 550,048 untyped resources in DBpedia, the classifier identies 519,900 (94.5%) as typeable. Generating types for those resources and evaluated them manually on a sample of 100 random resources. 91.8%

Estimating Type Completeness in DBpedia DBpedia types are at most 63.7% complete, with at least 2.7 million missing type statements (while YAGO types, which can be assessed accordingly, are at most 53.3% complete)

DBpedia Ontology YAGO Ontology 候选Type的生成 Wikipedia Pages Adj + Noun DBpedia Entity dbo:wikiPageID DBpedia Ontology YAGO Ontology rdf:type SDType

thks

过滤 识别 预处理 Wikipedia Pages Adj 为序数词 StanfordNLP句法解析 识别NP 抽取 摘要部分 WordNet词表识别NP中 复合型形容词/复合型名词 StanfordNLP句法解析 识别NP 抽取 摘要部分 Adj/Noun包含了 特殊字符 Wikipedia Pages StanfordNLP词性标注 识别NP中Adj + Noun Adj为比较级或最高级 去除 表格、图片、 链接、标记 Adj + Noun 为实体或Noun为专有名词 Adj + Noun 频率 ≤ 5 WikiExtractor

Wikipedia文本中的 adj + noun 抽取 版本:2018/10 规模:15.9G 抽取Adj + NP(NP中只包含有noun) 过滤掉 adj 为数字类型的序数词 过滤掉adj/noun包含了特殊字符的情况 adj + noun : 13,309,280; adj : 834,028; noun : 2,436,532 过滤比较级和最高级 adj + noun : 12,819,292; adj : 831,202; noun : 2,361,237 过滤掉adj + nouns (noun的个数 > 1)的情况 adj + noun : 8,095,419; adj : 747,043; noun : 396,040 过滤掉adj + noun 为实体或noun为专有名词

adj + noun 统计 adj + noun : 8,095,419

adj + noun 中 adj 的统计 adj : 747,043 first other new same second many own many several same second

“WordNet 中 adj”有4,677个不在“adj + noun 中 adj” adj + noun 中 adjs 的统计 adj : 747,043 WordNet 中 Adjs “Adj + Noun 中 Adjs” 与 “WordNet 中 Adjs” 的overlap Total 21,557 16,880 adj.all 17,777 13,785 adj.pert 4,379 3,055 adj.ppl 76 40 “WordNet 中 adj”有4,677个不在“adj + noun 中 adj” 22232 adj.all ∩ adj.pert : 663; adj.all ∩ adj.ppl : 12 4677

adj + noun 中 noun 的统计 noun : 396,040 school system version form people style group state approach time year years life season school

adj + noun 中 noun 的统计 noun : 396,040 WordNet 中 Nouns “Adj + Noun 中 Nouns” 与 “WordNet 中 Nouns” 的overlap Total 119188 35629 noun.person 18899 6703 noun.artifact 16381 6655 noun.act 9459 5274 noun.communication 8300 3882 noun.attribute 4802 3255 noun.state 5622 2726 noun.cognition 4429 2465 noun.animal 14324 2351 noun.substance 4639 1949 noun.plant 17809 1614 noun.group 3972 1381 noun.food 3595 1237 noun.location 4907 1194 noun.body 3572 1132 noun.event 1663 1064 noun.object 2303 867 noun.quantity 2031 844 noun.process 1127 665 noun.feeling 773 610 noun.possession 1520 563 noun.time 1689 532 noun.phenomenon 986 416 noun.shape 540 357 noun.relation 679 312 noun.Tops 83 63 noun.motive 78 41

noun WordNet overlap Adjs + [attribute] 698 656 Nouns + [attribute] part_of_speech Synset (320) noun Synset1 part_of_speech adjective attribute (620) WordNet overlap Adjs + [attribute] 698 656 Nouns + [attribute] 606 502 656/698

THKS

背景 Adj + Noun 也是问句理解中重要的部分。比如,大部分的KBQA的问答系统(例如:gAnswer)都将”adjective + noun” 映射到”special classes” Adj + Noun –> special classes 的一般方法 通过计算lexical similarity between the “adj + noun” and the class name nuclear weapon yago:NuclearWeapons yago:NuclearWeapon103834604 lexical similarity 常用方法 编辑距离 Word2Vec SimHash Jaro Distance

Motivation 一般方法的问题 当 “adj + noun” 的字面与 class name 相差比较大时就会映射不上 例如:” atomic weapon” 就无法准确映射到 yago:NuclearWeapons 只依靠 lexical similarity 会导致映射错误 例如: public library yago:PublicLibraries 6个实体 yago:PublicLibrary107978170 262个实体 问句中上下文信息难以利用 Which Greek goddesses dwelt on Mount Olympus? Which European countries have a constitutional monarchy? Give me all American presidents in the last 20 years. Give me all chemical elements. 类在知识库中上下文信息 类与类之间的信息 实体与类之间的信息 利用Wikipedia将adj + noun与知识库中的实体/类关联起来 在线检索 + 统计学习 离线构建资源库

1.adj + noun 的识别和抽取(Wikipedia中的文本) 2) adj + noun 的候选classes生成 English engineer city yago:Engineer109615807 Class2 Class3 Class4 资源库构建策略 1.adj + noun 的识别和抽取(Wikipedia中的文本) 2) adj + noun 的候选classes生成 3) 候选classes的过滤和重排序 4) 资源库的扩充(利用WordNet和PPDB) 实验评估 候选classes的过滤和重排序中”分类器”的评估 资源库的评估 资源库中的 ”adj + noun” 在问答数据集中的覆盖率 资源库中的 ”adj + noun –> special classes ” 在问答数据集中的正确率

Wikipedia文本中的adj + noun 抽取 文本语料:4,641,892 Wikipedia articles 工具:Stanford NLP POS 过滤规则: 1.过滤掉adj为序数词的情况 2.过滤掉adj + 特定名词 3.过滤掉adj是比较级、最高级的形式 4.过滤掉adj + noun是实体的情况 5.过滤掉adj/noun包含了特殊字符的情况 6.过滤掉出现频率较低的adj + noun 7.没有考虑adj + noun + noun的情况 adjs:26,693 adj + noun:288,452 平均每个adj会修饰10.8个nouns

目的 获取adj可能修饰的nouns 将Adj + Noun映射到知识库(DBpedia)中的classes yago:Engineer109615807 engineer Class2 Class3 English Class4 city Class5 Class6

Adj + noun 的抽取 语料源:Wikipedia

2. Adj + noun 的候选classes生成 yago:CausalAgent100007347 yago:Colleague109935990 yago:ComputerScientist109951070 yago:ComputerUser109951274 yago:Contestant109613191 yago:Engineer109615807 yago:MilitaryOfficer110317007 …… Type English engineer yago:Businessperson109882716 yago:Capitalist109609232 yago:CausalAgent100007347 yago:CivilAuthority110541833 yago:Donor110025730 yago:Engineer109615807 yago:Contestant109613191 …… Type

候选Class在Ontology Class体系中所处的层级 3. 候选classes的过滤和重排序 人工标注 + 分类 <English, engineer> 的候选class及其特征 Adj + Noun 与 class 共现次数 候选Class在Ontology Class体系中所处的层级 候选Class所处层级有几个class Noun 与 class 字面相似度 Noun 与 class 语义相似度 人工 标注 yago:CausalAgent100007347 1 …… yago:Colleague109935990 yago:ComputerScientist109951070 yago:ComputerUser109951274 yago:Contestant109613191 yago:Engineer109615807 2 3 10 0.81 0.92 yago:MilitaryOfficer110317007 owl:Thing

4. 资源库的扩充 WordNet 和 PPDB Class1 Noun1 Class2 Class3 Adj Class4 antonymy Noun2 Class5 Class6 Adj2

thks