Adjective + Noun 到知识库中特定Types 王远 2018/11/22
DBpedia Ontology YAGO Ontology Wikipedia Pages Adj + Noun DBpedia Entity dbo:wikiPageID DBpedia Ontology YAGO Ontology SDType Adj + Noun : 1,836,620 Adj : 75,867 Noun : 70,140 生成候选Type 测试数据集: 1.QALD1-7抽取了49个<adj, noun>对 2.根据QALD提供的标准SPARQL对每个<adj, noun>标注了一个标准的Type 测试结果: Top-100 Top-50 Top-20 Top-10 Average 40 35 21 4 28 结果分析: 1. 3个<adj, noun>中的noun是组合型名词,且没有包含在WordNet中。 例:Grunge#;#record label、 Australian#;#metalcore band 2. 2个<adj, noun>没有在资源库中(可能被过滤掉了;也有可能Wikipeidia不存在这个adj + noun) 例:Swedish#;#holiday 3. 4个<adj, noun>的标准Type不在Top-100; 例: anti-apartheid#;#activist( ?uri rdf:type text:"anti-apartheid activist" . ) Military conflicts Given Name
Adj + Noun 的上下文特征 Adj + Noun 与 Type 的共现特征 相似度特征 Type 的上下文特征 字面相似度特征 #5 Adj + Noun 与候选Type的LocalName字面相似度 语义相似度特征 #6 Adj + Noun 与候选Type的LocalName语义相似度 Type 的上下文特征 #7 候选 Type 所在层级 #8 候选 Type 在知识库中对应的实体个数 #9 候选 Type 与其它候选 Type 的PMI信息 #10 候选Type是属于DBpedia Ontology还是属于Yago Ontology #11 候选Type的LocalName中是否包含有 Adj + Noun
Type 的 LocalName 中包含有 Adj + Noun DBpedia : 60/414 Yago : 135,219/369,144 musical + group 1292 american 1753 people 4494 political + party 880 musical 1388 descent 3011 military + unit 627 military 1230 group 1578 gaelic + footballer 482 political 1198 school 1176 religious + building 459 british 1052 party 1131 archaeological + site 431 defunct 1039 football 793 american + people 340 canadian 942 unit 704 american + football 301 french 893 building 639 human + rights 266 german 881 footballer 638 …… 后续工作 对每个候选Type生成特征数据 人工标注100~200组数据 训练一个二分类模型来进行过滤排序
thks
Heiko Paulheim, Christian Bizer. Type Inference on Noisy RDF Data Heiko Paulheim, Christian Bizer. Type Inference on Noisy RDF Data. ISWC 2013. Motivation In DBpedia, common reasons for missing type statements are -- Missing infoboxes. an article without an infobox is not assigned any type. -- Too general infoboxes . if an article about an actor uses a person infobox instead of the more specic actor infobox, the instance is assigned the type dbpedia-owl:Person, but not dbpedia-owl:Actor. -- Wrong infobox mappings. the videogame infobox is mapped to dbpedia- owl:VideoGame, not dbpedia-owl:Game, and dbpedia-owl:VideoGame is not a subclass of dbpedia-owl:Game in the DBpedia ontology. -- Unclear semantics. dbpedia-owl:College. College in British and US English, can denote private secondary schools, universities, or institutions within universities Standard RDFS reasoning via entailment rules -- ?x a ?t1. ?t1 rdfs:subClassOf ?t2 entails ?x a ?t2 -- ?x ?r ?y . ?r rdfs:domain ?t entails ?x a ?t -- ?y ?r ?x . ?r rdfs:range ?t entails ?x a ?t Reasoning seems the straight forward approach to tackle the problem of completing missing types. The DBpedia dataset contains all types from the infobox types dataset (i.e., Dbpedia ontology, schema.org, and UMBEL) some DBpedia ontology classes do not have clear semantics
and military conflict…… dbr:Germany Types : country, award, city, sports team, mountain, stadium, record label, person, and military conflict…… dbpedia:Mze dbpedia-owl:sourceMountain dbpedia:Germany. dbpedia:XII Corps (United Kingdom) dbpedia-owl:battle dbpedia:Germany. SDType : An approach for inducing types which is tolerant with respect to erroneous and noisy data. 姆熱河
Evaluation Random samples of 10,000 instances from Dbpedia and OpenCyc. Using only ingoing properties. In DBpedia, outgoing properties and types are generated in the same step, so the correct type can be trivially predicted from outgoing properties. The reason for that is that DBpedia, with its stronger focus on coverage than on correctness, contains more faulty statements. When more links are present, the influence of each individual statement is reduced, which allows for correcting errors.
Evaluation From all 550,048 untyped resources in DBpedia, the classifier identies 519,900 (94.5%) as typeable. Generating types for those resources and evaluated them manually on a sample of 100 random resources. 91.8%
Estimating Type Completeness in DBpedia DBpedia types are at most 63.7% complete, with at least 2.7 million missing type statements (while YAGO types, which can be assessed accordingly, are at most 53.3% complete)
DBpedia Ontology YAGO Ontology 候选Type的生成 Wikipedia Pages Adj + Noun DBpedia Entity dbo:wikiPageID DBpedia Ontology YAGO Ontology rdf:type SDType
thks
过滤 识别 预处理 Wikipedia Pages Adj 为序数词 StanfordNLP句法解析 识别NP 抽取 摘要部分 WordNet词表识别NP中 复合型形容词/复合型名词 StanfordNLP句法解析 识别NP 抽取 摘要部分 Adj/Noun包含了 特殊字符 Wikipedia Pages StanfordNLP词性标注 识别NP中Adj + Noun Adj为比较级或最高级 去除 表格、图片、 链接、标记 Adj + Noun 为实体或Noun为专有名词 Adj + Noun 频率 ≤ 5 WikiExtractor
Wikipedia文本中的 adj + noun 抽取 版本:2018/10 规模:15.9G 抽取Adj + NP(NP中只包含有noun) 过滤掉 adj 为数字类型的序数词 过滤掉adj/noun包含了特殊字符的情况 adj + noun : 13,309,280; adj : 834,028; noun : 2,436,532 过滤比较级和最高级 adj + noun : 12,819,292; adj : 831,202; noun : 2,361,237 过滤掉adj + nouns (noun的个数 > 1)的情况 adj + noun : 8,095,419; adj : 747,043; noun : 396,040 过滤掉adj + noun 为实体或noun为专有名词
adj + noun 统计 adj + noun : 8,095,419
adj + noun 中 adj 的统计 adj : 747,043 first other new same second many own many several same second
“WordNet 中 adj”有4,677个不在“adj + noun 中 adj” adj + noun 中 adjs 的统计 adj : 747,043 WordNet 中 Adjs “Adj + Noun 中 Adjs” 与 “WordNet 中 Adjs” 的overlap Total 21,557 16,880 adj.all 17,777 13,785 adj.pert 4,379 3,055 adj.ppl 76 40 “WordNet 中 adj”有4,677个不在“adj + noun 中 adj” 22232 adj.all ∩ adj.pert : 663; adj.all ∩ adj.ppl : 12 4677
adj + noun 中 noun 的统计 noun : 396,040 school system version form people style group state approach time year years life season school
adj + noun 中 noun 的统计 noun : 396,040 WordNet 中 Nouns “Adj + Noun 中 Nouns” 与 “WordNet 中 Nouns” 的overlap Total 119188 35629 noun.person 18899 6703 noun.artifact 16381 6655 noun.act 9459 5274 noun.communication 8300 3882 noun.attribute 4802 3255 noun.state 5622 2726 noun.cognition 4429 2465 noun.animal 14324 2351 noun.substance 4639 1949 noun.plant 17809 1614 noun.group 3972 1381 noun.food 3595 1237 noun.location 4907 1194 noun.body 3572 1132 noun.event 1663 1064 noun.object 2303 867 noun.quantity 2031 844 noun.process 1127 665 noun.feeling 773 610 noun.possession 1520 563 noun.time 1689 532 noun.phenomenon 986 416 noun.shape 540 357 noun.relation 679 312 noun.Tops 83 63 noun.motive 78 41
noun WordNet overlap Adjs + [attribute] 698 656 Nouns + [attribute] part_of_speech Synset (320) noun Synset1 part_of_speech adjective attribute (620) WordNet overlap Adjs + [attribute] 698 656 Nouns + [attribute] 606 502 656/698
THKS
背景 Adj + Noun 也是问句理解中重要的部分。比如,大部分的KBQA的问答系统(例如:gAnswer)都将”adjective + noun” 映射到”special classes” Adj + Noun –> special classes 的一般方法 通过计算lexical similarity between the “adj + noun” and the class name nuclear weapon yago:NuclearWeapons yago:NuclearWeapon103834604 lexical similarity 常用方法 编辑距离 Word2Vec SimHash Jaro Distance
Motivation 一般方法的问题 当 “adj + noun” 的字面与 class name 相差比较大时就会映射不上 例如:” atomic weapon” 就无法准确映射到 yago:NuclearWeapons 只依靠 lexical similarity 会导致映射错误 例如: public library yago:PublicLibraries 6个实体 yago:PublicLibrary107978170 262个实体 问句中上下文信息难以利用 Which Greek goddesses dwelt on Mount Olympus? Which European countries have a constitutional monarchy? Give me all American presidents in the last 20 years. Give me all chemical elements. 类在知识库中上下文信息 类与类之间的信息 实体与类之间的信息 利用Wikipedia将adj + noun与知识库中的实体/类关联起来 在线检索 + 统计学习 离线构建资源库
1.adj + noun 的识别和抽取(Wikipedia中的文本) 2) adj + noun 的候选classes生成 English engineer city yago:Engineer109615807 Class2 Class3 Class4 资源库构建策略 1.adj + noun 的识别和抽取(Wikipedia中的文本) 2) adj + noun 的候选classes生成 3) 候选classes的过滤和重排序 4) 资源库的扩充(利用WordNet和PPDB) 实验评估 候选classes的过滤和重排序中”分类器”的评估 资源库的评估 资源库中的 ”adj + noun” 在问答数据集中的覆盖率 资源库中的 ”adj + noun –> special classes ” 在问答数据集中的正确率
Wikipedia文本中的adj + noun 抽取 文本语料:4,641,892 Wikipedia articles 工具:Stanford NLP POS 过滤规则: 1.过滤掉adj为序数词的情况 2.过滤掉adj + 特定名词 3.过滤掉adj是比较级、最高级的形式 4.过滤掉adj + noun是实体的情况 5.过滤掉adj/noun包含了特殊字符的情况 6.过滤掉出现频率较低的adj + noun 7.没有考虑adj + noun + noun的情况 adjs:26,693 adj + noun:288,452 平均每个adj会修饰10.8个nouns
目的 获取adj可能修饰的nouns 将Adj + Noun映射到知识库(DBpedia)中的classes yago:Engineer109615807 engineer Class2 Class3 English Class4 city Class5 Class6
Adj + noun 的抽取 语料源:Wikipedia
2. Adj + noun 的候选classes生成 yago:CausalAgent100007347 yago:Colleague109935990 yago:ComputerScientist109951070 yago:ComputerUser109951274 yago:Contestant109613191 yago:Engineer109615807 yago:MilitaryOfficer110317007 …… Type English engineer yago:Businessperson109882716 yago:Capitalist109609232 yago:CausalAgent100007347 yago:CivilAuthority110541833 yago:Donor110025730 yago:Engineer109615807 yago:Contestant109613191 …… Type
候选Class在Ontology Class体系中所处的层级 3. 候选classes的过滤和重排序 人工标注 + 分类 <English, engineer> 的候选class及其特征 Adj + Noun 与 class 共现次数 候选Class在Ontology Class体系中所处的层级 候选Class所处层级有几个class Noun 与 class 字面相似度 Noun 与 class 语义相似度 人工 标注 yago:CausalAgent100007347 1 …… yago:Colleague109935990 yago:ComputerScientist109951070 yago:ComputerUser109951274 yago:Contestant109613191 yago:Engineer109615807 2 3 10 0.81 0.92 yago:MilitaryOfficer110317007 owl:Thing
4. 资源库的扩充 WordNet 和 PPDB Class1 Noun1 Class2 Class3 Adj Class4 antonymy Noun2 Class5 Class6 Adj2
thks