Download presentation
Presentation is loading. Please wait.
Published byJohan Håkansson Modified 5年之前
1
Wikilinks 数据概览 丁文韬 施林锋 Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links to Wikipedia Author: UMASS-CS Sameer Singh & Andrew Mccallum, Google Research etc.
2
基本情况 大规模的带链接(wikipedia)标记数据 解决跨文档实体消解问题 选取带有wikipedia链接的网页为语料 较以往数据集特点
规模更大:9.5 million web pages,40 million mentions,3 million entities 更准确:网页由个人制作,较为精确 更真实:数据源不单一(新闻等),贴近真实表述,实体 类型多样化
3
构建方法 抓取带有英文wiki链接的网页数据 过滤 数据发布 >70%内容来自wikipedia本身的网页
URL Mention offset url … Token offset
4
统计 语料:9.5 million web pages 因为是个人网页,标记准确率较高,随机抽样了100 个都是正确的 规模: 分布:
5
类型统计 文中提到:People, locations, organizations, general concepts
dbpedia类结构,774个类 出现104个类
6
ArchitecturalStructure
1 Agent 20.065% 2 Work 10.289% 30.354% 3 Person 9.823% 40.177% 4 Location 9.742% 49.919% 5 Place 9.701% 59.620% 6 PopulatedPlace 7.214% 66.834% 7 Organisation 4.631% 71.465% 8 WrittenWork 997358 3.034% 74.499% 9 MusicalWork 797420 2.426% 76.926% 10 Artist 649733 1.977% 78.902% 11 Settlement 574820 1.749% 80.651% 12 Software 444810 1.353% 82.005% 13 ArchitecturalStructure 420293 1.279% 83.283% 14 Group 336067 1.022% 84.306% 15 Athlete 323761 0.985% 85.291% 16 Region 295673 0.900% 86.190% 17 Event 294125 0.895% 87.085% 18 SocietalEvent 287224 0.874% 87.959% 19 Species 275750 0.839% 88.798% 20 Eukaryote 256947 0.782% 89.580% 21 TimePeriod 249037 0.758% 90.338% ……… 32 Building 102975 0.313% 95.158% 52 Satellite 25871 0.079% 98.981% 53 WinterSportPlayer 24796 0.075% 99.056% 54 Plant 23992 0.073% 99.129%
7
Thx, Q&A
Similar presentations