Presentation is loading. Please wait.

Presentation is loading. Please wait.

从百科类网站抽取infobox 报告人:徐波.

Similar presentations


Presentation on theme: "从百科类网站抽取infobox 报告人:徐波."— Presentation transcript:

1 从百科类网站抽取infobox 报告人:徐波

2 背景 百科类网站拥有丰富的结构化信息 最主要的信息为Infobox表 转化为知识库格式 (attribute, value) pair
(subject, predict, object) tuple

3 但是高质量网页数量不够

4 问题 如何从百科类网站中获得更多的知识?

5 基本思路 从百科网站的网页正文中获得更多的知识 根据Infobox表和正文内容的对应关系,通过机器学习的方法,模拟人的思维,获取更多的知识

6 相关论文 F. Wu and D. S. Weld. Autonomously Semantifying Wikipedia. CIKM2007 Dustin Lange, Christoph Bohm and Felix Naumann. Extracting Structured Information from Wikipedia Articles to Populate Infoboxes. CIKM2010 Sultana etc. , Infobox suggestion for Wikipedia entities. CIKM2012

7 Autonomously Semantifying Wikipedia
2019/2/19

8 Schema Refiner Schema Refinement Duplicate attributes
Free edit -> schema drift Duplicate templates U.S.County(1428), US County(574), Counties(50), County(19) Duplicate attributes “Census Yr”, “Census Estimate Yr”, “Census Est.”, “Census Year” Low usage of attribute >15% occurrences 模板精炼器 Yuzhi shiyan guadian

9 Training Dataset Construction

10 Classifier Document Classifiers (1 per article type)
List & Category as features Naïve Bayes, Maximum Entropy or SVM classifier Other Fast heuristic approach Precision(98.5%) – with no learning! Recall(68.8%) Sentence Classifier (1 per article type x attribute) multi-class, multi-label text classification problem Trained on preprocessor output Features: bag of words, POS tags

11 Extractor Conditional Random Fields Model [Lafferty 01]
Attribute value extraction: sequential data labeling CRF model for each attribute independently Why good

12 Features

13 Extracting Structured Information from Wikipedia Articles to Populate Infoboxes
CIKM2007 CIKM2010 Given a Wikipedia article containing an incomplete infobox template call, the Infobox Population Problem is to extract as many correct attribute values from the article text as possible.

14 Structure Analysis Many attributes have a characteristic structure
number of employees: 12,500 (2003)  (Number ‘(’ Number‘)’) key people: Samuel J. Palmisano (Chairman, President and CEO) Multi values Bill Gates, Paul Allen for the founder attribute Discover a structure that represents most of these values simple, but powerful enough to split values and to combine value parts

15 Method

16 Training Data Creation
Article Paragraph Filtering Labeling with Similarity Measure Labeling Value Parts

17 Article Paragraph Filtering
很多百科网页正文内容很长但与infobox无关 First filter article paragraphs

18 Labeling with Similarity Measure
label occurrences of infobox attribute values in article Not exactly match Achieve an average occurrence rate of 26.0%, which is an increase of 23%(compare with exactly math).

19 Labeling Value Parts All attribute values are divided into several parts according to the corresponding attribute value structure Each part of the value structure is labeled separately Eg: the value of number of employees in infobox company is 54,400 (2008) Sentences “In 2008, the company had 54,400 employees” On average, searching for value parts increases the rate of found occurrences from 26.0% to 33.9%; an improvement of 30.5%.

20 Value Extractor Creation
(1)Selected extractor’s precision at least 0.75 (2)CRFsuite with L-BFGS as feature weight estimation method

21 Attribute Value Extraction
Align value parts Insert structural elements Avoid meaningless values Optional tokens often have no meaning without related mandatory tokens Eg: “IBM’s key people are Sam Palmisano, who serves as CEO, and Mark Loughridge as SVP.” Sam Palmisano (CEO), Mark Loughridge (SVP)

22 Infobox Suggestion for Wikipedia Entities
给wikipedia上没有infobox的entity推荐infobox template

23 STEP1:选择训练集和测试集

24 STEP2:选择特征

25 STEP3:voting of the features
category-as-feature produces more accurate results on labeled articles word-as-feature achieves better accuracy on unlabeled articles. Jicheng xuexi random forest

26 我的工作 中英文差异 (1)英文有模板,中文没有 (2)英文不需要分词,中文需要

27 Value Extractor Creation
Sentences 北京 制片地区:北京 上海 产地:上海 国家/地区:北京 地区:上海 江苏 制片地区:江苏 广州 是在广州制片 Patterns 制片地区:[value] 产地:[value] 国家/地区:[value] 地区:[value] 是在[value]制片 Patterns Freq. 制片地区:[value] 100 产地:[value] 87 国家/地区:[value] 70 地区:[value] 50 是在[value]制片 3 (c) Verifying & filter Patterns Prec. 制片地区:[value] 1 产地:[value] 国家/地区:[value] 0.99 地区:[value] 0.1 (a)parsing cotraining

28 Experiments

29 Experiments

30 百度有推荐模板 基本统计 19大类 34小类 371个属性(去重后278个) 覆盖率 ( /256498)

31 筛选后模板 34小类 315个属性 覆盖率0.835 (214136/256498)


Download ppt "从百科类网站抽取infobox 报告人:徐波."

Similar presentations


Ads by Google