从百科类网站抽取infobox 报告人:徐波.

Slides:



Advertisements
Similar presentations
宏 观 经 济 学 N.Gregory Mankiw 上海杉达学院.
Advertisements

Classification of Web Query Intent Using Encyclopedia 基于百科知识的查询意图获取
Teaching the Chinese Copula 是 for CSL Purposes
嘉義縣溪口國民中學 語文領域-國文科 閱讀與寫作 書目導讀 蕭奕鈞老師
資料庫設計 Database Design.
《比尔·盖茨》让我们学到了些什么 何骏小组.
BRIEF GUIDELINE FOR AUTHOR PREPARING PAPER FOR PUBLICATION
Homework 2 : VSM and Summary
中文命名实体识别及关系提取 *** *** ***.
数据库技术及应用 华中科技大学管理学院 课程网址:
Academic Year TFC EFL Data Collection Outline 学年美丽中国英语测试数据收集概述
Semantic-Synaptic Web Mining: A Novel Model for Improving the Web Mining 報告者:陳宜樺 報告日期:2015/9/25.
libD3C: 一种免参数的、支持不平衡分类的二类分类器
Text Segmentation for Chinese Spell Checking
The Construction of a Chinese Named Entity Tagged Corpus: CNEC1.0
A Question Answering Approach to Emotion Cause Extraction
Reading Do you remember what you were doing? 学习目标 1、了解几个重要历史事件。
Some Effective Techniques for Naive Bayes Text Classification
Improving classification models with taxonomy information
毕业论文报告 孙悦明
模式识别 Pattern Recognition
文本分类综述 王 斌 中国科学院计算技术研究所 2002年12月.
資料庫結構與組織.
(Exec1) GIS 空间分析-使用ArcGIS (Exec1)
Chapter 6 Graph Chang Chi-Chung
SQL Server 2008 資料採礦: 資料採礦An Overview of Key Data Mining Capabilities
Sampling Theory and Some Important Sampling Distributions
第4章(2) 空间数据库 —关系数据库 北京建筑工程学院 王文宇.
CCF-ADL 58 大媒体与大数据分析 北京·清华大学
Unit 2 Key points summary.
巨量資料分析與應用 (1) 楊立偉教授 台大工管系暨商研所 2014 Fall.
Word-Entity Duet Representations for Document Ranking
Data Pre-Processing … What about your data?.
971研究方法課程第九次上課 認識、理解及選擇一項適當的研究策略
药物和疾病啥关系 ? 李智恒.
SPSS-概述與資料處理 輔大統計資訊系 黃孝雲.
增强型MR可解决 临床放射成像的 多供应商互操作性问题
产品造型与设计II 向辉 山东大学软件学院 工程硕士-2003年秋季.
基于类关联规则的分类 Classification Based on Class-Association Rules
“把”字句 by Lin Guo.
最大熵模型简介 A Simple Introduction to the Maximum Entropy Models
—— 周小多.
API文档分析 张静宣 大连理工大学 2017年11月3日.
数据摘要现状调研报告 上下文摘要初步思考 徐丹云.
Version Control System Based DSNs
校園地震預警系統的建置與應用 林沛暘.
Unit 5 Reading A Couch Potato.
Guide to a successful PowerPoint design – simple is best
前向人工神经网络敏感性研究 曾晓勤 河海大学计算机及信息工程学院 2003年10月.
Learn Question Focus and Dependency Relations from Web Search Results for Question Classification 各位老師大家好,這是我今天要報告的論文題目,…… 那在題目上的括號是因為,前陣子我們有投airs的paper,那有reviewer對model的名稱產生意見.
OvidSP Introduction Flexible. Innovative. Precise.
從 ER 到 Logical Schema ──兼談Schema Integration
西南大学计算机系 郭云龙 徐潇 向宇 曾维刚 李莉
高考应试作文写作训练 5. 正反观点对比.
參考資料: 黃慕萱,Chap. 2-3 Harter, Chap. 3
计算机问题求解 – 论题1-5 - 数据与数据结构 2018年10月16日.
Efficient Query Relaxation for Complex Relationship Search on Graph Data 李舒馨
序贯监督学习框架下的 耀斑短期预报 哈尔滨工业大学 黄鑫.
SLIQ:一种快速可伸缩分类器 Manish Mehta, Rakesh Agrawal, Jorma Rissanen IBM Almaden Research Center, 1996 报告人:郭新涛
More About Auto-encoder
钱炘祺 一种面向实体浏览中属性融合的人机交互的设计与实现 Designing Human-Computer Interaction of Property Consolidation for Entity Browsing 钱炘祺
Speaker : YI-CHENG HUNG
Chapter 9 Validation Prof. Dehan Luo
Class imbalance in Classification
MGT 213 System Management Server的昨天,今天和明天
簡單迴歸分析與相關分析 莊文忠 副教授 世新大學行政管理學系 計量分析一(莊文忠副教授) 2019/8/3.
之前都是分类的蒸馏很简单。然后从分类到分割也是一样,下一篇是检测的蒸馏
Homework 2 : VSM and Summary
《神经网络与深度学习》 第10章 模型独立的学习方式
Some discussions on Entity Identification
Presentation transcript:

从百科类网站抽取infobox 报告人:徐波

背景 百科类网站拥有丰富的结构化信息 最主要的信息为Infobox表 转化为知识库格式 (attribute, value) pair (subject, predict, object) tuple

但是高质量网页数量不够

问题 如何从百科类网站中获得更多的知识?

基本思路 从百科网站的网页正文中获得更多的知识 根据Infobox表和正文内容的对应关系,通过机器学习的方法,模拟人的思维,获取更多的知识

相关论文 F. Wu and D. S. Weld. Autonomously Semantifying Wikipedia. CIKM2007 Dustin Lange, Christoph Bohm and Felix Naumann. Extracting Structured Information from Wikipedia Articles to Populate Infoboxes. CIKM2010 Sultana etc. , Infobox suggestion for Wikipedia entities. CIKM2012

Autonomously Semantifying Wikipedia 2019/2/19

Schema Refiner Schema Refinement Duplicate attributes Free edit -> schema drift Duplicate templates U.S.County(1428), US County(574), Counties(50), County(19) Duplicate attributes “Census Yr”, “Census Estimate Yr”, “Census Est.”, “Census Year” Low usage of attribute >15% occurrences 模板精炼器 Yuzhi shiyan guadian

Training Dataset Construction

Classifier Document Classifiers (1 per article type) List & Category as features Naïve Bayes, Maximum Entropy or SVM classifier Other Fast heuristic approach Precision(98.5%) – with no learning! Recall(68.8%) Sentence Classifier (1 per article type x attribute) multi-class, multi-label text classification problem Trained on preprocessor output Features: bag of words, POS tags

Extractor Conditional Random Fields Model [Lafferty 01] Attribute value extraction: sequential data labeling CRF model for each attribute independently Why good

Features

Extracting Structured Information from Wikipedia Articles to Populate Infoboxes CIKM2007 CIKM2010 Given a Wikipedia article containing an incomplete infobox template call, the Infobox Population Problem is to extract as many correct attribute values from the article text as possible.

Structure Analysis Many attributes have a characteristic structure number of employees: 12,500 (2003)  (Number ‘(’ Number‘)’) key people: Samuel J. Palmisano (Chairman, President and CEO) Multi values Bill Gates, Paul Allen for the founder attribute Discover a structure that represents most of these values simple, but powerful enough to split values and to combine value parts

Method

Training Data Creation Article Paragraph Filtering Labeling with Similarity Measure Labeling Value Parts

Article Paragraph Filtering 很多百科网页正文内容很长但与infobox无关 First filter article paragraphs

Labeling with Similarity Measure label occurrences of infobox attribute values in article Not exactly match Achieve an average occurrence rate of 26.0%, which is an increase of 23%(compare with exactly math).

Labeling Value Parts All attribute values are divided into several parts according to the corresponding attribute value structure Each part of the value structure is labeled separately Eg: the value of number of employees in infobox company is 54,400 (2008) Sentences “In 2008, the company had 54,400 employees” On average, searching for value parts increases the rate of found occurrences from 26.0% to 33.9%; an improvement of 30.5%.

Value Extractor Creation (1)Selected extractor’s precision at least 0.75 (2)CRFsuite with L-BFGS as feature weight estimation method

Attribute Value Extraction Align value parts Insert structural elements Avoid meaningless values Optional tokens often have no meaning without related mandatory tokens Eg: “IBM’s key people are Sam Palmisano, who serves as CEO, and Mark Loughridge as SVP.” Sam Palmisano (CEO), Mark Loughridge (SVP)

Infobox Suggestion for Wikipedia Entities 给wikipedia上没有infobox的entity推荐infobox template

STEP1:选择训练集和测试集

STEP2:选择特征

STEP3:voting of the features category-as-feature produces more accurate results on labeled articles word-as-feature achieves better accuracy on unlabeled articles. Jicheng xuexi random forest

我的工作 中英文差异 (1)英文有模板,中文没有 (2)英文不需要分词,中文需要

Value Extractor Creation Sentences 北京 制片地区:北京 上海 产地:上海 国家/地区:北京 地区:上海 江苏 制片地区:江苏 广州 是在广州制片 Patterns 制片地区:[value] 产地:[value] 国家/地区:[value] 地区:[value] 是在[value]制片 Patterns Freq. 制片地区:[value] 100 产地:[value] 87 国家/地区:[value] 70 地区:[value] 50 是在[value]制片 3 (c) Verifying & filter Patterns Prec. 制片地区:[value] 1 产地:[value] 国家/地区:[value] 0.99 地区:[value] 0.1 (a)parsing cotraining

Experiments

Experiments

百度有推荐模板 基本统计 19大类 34小类 371个属性(去重后278个) 覆盖率0.99869 ( 256163/256498)

筛选后模板 34小类 315个属性 覆盖率0.835 (214136/256498)