关于属性及其抽取技术的研究现状调研报告人:彭德家指导老师:瞿裕忠.

Slides:

Advertisements

Similar presentations

Daily proverbs: 1.It takes two to make a quarrel. 2. Two heads are better than one. 3. A coin has two sides.

Advertisements

数据库研究方法和论文写作陆嘉恒中国人民大学.

Classification of Web Query Intent Using Encyclopedia 基于百科知识的查询意图获取

二維品質模式與麻醉前訪視滿意度中文摘要麻醉前訪視，是麻醉醫護人員對病患提供麻醉相關資訊與服務，並建立良好醫病關係的第一次接觸。本研究目的是以Kano‘s 二維品質模式，設計病患滿意度問卷，探討麻醉前訪視內容與病患滿意度之關係，以期分析關鍵品質要素為何，作為提高病患對醫療滿意度之參考。本研究於台灣北部某醫學中心，通過該院人體試驗委員會審查後進行。對象為婦科排程手術住院病患，其中實驗組共107位病患，在麻醉醫師訪視之前，安排先觀看麻醉流程衛教影片；另外對照組111位病患，則未提供衛教影片。問卷於麻醉醫師

人工智能 Artificial Intelligence 第十一章

数学与工程的对话中山大学信息科学与技术学院李硕彦教授演讲 (10月21, 24日) 李硕彦 ( Bob Li ) 简介:

Homework 2 : VSM and Summary

Leftmost Longest Regular Expression Matching in Reconfigurable Logic

A Novel Geographic Routing Strategy over VANET

A Question Answering Approach to Emotion Cause Extraction

九年级Unit 6 Topic 1 Section C 张秋红.

Some Effective Techniques for Naive Bayes Text Classification

Platypus — Indoor Localization and Identification through Sensing Electric Potential Changes in Human Bodies.

毕业论文报告孙悦明

NLP Group, Dept. of CS&T, Tsinghua University

中国物种信息系统 China Species Information System （CSIS）

Nationality Objective

Knowledge Engineering & Artificial Intelligence Lab (知識工程與人工智慧)

第十章基于立体视觉的深度估计.

Draft Amendment to STANDARD FOR Information Technology -Telecommunications and Information Exchange Between Systems - LAN/: R: Fast BSS.

第4章(2) 空间数据库 —关系数据库北京建筑工程学院王文宇.

1 Introduction Prof. Lin-Shan Lee.

关于“理解名词短语”的重新思考丁文韬.

Nationality Objective

Word-Entity Duet Representations for Document Ranking

Omid Bakhshandeh and James F. Allen IWCS 2015

971研究方法課程第九次上課認識、理解及選擇一項適當的研究策略

Interval Estimation區間估計

Formal Pivot to both Language and Intelligence in Science

药物和疾病啥关系？李智恒.

Symbolic Execution During Test Data Generation and Augmentation Top Paper Review Zhiyi Zhang.

从百科类网站抽取infobox 报告人：徐波.

Towards Emotional Awareness in Software Development Teams

服務於中國研究的網絡基礎設施 A Cyberinfrastructure for Historical China Studies

Abstract Data Types 抽象数据类型 Institute of Computer Software 2019/2/24

—— 周小多.

API文档分析张静宣大连理工大学 2017年11月3日.

数据摘要现状调研报告上下文摘要初步思考徐丹云.

Answering aggregation question over knowledge base

Version Control System Based DSNs

成品检查报告 Inspection Report

研究技巧與論文撰寫方法中央大學資管系陳彥良.

Dept. of Information Management OCIT February, 2002

高性能计算与天文技术联合实验室智能与计算学部天津大学

Maintaining Frequent Itemsets over High-Speed Data Streams

Guide to a successful PowerPoint design – simple is best

Review and Analysis of the Usage of Degree Adverbs

Learn Question Focus and Dependency Relations from Web Search Results for Question Classification 各位老師大家好,這是我今天要報告的論文題目,…… 那在題目上的括號是因為,前陣子我們有投airs的paper,那有reviewer對model的名稱產生意見.

Representation Learning of Knowledge Graphs with Hierarchical Types

從 ER 到 Logical Schema ──兼談Schema Integration

主講人：陳鴻文副教授銘傳大學資訊傳播工程系所日期：3/13/2010

Google Local Search API Research and Implementation

A Data Mining Algorithm for Generalized Web Prefetching

Distance Vector vs Link State

An Efficient MSB Prediction-based Method for High-capacity Reversible Data Hiding in Encrypted Images 基于有效MSB预测的加密图像大容量可逆数据隐藏方法。本文目的：做到既有较高的藏量（1bpp),

知識管理第二章本體論為基礎的知識.

An organizational learning approach to information systems development

Nucleon EM form factors in a quark-gluon core model

Efficient Query Relaxation for Complex Relationship Search on Graph Data 李舒馨

MODELING GENERALIZATION & REFINING THE DOMAIN MODEL

Introduction of this course

More About Auto-encoder

Distance Vector vs Link State Routing Protocols

怎樣把同一評估給與在不同班級的學生 How to administer the Same assessment to students from Different classes and groups.

Class imbalance in Classification

Abdalghani Abujabal, Rishiraj Saha Roy, Mohamed Yahya, Gerhard Weikum

DATASET 0.2 设计方案（第一阶段） 2019/7/20 刘庆霞 Websoft NJU.

Principle and application of optical information technology

Homework 2 : VSM and Summary

Gaussian Process Ruohua Shi Meeting

Presentation transcript:

关于属性及其抽取技术的研究现状调研报告人:彭德家指导老师:瞿裕忠

Papers Biperpedia: An Ontology for Search Applications Attribute Extraction and Scoring:A Probabilistic Approach Fact Extraction for Nominal Attributes. 到现在为止我只找到了三篇比较相关的paper

1 Biperpedia: An Ontology for Search Applications

introduction Answered query by structured data Lack of attributes Synonyms and text patterns 搜索引擎希望识别查询，以结构化数据来响应所以维护了很多高质量的数据库数据库中实体的覆盖面比较广但是属性的数量相对较少如freebase中关于国家的属性只有200个，但实际关心的有数以千计； Biperpedia, an ontology with 1.6M (class, attribute) pairs and 67K distinct attribute names.

Example

Biperpedia Extraction

Extraction from web text 了解distant supervision、Pattern induction

Attribute classification Biperpedia categorizes each attribute as numeric (e.g. COFFEE PRODUCTION), atomic-but-textual (e.g. POLICE-CHIEF), non-atomic (e.g. HISTORY), or none of the above.

Text Pattern 提取出2500个pattern 前200个覆盖了99% 没有pattern覆盖超过6% an ontology with 1.6M (class, attribute) pairs and 67K distinct attribute names.

Result

Inspiration Extraction from web text Attribute classification Text Pattern 对我的实验的启发主要是这三个点，如何文本中抽取，它是如何定义属性，对于数量型属性本身的定义应该要有至少name、range、domain、measurement units及文本模板 numeric/textual/non-atomic use the extractions from the query stream and Freebase to train a learner a highquality text extractor.

2 Attribute Extraction and Scoring:A Probabilistic Approach

introduction Background Data sources Works knowledge about attributes (of concepts or entities) plays a critical role in inferencing Data sources web documents, search logs, existing knowledge bases Works methods to derive attributes quantify the typicality of the attributes with regard to their corresponding concepts

Example 最终目标抽取(concept,attribute)对，以及计算typicality。 P(c|a) denotes how typical concept c is, given attribute a. P(a|c) denotes how typical attribute a is, given concept c.

Extraction of attribute 分为两类，分别是the concept-based approach and the instance-based approach. 对于网页数据两个方法都使用了，对于搜索记录和结构化数据，只能使用instance-based；通过probase来指导属性抽取 probabilistic knowledge base called Probase to guide attribute extraction. both methods to our web data. for the search log and structured data, only instance-based attributes are available.

Two methods concept-based (CB) extraction attributes can be mechanically bound to a concept. “the population of a state”, a machine can naturally bind the attribute population to the concept state. instance-based (IB) extraction IB patterns may lead to the harvesting of higher-quality attributes. “the population of Washington”

Comparison

Pattern Extraction Filtering 对于第二种：建立一个黑名单来剔除那些词汇。对于第三种包含of的名词性短语（1）根据大小写；“the Bank of China”, “the People’s Republic of China”. （2）根据知识库，判断是一个实例。“the University of Chicago”

Typicaity scoring quantify P(a|c) for attribute-concept pairs. Example: a dog is a typical instance of pet as it is frequently mentioned as a pet, and it shares some resemblance to other pet instances.

Typicaity scoring Computing Typicality from a CB List Computing Typicality from an IB List Typicality Score Aggregation 中间还有一步消歧的过程，这步没有怎么看懂；

Evaluation Measures and Baseline Marius Pasca’s two methods [13] using Web documents(MWD) and the query logs (MQL) respectively unified typicality model (UTM), with two baselines– MWD using documents and MQL using query logs.

Examples

Example

Inspiration Pattern Extraction Filtering Evaluation Measures and Baseline Two methods: Concept-based Instance-based 对抽取的模式过滤，以及如何评估抽取出来的模板，对于文本材料可以从基于概念和实例两个方面考虑

3 Fact Extraction for Nominal Attributes

Introduction Background ReNoun the construction of these knowledge bases is largely manual and does not scale to the long and heavy tail of facts. ReNoun an open information extraction system that complements previous efforts by focusing on nominal attributes and on the long tail. extract triples of the form (S,A,O), where S is subject, A is the attribute, and O is the object. generalizes from this seed set to produce a much larger set of extractions that are then scored. 针对于long tail的词汇，在新的语料库中按照出现次数排序。前218个位fat head；后60K为attribute 是long tail；针对名词性的属性 extract facts for attributes expressed as noun phrases.针对名词性的属性

Example

Four stages Seed fact extraction Extraction pattern generation Candidate generation Scoring

Seed fact extraction apply an extraction rule to generate a triple (S,A,O) requiring that (1) A is an attribute in our ontology, and (2) the value of A and the object O corefer to the same real-world

Pattern and candidate fact generation use the seed facts to learn patterns over dependency parses of text sentences. Generating dependency patterns

Example

Applying the dependency patterns Each match of a pattern against the corpus will indicate the heads of the potential subject, attribute and object. a triple (S, A, O) is constructed from the attribute and the Freebase entities to which the tokens corresponding to the S and O nodes in the pattern are resolved.

Scoring extracted facts Based on a pattern: its frequency and coherence 基于一个模板的频率：这个模板所抽取的数目；数目越大，就表示模板越好和模板连贯性：基于同个模板抽取出来的事实的相关性；例如一个模板抽取的是 ex-wife,boyfriend, and ex-partner。 ex-wife, general manager, and subsidiary。

Experimental Evaluation ReNoun is capable of generating a large number of high quality facts (≥70% precise at 1M), which our scoring method manages to successfully surface to the top.

Inspiration The procedures of the experiment like boosting 我的实验的流程应该比较类似这篇文章的一个整个流程的操作。

References [1] Rahul Gupta, Alon Y. Halevy, Xuezhi Wang, Steven Euijong Whang, Fei Wu: Biperpedia: An Ontology for Search Applications. PVLDB 7(7): 505- 516 (2014) [2] Taesung Lee, Zhongyuan Wang, Haixun Wang, Seung-won Hwang: Attribute Extraction and Scoring:A Probabilistic Approach. ICDE 2013 [3] Mohamed Yahya, Steven Euijong Whang,, Rahul Gupta, Alon Halevy: ReNoun: Fact Extraction for Nominal Attributes.aclweb 2014

Q&A Thanks