关于属性及其抽取技术的研究现状调研 报告人:彭德家 指导老师:瞿裕忠
Papers Biperpedia: An Ontology for Search Applications Attribute Extraction and Scoring:A Probabilistic Approach Fact Extraction for Nominal Attributes. 到现在为止我只找到了三篇比较相关的paper
1 Biperpedia: An Ontology for Search Applications
introduction Answered query by structured data Lack of attributes Synonyms and text patterns 搜索引擎希望识别查询,以结构化数据来响应 所以维护了很多高质量的数据库 数据库中实体的覆盖面比较广但是属性的数量相对较少 如freebase中关于国家的属性只有200个,但实际关心的有数以千计; Biperpedia, an ontology with 1.6M (class, attribute) pairs and 67K distinct attribute names.
Example
Biperpedia Extraction
Extraction from web text 了解distant supervision、Pattern induction
Attribute classification Biperpedia categorizes each attribute as numeric (e.g. COFFEE PRODUCTION), atomic-but-textual (e.g. POLICE-CHIEF), non-atomic (e.g. HISTORY), or none of the above.
Text Pattern 提取出2500个pattern 前200个覆盖了99% 没有pattern覆盖超过6% an ontology with 1.6M (class, attribute) pairs and 67K distinct attribute names.
Result
Inspiration Extraction from web text Attribute classification Text Pattern 对我的实验的启发主要是这三个点,如何文本中抽取,它是如何定义属性,对于数量型属性本身的定义应该要有至少name、range、domain、measurement units及文本模板 numeric/textual/non-atomic use the extractions from the query stream and Freebase to train a learner a highquality text extractor.
2 Attribute Extraction and Scoring:A Probabilistic Approach
introduction Background Data sources Works knowledge about attributes (of concepts or entities) plays a critical role in inferencing Data sources web documents, search logs, existing knowledge bases Works methods to derive attributes quantify the typicality of the attributes with regard to their corresponding concepts
Example 最终目标抽取(concept,attribute)对,以及计算typicality。 P(c|a) denotes how typical concept c is, given attribute a. P(a|c) denotes how typical attribute a is, given concept c.
Extraction of attribute 分为两类,分别是the concept-based approach and the instance-based approach. 对于网页数据两个方法都使用了,对于搜索记录和结构化数据,只能使用instance-based; 通过probase来指导属性抽取 probabilistic knowledge base called Probase to guide attribute extraction. both methods to our web data. for the search log and structured data, only instance-based attributes are available.
Two methods concept-based (CB) extraction attributes can be mechanically bound to a concept. “the population of a state”, a machine can naturally bind the attribute population to the concept state. instance-based (IB) extraction IB patterns may lead to the harvesting of higher-quality attributes. “the population of Washington”
Comparison
Pattern Extraction Filtering 对于第二种:建立一个黑名单来剔除那些词汇。 对于第三种包含of的名词性短语 (1)根据大小写;“the Bank of China”, “the People’s Republic of China”. (2)根据知识库,判断是一个实例。“the University of Chicago”
Typicaity scoring quantify P(a|c) for attribute-concept pairs. Example: a dog is a typical instance of pet as it is frequently mentioned as a pet, and it shares some resemblance to other pet instances.
Typicaity scoring Computing Typicality from a CB List Computing Typicality from an IB List Typicality Score Aggregation 中间还有一步消歧的过程,这步没有怎么看懂;
Evaluation Measures and Baseline Marius Pasca’s two methods [13] using Web documents(MWD) and the query logs (MQL) respectively unified typicality model (UTM), with two baselines– MWD using documents and MQL using query logs.
Examples
Example
Inspiration Pattern Extraction Filtering Evaluation Measures and Baseline Two methods: Concept-based Instance-based 对抽取的模式过滤,以及如何评估抽取出来的模板,对于文本材料可以从基于概念和实例两个方面考虑
3 Fact Extraction for Nominal Attributes
Introduction Background ReNoun the construction of these knowledge bases is largely manual and does not scale to the long and heavy tail of facts. ReNoun an open information extraction system that complements previous efforts by focusing on nominal attributes and on the long tail. extract triples of the form (S,A,O), where S is subject, A is the attribute, and O is the object. generalizes from this seed set to produce a much larger set of extractions that are then scored. 针对于long tail的词汇,在新的语料库中按照出现次数排序。前218个位fat head;后60K为attribute 是long tail; 针对名词性的属性 extract facts for attributes expressed as noun phrases.针对名词性的属性
Example
Four stages Seed fact extraction Extraction pattern generation Candidate generation Scoring
Seed fact extraction apply an extraction rule to generate a triple (S,A,O) requiring that (1) A is an attribute in our ontology, and (2) the value of A and the object O corefer to the same real-world
Pattern and candidate fact generation use the seed facts to learn patterns over dependency parses of text sentences. Generating dependency patterns
Example
Applying the dependency patterns Each match of a pattern against the corpus will indicate the heads of the potential subject, attribute and object. a triple (S, A, O) is constructed from the attribute and the Freebase entities to which the tokens corresponding to the S and O nodes in the pattern are resolved.
Scoring extracted facts Based on a pattern: its frequency and coherence 基于一个模板的频率:这个模板所抽取的数目;数目越大,就表示模板越好 和模板连贯性:基于同个模板抽取出来的事实的相关性;例如一个模板抽取的是 ex-wife,boyfriend, and ex-partner。 ex-wife, general manager, and subsidiary。
Experimental Evaluation ReNoun is capable of generating a large number of high quality facts (≥70% precise at 1M), which our scoring method manages to successfully surface to the top.
Inspiration The procedures of the experiment like boosting 我的实验的流程应该比较类似这篇文章的一个整个流程的操作。
References [1] Rahul Gupta, Alon Y. Halevy, Xuezhi Wang, Steven Euijong Whang, Fei Wu: Biperpedia: An Ontology for Search Applications. PVLDB 7(7): 505- 516 (2014) [2] Taesung Lee, Zhongyuan Wang, Haixun Wang, Seung-won Hwang: Attribute Extraction and Scoring:A Probabilistic Approach. ICDE 2013 [3] Mohamed Yahya, Steven Euijong Whang,, Rahul Gupta, Alon Halevy: ReNoun: Fact Extraction for Nominal Attributes.aclweb 2014
Q&A Thanks