Automated Scientific Paper Classification Linlin Jia
Outline Motivation Related Work Problem Setting Basic Idea
Motivation Search and organize papers into necessary categories according to different needs Improving the precision of Web searching Community Information Management (DBLife / libra / DBRef) Personal Information Management Paper-Reviewer dispatch Any application requiring paper organization or selective and adaptive document dispatching. Mining topic trend and key factors in research evolution process 随着科学研究发展,新的学术会议层出不穷,世界上大量论文 互联网的发展,出现了Digital labrary、Community Information Management等应用 迫切需要一种更为准确的学术论文的自动分类方法 研究如何实现学术论文的面向主题的自动获 取、自动分类是Web 资源开发与利用、实现个性化 服务的一个很有意义的课题 更好地理解用户的搜索需求。
Outline Motivation Related Work Problem Setting Basic Idea
Related Work 知识工程(Knowledge Engineering)1960s Machine learning(since 1990s) Native Bayes 朴素贝叶斯 K-nearest neighbors k-临近 Support vector machines 支持向量机 Maximum entropy 最大熵 Neural networks 神经网络 Decision trees 决策树 Similarity measures Bag-of-word Cosine Okapi Drawback of content-based methods 基于概率的方法会忽略小概率事件,优势是具有一致性 基于网络的方法不透明,难以理解,优势是可以学习复杂的非线性的映射 基于规则的方法对于不确定事件的描述和规则之间的相容性方面有限制,优势是可以理解,弥补统计方法无法解决的问题 单纯的基于文本的方法不适用与论文分类: 论文不像网页能够容易地拿到全文 证明仅用论文的meta data(title,abstract,keyword)能够达到比全文更高的准确率 但是abstract、keyword不是所有论文都有
Related Work Measure of the relationship between two documents(web pages/papers) small1973 Co-citation Kessler1963 bibliographic coupling A B C F E D A C B D E F A C B D E F G I H Amsler1972 amsler DeanH1999 Companion Algorithm (extend HITS) 只考虑相邻节点 A and B are related (1) A and B are cited by the same paper, or (2) A and B cite the same paper, or (3) A cites a third paper C that cites B. Paper A and B are associated because they are both cited by C,D,E and F. Citing Papers A and B are related because they cite papers C,D,E and F.
Related Work Hybrid methods Fusion of Evidence PMENBM03 CaladoCMZNG Combining Link-Based and Content-Based Methods using bayesian network CaladoCMZNG combining the decisions of linkage and text classifiers using a belief network strategy. Fusion of Evidence JoachimsCT2001 Study linear combination of support vector machine kernel functions representing co-citation and textual information. 1.连接的方法能够将大量文档正确分类,但会引入噪音; 基于内容的方法能够过滤掉一定噪音,但会把连接方法正确归类的文档移除; 因此在在不同应用不同数据集混合使用这两类不同的evidence时的performance依赖于不同evidence的重要性。以上的工作认为两者是一样的。 2.由于信息缺失,有些属性上没有值,例如,并非所有paper都有abstract、keyword、references的类别信息 3098 papers,11712 citation links,76% same class,24% cross topics 4.survey paper在分类中的特点 1Link information is useful when the documents have a high link density and most links are of high quality. 2论文只能分给一个类别,但是随着科学研究的发展,传统科学个学科之间泾渭分明的界限已经被部分打破 ,前沿学科、交叉学科和横断学科出的论文应属于多个类别(ACM分类) 3分类层次比较高(1-2level) 4
Related Work Drawback of above methods ZhangGFCFCC2004 ZhangCFFGCC2005 non-linear similarity functions through Genetic Programming techniques VelosoMCGZ2006 Rule-based combination Drawback of above methods Get low precision when data set has low link density Not multi-label high level category Need big testing set
Outline Motivation Related Work Problem Setting Basic Idea
Problem Setting Definition C ={c1,c2,c3,…cn} is a set of predefined categories. D ={d1,d2,d3,…dm} is a set of scientific papers Φ: D×C→{T, F} The meta data of papers are stored in database. The categories are not just symbolic labels, their meaning is available. Some exogenous knowledge (i.e., data provided for classification purposes by an external source) is available; In particular, this means that metadata such as, for example, publication date, document type, publication source, etc., is assumed to be available.
Outline Motivation Related Work Problem Setting Basic Idea
Analysis Shortcomings of existing works Can not interpret the results Not use network-based machine learning method Need a big data set and high link density Extend the source Authors with different backgrounds Cross topics Multi-label Topic evolution Time factor Back ground 可能是学术,也可能是国家 不同背景的作者,对同一个问题叫法不一样,用词习惯不一样 图像处理的研究者 人工智能的研究者 Web研究者
Basic Idea Ci=<L, Di> L: label Di: a set of papers which are classified in L(known papers of user i and other papers in directories named L d2 c2 d3 c3 d4 Inner link Outer link c4 d5 User directory in DBRef papers
Basic idea Step 1 extended content-based method Extend text content by citeseer to overcome the limitation of small data set. Step 2 extended link-based method Add extra links to overcome the limitation of the low density data set Step 3 combine
Basic Idea C E A B F D
Author Information Social Network(co-author network) How to combine social network and citation network? Method 1 Compute the dist of P1(A,B,C,D,E) and P2(A,C,B,D,E) Compute P(ci|dist)
Time Information MourãoRA2008 The characteristics of the documents and the classes to which they belong may change over time, since new information is created, new terms are introduced, new fields emerge, and large fields are divided into more specialized sub-fields. How to express the effect of temporal factor? Is temporal factor effect the result of link-based method?
Citation Text Information Citeseer Citation text on papers external to our collection will be add 减轻数据集小的缺点 并且改善了数据缺失的情况
Location Information One word at different locations Experiment: abstract A word frequently occur, should be deleted Experiment: keywords/General terms The main content of paper is exp. One citation at different locations Cite A at Introduction/background section Cite A at experiments section 结合位置与类别的关系、引文时上下文的语义可以对引用文章与当前文章的类别关系进行判定。