Automated Scientific Paper Classification

Slides:

Advertisements

Similar presentations

2007年8月龙星课程周源源老师课程体会包云岗中科院计算所

Advertisements

Classification of Web Query Intent Using Encyclopedia 基于百科知识的查询意图获取

Chapter 5 research Methods in Social Medicine

人工智能 Artificial Intelligence 第十一章

全球科研项目整合检索系统海研网

华东师范大学软件学院王科强 (第一作者), 王晓玲

BRIEF GUIDELINE FOR AUTHOR PREPARING PAPER FOR PUBLICATION

-Artificial Neural Network- Hopfield Neural Network(HNN) 朝陽科技大學資訊管理系李麗華教授.

都市計畫概論論文概述及評論：彰化高鐵站區域計畫

大数据在医疗行业的应用.

Chapter 8 Liner Regression and Correlation 第八章直线回归和相关

Leftmost Longest Regular Expression Matching in Reconfigurable Logic

Semantic-Synaptic Web Mining: A Novel Model for Improving the Web Mining 報告者：陳宜樺報告日期：2015/9/25.

Homework 4 an innovative design process model TEAM 7

An Adaptive Cross-Layer Multi-Path Routing Protocol for Urban VANET

Some Effective Techniques for Naive Bayes Text Classification

Improving classiﬁcation models with taxonomy information

指導教授：許子衡教授報告學生：翁偉傑 Qiangyuan Yu , Geert Heijenk

ISI Web of Science 7.0 加速学术信息交流推动科学研究发展

毕业论文报告孙悦明

NLP Group, Dept. of CS&T, Tsinghua University

模式识别 Pattern Recognition

軟體原型 (Software Prototyping)

陳國泰博士崑山科技大學電腦與通訊系副教授兼圖書資訊館副館長

Source: IEEE Access, vol. 5, pp , October 2017

如何從事論文寫作 2 玄奘大學林國威

Knowledge Engineering & Artificial Intelligence Lab (知識工程與人工智慧)

Journal Citation Reports® 期刊引文分析報告的使用和檢索

文獻探討花蓮師院科教所李暉老師編輯 2002/10/16.

Special Topics in Social Media Services 社會媒體服務專題

Decision Support System (靜宜資管楊子青)

Mechanisms and Machine Theory.

Data Mining 資料探勘 Introduction to Data Mining Min-Yuh Day 戴敏育

This Is English 3 双向视频文稿.

Journal Citation Report

Web citation Availability： A Follow-up study

塑膠材料的種類塑膠在模具內的流動模式流動性質的影響溫度性質的影響

Thomson-ISI 更新功能簡介 Web of Science 7. 0 Web of Knowledge 3

Source: IEEE Transactions on Image Processing, Vol. 25, pp ,

PubMed整合显示图书馆电子资源医科院图书馆电子资源培训讲座.

Decision Support System (靜宜資管楊子青)

Towards Emotional Awareness in Software Development Teams

服務於中國研究的網絡基礎設施 A Cyberinfrastructure for Historical China Studies

Study for Specification of EPG EPG规范研究

谈模式识别方法在林业管理问题中的应用报告人：管理工程系马宁报告地点：学研B107

Version Control System Based DSNs

研究技巧與論文撰寫方法中央大學資管系陳彥良.

2 GROUP 身體殘障人士傷健共融—如何改善肢體殘障人士的生活素質

高性能计算与天文技术联合实验室智能与计算学部天津大学

Learn Question Focus and Dependency Relations from Web Search Results for Question Classification 各位老師大家好,這是我今天要報告的論文題目,…… 那在題目上的括號是因為,前陣子我們有投airs的paper,那有reviewer對model的名稱產生意見.

主講人：陳鴻文副教授銘傳大學資訊傳播工程系所日期：3/13/2010

2008 TIME USE SURVEY IN CHINA

A Data Mining Algorithm for Generalized Web Prefetching

An Efficient MSB Prediction-based Method for High-capacity Reversible Data Hiding in Encrypted Images 基于有效MSB预测的加密图像大容量可逆数据隐藏方法。本文目的：做到既有较高的藏量（1bpp),

An organizational learning approach to information systems development

Chapter 10 Mobile IP TCP/IP Protocol Suite

Efficient Query Relaxation for Complex Relationship Search on Graph Data 李舒馨

ACM Digital Library 進階利用與實作郭珮琪主講

More About Auto-encoder

钱炘祺一种面向实体浏览中属性融合的人机交互的设计与实现 Designing Human-Computer Interaction of Property Consolidation for Entity Browsing 钱炘祺

Speaker : YI-CHENG HUNG

主要内容什么是概念图？概念图的理论基础概念图的功能概念地图的种类如何构建概念图概念地图的评价标准国内外概念图研究现状

Infrastructure as Learning Environment 学习环境的基础结构

贵阳市教科所代钊模教师如何做课题研究贵阳市教科所代钊模

MGT 213 System Management Server的昨天，今天和明天

质量管理体系与工具工程管理学

Principle and application of optical information technology

WiFi is a powerful sensing medium

Gaussian Process Ruohua Shi Meeting

Presentation transcript:

Automated Scientific Paper Classification Linlin Jia

Outline Motivation Related Work Problem Setting Basic Idea

Motivation Search and organize papers into necessary categories according to different needs Improving the precision of Web searching Community Information Management (DBLife / libra / DBRef) Personal Information Management Paper-Reviewer dispatch Any application requiring paper organization or selective and adaptive document dispatching. Mining topic trend and key factors in research evolution process 随着科学研究发展，新的学术会议层出不穷，世界上大量论文互联网的发展，出现了Digital labrary、Community Information Management等应用迫切需要一种更为准确的学术论文的自动分类方法研究如何实现学术论文的面向主题的自动获取、自动分类是Web 资源开发与利用、实现个性化服务的一个很有意义的课题更好地理解用户的搜索需求。

Outline Motivation Related Work Problem Setting Basic Idea

Related Work 知识工程（Knowledge Engineering）1960s Machine learning(since 1990s) Native Bayes 朴素贝叶斯 K-nearest neighbors k-临近 Support vector machines 支持向量机 Maximum entropy 最大熵 Neural networks 神经网络 Decision trees 决策树 Similarity measures Bag-of-word Cosine Okapi Drawback of content-based methods 基于概率的方法会忽略小概率事件，优势是具有一致性基于网络的方法不透明，难以理解，优势是可以学习复杂的非线性的映射基于规则的方法对于不确定事件的描述和规则之间的相容性方面有限制，优势是可以理解，弥补统计方法无法解决的问题单纯的基于文本的方法不适用与论文分类：论文不像网页能够容易地拿到全文证明仅用论文的meta data（title,abstract,keyword)能够达到比全文更高的准确率但是abstract、keyword不是所有论文都有

Related Work Measure of the relationship between two documents(web pages/papers) small1973 Co-citation Kessler1963 bibliographic coupling A B C F E D A C B D E F A C B D E F G I H Amsler1972 amsler DeanH1999 Companion Algorithm (extend HITS) 只考虑相邻节点 A and B are related (1) A and B are cited by the same paper, or (2) A and B cite the same paper, or (3) A cites a third paper C that cites B. Paper A and B are associated because they are both cited by C,D,E and F. Citing Papers A and B are related because they cite papers C,D,E and F.

Related Work Hybrid methods Fusion of Evidence PMENBM03 CaladoCMZNG Combining Link-Based and Content-Based Methods using bayesian network CaladoCMZNG combining the decisions of linkage and text classifiers using a belief network strategy. Fusion of Evidence JoachimsCT2001 Study linear combination of support vector machine kernel functions representing co-citation and textual information. 1.连接的方法能够将大量文档正确分类，但会引入噪音；基于内容的方法能够过滤掉一定噪音，但会把连接方法正确归类的文档移除；因此在在不同应用不同数据集混合使用这两类不同的evidence时的performance依赖于不同evidence的重要性。以上的工作认为两者是一样的。 2.由于信息缺失，有些属性上没有值，例如，并非所有paper都有abstract、keyword、references的类别信息 3098 papers,11712 citation links,76% same class，24% cross topics 4.survey paper在分类中的特点 1Link information is useful when the documents have a high link density and most links are of high quality. 2论文只能分给一个类别，但是随着科学研究的发展，传统科学个学科之间泾渭分明的界限已经被部分打破，前沿学科、交叉学科和横断学科出的论文应属于多个类别（ACM分类） 3分类层次比较高（1-2level） 4

Related Work Drawback of above methods ZhangGFCFCC2004 ZhangCFFGCC2005 non-linear similarity functions through Genetic Programming techniques VelosoMCGZ2006 Rule-based combination Drawback of above methods Get low precision when data set has low link density Not multi-label high level category Need big testing set

Outline Motivation Related Work Problem Setting Basic Idea

Problem Setting Definition C ={c1,c2,c3,…cn} is a set of predefined categories. D ={d1,d2,d3,…dm} is a set of scientific papers Φ: D×C→{T, F} The meta data of papers are stored in database. The categories are not just symbolic labels, their meaning is available. Some exogenous knowledge (i.e., data provided for classification purposes by an external source) is available; In particular, this means that metadata such as, for example, publication date, document type, publication source, etc., is assumed to be available.

Outline Motivation Related Work Problem Setting Basic Idea

Analysis Shortcomings of existing works Can not interpret the results Not use network-based machine learning method Need a big data set and high link density Extend the source Authors with different backgrounds Cross topics Multi-label Topic evolution Time factor Back ground 可能是学术，也可能是国家不同背景的作者，对同一个问题叫法不一样，用词习惯不一样图像处理的研究者人工智能的研究者 Web研究者

Basic Idea Ci=<L, Di> L: label Di: a set of papers which are classified in L(known papers of user i and other papers in directories named L d2 c2 d3 c3 d4 Inner link Outer link c4 d5 User directory in DBRef papers

Basic idea Step 1 extended content-based method Extend text content by citeseer to overcome the limitation of small data set. Step 2 extended link-based method Add extra links to overcome the limitation of the low density data set Step 3 combine

Basic Idea C E A B F D

Author Information Social Network(co-author network) How to combine social network and citation network? Method 1 Compute the dist of P1(A,B,C,D,E) and P2(A,C,B,D,E) Compute P(ci|dist)

Time Information MourãoRA2008 The characteristics of the documents and the classes to which they belong may change over time, since new information is created, new terms are introduced, new fields emerge, and large fields are divided into more specialized sub-fields. How to express the effect of temporal factor? Is temporal factor effect the result of link-based method?

Citation Text Information Citeseer Citation text on papers external to our collection will be add 减轻数据集小的缺点并且改善了数据缺失的情况

Location Information One word at different locations Experiment: abstract A word frequently occur, should be deleted Experiment: keywords/General terms The main content of paper is exp. One citation at different locations Cite A at Introduction/background section Cite A at experiments section 结合位置与类别的关系、引文时上下文的语义可以对引用文章与当前文章的类别关系进行判定。