Automated Scientific Paper Classification

Slides:



Advertisements
Similar presentations
2007年8月龙星课程 周源源老师课程体会 包云岗 中科院计算所
Advertisements

Classification of Web Query Intent Using Encyclopedia 基于百科知识的查询意图获取
Chapter 5 research Methods in Social Medicine
人工智能 Artificial Intelligence 第十一章
全球科研项目整合检索系统 海研网
华东师范大学软件学院 王科强 (第一作者), 王晓玲
BRIEF GUIDELINE FOR AUTHOR PREPARING PAPER FOR PUBLICATION
-Artificial Neural Network- Hopfield Neural Network(HNN) 朝陽科技大學 資訊管理系 李麗華 教授.
都市計畫概論論文概述及評論: 彰化高鐵站區域計畫
大数据在医疗行业的应用.
Chapter 8 Liner Regression and Correlation 第八章 直线回归和相关
Leftmost Longest Regular Expression Matching in Reconfigurable Logic
Semantic-Synaptic Web Mining: A Novel Model for Improving the Web Mining 報告者:陳宜樺 報告日期:2015/9/25.
Homework 4 an innovative design process model TEAM 7
An Adaptive Cross-Layer Multi-Path Routing Protocol for Urban VANET
Some Effective Techniques for Naive Bayes Text Classification
Improving classification models with taxonomy information
指導教授:許子衡 教授 報告學生:翁偉傑 Qiangyuan Yu , Geert Heijenk
ISI Web of Science 7.0 加速学术信息交流 推动科学研究发展
毕业论文报告 孙悦明
NLP Group, Dept. of CS&T, Tsinghua University
模式识别 Pattern Recognition
軟體原型 (Software Prototyping)
陳國泰 博士 崑山科技大學 電腦與通訊系 副教授 兼 圖書資訊館 副館長
Source: IEEE Access, vol. 5, pp , October 2017
如何從事論文寫作 2 玄奘大學 林國威
Knowledge Engineering & Artificial Intelligence Lab (知識工程與人工智慧)
Journal Citation Reports® 期刊引文分析報告的使用和檢索
文獻探討 花蓮師院科教所 李暉老師編輯 2002/10/16.
Special Topics in Social Media Services 社會媒體服務專題
Decision Support System (靜宜資管楊子青)
Mechanisms and Machine Theory.
Data Mining 資料探勘 Introduction to Data Mining Min-Yuh Day 戴敏育
This Is English 3 双向视频文稿.
Journal Citation Report
Web citation Availability: A Follow-up study
塑膠材料的種類 塑膠在模具內的流動模式 流動性質的影響 溫度性質的影響
Thomson-ISI 更新功能簡介 Web of Science 7. 0 Web of Knowledge 3
Source: IEEE Transactions on Image Processing, Vol. 25, pp ,
PubMed整合显示图书馆电子资源 医科院图书馆电子资源培训讲座.
Decision Support System (靜宜資管楊子青)
Towards Emotional Awareness in Software Development Teams
服務於中國研究的網絡基礎設施 A Cyberinfrastructure for Historical China Studies
Study for Specification of EPG EPG规范研究
谈模式识别方法在林业管理问题中的应用 报告人:管理工程系 马宁 报告地点:学研B107
Version Control System Based DSNs
研究技巧與論文撰寫方法 中央大學資管系 陳彥良.
2 GROUP 身體殘障人士 傷健共融—如何改善肢體殘障人士的生活素質
高性能计算与天文技术联合实验室 智能与计算学部 天津大学
Learn Question Focus and Dependency Relations from Web Search Results for Question Classification 各位老師大家好,這是我今天要報告的論文題目,…… 那在題目上的括號是因為,前陣子我們有投airs的paper,那有reviewer對model的名稱產生意見.
主講人:陳鴻文 副教授 銘傳大學資訊傳播工程系所 日期:3/13/2010
2008 TIME USE SURVEY IN CHINA
A Data Mining Algorithm for Generalized Web Prefetching
An Efficient MSB Prediction-based Method for High-capacity Reversible Data Hiding in Encrypted Images 基于有效MSB预测的加密图像大容量可逆数据隐藏方法。 本文目的: 做到既有较高的藏量(1bpp),
An organizational learning approach to information systems development
Chapter 10 Mobile IP TCP/IP Protocol Suite
Efficient Query Relaxation for Complex Relationship Search on Graph Data 李舒馨
ACM Digital Library 進階利用與實作 郭珮琪主講
More About Auto-encoder
钱炘祺 一种面向实体浏览中属性融合的人机交互的设计与实现 Designing Human-Computer Interaction of Property Consolidation for Entity Browsing 钱炘祺
Speaker : YI-CHENG HUNG
主要内容 什么是概念图? 概念图的理论基础 概念图的功能 概念地图的种类 如何构建概念图 概念地图的评价标准 国内外概念图研究现状
Infrastructure as Learning Environment 学习环境的基础结构
贵阳市教科所 代钊模 教师如何做课题研究 贵阳市教科所 代钊模
MGT 213 System Management Server的昨天,今天和明天
质量管理体系与工具 工程管理学
Principle and application of optical information technology
WiFi is a powerful sensing medium
Gaussian Process Ruohua Shi Meeting
Presentation transcript:

Automated Scientific Paper Classification Linlin Jia

Outline Motivation Related Work Problem Setting Basic Idea

Motivation Search and organize papers into necessary categories according to different needs Improving the precision of Web searching Community Information Management (DBLife / libra / DBRef) Personal Information Management Paper-Reviewer dispatch Any application requiring paper organization or selective and adaptive document dispatching. Mining topic trend and key factors in research evolution process 随着科学研究发展,新的学术会议层出不穷,世界上大量论文 互联网的发展,出现了Digital labrary、Community Information Management等应用 迫切需要一种更为准确的学术论文的自动分类方法 研究如何实现学术论文的面向主题的自动获 取、自动分类是Web 资源开发与利用、实现个性化 服务的一个很有意义的课题 更好地理解用户的搜索需求。

Outline Motivation Related Work Problem Setting Basic Idea

Related Work 知识工程(Knowledge Engineering)1960s Machine learning(since 1990s) Native Bayes 朴素贝叶斯 K-nearest neighbors k-临近 Support vector machines 支持向量机 Maximum entropy 最大熵 Neural networks 神经网络 Decision trees 决策树 Similarity measures Bag-of-word Cosine Okapi Drawback of content-based methods 基于概率的方法会忽略小概率事件,优势是具有一致性 基于网络的方法不透明,难以理解,优势是可以学习复杂的非线性的映射 基于规则的方法对于不确定事件的描述和规则之间的相容性方面有限制,优势是可以理解,弥补统计方法无法解决的问题 单纯的基于文本的方法不适用与论文分类: 论文不像网页能够容易地拿到全文 证明仅用论文的meta data(title,abstract,keyword)能够达到比全文更高的准确率 但是abstract、keyword不是所有论文都有

Related Work Measure of the relationship between two documents(web pages/papers) small1973 Co-citation Kessler1963 bibliographic coupling A B C F E D A C B D E F A C B D E F G I H Amsler1972 amsler DeanH1999 Companion Algorithm (extend HITS) 只考虑相邻节点 A and B are related (1) A and B are cited by the same paper, or (2) A and B cite the same paper, or (3) A cites a third paper C that cites B. Paper A and B are associated because they are both cited by C,D,E and F. Citing Papers A and B are related because they cite papers C,D,E and F.

Related Work Hybrid methods Fusion of Evidence PMENBM03 CaladoCMZNG Combining Link-Based and Content-Based Methods using bayesian network CaladoCMZNG combining the decisions of linkage and text classifiers using a belief network strategy. Fusion of Evidence JoachimsCT2001 Study linear combination of support vector machine kernel functions representing co-citation and textual information. 1.连接的方法能够将大量文档正确分类,但会引入噪音; 基于内容的方法能够过滤掉一定噪音,但会把连接方法正确归类的文档移除; 因此在在不同应用不同数据集混合使用这两类不同的evidence时的performance依赖于不同evidence的重要性。以上的工作认为两者是一样的。 2.由于信息缺失,有些属性上没有值,例如,并非所有paper都有abstract、keyword、references的类别信息 3098 papers,11712 citation links,76% same class,24% cross topics 4.survey paper在分类中的特点 1Link information is useful when the documents have a high link density and most links are of high quality. 2论文只能分给一个类别,但是随着科学研究的发展,传统科学个学科之间泾渭分明的界限已经被部分打破 ,前沿学科、交叉学科和横断学科出的论文应属于多个类别(ACM分类) 3分类层次比较高(1-2level) 4

Related Work Drawback of above methods ZhangGFCFCC2004 ZhangCFFGCC2005 non-linear similarity functions through Genetic Programming techniques VelosoMCGZ2006 Rule-based combination Drawback of above methods Get low precision when data set has low link density Not multi-label high level category Need big testing set

Outline Motivation Related Work Problem Setting Basic Idea

Problem Setting Definition C ={c1,c2,c3,…cn} is a set of predefined categories. D ={d1,d2,d3,…dm} is a set of scientific papers Φ: D×C→{T, F} The meta data of papers are stored in database. The categories are not just symbolic labels, their meaning is available. Some exogenous knowledge (i.e., data provided for classification purposes by an external source) is available; In particular, this means that metadata such as, for example, publication date, document type, publication source, etc., is assumed to be available.

Outline Motivation Related Work Problem Setting Basic Idea

Analysis Shortcomings of existing works Can not interpret the results Not use network-based machine learning method Need a big data set and high link density Extend the source Authors with different backgrounds Cross topics Multi-label Topic evolution Time factor Back ground 可能是学术,也可能是国家 不同背景的作者,对同一个问题叫法不一样,用词习惯不一样 图像处理的研究者 人工智能的研究者 Web研究者

Basic Idea Ci=<L, Di> L: label Di: a set of papers which are classified in L(known papers of user i and other papers in directories named L d2 c2 d3 c3 d4 Inner link Outer link c4 d5 User directory in DBRef papers

Basic idea Step 1 extended content-based method Extend text content by citeseer to overcome the limitation of small data set. Step 2 extended link-based method Add extra links to overcome the limitation of the low density data set Step 3 combine

Basic Idea C E A B F D

Author Information Social Network(co-author network) How to combine social network and citation network? Method 1 Compute the dist of P1(A,B,C,D,E) and P2(A,C,B,D,E) Compute P(ci|dist)

Time Information MourãoRA2008 The characteristics of the documents and the classes to which they belong may change over time, since new information is created, new terms are introduced, new fields emerge, and large fields are divided into more specialized sub-fields. How to express the effect of temporal factor? Is temporal factor effect the result of link-based method?

Citation Text Information Citeseer Citation text on papers external to our collection will be add 减轻数据集小的缺点 并且改善了数据缺失的情况

Location Information One word at different locations Experiment: abstract A word frequently occur, should be deleted Experiment: keywords/General terms The main content of paper is exp. One citation at different locations Cite A at Introduction/background section Cite A at experiments section 结合位置与类别的关系、引文时上下文的语义可以对引用文章与当前文章的类别关系进行判定。