The Principle of Information Retrieval

Slides:



Advertisements
Similar presentations
期末考试作文讲解 % 的同学赞成住校 30% 的学生反对住校 1. 有利于培养我们良好的学 习和生活习惯; 1. 学生住校不利于了解外 界信息; 2 可与老师及同学充分交流有 利于共同进步。 2. 和家人交流少。 在寄宿制高中,大部分学生住校,但仍有一部分学生选 择走读。你校就就此开展了一次问卷调查,主题为.
Advertisements

考研英语复试 口语准备 考研英语口语复试. 考研英语复试 口语准备 服装 谦虚、微笑、自信 态度积极 乐观沉稳.
变革中的教师教育 Teacher education in transformation
黄国文 中山大学 通用型英语人才培养中的 语言学教学 黄国文 中山大学
应如何将神的话语大声读出来会众才能真正的听见!
第七章 筛检 Screening.
B型肝炎帶原之肝細胞癌患者接受肝動脈栓塞治療後血液中DNA之定量分析
Chapter 8 Liner Regression and Correlation 第八章 直线回归和相关
Chaoping Li, Zhejiang University
摘要的开头: The passage mainly tells us sth.
Academic Year TFC EFL Data Collection Outline 学年美丽中国英语测试数据收集概述
沐阳老年社区.
Operating System CPU Scheduing - 3 Monday, August 11, 2008.
Euler’s method of construction of the Exponential function
Unit 4 I used to be afraid of the dark.
Some Effective Techniques for Naive Bayes Text Classification
Platypus — Indoor Localization and Identification through Sensing Electric Potential Changes in Human Bodies.
Thinking of Instrumentation Survivability Under Severe Accident
指導教授:許子衡 教授 報告學生:翁偉傑 Qiangyuan Yu , Geert Heijenk
Population proportion and sample proportion
Descriptive statistics
模式识别 Pattern Recognition
考试与考生 --不对等与对等 邹申 上海外国语大学
優質教育基金研究計劃研討會: 經驗分享 - 透過Web 2.0推動高小程度 探究式專題研習的協作教學模式
Differential Equations (DE)
微積分網路教學課程 應用統計學系 周 章.
Continuous Probability Distributions
Journal Citation Reports® 期刊引文分析報告的使用和檢索
Properties of Continuous probability distributions
Sampling Theory and Some Important Sampling Distributions
Decision Support System (靜宜資管楊子青)
创建型设计模式.
信息检索的评价 哈工大计算机学院 信息检索研究室 2007.
高考常考单选、写作句型默写.
製程能力分析 何正斌 教授 國立屏東科技大學工業管理學系.
971研究方法課程第九次上課 認識、理解及選擇一項適當的研究策略
Interval Estimation區間估計
塑膠材料的種類 塑膠在模具內的流動模式 流動性質的影響 溫度性質的影響
消費者偏好與效用概念.
Decision Support System (靜宜資管楊子青)
绩效管理.
IBM SWG Overall Introduction
高性能计算与天文技术联合实验室 智能与计算学部 天津大学
Guide to a successful PowerPoint design – simple is best
相關統計觀念復習 Review II.
中国科学技术大学计算机系 陈香兰 2013Fall 第七讲 存储器管理 中国科学技术大学计算机系 陈香兰 2013Fall.
虚 拟 仪 器 virtual instrument
Common Qs Regarding Earnings
关联词 Writing.
線性規劃模式 Linear Programming Models
Lab 4 買房負擔 著重: 不動產計算 是否可承擔起買房 (lab 4) 使用”分析藍本管理員” Excel : IF 函數/功能.
成才之路 · 英语 人教版 · 必修1 路漫漫其修远兮 吾将上下而求索.
中考英语阅读理解 完成句子命题与备考 宝鸡市教育局教研室 任军利
Inter-band calibration for atmosphere
績效考核 一.績效考核: 1.意義 2.目的 3.影響績效的因素 二.要考核什麼? 三.誰來負責考核? 四.運用什麼工具與方法?
The Bernoulli Distribution
高考应试作文写作训练 5. 正反观点对比.
An organizational learning approach to information systems development
Efficient Query Relaxation for Complex Relationship Search on Graph Data 李舒馨
Review of Statistics.
名词从句(2).
More About Auto-encoder
怎樣把同一評估 給與在不同班級的學生 How to administer the Same assessment to students from Different classes and groups.
Chapter 9 Validation Prof. Dehan Luo
Class imbalance in Classification
MGT 213 System Management Server的昨天,今天和明天
簡單迴歸分析與相關分析 莊文忠 副教授 世新大學行政管理學系 計量分析一(莊文忠副教授) 2019/8/3.
Principle and application of optical information technology
Gaussian Process Ruohua Shi Meeting
Presentation transcript:

The Principle of Information Retrieval Department of Information Management School of Information Engineering Nanjing University of Finance & Economics 2011

II 课程内容

5 Evaluation in information retrieval

How to measure user happiness The key utility measure is user happiness Relevance of results (effectiveness) Speed of response Size of the index User interface design Independent of the quality of the results Utility, success, completeness, satisfaction, worth, value, time, cost, …

5.1 Introduction

评价的作用 评价在信息检索研究中发挥着重要作用 评价在信息检索系统的研发中一直处于核心的地位,以致于算法与其效果评价方式是合二为一的(Saracevic, SIGIR 1995)

信息检索系统评价的起源 Kent等人第一次提出了关于Precision和Recall(开始称为relevance)的概念(Kent, 1955)

信息检索系统评价的起源 Cranfield-like evaluation methodology Cranfield在上世纪伍十年代末到六十年代初提出了基于查询样例集、标准答案集和语料库的评测方案,被称为IR评价的“grand-daddy” 确立了评价在信息检索研究中的核心地位 Cranfield是一个地名,也是一个研究所的名称

信息检索系统评价的起源 Gerard Salton 与 SMART 系统 Gerard Salton是SMART系统的主要研发者。SMART首次提供了一个研究平台,你可以只关心算法,而不必关心索引什么的,同时也提供了一个评测计算,你提供了答案后,可以给出常用的指标

信息检索系统评价的起源 Sparck-Jones 的著作 “Information retrieval experiment” 主要论述IR实验和评测

How to measure information retrieval effectiveness1-2 We need a test collection consisting of three things A document collection A test suite of information needs, expressible as queries A set of relevance judgments, standardly a binary assessment of either relevant or not relevant for each query-document pair

How to measure information retrieval effectiveness2-2 And in this test collection A document is given a binary classification as either relevant or not relevant Collection and suite of information needs have to be of a reasonable size Results are highly variable over different documents and information needs 50 information needs at least

Difficulties The difference of stated information need and query Relevance is assessed relative to an information need, not a query The subjectivity of relevance decision Many systems contain various parameters that can be adjusted to tune system performance The correct procedure is to have one or more development test collections

Difficulties Voorhees 估计,对一个规模为800万的文档集合进行针对1个查询主题的相关性评判需要耗费1名标注人员9个月的工作时间 TREC提出pooling方法,在保证评价结果可靠性的基础上大大减少了评判工作量 缺点:处理的查询数目少,针对小规模的查询集合,仍需要耗费十余名标注人员1-2个月的工作时间

5.2 Standard test collections The Cranfield collection Text Retrieval Conference (TREC) GOV2 NII Test Collections for IR Systems (NTCIR) Reuters-21578 and Reuters-RCV1

The Cranfield collection Collected in the United Kingdom starting in the late 1950s, it contains 1398 abstracts of aerodynamics journal articles, a set of 225 queries Allowing precise quantitative measures But too small

Text Retrieval Conference (TREC)1-2 The U.S. National Institute of Standards and Technology (NIST) has run a large IR test bed evaluation series since 1992 Over a range of different test collections But the best known test collections are the ones used for the TREC Ad Hoc track during the first 8 TREC evaluations from 1992 to 1999 Comprise 6 CDs containing 1.89 million documents (mainly, but not exclusively, newswire articles)

Text Retrieval Conference (TREC)2-2 TRECs 6–8 provide 150 information needs over about 528,000 newswire and Foreign Broadcast Information Service articles Relevance judgments are available only for the documents that were among the top k returned for some system which was entered in the TREC evaluation

GOV2 Contains 25 million GOV2 web pages GOV2 is one of the largest Web test collection but still more than 3 orders of magnitude smaller than the current size of the document collections indexed by the large web search companies

NII Test Collections for IR Systems (NTCIR) Similar sizes to the TREC collections Focusing on East Asian language and cross-language information retrieval

Reuters-21578 and Reuters-RCV1 Most used for text classification Reuters-21578 collection of 21578 newswire articles Reuters Corpus Volume 1 (RCV1) is much larger, consisting of 806,791 documents

8.3 Evaluation of unranked retrieval sets Precision Recall Accuracy F measure

Precision Precision (P) is the fraction of retrieved documents that are relevant

Recall Recall (R) is the fraction of relevant documents that are retrieved

The another way to define P and R

The comparison of P and R Typical web surfers prefer P to R Various professional searchers such as paralegals and intelligence analysts prefer R to P Individuals searching their hard disks prefer R to P

The comparison of P and R The two quantities clearly trade off against one another Recall is a non-decreasing function of the number of documents retrieved Can always get a recall of 1 (but very low precision) by retrieving all documents for all queries Precision usually decreases as the number of documents retrieved is increased

Which is more difficult to measure? Precision Recall

关于查全率的质疑 分母是个无法确定的值,所以建立在其理论上的查全率也是不实际的 相关文献没能被检索出来的原因是什么? 数据库的设计水平还是用户操作水平? 是标引的原因还是检索的原因? 是任何一个数据库都有一个相关的系数存在?

关于查准率的质疑 数据库中存在大量应查到而查不到的文献时,查出来的文献就是100%准确有意义吗? 查准率分母中的不相关文档是如何产生的? 是系统造成的? 用户检索时由于表达不清造成的? 还是用户最终取舍形成的?

相对查准率

关于查全率与查准率关系的质疑1-2 b: irrelevant and retrieved a: relevant and retrieved irrelevant and unretrieved c: relevant and unretrieved

关于查全率与查准率关系的质疑2-2 一般认为是呈反比关系 如果a/(c+a)值增大,必然是c值减小 c值减小有两种原因,是c线下移或b线右移,其结果必然b值增大,所以a/(b+a)值减小,反之也成立 但是这是建立在一种假设之上,那就是a为定量 但是定量不是a,而是(c+a),如c值下降,a值必然上升,所以a是变量 所以,b和c之间没有必然的联系,可以想象b线和c线能够同时向边线移动。c和b等于0不太可能,但不是没有可能 事实上,检索系统可以同时提高两个指标

Contingency table Relevant Not relevant Retrieved true positives (tp) false positives (fp) Not retrieved false negatives (fn) true negatives (tn)

The other way to define P and R P = tp/(tp + fp) R = tp/(tp + fn)

F measure1-4 A single measure that trades off precision versus recall α∈[0, 1] and thus β2∈[0,∞] The default balanced F measure use β=1

F measure2-4 Values of β < 1 emphasize precision, while values of β > 1 emphasize recall It is harmonic mean rather than the simpler average

F measure3-4 The harmonic mean is always less than either the arithmetic or geometric mean, and often quite close to the minimum of the two numbers This strongly suggests that the arithmetic mean is an unsuitable measure to use because it closer to their maximum than harmonic mean

F measure4-4

Accuracy1-2 Accuracy is the fraction of its classifications that are correct An information retrieval system can be thought of as a two-class classifier Accuracy=(tp+tn)/(tp+fp+fn+tn)

Accuracy2-2 Often used for evaluating machine learning classification problems Not an appropriate measure for information retrieval problems Normally over 99.9% of the documents are in the not relevant category Can maximize accuracy by simply deeming all documents irrelevant to all queries, that is to say tn maybe too large

8.4 Evaluation of ranked retrieval results Precision, recall, and the F measure are set-based measures and are computed using unordered sets of documents In a ranked retrieval context, appropriate sets of retrieved documents are naturally given by the top k retrieved documents

The types Precision-recall curve Mean Average Precision (MAP) Precision at k R-precision ROC curve

Precision-recall curve1-2

Precision-recall curve2-2 If the (k + 1)th document retrieved is irrelevant then recall is the same as for the top k documents, but precision has dropped If it is relevant, then both precision and recall increase, and the curve jags up and to the right

How to evaluate? More cliffy, less effective

Precision-recall curve with interpolated precision Can remove the jiggles of curve The interpolated precision at a certain recall level r is defined as the highest precision found for any recall level q ≥ r The justification is that almost anyone would be prepared to look at more documents if it would increase the percentage of the viewed set that were relevant

Precision-recall curve with 11-point interpolated average precision1-2 Can boil curve’s information down to a few numbers 11 points means 0.0, 0.1, 0.2, … , 1.0 For each recall level, we then calculate the arithmetic mean of the interpolated precision at that recall level for each information need in the test collection

Precision-recall curve with 11-point interpolated average precision2-2

Mean Average Precision (MAP)1-3 Most standard among the TREC community A single-figure measure of quality across recall levels Have especially good discrimination and stability

Mean Average Precision (MAP)2-3 For a single information need, Average Precision is the average of the precision value obtained for the set of top k documents existing after each relevant document is retrieved

Mean Average Precision (MAP)3-3 The set of relevant documents for an information need qj ∈ Q is {d1, . . . dmj} Rjk is the set of ranked retrieval results from the top result until you get to document dk

The disadvantage of MAP Fixed recall levels are not chosen, corresponding precision is at all recall levels Calculated scores normally vary widely across information needs( 0.1 to 0.7 ) So a set of test information needs must be large and diverse enough to be representative of system effectiveness across different queries

Precision at k What matters is how many good results there are on the first page or the first k pages, especially for search engine This method measures precision at fixed low levels of retrieved results, such as 10 or 30 documents It has the advantage of not requiring any estimate of the size of the set of relevant documents but the disadvantages that it is the least stable

R-precision1-2 It requires having a set of known relevant documents of size Rel, from which we calculate the precision of the top Rel documents returned Like Precision at k, R-precision describes only one point on the precision-recall curve And the R-precision and R-recall is equal ( r/Rel ) and is identical to the break-even point of the PR curve

R-precision2-2 But even a perfect system could only achieve a precision at 20 of 0.4 if there were only 8 relevant documents Averaging this measure across queries thus makes more sense

ROC curve1-3 Receiver Operating Characteristics, ROC ROC curve plots the true positive rate ( sensitivity or recall ) against the false positive rate ( 1-specificity The false positive rate is given by fp/( fp+tn ) Specificity is given by tn/( fp + tn) sensitivity:灵敏性 specificity:特异性

ROC curve2-3 ROC curve always goes from the bottom left to the top right of the graph For a good system, the graph climbs steeply on the left side

ROC curve3-3

8.5 Assessing relevance

About information needs Information needs must be germane to the documents Information needs are best designed by domain experts Using random combinations of query terms as an information need is generally not a good idea because typically they will not resemble the actual distribution of information needs

About assessment1-2 Time-consuming and expensive process involving human beings For tiny collections like Cranfield, exhaustive judgments of relevance for each query and document pair were obtained For large modern collections, it is usual for relevance to be assessed only for a subset of the documents for each query The most standard approach is pooling, where relevance is assessed over a subset of the collection that is formed from the top k documents returned

About assessment2-2 Humans and their relevance judgments are quite variable But this is not a problem to be solved In the final analysis, the success of an IR system depends on how good it is at satisfying the needs of these variable humans But needs to measure the agreement of these assessments

Kappa statistic1-5 It can measure how much agreement between relevance judgments P(A) is the proportion of the times the judges agreed P(E) is the proportion of the times they would be expected to agree by chance, which needs to be estimated

Kappa statistic2-5 II I Yes No Total 300 20 320 10 70 80 310 90 400

Kappa statistic3-5 P(A) = (300+ 70)/400 = 370/400 = 0.925 P(nonrelevant) = (80+90)/(400+400) = 170/800 = 0.2125 P(relevant) = (320+ 310)/(400+ 400) = 630/800 = 0.7878 P(E) = P(nonrelevant) + P(relevant) = 0.21252 + 0.78782 = 0.665 k= (P(A)− P(E))/(1− P(E)) = (0.925−0.665)/(1− 0.665) = 0.776

Kappa statistic4-5 Result Kappa value above 0.8 is taken as good agreement Kappa value between 0.67 and 0.8 is taken as fair agreement Kappa value below 0.67 is seen as dubious basis for evaluation If there are more than two judges, it is normal to calculate an average pairwise kappa value

Kappa statistic5-5 The level of agreement in TREC normally falls in the range of “fair” (0.67–0.8) This means not requiring more fine-grained relevance labeling But these deviation has in general been found to have little impact on the relative effectiveness ranking despite the variation of individual assessors’ judgments

About relevance Some dubious hypothesis The relevance of one document is treated as independent of the relevance of other documents in the collection Assessments are binary and not nuanced Information need is treated as an absolute, objective decision So any results are heavily skewed by the choice of collection, queries, and relevance judgment set nuanced:有细微差别的

Marginal relevance It means whether a document still has distinctive usefulness after the user has looked at certain other documents The most extreme case of this is documents that are duplicates – a phenomenon that is actually very common on the Web Maximizing marginal relevance requires returning documents that exhibit diversity and novelty

8.6 Evaluation of Search Engines 九款流行搜索引擎评测 华军软件园测评中心 2008-07-21

主要搜索服务种类

平均搜索速度

搜索结果显示

搜索性能测试——网页搜索测试

搜索性能测试——图片搜索测试

搜索性能测试——音频搜索测试

搜索性能测试——视频搜索测试

搜索性能测试——文件搜索测试

美国九大搜索引擎评测

基于用户行为分析的自动评价 标准测试集方法和日志分析方法的综合 从宏观用户行为的角度客观反映搜索引擎检索质量的变化 提供一个快速、半即时的评价各大搜索引擎检索效果的平台

点击集中度 针对查询历史信息的特征 假设:不同用户具有相同的检索需求时,他们的点击都会集中在某个网页或几个页面上。 特征:用户点击分布 导航类需求,目标页面唯一 信息类需求,目标页面多样化 特征:用户点击分布

点击集中度 不同的搜索引擎用户点击分布不同 例:电影(inf)