Download presentation
Presentation is loading. Please wait.
1
The Principle of Information Retrieval
Department of Information Management School of Information Engineering Nanjing University of Finance & Economics 2011
2
II 课程内容
3
5 Evaluation in information retrieval
4
How to measure user happiness
The key utility measure is user happiness Relevance of results (effectiveness) Speed of response Size of the index User interface design Independent of the quality of the results Utility, success, completeness, satisfaction, worth, value, time, cost, …
6
5.1 Introduction
7
评价的作用 评价在信息检索研究中发挥着重要作用
评价在信息检索系统的研发中一直处于核心的地位,以致于算法与其效果评价方式是合二为一的(Saracevic, SIGIR 1995)
8
信息检索系统评价的起源 Kent等人第一次提出了关于Precision和Recall(开始称为relevance)的概念(Kent, 1955)
9
信息检索系统评价的起源 Cranfield-like evaluation methodology
Cranfield在上世纪伍十年代末到六十年代初提出了基于查询样例集、标准答案集和语料库的评测方案,被称为IR评价的“grand-daddy” 确立了评价在信息检索研究中的核心地位 Cranfield是一个地名,也是一个研究所的名称
10
信息检索系统评价的起源 Gerard Salton 与 SMART 系统
Gerard Salton是SMART系统的主要研发者。SMART首次提供了一个研究平台,你可以只关心算法,而不必关心索引什么的,同时也提供了一个评测计算,你提供了答案后,可以给出常用的指标
11
信息检索系统评价的起源 Sparck-Jones 的著作 “Information retrieval experiment”
主要论述IR实验和评测
12
How to measure information retrieval effectiveness1-2
We need a test collection consisting of three things A document collection A test suite of information needs, expressible as queries A set of relevance judgments, standardly a binary assessment of either relevant or not relevant for each query-document pair
13
How to measure information retrieval effectiveness2-2
And in this test collection A document is given a binary classification as either relevant or not relevant Collection and suite of information needs have to be of a reasonable size Results are highly variable over different documents and information needs 50 information needs at least
14
Difficulties The difference of stated information need and query
Relevance is assessed relative to an information need, not a query The subjectivity of relevance decision Many systems contain various parameters that can be adjusted to tune system performance The correct procedure is to have one or more development test collections
15
Difficulties Voorhees 估计,对一个规模为800万的文档集合进行针对1个查询主题的相关性评判需要耗费1名标注人员9个月的工作时间 TREC提出pooling方法,在保证评价结果可靠性的基础上大大减少了评判工作量 缺点:处理的查询数目少,针对小规模的查询集合,仍需要耗费十余名标注人员1-2个月的工作时间
16
5.2 Standard test collections
The Cranfield collection Text Retrieval Conference (TREC) GOV2 NII Test Collections for IR Systems (NTCIR) Reuters and Reuters-RCV1
17
The Cranfield collection
Collected in the United Kingdom starting in the late 1950s, it contains 1398 abstracts of aerodynamics journal articles, a set of 225 queries Allowing precise quantitative measures But too small
18
Text Retrieval Conference (TREC)1-2
The U.S. National Institute of Standards and Technology (NIST) has run a large IR test bed evaluation series since 1992 Over a range of different test collections But the best known test collections are the ones used for the TREC Ad Hoc track during the first 8 TREC evaluations from 1992 to 1999 Comprise 6 CDs containing 1.89 million documents (mainly, but not exclusively, newswire articles)
19
Text Retrieval Conference (TREC)2-2
TRECs 6–8 provide 150 information needs over about 528,000 newswire and Foreign Broadcast Information Service articles Relevance judgments are available only for the documents that were among the top k returned for some system which was entered in the TREC evaluation
20
GOV2 Contains 25 million GOV2 web pages
GOV2 is one of the largest Web test collection but still more than 3 orders of magnitude smaller than the current size of the document collections indexed by the large web search companies
21
NII Test Collections for IR Systems (NTCIR)
Similar sizes to the TREC collections Focusing on East Asian language and cross-language information retrieval
22
Reuters-21578 and Reuters-RCV1
Most used for text classification Reuters collection of newswire articles Reuters Corpus Volume 1 (RCV1) is much larger, consisting of 806,791 documents
23
8.3 Evaluation of unranked retrieval sets
Precision Recall Accuracy F measure
24
Precision Precision (P) is the fraction of retrieved documents that are relevant
25
Recall Recall (R) is the fraction of relevant documents that are retrieved
26
The another way to define P and R
27
The comparison of P and R
Typical web surfers prefer P to R Various professional searchers such as paralegals and intelligence analysts prefer R to P Individuals searching their hard disks prefer R to P
28
The comparison of P and R
The two quantities clearly trade off against one another Recall is a non-decreasing function of the number of documents retrieved Can always get a recall of 1 (but very low precision) by retrieving all documents for all queries Precision usually decreases as the number of documents retrieved is increased
30
Which is more difficult to measure?
Precision Recall
32
关于查全率的质疑 分母是个无法确定的值,所以建立在其理论上的查全率也是不实际的 相关文献没能被检索出来的原因是什么?
数据库的设计水平还是用户操作水平? 是标引的原因还是检索的原因? 是任何一个数据库都有一个相关的系数存在?
33
关于查准率的质疑 数据库中存在大量应查到而查不到的文献时,查出来的文献就是100%准确有意义吗? 查准率分母中的不相关文档是如何产生的?
是系统造成的? 用户检索时由于表达不清造成的? 还是用户最终取舍形成的?
34
相对查准率
35
关于查全率与查准率关系的质疑1-2 b: irrelevant and retrieved
a: relevant and retrieved irrelevant and unretrieved c: relevant and unretrieved
36
关于查全率与查准率关系的质疑2-2 一般认为是呈反比关系 如果a/(c+a)值增大,必然是c值减小
c值减小有两种原因,是c线下移或b线右移,其结果必然b值增大,所以a/(b+a)值减小,反之也成立 但是这是建立在一种假设之上,那就是a为定量 但是定量不是a,而是(c+a),如c值下降,a值必然上升,所以a是变量 所以,b和c之间没有必然的联系,可以想象b线和c线能够同时向边线移动。c和b等于0不太可能,但不是没有可能 事实上,检索系统可以同时提高两个指标
41
Contingency table Relevant Not relevant Retrieved true positives (tp)
false positives (fp) Not retrieved false negatives (fn) true negatives (tn)
42
The other way to define P and R
P = tp/(tp + fp) R = tp/(tp + fn)
43
F measure1-4 A single measure that trades off precision versus recall
α∈[0, 1] and thus β2∈[0,∞] The default balanced F measure use β=1
44
F measure2-4 Values of β < 1 emphasize precision, while values of β > 1 emphasize recall It is harmonic mean rather than the simpler average
45
F measure3-4 The harmonic mean is always less than either the arithmetic or geometric mean, and often quite close to the minimum of the two numbers This strongly suggests that the arithmetic mean is an unsuitable measure to use because it closer to their maximum than harmonic mean
46
F measure4-4
47
Accuracy1-2 Accuracy is the fraction of its classifications that are correct An information retrieval system can be thought of as a two-class classifier Accuracy=(tp+tn)/(tp+fp+fn+tn)
48
Accuracy2-2 Often used for evaluating machine learning classification problems Not an appropriate measure for information retrieval problems Normally over 99.9% of the documents are in the not relevant category Can maximize accuracy by simply deeming all documents irrelevant to all queries, that is to say tn maybe too large
49
8.4 Evaluation of ranked retrieval results
Precision, recall, and the F measure are set-based measures and are computed using unordered sets of documents In a ranked retrieval context, appropriate sets of retrieved documents are naturally given by the top k retrieved documents
50
The types Precision-recall curve Mean Average Precision (MAP)
Precision at k R-precision ROC curve
51
Precision-recall curve1-2
52
Precision-recall curve2-2
If the (k + 1)th document retrieved is irrelevant then recall is the same as for the top k documents, but precision has dropped If it is relevant, then both precision and recall increase, and the curve jags up and to the right
53
How to evaluate? More cliffy, less effective
54
Precision-recall curve with interpolated precision
Can remove the jiggles of curve The interpolated precision at a certain recall level r is defined as the highest precision found for any recall level q ≥ r The justification is that almost anyone would be prepared to look at more documents if it would increase the percentage of the viewed set that were relevant
55
Precision-recall curve with 11-point interpolated average precision1-2
Can boil curve’s information down to a few numbers 11 points means 0.0, 0.1, 0.2, … , 1.0 For each recall level, we then calculate the arithmetic mean of the interpolated precision at that recall level for each information need in the test collection
56
Precision-recall curve with 11-point interpolated average precision2-2
58
Mean Average Precision (MAP)1-3
Most standard among the TREC community A single-figure measure of quality across recall levels Have especially good discrimination and stability
59
Mean Average Precision (MAP)2-3
For a single information need, Average Precision is the average of the precision value obtained for the set of top k documents existing after each relevant document is retrieved
60
Mean Average Precision (MAP)3-3
The set of relevant documents for an information need qj ∈ Q is {d1, dmj} Rjk is the set of ranked retrieval results from the top result until you get to document dk
61
The disadvantage of MAP
Fixed recall levels are not chosen, corresponding precision is at all recall levels Calculated scores normally vary widely across information needs( 0.1 to 0.7 ) So a set of test information needs must be large and diverse enough to be representative of system effectiveness across different queries
62
Precision at k What matters is how many good results there are on the first page or the first k pages, especially for search engine This method measures precision at fixed low levels of retrieved results, such as 10 or 30 documents It has the advantage of not requiring any estimate of the size of the set of relevant documents but the disadvantages that it is the least stable
63
R-precision1-2 It requires having a set of known relevant documents of size Rel, from which we calculate the precision of the top Rel documents returned Like Precision at k, R-precision describes only one point on the precision-recall curve And the R-precision and R-recall is equal ( r/Rel ) and is identical to the break-even point of the PR curve
64
R-precision2-2 But even a perfect system could only achieve a precision at 20 of 0.4 if there were only 8 relevant documents Averaging this measure across queries thus makes more sense
65
ROC curve1-3 Receiver Operating Characteristics, ROC
ROC curve plots the true positive rate ( sensitivity or recall ) against the false positive rate ( 1-specificity The false positive rate is given by fp/( fp+tn ) Specificity is given by tn/( fp + tn) sensitivity:灵敏性 specificity:特异性
66
ROC curve2-3 ROC curve always goes from the bottom left to the top right of the graph For a good system, the graph climbs steeply on the left side
67
ROC curve3-3
68
8.5 Assessing relevance
69
About information needs
Information needs must be germane to the documents Information needs are best designed by domain experts Using random combinations of query terms as an information need is generally not a good idea because typically they will not resemble the actual distribution of information needs
70
About assessment1-2 Time-consuming and expensive process involving human beings For tiny collections like Cranfield, exhaustive judgments of relevance for each query and document pair were obtained For large modern collections, it is usual for relevance to be assessed only for a subset of the documents for each query The most standard approach is pooling, where relevance is assessed over a subset of the collection that is formed from the top k documents returned
71
About assessment2-2 Humans and their relevance judgments are quite variable But this is not a problem to be solved In the final analysis, the success of an IR system depends on how good it is at satisfying the needs of these variable humans But needs to measure the agreement of these assessments
72
Kappa statistic1-5 It can measure how much agreement between relevance judgments P(A) is the proportion of the times the judges agreed P(E) is the proportion of the times they would be expected to agree by chance, which needs to be estimated
73
Kappa statistic2-5 II I Yes No Total 300 20 320 10 70 80 310 90 400
74
Kappa statistic3-5 P(A) = (300+ 70)/400 = 370/400 = 0.925
P(nonrelevant) = (80+90)/( ) = 170/800 = P(relevant) = ( )/( ) = 630/800 = P(E) = P(nonrelevant) + P(relevant) = = 0.665 k= (P(A)− P(E))/(1− P(E)) = (0.925−0.665)/(1− 0.665) = 0.776
75
Kappa statistic4-5 Result
Kappa value above 0.8 is taken as good agreement Kappa value between 0.67 and 0.8 is taken as fair agreement Kappa value below 0.67 is seen as dubious basis for evaluation If there are more than two judges, it is normal to calculate an average pairwise kappa value
76
Kappa statistic5-5 The level of agreement in TREC normally falls in the range of “fair” (0.67–0.8) This means not requiring more fine-grained relevance labeling But these deviation has in general been found to have little impact on the relative effectiveness ranking despite the variation of individual assessors’ judgments
77
About relevance Some dubious hypothesis
The relevance of one document is treated as independent of the relevance of other documents in the collection Assessments are binary and not nuanced Information need is treated as an absolute, objective decision So any results are heavily skewed by the choice of collection, queries, and relevance judgment set nuanced:有细微差别的
78
Marginal relevance It means whether a document still has distinctive usefulness after the user has looked at certain other documents The most extreme case of this is documents that are duplicates – a phenomenon that is actually very common on the Web Maximizing marginal relevance requires returning documents that exhibit diversity and novelty
79
8.6 Evaluation of Search Engines
九款流行搜索引擎评测 华军软件园测评中心
81
主要搜索服务种类
82
平均搜索速度
83
搜索结果显示
84
搜索性能测试——网页搜索测试
85
搜索性能测试——图片搜索测试
86
搜索性能测试——音频搜索测试
87
搜索性能测试——视频搜索测试
88
搜索性能测试——文件搜索测试
89
美国九大搜索引擎评测
92
基于用户行为分析的自动评价 标准测试集方法和日志分析方法的综合 从宏观用户行为的角度客观反映搜索引擎检索质量的变化
提供一个快速、半即时的评价各大搜索引擎检索效果的平台
93
点击集中度 针对查询历史信息的特征 假设:不同用户具有相同的检索需求时,他们的点击都会集中在某个网页或几个页面上。 特征:用户点击分布
导航类需求,目标页面唯一 信息类需求,目标页面多样化 特征:用户点击分布
94
点击集中度 不同的搜索引擎用户点击分布不同 例:电影(inf)
Similar presentations