A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Using score distributions to compare statistical significance tests for information retrieval evaluation
2019
Journal of the Association for Information Science and Technology
Using Score Distributions, we model the output of multiple search systems, produce simulated search results from such models, and compare them using various significance tests. ...
This new method for studying the power of significance tests in Information Retrieval evaluation is formal and innovative. ...
We also thank the anonymous reviewers for their really useful suggestions and comments. ...
doi:10.1002/asi.24203
fatcat:jgohta6wmvfhbm3owh4zdoq42u
Using statistical testing in the evaluation of retrieval experiments
1993
Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '93
Anumber of different statistical tests are described for determining if differences in performance between retrieval methods are significant. ...
However, one can test this assumption using simple diagnostic plots, and if it is a poor approximation, there are anumber ofnon-parametric alternatives. ...
The value of this statistic
can
be compared
to the F-distribution,
which
is the distri-
bution
of these scores if all retrieval
methods
are equally
effect ive. ...
doi:10.1145/160688.160758
dblp:conf/sigir/Hull93
fatcat:4tiviqm5pfbpdpulvcadd67f3q
By the power of Grayskull
2014
Proceedings of the 2014 Australasian Document Computing Symposium on - ADCS '14
Information Retrieval evaluation is typically performed using a sample of queries and a statistical hypothesis test is used to make inferences about the systems accuracy on the population of queries. ...
Research has shown that the t test is one of a set of tests that provides the greatest statistical power while maintaining acceptable type I error rates, when evaluating with a large sample of queries. ...
Acknowledgement The authors thank Falk Scholer for his comments and advice on Information Retrieval evaluation methods. ...
doi:10.1145/2682862.2682878
dblp:conf/adcs/ParkS14
fatcat:jyvegogm25bcxd2qnoap2eonru
Evaluation Metrics and Evaluation
[chapter]
2018
Clinical Text Mining
First the scientific base for evaluation of all information retrieval systems, called the Cranfield paradigm will be described. ...
Statistical significance testing will be presented. This chapter will also discuss manual annotation and inter-annotator agreement, annotation tools such as BRAT and the gold standard. ...
topics are used for the evaluation of information retrieval. ...
doi:10.1007/978-3-319-78503-5_6
fatcat:v5mykkmvhrf4xlzcrpwmoi4sdy
Evaluating the Interest of Revamping Past Search Results
[chapter]
2013
Lecture Notes in Computer Science
Exponential and Zipf distribution as well as Bradford's law are applied to construct simulated document collections suitable for information retrieval evaluation. ...
In this paper we present two contributions: a method to construct simulated document collections suitable for information retrieval evaluation as well as an approach of information retrieval using past ...
Then, we applied the Student's paired sample t-test to test if the difference between the two compared approaches with regards to P@10 was statistically significant. Experiment 1. ...
doi:10.1007/978-3-642-40173-2_9
fatcat:r5eyqwcncfhppnblas4qn4w2cy
Measuring the Variability in Effectiveness of a Retrieval System
[chapter]
2010
Lecture Notes in Computer Science
A typical evaluation of a retrieval system involves computing an effectiveness metric, e.g. average precision, for each topic of a test collection and then using the average of the metric, e.g. mean average ...
However, averages do not capture all the important aspects of effectiveness and, used alone, may not be an informative measure of systems' effectiveness. ...
Acknowledgements The authors thank Jun Wang and Jianhan Zhu of UCL and Stephan Robertson of Microsoft Research Cambridge for useful discussion on earlier drafts of this paper.
Bibliography ...
doi:10.1007/978-3-642-13084-7_7
fatcat:bmz6ae4ui5aq3m4n32s2xr6igu
A comparison of statistical significance tests for information retrieval evaluation
2007
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management - CIKM '07
Information retrieval (IR) researchers commonly use three tests of statistical significance: the Student's paired t-test, the Wilcoxon signed rank test, and the sign test. ...
For each of these five tests, we took the ad-hoc retrieval runs submitted to TRECs 3 and 5-8, and for each pair of runs, we measured the statistical significance of the difference in their mean average ...
The IR researcher should select a significance test that uses the same test statistic as the researcher is using to compare systems. ...
doi:10.1145/1321440.1321528
dblp:conf/cikm/SmuckerAC07
fatcat:yysoebqcxvaxjcuegw3mtdetdm
Combining Multiple Strategies for Effective Monolingual and Cross-Language Retrieval
2004
Information retrieval (Boston)
This paper describes and evaluates different retrieval strategies that are useful for search operations on document collections written in various European languages, namely French, Italian, Spanish and ...
In order to cross language barriers, we propose a combined query translation approach that has resulted in interesting retrieval effectiveness. ...
The author would like to thank the three anonymous referees for their helpful suggestions and remarks. ...
doi:10.1023/b:inrt.0000009443.51912.e7
fatcat:vpdkttg7y5bijmnzlmh5wkar7u
How Significant Is Statistically Significant? The Case Of Audio Music Similarity And Retrieval
2012
Zenodo
INTRODUCTION Evaluation experiments are the main research tool in Information Retrieval (IR) to determine which systems perform well and which perform poorly for a given task [1] . ...
Thus, observing a statistically significant difference does 13th International Society for Music Information Retrieval Conference (ISMIR 2012) not mean that the systems really are different, in fact ...
doi:10.5281/zenodo.1418054
fatcat:m5c5dettxbaq3ktscpnp3xsgom
STATISTICAL SIGNIFICANCE IN MULTILINGUAL INFORMATION RETRIEVAL (MLIR) SYSTEM
2012
IOSR Journal of Engineering
Significance tests are often used to estimate the reliability of such comparisons. In this research paper, we revisit the question of how such significance tests should be used. ...
The efficiency of retrieval system is precise by comparing performance on a regular set of queries in Information Retrieval (IR) and MLIR systems. ...
INTRODUCTION Test collections are the principal tool used for comparison and evaluation of retrieval systems. ...
doi:10.9790/3021-0204794802
fatcat:furywk2wzzbdbhyyuqzzenq5hu
Investigating the exhaustivity dimension in content-oriented XML element retrieval evaluation
2006
Proceedings of the 15th ACM international conference on Information and knowledge management - CIKM '06
This paper attempts to answer this question through extensive statistical tests to compare the conclusions about system performance that could be made under different assessment scenarios. ...
INEX, the evaluation initiative for content-oriented XML retrieval, has since its establishment defined the relevance of an element according to two graded dimensions, exhaustivity and specificity. ...
Acknowledgements The INEX initiative is an activity of DELOS, a network of excellence for digital libraries. Paul Ogilvie was funded in part by NSF grant IIS-0534345. ...
doi:10.1145/1183614.1183631
dblp:conf/cikm/OgilvieL06
fatcat:fj2ik7me3rdzpm74nzycrdwsha
Within-Document Term-Based Index Pruning with Statistical Hypothesis Testing
[chapter]
2011
Lecture Notes in Computer Science
Our method is based on the statistical significance of term frequency ratios: using the two-sample two-proportion (2P2N) test, we statistically compare the frequency of occurrence of a word within a given ...
Furthermore, we give a formal statistical justification for such methods. ...
Our goal is to test whether retrieval speed and effectiveness are substantially affected by pruning using the 2N2P tests, and to compare those tests to the baseline. ...
doi:10.1007/978-3-642-20161-5_54
fatcat:s27nko6ravhdlm2r23bbharfsy
Does degree of work task completion influence retrieval performance?
2010
Proceedings of the American Society for Information Science and Technology
Also, with the exception of full text records and across all document types, both measured at rank 10, no statistically significant correlation is observed with respect to retrieval performance influenced ...
In this contribution we investigate the potential influence between assessors' perceived completion of their work task at hand and their actual assessment of usefulness of the retrieved information. ...
up to rank 30, and statistically significant at nDCG10: when work tasks are perceived 'Not Complete' the usefulness score of the retrieved documents is indeed lower than for tasks felt 'Somewhat Complete ...
doi:10.1002/meet.14504701321
fatcat:qd4poaxlireqffcz45qzhvyjci
User-Centered Measures Vs. System Effectiveness In Finding Similar Songs
2012
Zenodo
We also thank the IMIRSEL in the University of Illinois for providing the MIREX AMS data. ...
Many of these studies used TREC (Text Retrieval Conference) evaluation results to select systems to be evaluated by users and to obtain data on system effectiveness. ...
does not assume normal distribution of tested variables. ...
doi:10.5281/zenodo.1416868
fatcat:lsuwxvvdlra6robk6j6uc7xhiq
A Comparative User Study of Web Search Interfaces: HotMap, Concept Highlighter, and Google
2006
2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06)
We suggest the use of information visualization and interactive visual manipulation as methods for improving the ability of users to evaluate the results of a web search. ...
Users of traditional web search engines commonly find it difficult to evaluate the results of their web searches. ...
For Task 1, the differences in the perceived precision scores proved to be statistically significant; for Task 2, the differences proved to not be statistically significant. ...
doi:10.1109/wi.2006.6
dblp:conf/webi/HoeberY06
fatcat:nfwuiqpx7bhk3pps3a3iyjyq24
« Previous
Showing results 1 — 15 out of 199,387 results