Filters








2,573 Hits in 2.5 sec

Lessons from surgery and anaesthesia: evaluation of non-technical skills in interventional radiology

Chun L Pang, Salil B Patel, Nicola Pilkington
2015 JRSM Open  
In the medical profession, surgery and anaesthesia are leading the way in identifying human errors that negatively affect patient safety.  ...  This literature review supports the use of standardised assessment tools used in surgery and anaesthesia.  ...  Considering the inter-rater agreement variation (overall ICC < 0.8), NTS assessment in IR should be considered as an addition to existing assessments, rather than an individual 'high stakes' test.  ... 
doi:10.1177/2054270415611834 pmid:26664733 pmcid:PMC4668915 fatcat:lgjr47hwqnfvtep3afaqtvto4q

Relevance dimensions in preference-based IR evaluation

Jinyoung Kim, Gabriella Kazai, Imed Zitouni
2013 Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval - SIGIR '13  
In this paper, we investigate how assessors determine their preference for one list of results over another with the aim to understand the role of various relevance dimensions in preferencebased evaluation  ...  Evaluation of information retrieval (IR) systems has recently been exploring the use of preference judgments over two search result lists.  ...  In spite of these, relevance has been proven to be a reliable quantity in comparative IR evaluation [14] .  ... 
doi:10.1145/2484028.2484168 dblp:conf/sigir/KimKZ13 fatcat:5pxtfbvpire5nh7vxz2zu6g5e4

Relevance & Assessment: Cognitively Motivated Approach toward Assessor-Centric Query-Topic Relevance Model

2018 Acta Polytechnica Hungarica  
However, this presentation proceeds from an assessor-oriented model considering the cognitive aspect and the multidimensionality of relevance in the sense; it is considered as a multidimensional cognitive  ...  Furthermore, classifying query relevance datasets according to grades of agreements among judgments is useful as it gives a better overview of the performance of the considered system and the comparison  ...  They observed 45% agreement with TREC relevance. [2] Found even 65% agreements with the official TREC judgments in an Interactive IR experiment.  ... 
doi:10.12700/aph.15.5.2018.5.8 fatcat:7knqqyr37rarrnqv55tdsfjf2q

Augmented Test Collections: A Step in the Right Direction [article]

Laura Hasler, Martin Halvey, Robert Villa
2015 arXiv   pre-print
We propose enhancing test collections used in evaluation with information related to human assessors and their interpretation of the task.  ...  In this position paper we argue that certain aspects of relevance assessment in the evaluation of IR systems are oversimplified and that human assessments represented by qrels should be augmented to take  ...  differences in assessor judgements (i.e. as captured by inter-assessor agreement metrics).  ... 
arXiv:1501.06370v1 fatcat:he46ekn32zeppi7un5yz2gnovy

Better than Their Reputation? On the Reliability of Relevance Assessments with Students [chapter]

Philipp Schaer
2012 Lecture Notes in Computer Science  
In this study we do not focus on the retrieval performance of our system but on the relevance assessments and the inter-assessor reliability.  ...  We use the two agreement measures to drop too unreliable assessments from our data set.  ...  the evaluation systems.  ... 
doi:10.1007/978-3-642-33247-0_14 fatcat:ma24j4hpsndcnpfq2uw7ue6fgy

On the impact of domain expertise on query formulation, relevance assessment and retrieval performance in clinical settings

Lynda Tamine, Cecile Chouquet
2017 Information Processing & Management  
The findings of this study presents opportunities for the design of personalized health-related IR systems, but also for providing insights about the evaluation of such systems.  ...  This article focuses on the extent to which expertise can impact clinical query formulation, document relevance assessment and retrieval performance in the context of tailoring retrieval models and systems  ...  The second question related to assessor agreement which impacts in contrast system rankings and has been addressed in early IR works by Lesk and Salton Lesk and Salton (1968) .  ... 
doi:10.1016/j.ipm.2016.11.004 fatcat:ekgnxyvzurcbha35n64jxs7o5a

Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collections Accurately and Affordably [article]

Mucahid Kutlu, Tyler McDonnell, Aashish Sheshadri, Tamer Elsayed, Matthew Lease
2018 arXiv   pre-print
However, crowd assessors may show higher variance in judgment quality than trusted assessors. In this paper, we investigate how to effectively utilize both groups of assessors in partnership.  ...  We specifically investigate how agreement in judging is correlated with three factors: relevance category, document rankings, and topical variance.  ...  By convention, τ = 0.9 is assumed to constitute an acceptable correlation level for reliable IR evaluation [30] . Results are shown in Figure 3 .  ... 
arXiv:1806.00755v3 fatcat:co2we7y3x5c5ldeaucmplfmjmy

Creation of Reliable Relevance Judgments in Information Retrieval Systems Evaluation Experimentation through Crowdsourcing: A Review

Parnia Samimi, Sri Devi Ravana
2014 The Scientific World Journal  
One of the crowdsourcing applications in IR is to judge relevancy of query document pair.  ...  Test collection is used to evaluate the information retrieval systems in laboratory-based evaluation experimentation.  ...  Table 3 summarizes four common methods suggested to calculate the interrater agreement between crowdsourcing workers and human assessors for relevance judgment in IR evaluation [20] . Alonso et al.  ... 
doi:10.1155/2014/135641 pmid:24977172 pmcid:PMC4055211 fatcat:qfbavfc45jfmzp4yisqyedx2o4

Evaluating aggregated search pages

Ke Zhou, Ronan Cummins, Mounia Lalmas, Joemon M. Jose
2012 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval - SIGIR '12  
Aggregating search results from a variety of heterogeneous sources or verticals such as news, image and video into a single interface is a popular paradigm in web search.  ...  This paper proposes a general framework for evaluating the quality of aggregated search pages.  ...  Any opinions, findings, and recommendations expressed in this paper are the authors' and do not necessarily reflect those of the sponsors.  ... 
doi:10.1145/2348283.2348302 dblp:conf/sigir/ZhouCLJ12 fatcat:b6yv5tsoongh5mnelokcfp3jpe

A comparison of user and system query performance predictions

Claudia Hauff, Diane Kelly, Leif Azzopardi
2010 Proceedings of the 19th ACM international conference on Information and knowledge management - CIKM '10  
The question we consider is, whether the predictions of query performance that systems make are in line with the predictions that users make.  ...  Query performance prediction methods are usually applied to estimate the retrieval effectiveness of queries, where the evaluation is largely system sided.  ...  , when it comes to the agreement between assessors at the query suggestion level there was less agreement between assessors -in the topic level experiments the median agreement reached κ = 0.36.  ... 
doi:10.1145/1871437.1871562 dblp:conf/cikm/HauffKA10 fatcat:ieo43ycyard6xozgwpxsxdfrou

RELIABILITY AND VALIDITY OF THE HALO DIGITAL GONIOMETER FOR SHOULDER RANGE OF MOTION IN HEALTHY SUBJECTS

Sarah Correll, Jennifer Field, Heather Hutchinson, Gabby Mickevicius, Amber Fitzsimmons, Betty Smoot
2018 International Journal of Sports Physical Therapy  
The ICCs for agreement, comparing the HALO digital goniometer to the UG ranged from 0.79 to 0.99.  ...  Two evaluators measured each motion twice with each device (HALO and the UG) per shoulder.  ...  agreement.  ... 
pmid:30140564 pmcid:PMC6088125 fatcat:sbmchrf4pney7mqnys7dthg2oy

Effects of Inconsistent Relevance Judgments on Information Retrieval Test Results: A Historical Perspective

Tefko Saracevic
2008 Library Trends  
A historical context for these studies and for IR testing is provided including an assessment of Lancaster's (1969) evaluation of MEDLARS and its unique place in the history of IR evaluation.  ...  as the gold standard for performance evaluation.  ...  IR evaluation was.  ... 
doi:10.1353/lib.0.0000 fatcat:uv36lpme3va6phzyy5mqvull4u

Quantifying test collection quality based on the consistency of relevance judgements

Falk Scholer, Andrew Turpin, Mark Sanderson
2011 Proceedings of the 34th international ACM SIGIR conference on Research and development in Information - SIGIR '11  
Relevance assessments are a key component for test collectionbased evaluation of information retrieval systems.  ...  While the level of detail in a topic specification does not appear to influence the errors that assessors make, judgements are significantly affected by the decisions made on previously seen similar documents  ...  Previous work on assessor consistency considered the level of agreement between system orderings when evaluation measures were calculated based on relevance judgements from different assessors.  ... 
doi:10.1145/2009916.2010057 dblp:conf/sigir/ScholerTS11 fatcat:22o2aqcozfh65dgs4kbybdbj2q

Accurate user directed summarization from existing tools

Mark Sanderson
1998 Proceedings of the seventh international conference on Information and knowledge management - CIKM '98  
The techniques proved to have a wider utility, however, as the summarizer was one of the better performing systems in the SUMMAC evaluation.  ...  The design of this summarizer is presented with a range of evaluations: both those provided by SUMMAC as well as a set of preliminary, more informal, evaluations that examined additional aspects of the  ...  Table 6 : 6 Changes in inter-assessor agreement at different rank positions.  ... 
doi:10.1145/288627.288640 dblp:conf/cikm/Sanderson98 fatcat:tdj5wdztkrahnl33wbrng63m6a

Overview of the WiQA Task at CLEF 2006 [chapter]

Valentin Jijkoun, Maarten de Rijke
2007 Lecture Notes in Computer Science  
Going beyond traditional factoid questions, the task considered at WiQA 2006 was to return-given an source page from Wikipedia-to identify snippets from other Wikipedia pages, possibly in languages different  ...  Our main findings are two-fold: (i) while challenging, the tasks considered at WiQA are do-able as participants achieved impressive scores as measured in terms of yield, mean reciprocal rank, and precision  ...  In this overview we first provide a description of the tasks considered and of the evaluation and assessment procedures (Section 2).  ... 
doi:10.1007/978-3-540-74999-8_33 fatcat:ppqh5s2p7ffi5g5vizutr25wyi
« Previous Showing results 1 — 15 out of 2,573 results