A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Filters
User performance versus precision measures for simple search tasks
2006
Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '06
Two of the studies used an instance recall task, and a third used a question answering task, so perhaps it is unsurprising that the precision based measures of IR system effectiveness on one-shot query ...
evaluation do not correlate with user performance on these tasks. ...
Hersh et al. investigated whether batch and user evaluations give the same results for an instance recall task [13] . ...
doi:10.1145/1148170.1148176
dblp:conf/sigir/TurpinS06
fatcat:gbkbtfadhzabhcsfkszsxctryq
Do batch and user evaluations give the same results?
2000
Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '00
Our results showed the weighting scheme giving beneficial results in batch studies did not do so with real users. ...
Further analysis did identi~ other factors predictive of instance recall, including number of documents saved by the user, document recall, and number of documents seen by the user. ...
The ultimate answer to the question of whether these two approaches to evaluation give the same results must ultimately be answered by filrther experiments that use a larger number of queries and more ...
doi:10.1145/345508.345539
dblp:conf/sigir/HershTPCKSO00
fatcat:d6ybdwpfmjgbbkap5na7u3dtcy
Should Answer Immediately or Wait for Further Information? A Novel Wait-or-Answer Task and Its Predictive Approach
[article]
2020
arXiv
pre-print
The arbitrator's decision is made with the assistance of two ancillary imaginator models: a wait imaginator and an answer imaginator. ...
The answer imaginator, nevertheless, struggles to predict the answer of the dialogue system and convince the arbitrator that it's a superior choice to answer the users' query immediately. ...
than a single u erance. is will create a critical dilemma faced by the dialogue systems in which the dialogue system is not sure whether it should wait for the further input of the user or simply answer ...
arXiv:2005.13119v1
fatcat:bbf3flqtc5hu5oujxjbc7pcj24
Metric and Relevance Mismatch in Retrieval Evaluation
[chapter]
2009
Lecture Notes in Computer Science
When relevance profiles can be estimated well, this classification scheme can offer further insight into the transferability of batch results to real user search tasks. ...
In this paper, we explore how these evaluation paradigms may be reconciled. First, we investigate the DCG@1 and P@1 metrics, and their relationship with user performance on a common web search task. ...
Investigations by Hersh and Turpin found no relationship between MAP and user performance on an instance recall task [8] , or a question answering task [18] . ...
doi:10.1007/978-3-642-04769-5_5
fatcat:qg2gvnwkuree7h6pgo4bjh7afa
Using interview data to identify evaluation criteria for interactive, analytical question-answering systems
2007
Journal of the American Society for Information Science and Technology
The purpose of this work is to identify potential evaluation criteria for interactive, analytical question-answering (QA) systems by analyzing evaluative comments made by users of such a system. ...
These data were collected as part of an intensive, three-day evaluation workshop of the High-Quality Interactive Question Answering (HITIQA) system. ...
Acknowledgements This article is based on work supported in part by the Advanced Research and Development Activity (ARDA)'s Advanced Question Answering for Intelligence (AQUAINT) Program under a contract ...
doi:10.1002/asi.20575
fatcat:oz5tqliqsnarlobmmrosfjppzq
BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis
[article]
2019
arXiv
pre-print
to answer user questions. ...
To show the generality of the approach, the proposed post-training is also applied to some other review-based tasks such as aspect extraction and aspect sentiment classification in aspect-based sentiment ...
Acknowledgments Bing Liu's work was partially supported by the National Science Foundation (NSF IIS 1838770) and by a research gift from Huawei. ...
arXiv:1904.02232v2
fatcat:b62iccqtlbfb3g7znqasscktfq
Challenging conventional assumptions of automated information retrieval with real users: Boolean searching and batch retrieval evaluations
2001
Information Processing & Management
Our results showed that Boolean and natural language searching achieved comparable results and that the results from batch evaluations were not comparable to those obtained in experiments with real users ...
We challenged these assumptions in the Text Retrieval Conference (TREC) interactive track, with real users following a consensus protocol to search for an instance recall task. ...
Acknowledgements This study was funded in part by Grant LM-06311 from the US National Library of Medicine. ...
doi:10.1016/s0306-4573(00)00054-6
fatcat:s4fad5xgfzerzasxkqqzrkqobm
Assessing the performance of Olelo, a real-time biomedical question answering application
2017
BioNLP 2017
Question answering (QA) can support physicians and biomedical researchers to find answers to their questions in the scientific literature. ...
In addition to the BioASQ evaluation, we compared our system to other on-line biomedical QA systems in terms of the response time and the quality of the answers. 1 http://hpi.de/plattner/olelo sults for ...
Introduction Question answering (QA) is the task of automatically answering questions posed by users (Jurafsky and Martin, 2013) . ...
doi:10.18653/v1/w17-2344
dblp:conf/bionlp/NevesEFU17
fatcat:lerk6gynazallo6vamr4coqubq
Increasing cheat robustness of crowdsourcing tasks
2012
Information retrieval (Boston)
In this work, we take a different approach by investigating means of a priori making crowdsourced tasks more resistant against cheaters. ...
Recently, we have seen numerous sophisticated schemes of identifying such workers. Those, however, often require additional resources or introduce artificial limitations to the task. ...
Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) ...
doi:10.1007/s10791-011-9181-9
fatcat:gzcsi7e6oranvov7j6qzudx5fy
A Biomedical Question Answering System in BioASQ 2017
2017
BioNLP 2017
Question answering, the identification of short accurate answers to users questions, is a longstanding challenge widely studied over the last decades in the opendomain. ...
Preliminary results show that our system achieves good and competitive results in both exact and ideal answers extraction tasks as compared with the participating systems. ...
Overall, from the results and analysis on fives batches of testing data of BioASQ task 5b, we can draw a conclusion that our system is very competitive as compared with the participating systems in both ...
doi:10.18653/v1/w17-2337
dblp:conf/bionlp/SarroutiA17
fatcat:n3qa2pvzhrb6tcsdokyicuqlpq
Human Evaluation of Spoken vs. Visual Explanations for Open-Domain QA
[article]
2020
arXiv
pre-print
., whether they are communicated to users through a spoken or visual interface, and contrast effectiveness across modalities. ...
only evaluates explanations using a visual display, and may erroneously extrapolate conclusions about the most performant explanations to other modalities. ...
D Task We use a random sample of 120 questions from our dataset which remains the same across all conditions. ...
arXiv:2012.15075v1
fatcat:kbngatjdlfhhhod3on5cpiqnoq
LAReQA: Language-agnostic answer retrieval from a multilingual pool
[article]
2020
arXiv
pre-print
This finding underscores our claim that languageagnostic retrieval is a substantively new kind of cross-lingual evaluation. ...
Interestingly, the embedding baseline that performs the best on LAReQA falls short of competing baselines on zero-shot variants of our task that only target "weak" alignment. ...
We thank Sebastian Ruder and Melvin Johnson for helpful comments on an earlier draft of this paper. We also thank Rattima Nitisaroj for helping us evaluate the quality of our Thai sentence breaking. ...
arXiv:2004.05484v1
fatcat:75ilkbbezzhdvnqm4xzultmldu
Assessing the Cognitive Complexity of Information Needs
2014
Proceedings of the 2014 Australasian Document Computing Symposium on - ADCS '14
Information retrieval systems can be evaluated in laboratory settings through the use of user studies, and through the use of test collections and effectiveness metrics. ...
The goal is to reach a position from which we can determine whether user actions while searching are influenced by the way the information need is expressed, and by the fundamental nature of the information ...
Acknowledgment This work was supported by the Australian Research Council's Discovery Projects Scheme (projects DP110101934 and DP140102655). ...
doi:10.1145/2682862.2682874
dblp:conf/adcs/MoffatBST14
fatcat:pj6x4clz4vabplnzf2luhvdoia
Interleaved Evaluation for Retrospective Summarization and Prospective Notification on Document Streams
2016
Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval - SIGIR '16
yields system comparisons that accurately match the result of batch evaluations. ...
By simulating user interactions with interleaved results on submitted runs to the TREC 2014 tweet timeline generation (TTG) task and the TREC 2015 real-time filtering task, we demonstrate that our methodology ...
Specifically, we simulate user interactions with interleaved results to produce a decision on whether system A is better than system B, and correlate these decisions with the results of batch evaluations ...
doi:10.1145/2911451.2911494
dblp:conf/sigir/QianLR16
fatcat:2rdjuzmmtvehjld62prftfu2lm
Multi-modal Retrieval of Tables and Texts Using Tri-encoder Models
[article]
2021
arXiv
pre-print
Comparing different dense embedding models, tri-encoders with one encoder for each question, text and table, increase retrieval performance compared to bi-encoders with one encoder for the question and ...
Open-domain extractive question answering works well on textual data by first retrieving candidate texts and then extracting the answer from those candidates. ...
Acknowledgements We would like to thank Jonathan Herzig and Julian Eisenschlos for taking the time to discuss ideas with us and to give early feedback on experiment results. ...
arXiv:2108.04049v2
fatcat:5ab7umnm2jcm7lpbngi2hxhdyy
« Previous
Showing results 1 — 15 out of 20,618 results