20,618 Hits in 13.1 sec

User performance versus precision measures for simple search tasks

Andrew Turpin, Falk Scholer
2006 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '06  
Two of the studies used an instance recall task, and a third used a question answering task, so perhaps it is unsurprising that the precision based measures of IR system effectiveness on one-shot query  ...  evaluation do not correlate with user performance on these tasks.  ...  Hersh et al. investigated whether batch and user evaluations give the same results for an instance recall task [13] .  ... 
doi:10.1145/1148170.1148176 dblp:conf/sigir/TurpinS06 fatcat:gbkbtfadhzabhcsfkszsxctryq

Do batch and user evaluations give the same results?

William Hersh, Andrew Turpin, Susan Price, Benjamin Chan, Dale Kramer, Lynetta Sacherek, Daniel Olson
2000 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '00  
Our results showed the weighting scheme giving beneficial results in batch studies did not do so with real users.  ...  Further analysis did identi~ other factors predictive of instance recall, including number of documents saved by the user, document recall, and number of documents seen by the user.  ...  The ultimate answer to the question of whether these two approaches to evaluation give the same results must ultimately be answered by filrther experiments that use a larger number of queries and more  ... 
doi:10.1145/345508.345539 dblp:conf/sigir/HershTPCKSO00 fatcat:d6ybdwpfmjgbbkap5na7u3dtcy

Should Answer Immediately or Wait for Further Information? A Novel Wait-or-Answer Task and Its Predictive Approach [article]

Zehao Lin, Shaobo Cui, Xiaoming Kang, Guodun Li, Feng Ji, Haiqing Chen, Yin Zhang
2020 arXiv   pre-print
The arbitrator's decision is made with the assistance of two ancillary imaginator models: a wait imaginator and an answer imaginator.  ...  The answer imaginator, nevertheless, struggles to predict the answer of the dialogue system and convince the arbitrator that it's a superior choice to answer the users' query immediately.  ...  than a single u erance. is will create a critical dilemma faced by the dialogue systems in which the dialogue system is not sure whether it should wait for the further input of the user or simply answer  ... 
arXiv:2005.13119v1 fatcat:bbf3flqtc5hu5oujxjbc7pcj24

Metric and Relevance Mismatch in Retrieval Evaluation [chapter]

Falk Scholer, Andrew Turpin
2009 Lecture Notes in Computer Science  
When relevance profiles can be estimated well, this classification scheme can offer further insight into the transferability of batch results to real user search tasks.  ...  In this paper, we explore how these evaluation paradigms may be reconciled. First, we investigate the DCG@1 and P@1 metrics, and their relationship with user performance on a common web search task.  ...  Investigations by Hersh and Turpin found no relationship between MAP and user performance on an instance recall task [8] , or a question answering task [18] .  ... 
doi:10.1007/978-3-642-04769-5_5 fatcat:qg2gvnwkuree7h6pgo4bjh7afa

Using interview data to identify evaluation criteria for interactive, analytical question-answering systems

Diane Kelly, Nina Wacholder, Robert Rittman, Ying Sun, Paul Kantor, Sharon Small, Tomek Strzalkowski
2007 Journal of the American Society for Information Science and Technology  
The purpose of this work is to identify potential evaluation criteria for interactive, analytical question-answering (QA) systems by analyzing evaluative comments made by users of such a system.  ...  These data were collected as part of an intensive, three-day evaluation workshop of the High-Quality Interactive Question Answering (HITIQA) system.  ...  Acknowledgements This article is based on work supported in part by the Advanced Research and Development Activity (ARDA)'s Advanced Question Answering for Intelligence (AQUAINT) Program under a contract  ... 
doi:10.1002/asi.20575 fatcat:oz5tqliqsnarlobmmrosfjppzq

BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis [article]

Hu Xu, Bing Liu, Lei Shu, Philip S. Yu
2019 arXiv   pre-print
to answer user questions.  ...  To show the generality of the approach, the proposed post-training is also applied to some other review-based tasks such as aspect extraction and aspect sentiment classification in aspect-based sentiment  ...  Acknowledgments Bing Liu's work was partially supported by the National Science Foundation (NSF IIS 1838770) and by a research gift from Huawei.  ... 
arXiv:1904.02232v2 fatcat:b62iccqtlbfb3g7znqasscktfq

Challenging conventional assumptions of automated information retrieval with real users: Boolean searching and batch retrieval evaluations

William Hersh, Andrew Turpin, Susan Price, Dale Kraemer, Daniel Olson, Benjamin Chan, Lynetta Sacherek
2001 Information Processing & Management  
Our results showed that Boolean and natural language searching achieved comparable results and that the results from batch evaluations were not comparable to those obtained in experiments with real users  ...  We challenged these assumptions in the Text Retrieval Conference (TREC) interactive track, with real users following a consensus protocol to search for an instance recall task.  ...  Acknowledgements This study was funded in part by Grant LM-06311 from the US National Library of Medicine.  ... 
doi:10.1016/s0306-4573(00)00054-6 fatcat:s4fad5xgfzerzasxkqqzrkqobm

Assessing the performance of Olelo, a real-time biomedical question answering application

Mariana Neves, Fabian Eckert, Hendrik Folkerts, Matthias Uflacker
2017 BioNLP 2017  
Question answering (QA) can support physicians and biomedical researchers to find answers to their questions in the scientific literature.  ...  In addition to the BioASQ evaluation, we compared our system to other on-line biomedical QA systems in terms of the response time and the quality of the answers. 1 sults for  ...  Introduction Question answering (QA) is the task of automatically answering questions posed by users (Jurafsky and Martin, 2013) .  ... 
doi:10.18653/v1/w17-2344 dblp:conf/bionlp/NevesEFU17 fatcat:lerk6gynazallo6vamr4coqubq

Increasing cheat robustness of crowdsourcing tasks

Carsten Eickhoff, Arjen P. de Vries
2012 Information retrieval (Boston)  
In this work, we take a different approach by investigating means of a priori making crowdsourced tasks more resistant against cheaters.  ...  Recently, we have seen numerous sophisticated schemes of identifying such workers. Those, however, often require additional resources or introduce artificial limitations to the task.  ...  Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s)  ... 
doi:10.1007/s10791-011-9181-9 fatcat:gzcsi7e6oranvov7j6qzudx5fy

A Biomedical Question Answering System in BioASQ 2017

Mourad Sarrouti, Said Ouatik El Alaoui
2017 BioNLP 2017  
Question answering, the identification of short accurate answers to users questions, is a longstanding challenge widely studied over the last decades in the opendomain.  ...  Preliminary results show that our system achieves good and competitive results in both exact and ideal answers extraction tasks as compared with the participating systems.  ...  Overall, from the results and analysis on fives batches of testing data of BioASQ task 5b, we can draw a conclusion that our system is very competitive as compared with the participating systems in both  ... 
doi:10.18653/v1/w17-2337 dblp:conf/bionlp/SarroutiA17 fatcat:n3qa2pvzhrb6tcsdokyicuqlpq

Human Evaluation of Spoken vs. Visual Explanations for Open-Domain QA [article]

Ana Valeria Gonzalez, Gagan Bansal, Angela Fan, Robin Jia, Yashar Mehdad, Srinivasan Iyer
2020 arXiv   pre-print
., whether they are communicated to users through a spoken or visual interface, and contrast effectiveness across modalities.  ...  only evaluates explanations using a visual display, and may erroneously extrapolate conclusions about the most performant explanations to other modalities.  ...  D Task We use a random sample of 120 questions from our dataset which remains the same across all conditions.  ... 
arXiv:2012.15075v1 fatcat:kbngatjdlfhhhod3on5cpiqnoq

LAReQA: Language-agnostic answer retrieval from a multilingual pool [article]

Uma Roy, Noah Constant, Rami Al-Rfou, Aditya Barua, Aaron Phillips, Yinfei Yang
2020 arXiv   pre-print
This finding underscores our claim that languageagnostic retrieval is a substantively new kind of cross-lingual evaluation.  ...  Interestingly, the embedding baseline that performs the best on LAReQA falls short of competing baselines on zero-shot variants of our task that only target "weak" alignment.  ...  We thank Sebastian Ruder and Melvin Johnson for helpful comments on an earlier draft of this paper. We also thank Rattima Nitisaroj for helping us evaluate the quality of our Thai sentence breaking.  ... 
arXiv:2004.05484v1 fatcat:75ilkbbezzhdvnqm4xzultmldu

Assessing the Cognitive Complexity of Information Needs

Alistair Moffat, Peter Bailey, Falk Scholer, Paul Thomas
2014 Proceedings of the 2014 Australasian Document Computing Symposium on - ADCS '14  
Information retrieval systems can be evaluated in laboratory settings through the use of user studies, and through the use of test collections and effectiveness metrics.  ...  The goal is to reach a position from which we can determine whether user actions while searching are influenced by the way the information need is expressed, and by the fundamental nature of the information  ...  Acknowledgment This work was supported by the Australian Research Council's Discovery Projects Scheme (projects DP110101934 and DP140102655).  ... 
doi:10.1145/2682862.2682874 dblp:conf/adcs/MoffatBST14 fatcat:pj6x4clz4vabplnzf2luhvdoia

Interleaved Evaluation for Retrospective Summarization and Prospective Notification on Document Streams

Xin Qian, Jimmy Lin, Adam Roegiest
2016 Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval - SIGIR '16  
yields system comparisons that accurately match the result of batch evaluations.  ...  By simulating user interactions with interleaved results on submitted runs to the TREC 2014 tweet timeline generation (TTG) task and the TREC 2015 real-time filtering task, we demonstrate that our methodology  ...  Specifically, we simulate user interactions with interleaved results to produce a decision on whether system A is better than system B, and correlate these decisions with the results of batch evaluations  ... 
doi:10.1145/2911451.2911494 dblp:conf/sigir/QianLR16 fatcat:2rdjuzmmtvehjld62prftfu2lm

Multi-modal Retrieval of Tables and Texts Using Tri-encoder Models [article]

Bogdan Kostić, Julian Risch, Timo Möller
2021 arXiv   pre-print
Comparing different dense embedding models, tri-encoders with one encoder for each question, text and table, increase retrieval performance compared to bi-encoders with one encoder for the question and  ...  Open-domain extractive question answering works well on textual data by first retrieving candidate texts and then extracting the answer from those candidates.  ...  Acknowledgements We would like to thank Jonathan Herzig and Julian Eisenschlos for taking the time to discuss ideas with us and to give early feedback on experiment results.  ... 
arXiv:2108.04049v2 fatcat:5ab7umnm2jcm7lpbngi2hxhdyy
« Previous Showing results 1 — 15 out of 20,618 results