Exploring Document Content with XML to Answer Questions

Kenneth C. Litkowski
2005 Text Retrieval Conference  
CL Research participated in the question answering track in TREC 2004, submitting runs for the main task, the document relevance task, and the relationship task. The tasks were performed using the Knowledge Management System (KMS), which provides a single interface for question answering, text summarization, information extraction, and document exploration. These tasks are based on creating and exploiting an XML representation of the texts in the AQUAINT collection. Question answering is
more » ... ed directly within KMS, which answers questions either from the collection or from the Internet projected back onto the collection. For the main task, we submitted one run and our average per-series score was 0.136, with scores of 0.180 for factoid questions, 0.026 for list questions, and 0.152 for "other" questions. For the document ranking task, the average precision was 0.2253 and the R-precision was 0.2405. For the relationship task, we submitted two runs, with scores of 0.276 and 0.216, the first run was the best score on this task. We describe the overall architecture of KMS and how it permits examination of the question-answering task and strategies within TREC, but also in a real-world application in the bioterrorism domain. We also raise some issues concerning the judgments used for evaluating TREC results and their possible relevance in a wider context. Problem Description The TREC 2005 QA used the AQUAINT Corpus of English News Text on two CD-ROMs, about one million newswire documents from the Associated Press Newswire, New York Times Newswire, and Xinhua News Agency. These documents were stored with SGML formatting tags (XML compliant). For the main task of the QA track, participants were provided with 75 targets, primarily names of people, groups, organizations, and events, viewed as entities for which definitional information was to be assembled. For each target, a few factual questions were posed, totaling 362 factoid questions for the 75 targets (e.g., for the target event "Plane clips cable wires in Italian resort", two factoid questions were "When did the accident occur?" and "How many people were killed?"). One or two list questions for each target were also posed for most of the targets (e.g., "Who were on-ground witnesses to the accident?"); there were 93 list questions. Finally, for each target, "other" information was to be provided, simulating an attempt to "define" the target. Each target was used as a search query against the AQUAINT corpus. NIST provided the full text of the top 50 documents, along with a list of the top 1000 documents. Participants were required to answer the 362 factoid questions with a single exact answer, containing no extraneous information and supported by a document in the corpus. A valid answer could be NIL, indicating that there was no answer in the document set; NIST included 17 questions for which no answer exists in the collection. For these factoid questions, NIST evaluators judged whether an answer was correct, inexact, unsupported, or incorrect. The submissions were then scored as percent of correct answers. For the list questions, participants returned a set of answers (e.g., a list of witnesses); submissions were given F-scores, measuring recall of the possible set of answers and the precision of the answers returned. For the "other" questions, participants provided a set of answers. These answer sets were also scored with an F-score, measuring whether the answer set contained certain "vital" information and how efficiently peripheral information was captured (based on answer lengths). Participants in the main task were also required to participate in the document-ranking task by submitting up to 1000 documents, ordered by score. Instead of providing an exact answer, participants were required to submit only the identifier of the document deemed to contain an answer. Document ranks were to be provided for 50 questions, with at least one document for each question. Scoring for this task used standard measures of recall (how many of the relevant documents were retrieved) and precision (how many of those retrieved were actually relevant). Summary measures are the average precision for all relevant documents and R-precision, the precision after R documents have been retrieved, where R is the number of relevant documents for the question. For the relationship task, participants were provided with TREC-like topic statements to set a context, where the topic was specific about the type of relationship being sought (generally, the ability of one entity to influence another, including both the means to influence and the motivation for doing so). The topic ended with a question that is either a yes/no question, which is to be understood as a request for evidence supporting the answer, or a request for the evidence itself. The system response is a set of information nuggets that provides evidence for the answer. An example is shown in Figure 1 (along with a comparable topic used in DUC 2005). Answers are scored for the relationship task in the same manner as the for the "other" questions of the main task. CL Research submitted one run for the main and document-ranking tasks and two runs for the relationship task.
dblp:conf/trec/Litkowski05 fatcat:we3oist5czfg5mq2eca6crul4e