Spoken Document Retrieval for TREC-9 at Cambridge University

Sue E. Johnson, P. Jourlin, Karen Sparck Jones, Philip C. Woodland
2000 Text Retrieval Conference  
This paper presents work done at Cambridge University for the TREC-9 Spoken Document Retrieval (SDR) track. The CU-HTK transcriptions from TREC-8 with Word Error Rate (WER) of 20.5% were used in conjunction with stopping, Porter stemming, Okapi-style weighting and query expansion using a contemporaneous corpus of newswire. A windowing/recombination strategy was applied for the case where story boundaries were unknown (SU) obtaining a final result of 38.8% and 43.0% Average Precision for the
more » ... -9 short and terse queries respectively. The corresponding results for the story boundaries known runs (SK) were 49.5% and 51.9%. Document expansion was used in the SK runs and shown to also be beneficial for SU under certain circumstances. Non-lexical information was generated, which although not used within the evaluation, should prove useful to enrich the transcriptions in real-world applications. Finally, cross recogniser experiments again showed there is little performance degradation as WER increases and thus SDR now needs new challenges such as integration with video data. 1 Or at least where topic boundaries are not available within the global boundaries of a newscast.
dblp:conf/trec/JohnsonJJW00 fatcat:4p4rs4lfpvfhnosvqi3lw326om