ALBAYZIN Query-by-example Spoken Term Detection 2016 evaluation

Javier Tejedor, Doroteo T. Toledano, Paula Lopez-Otero, Laura Docio-Fernandez, Jorge Proença, Fernando Perdigão, Fernando García-Granada, Emilio Sanchis, Anna Pompili, Alberto Abad
2018 EURASIP Journal on Audio, Speech, and Music Processing  
Query-by-example Spoken Term Detection (QbE STD) aims to retrieve data from a speech repository given an acoustic (spoken) query containing the term of interest as the input. This paper presents the systems submitted to the ALBAYZIN QbE STD 2016 Evaluation held as a part of the ALBAYZIN 2016 Evaluation Campaign at the IberSPEECH 2016 conference. Special attention was given to the evaluation design so that a thorough post-analysis of the main results could be carried out. Two different Spanish
more » ... eech databases, which cover different acoustic and language domains, were used in the evaluation: the MAVIR database, which consists of a set of talks from workshops, and the EPIC database, which consists of a set of European Parliament sessions in Spanish. We present the evaluation design, both databases, the evaluation metric, the systems submitted to the evaluation, the results, and a thorough analysis and discussion. Four different research groups participated in the evaluation, and a total of eight template matching-based systems were submitted. We compare the systems submitted to the evaluation and make an in-depth analysis based on some properties of the spoken queries, such as query length, single-word/multi-word queries, and in-language/out-of-language queries. interest within a speech data repository, and their purpose is to find similar speech segments within that repository. The speech segment found is the query, and the system outputs other similar segments from the repository, which we will henceforth refer to as utterances. Alternatively, the query can be uttered by the user. This is a highly valuable task for blind people or devices that do not have a textbased input, and consequently, the query must be given in other format such as speech. The STD systems are typically composed of three different stages: (1) the audio is decoded into word/subword lattices using an automatic speech recognition (ASR) subsystem trained for the target language (which makes the STD system language-dependent), (2) a term detection subsystem searches the terms within those word/subword lattices to hypothesize detections, and (3) confidence measures are computed to rank detections. The STD systems are normally language-dependent and require large
doi:10.1186/s13636-018-0125-9 fatcat:ccrh3ur67nffnf5hpfwg45kmhq