Spontaneous speech and opinion detection: mining call-centre transcripts
Language Resources and Evaluation
Opinion mining on conversational telephone speech tackles two challenges: the robustness of speech transcriptions and the relevance of opinion models. The two challenges are critical in an industrial context such as marketing. The paper addresses jointly these two issues by analyzing the influence of speech transcription errors on the detection of opinions and business concepts. We present both modules: the speech transcription system, which consists in a successful adaptation of a
... l speech transcription system to call-centre data and the information extraction module, which is based on a semantic modeling of business concepts, opinions and sentiments with complex linguistic rules. Three models of opinions are implemented based on the discourse theory, the appraisal theory and the marketers' expertise, respectively. The influence of speech recognition errors on the information extraction module is evaluated by comparing its outputs on manual versus automatic transcripts. The F-scores obtained are 0.79 for business concepts detection, 0.74 for opinion detection and 0.67 for the extraction of relations between opinions and their target. This result and the in-depth analysis of the errors show the feasibility of opinion detection based on complex rules on call-centre transcripts. A key challenge of speech processing is to give computers the ability to understand human behavior. The input is low-level information provided by audio samples, which can be very hard to process in the context of human-to-human interactions, such as phone calls for example. Some approaches focus on the analysis of speech signal. Acoustic features such as prosody, voice quality or spectral features are used in order to develop acoustic emotion recognition systems (Clavel and Richard 2011; Devillers et al. 2010 ). However, the issue of information extraction on speech is more globally tackled according to the point of view of natural language processing methods focusing on named entities detection and information retrieval. Research has unraveled many aspects concerning this issue with various evaluation campaigns driven in these two fields, for instance the ESTER2 campaign for named entities detection (Galliano et al. 2009 ), or the TREC 7-Spoken Document Retrieval, SDR-(Garofolo et al. 1999). However, such campaigns are mainly based on broadcast news and have not yet tackled the issue of information extraction on phone conversations, in which spontaneous speech features are more frequent. Moreover, the performance of speech recognition systems falls down on such data and information extraction is thus more difficult. Other approaches, such as the one described in Olsson et al. (2007) , search keywords directly in the acoustic signal or in phonetic transcriptions. They can offer solutions to handle speech recognition errors but are difficult to use for the detection of subtler information than keywords such as opinions and sentiments. Alongside these works on speech transcripts, sentiment analysis and opinion mining on texts are research fields that have been blooming since the year 2000. This is mostly due to the apparition of a new type of corpus: the interactive web. Users comment the products they have bought, review the films they have seen and make their opinions public. The web sites usually equally foresee in a starred notation, which makes the user comments' sites a perfect learning corpus. An overall overview on sentiment analysis and its evolution can be found in Pang and Lee (2008) and Tang et al. (2009) . Several methods are in use to distinguish positive from negative. Pang et al. (2002) automatically extract the linguistic clues from movie reviews and have tested three learning methods to classify them. They conclude that if the results are satisfying, they are not as good as the usual text categorization tasks. The clues used by Turney (2002) are bigrams extracted by predefined morpho-syntactic patterns (like adjective ? noun and adverb ? verb). The results are 84 % of good categorizations of product reviews and 66 % of film reviews. When running experiments, Dave et al. (2003) find out that the length of the n-gram should be optimally tuned to optimize the categorization. The longer the C. Clavel et al.