Towards the use of entropy as a measure for the reliability of automatic MT evaluation metrics

Michal Munk, Dasa Munkova, Lubomir Benko, David Pinto, Vivek Kumar Singh, Aline Villavicencio, Philipp Mayr-Schlegel, Efstathios Stamatatos
2018 Journal of Intelligent & Fuzzy Systems  
The study describes an experiment with different estimations of reliability. Reliability reflects the technical quality of the measurement procedure such as an automatic evaluation of Machine Translation (MT). Reliability is an indicator of accuracy, the reliability of measuring, in our case, measuring the accuracy and error rate of MT output based on automatic metrics (precision, recall, f-measure, Bleu-n, WER, PER, and CDER). The experiment showed metrics (Bleu-4 and WER) that reduce the
more » ... ll reliability of the automatic evaluation of accuracy and error rate using entropy. Based on the results we can say, that the use of entropy for the estimation of reliability brings more accurate results than conventional estimations of reliability (Cronbach's alpha and correlation). MT evaluation, based on n-grams or edit distance, using entropy could offer a new view on lexicon-based metrics in comparison to commonly used ones. proaches to MT evaluation, from fully automated quality scoring to manual or human assessment of the quality of MT output. In most evaluation approaches translation quality is viewed as an optimal compromise between adequacy (the degree of meaning preservation) and fluency (correctness of target language) [3]. Approaches to manual or human evaluation of MT, requiring human translator knowledge, assess the quality of MT output along the two axes of target language correctness and semantic fidelity, such as ranking, scales, error analysis, or post-editing [18] . Compared to automatic MT evaluation, which is not only fast and cheap but reusable and language-independent; manual evaluation is regarded as the most reliable but time and labor consuming and not re-usable. Papineni et al. [13] stated that manual evaluation is too slow and time consuming for the development of MT systems, for which
doi:10.3233/jifs-169505 fatcat:5b5vv5fwrjbfzhaqj2sphq4yti