Building Dialogue Understanding Models for Low-resource Language Indonesian from Scratch release_cuon7arjnzhybd4a4ewedy7z6e

by Donglin Di, Xianyang Song, Weinan Zhang, Yue Zhang, Fanglin Wang

Published in ACM Transactions on Asian and Low-Resource Language Information Processing by Association for Computing Machinery (ACM).

2022  

Abstract

Using off-the-shelf resources from resource-rich languages to transfer knowledge to low-resource languages has received a lot of attention. The requirements of enabling the model to achieve the reliable performance, including the scale of required annotated data and the effective framework, are not well guided. To address the first question, we empirically investigate the cost-effectiveness of several methods for training intent classification and slot-filling models from scratch in Indonesia (ID) using English data. Confronting the second challenge, we propose a Bi-Confidence-Frequency Cross-Lingual transfer framework (BiCF), which consists of "BiCF Mixing", "Latent Space Refinement" and "Joint Decoder", respectively, to overcome the lack of low-resource language dialogue data. BiCF Mixing based on the word-level alignment strategy generates code-mixed data by utilizing the importance-frequency and translating-confidence. Moreover, Latent Space Refinement trains a new dialogue understanding model using code-mixed data and word embedding models. Joint Decoder based on Bidirectional LSTM (BiLSTM) and Conditional Random Field (CRF) is used to obtain experimental results of intent classification and slot-filling. We also release a large-scale fine-labeled Indonesia dialogue dataset (ID-WOZ) and ID-BERT for experiments. BiCF achieves 93.56% and 85.17% (F1 score) on intent classification and slot filling, respectively. Extensive experiments demonstrate that our framework performs reliably and cost-efficiently on different scales of manually annotated Indonesian data.
In application/xml+jats format

Archived Files and Locations

application/pdf   2.1 MB
file_ry36pjvdincededtnubkeprvd4
dl.acm.org (publisher)
web.archive.org (webarchive)
Read Archived PDF
Preserved and Accessible
Type  article-journal
Stage   published
Date   2022-12-15
Language   en ?
Container Metadata
Not in DOAJ
In Keepers Registry
ISSN-L:  2375-4699
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: 72ad46e3-5a9d-4e11-81b8-19e95f4b9e7d
API URL: JSON