Batak Toba language-Indonesian machine translation with transfer learning using no language left behind
Abstract
This study focuses on neural machine translation (NMT) for low-resource languages (LRLs) pair, Batak Toba-Indonesian (bbc↔ind). The Batak Toba language is a critically endangered dialect of an Indonesian ethnic group, Batak. Recent advances in machine translation offer potential solutions, with transfer learning emerging as a promising approach for this language pair. We used a publicly available bbc↔ind parallel corpora from the Hugging Face datasets hub and employed the NLLB-200's distilled 600M variant model as the baseline model. Our models achieved sacreBLEU scores as follows: i) for bbc→ind, it achieved a score of 37.10 (+25.67, up from 11.43) and ii) for ind→bbc, it achieved a score of 30.84 (+25.82, up from 5.02). These results outperform all previous works in the task bbc↔ind machine translation and prove the validity of our approach.
Full Text:
PDFDOI: http://doi.org/10.11591/ijaas.v13.i4.pp830-839
Refbacks
- There are currently no refbacks.
International Journal of Advances in Applied Sciences (IJAAS)
p-ISSN 2252-8814, e-ISSN 2722-2594
This journal is published by the Institute of Advanced Engineering and Science (IAES) in collaboration with Intelektual Pustaka Media Utama (IPMU).
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.