Papers

Peer-reviewed
Jul, 2021

Normalization of Transliterated Mongolian Words Using Seq2Seq Model with Limited Data (accepted)

ACM Transactions on Asian and Low-Resource Language Information Processing
  • BYAMBADORJ ZOLZAYA
  • ,
  • Nishimura Ryota
  • ,
  • Altangerel Ayush
  • ,
  • Kitaoka Norihide

Language
English
Publishing type
Research paper (scientific journal)

The huge increase in social media use in recent years has resulted in new forms of social interaction, changing our daily lives. Due to increasing contact between people from different cultures as a result of globalization, there has also been an increase in the use of the Latin alphabet, and as a result a large amount of transliterated text is being used on social media. In this study, wepropose a variety of character level sequence-to-sequence (seq2seq) models for normalizing noisy, transliterated text written in Latin script into Mongolian Cyrillic script, for scenarios in which there is a limited amount of training data available. We applied performance enhancement methods, which included various beam search strategies, N-gram-based context adoption, edit distance-based correction and dictionary-based checking, in novel ways to two basic seq2seq models. We experimentally evaluated these two basic models as well as fourteen enhanced seq2seq models, and compared their noisy text normalization performance with that of a transliteration model and a conventional statistical machine translation (SMT) model. The proposed seq2seq models improved the robustness of the basic seq2seq models for normalizing out-of-vocabulary (OOV) words, and most of our models achieved higher normalization performance than the conventional method. When using test data during our text normalization experiment,our proposed method which included checking each hypothesis during the inference period achieved the lowest word error rate (WER = 13.41%), which was 4.51% fewer errors than when using the conventional SMT method.

Link information
URL
https://web.db.tokushima-u.ac.jp/cgi-bin/edb_browse?EID=376744

Export
BibTeX RIS