Presentations

Mar 19, 2020

NORMALIZATION OF TRANSLITERATED WORDS USING SEQ2SEQ MODEL WITH SPELL CHECKER

言語処理学会発表論文集
  • zolzaya
  • ,
  • Nishimura Ryota
  • ,
  • Altangerel Ayush
  • ,
  • Kitaoka Norihide

Language
Japanese
Presentation type

There are two written systems in Mongolian language: -Classic Mongolian (Uyghur Mongolian) -Cyrillic Both of them are used in Mongolia. The Mongolian People's Republic, as it was called then, first started using a modified Russian Cyrillic alphabet in 1940 which is still used and the official written system. Mongolian Cyrillic has 35 characters. Even though the official written system is the Cyrillic script before as mentioned, recently many people use Latin alphabets to write text on social media like Facebook and Twitter. While writing transliterated text using the Roman script on social media, there is no rule. Therefore, one word can be written in different forms. The text processing of social media is one of the important subjects in NLP. Therefore, in the last years, there has been a lot of work that focuses on social media. But there is a lack of research in this area for the Mongolian language and this is the first study of text normalization for Mongolian. Text normalization is a pre-processing stage for speech and language processing applications. At first, text normalization was to convert words in non-standard forms such as numbers, dates, acronyms, and abbreviations to standard forms in the formal text. But later this content was expanded to convert informal text on social media into formal text. Both source and target texts are the same languages in the most research work of noisy text normalizations. In our case, it is a little bit different and our purpose is to convert noisy transliterated text on social media to the formal style. In other words, the scripts of source and target texts are different, Roman and Cyrillic scripts, respectively.

Link information
URL
https://web.db.tokushima-u.ac.jp/cgi-bin/edb_browse?EID=373079