講演・口頭発表等

2021年3月12日

Building low-resource speech recognizer: Transfer learning and data augmentation

日本音響学会研究発表会講演論文集
  • NARANGEREL PUREVDORJ
  • ,
  • 西村 良太
  • ,
  • Altangerel Ayush
  • ,
  • 太田 健吾
  • ,
  • 北岡 教英

記述言語
英語
会議種別

Sequence-to-sequence (S2S) models are now widely used for end-to-end speech processing, especially in Automatic Speech Recognition (ASR) applications. But recently, a hybrid attention/CTC architecture which uses selfattention to model temporal context information has achieved significantly lower Word Error Rates (WER) [4] than S2S based systems. While attention-based encoder-decoder architectures are used in the best performing end-to-end ASR systems, these approaches cannot be easily adapted for low-resource ASR [3], for use with languages that lack large, well-annotated speech corpora. Our main goal in this paper is the development of a low-resource Mongolian ASR system. The training dataset we used contains 23 hours of continuous Mongolian speech from 217 speakers, which is much less data than what is normally needed to train conventional ASR systems. Lowresource ASR approaches typically rely on data augmentation (DA), which can be implemented at low cost. In this study, we also evaluate several multilingual transfer learning methods that use foreign language corpora to supplement lowresource target language training. We tested multiple DA and multilingual training approaches and compared their effectiveness using ASR character and word error rates (CER/WER) to measure performance.

リンク情報
URL
https://web.db.tokushima-u.ac.jp/cgi-bin/edb_browse?EID=374238