論文

査読有り 本文へのリンクあり 国際誌
2020年12月

Module Comparison of Transformer-TTS for Speaker Adaptation based on Fine-tuning

Proceedings of APSIPA Annual Summit and Conference (APSIPA-ASC 2020)
  • Katsuki Inoue
  • ,
  • Sunao Hara
  • ,
  • Masanobu Abe

開始ページ
826
終了ページ
830
記述言語
英語
掲載種別
研究論文(国際会議プロシーディングス)
出版者・発行元
IEEE/APSIPA

End-to-end text-to-speech (TTS) models have achieved remarkable results in recent times. However, the model requires a large amount of text and audio data for training. A speaker adaptation method based on fine-tuning has been proposed for constructing a TTS model using small scale data. Although these methods can replicate the target speaker s voice quality, synthesized speech includes the deletion and/or repetition of speech. The goal of speaker adaptation is to change the voice quality to match the target speaker ' s on the premise that adjusting the necessary modules will reduce the amount of data to be fine-tuned. In this paper, we clarify the role of each module in the Transformer-TTS process by not updating it. Specifically, we froze character embedding, encoder, layer predicting stop token, and loss function for estimating sentence ending. The experimental results showed the following: (1) fine-tuning the character embedding did not result in an improvement in the deletion and/or repetition of speech, (2) speech deletion increases if the encoder is not fine-tuned, (3) speech deletion was suppressed when the layer predicting stop token is not fine-tuned, and (4) there are frequent speech repetitions at sentence end when the loss function estimating sentence ending is omitted.

リンク情報
Web of Science
https://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcAuth=JSTA_CEL&SrcApp=J_Gate_JST&DestLinkType=FullRecord&KeyUT=WOS:000678729400141&DestApp=WOS_CPL
URL
http://www.apsipa.org/proceedings/2020/pdfs/0000826.pdf 本文へのリンクあり
ID情報
  • ISSN : 2309-9402
  • Web of Science ID : WOS:000678729400141

エクスポート
BibTeX RIS