Module Comparison of Transformer-TTS for Speaker Adaptation based on Fine-tuning

Proceedings of APSIPA Annual Summit and Conference (APSIPA-ASC 2020)

Katsuki Inoue
Sunao Hara
Masanobu Abe

開始ページ: 826
終了ページ: 830
記述言語: 英語
掲載種別: 研究論文（国際会議プロシーディングス）
出版者・発行元: IEEE/APSIPA

End-to-end text-to-speech (TTS) models have achieved remarkable results in recent times. However, the model requires a large amount of text and audio data for training. A speaker adaptation method based on fine-tuning has been proposed for constructing a TTS model using small scale data. Although these methods can replicate the target speaker s voice quality, synthesized speech includes the deletion and/or repetition of speech. The goal of speaker adaptation is to change the voice quality to match the target speaker ' s on the premise that adjusting the necessary modules will reduce the amount of data to be fine-tuned. In this paper, we clarify the role of each module in the Transformer-TTS process by not updating it. Specifically, we froze character embedding, encoder, layer predicting stop token, and loss function for estimating sentence ending. The experimental results showed the following: (1) fine-tuning the character embedding did not result in an improvement in the deletion and/or repetition of speech, (2) speech deletion increases if the encoder is not fine-tuned, (3) speech deletion was suppressed when the layer predicting stop token is not fine-tuned, and (4) there are frequent speech repetitions at sentence end when the loss function estimating sentence ending is omitted.

リンク情報

Web of Science: https://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcAuth=JSTA_CEL&SrcApp=J_Gate_JST&DestLinkType=FullRecord&KeyUT=WOS:000678729400141&DestApp=WOS_CPL
URL: http://www.apsipa.org/proceedings/2020/pdfs/0000826.pdf 本文へのリンクあり

ID情報

ISSN : 2309-9402
Web of Science ID : WOS:000678729400141

エクスポート: BibTeX RIS

原直

論文

Module Comparison of Transformer-TTS for Speaker Adaptation based on Fine-tuning

メニュー

共著者の一覧