Publications

Selected and recent publications are given below. The complete list of my articles are available on my Google Scholar profile.

Towards Developing State-of-the-Art TTS Synthesisers for 13 Indian Languages with Signal Processing aided Alignments

Published in IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023

End-to-end (E2E) systems synthesise high-quality speech, but this typically requires a large amount of data. As E2E synthesis progressed from Tacotron to FastSpeech2, it became evident that features representing prosody, particularly sub-word durations, are important for error-free synthesis. Variants of FastSpeech use a teacher model or forced alignments for training. This paper uses signal processing cues in tandem with forced alignment to produce accurate phone boundaries for the training data. As a result of better duration modelling, good-quality synthesisers are developed. Evaluations indicate that systems developed using the proposed signal processing-aided approach are better than systems developed using other alignment approaches, especially in low-resource scenarios. Our systems also outperform the existing best TTS systems available for 13 Indian languages.

Recommended citation: Anusha Prakash, S. Umesh and Hema A. Murthy, "Towards Developing State-of-The-Art TTS Synthesisers for 13 Indian Languages with Signal Processing Aided Alignments", 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, Taiwan, 2023, pp. 1-8, doi: 10.1109/ASRU57964.2023.10389630. https://ieeexplore.ieee.org/abstract/document/10389630

Exploring the Role of Language Families for Building Indic Speech Synthesisers

Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 2022

Building end-to-end speech synthesisers for Indian languages is challenging, given the lack of adequate clean training data and multiple grapheme representations across languages. This work explores the importance of training multilingual and multi-speaker text-to-speech (TTS) systems based on language families. The objective is to exploit the phonotactic properties of language families, where small amounts of accurately transcribed data across languages can be pooled together to train TTS systems. These systems can then be adapted to new languages belonging to the same family in extremely low-resource scenarios. TTS systems are trained separately for Indo-Aryan and Dravidian language families, and their performance is compared to that of a combined Indo-Aryan+Dravidian voice. We also investigate the amount of training data required for a language in a multilingual setting. Same-family and cross-family synthesis and adaptation to unseen languages are analysed. The analyses show that language family-wise training of Indic systems is the way forward for the Indian subcontinent, where a large number of languages are spoken.

Recommended citation: Anusha Prakash and Hema A. Murthy, "Exploring the Role of Language Families for Building Indic Speech Synthesisers", in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 734-747, 2023, doi: 10.1109/TASLP.2022.3230453. https://ieeexplore.ieee.org/abstract/document/9992058

Generic Indic Text-to-Speech Synthesisers with Rapid Adaptation in an End-to-End Framework

Published in Interspeech, 2020

Building text-to-speech (TTS) synthesisers for Indian languages is a difficult task owing to a large number of active languages. Indian languages can be classified into a finite set of families, prominent among them, Indo-Aryan and Dravidian. The proposed work exploits this property to build a generic TTS system using multiple languages from the same family in an end-to-end framework. Generic systems are quite robust as they are capable of capturing a variety of phonotactics across languages. These systems are then adapted to a new language in the same family using small amounts of adaptation data. Experiments indicate that good quality TTS systems can be built using only 7 minutes of adaptation data. An average degradation mean opinion score of 3.98 is obtained for the adapted TTSes. Extensive analysis of systematic interactions between languages in the generic TTSes is carried out. x-vectors are included as speaker embedding to synthesise text in a particular speaker’s voice. An interesting observation is that the prosody of the target speaker’s voice is preserved. These results are quite promising as they indicate the capability of generic TTSes to handle speaker and language switching seamlessly, along with the ease of adaptation to a new language.

Recommended citation: Anusha Prakash and Hema A. Murthy, "Generic Indic Text-to-Speech Synthesisers with Rapid Adaptation in an End-to-End Framework", in Proc. Interspeech, 2020, 2962-2966, doi: 10.21437/Interspeech.2020-2663. https://www.isca-archive.org/interspeech_2020/prakash20_interspeech.html