Kielipankki – The Language Bank of Finland offers a comprehensive set of resources, tools and services in a high-performance environment. Tuukka Törö tells us about his research on Finnish speech synthesis. Neural network models, which are trained with large amounts of audio data from varied datasets, enable researchers to analyze speech in new ways.
I am Tuukka Törö. I have been working as a doctoral researcher at the University of Helsinki’s Phonetics and Speech Synthesis Research Group since the beginning of this year. My background is in linguistics and phonetics, and I hold a BA in English studies from the University of Malmö and an MA in Phonetics from the University of Helsinki. After writing my Master’s thesis on controlling speaking styles in speech synthesis, I spent some time working with YLE on AI radio projects where we created synthetic ‘actors’ for radio features.
In my current position, I work in the Academy of Finland funded project Predictive Processing Approach to Modelling Prosodic Hierarchy for Speech Synthesis. The project’s aim is to develop text-to-speech (TTS) synthesis inspired by the predictive processing theory of human cognition.
While my focus has become more technically inclined, the primary inspiration behind my work stems from a fascination with how social structures influence speech, from macro level variation to how people convey social dynamics in specific contexts.
Currently I am researching macro level language variation using neural-network models built for TTS and speech recognition. While the models’ original purpose is in technological applications, they enable us to analyze speech in new ways. As the models are trained with large amounts of audio, they can be used to model ’wild’ data of varying quality on a large scale instead of picking apart specific acoustic features from small, professionally recorded datasets.
Within the academy project, my aim is to tie together sociolinguistic variation with the predictive processing and speech synthesis side of things. Hopefully, in the coming years we will learn something new about how humans perceive social cues in speech and how high-level social predictions can be utilized to improve speech synthesis.
I often use corpora from Kielipankki such as Samples of Spoken Finnish (SKN), FinSyn (to be available in Kielipankki), and most of all Donate Speech (Lahjoita puhetta). In order to train speech synthesizers that we control on social variables – such as age, gender, and dialect – we need a large amount of audio data from people with a rich variety of backgrounds. With Finnish being a relatively small language, it is vital to have a concentrated effort for building large datasets like the Donate Speech corpus.
Törö, T., Suni, A. and Šimko, J. (2024). Analysis of regional variants in a vast corpus of Finnish spontaneous speech using a large-scale self-supervised model, Proceedings of Speech Prosody 2024, Leiden, Netherlands. DOI: 10.21437/SpeechProsody.2024-8
Šimko, J., Törö, T., Vainio M., and Suni, A. (2023). Prosody under control: Controlling prosody in text-to-speech synthesis by adjustments in latent reference space, Proceedings of the 18th International Congress of Phonetic Sciences, Prague, Czech Republic. http://hdl.handle.net/10138/565382
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers of Social Sciences and Humanities to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.