Kielipankki – The Language Bank of Finland offers a comprehensive set of resources, tools and services in a high-performance environment. Tamás Grósz tells us about his research on speech technology.
My name is Tamás Grósz and I am a Research Fellow in the Speech Recognition group at the Department of Information and Communications Engineering of Aalto University.
During my PhD years, my research was focused on Speech Technology, specifically on developing new deep-learning-based solutions for Automatic Speech Recognition (ASR). Although my main interest was acoustic modelling, I was also active in other areas. Paralinguistics, in particular, piqued my interest, and I worked on a wide variety of tasks. I regularly participated in the Interspeech ComParE challenges and won several times over the years. Perhaps the most notable of our systems is the one that automatically assesses the condition of patients suffering from Parkinson’s disease. Besides the competitions, I was also part of a project that concentrated on developing a speech-based solution for early detection of mild cognitive impairment. In the last years of my studies, my focus shifted towards silent speech interfaces. I had the pleasure of working with state-of-the-art prototypes and developing new systems that could generate speech from ultrasound tongue movement videos.
After graduation, I joined Mikko Kurimo’s lab as a postdoc, where I had an opportunity to work on other topics, including language modelling and AI explainability. Initially, I worked on subword-based language models for agglutinative languages like Hungarian and Finnish. While working with various models, I noticed the importance of curriculum learning. As a spin-off project, I have started investigating different ways of estimating the difficulties of training samples and constructing new curriculums for AI models.
Simultaneously, working on projects like Teflon, AASIS and Kielibuusti enabled me to learn more about children’s ASR, speech assessment and tools that can aid language learners. Our best models have been successfully integrated into a mobile application that can aid immigrants in learning the Finnish language.
In 2022, we developed a system that can recognize different kinds of stuttering (e.g. word/phrase repetition, prolongation, sound repetition and others) and won the INTERSPEECH 2022 Stefan Steidl Computational Paralinguistics Award. Later, we investigated how the emotional state of speakers can be recognized from non-verbal vocal expressions (such as laughter, cries, moans, and screams). Our system achieved first place for both tasks in the ACMMM CompParE competition. Since then, I have also worked on multimodal solutions for Emotion and Humor detection.
My current work mainly focuses on training and understanding Self-Supervised Foundation models as part of our Extreme-scale LUMI project and the LAREINA project. Explainable AI and model interpretation has been a long-term interest of mine, and with these new models and computational resources, I had the opportunity to explore new techniques. Recently, I have developed ways to find the relevant subspaces inside large foundation models and explore the concepts discovered by the models during pre-training, as well as understand the changes caused by the fine-tuning process. These techniques enabled us to better understand our models and guided us in designing new, better training algorithms.
As modern speech recognizers require a considerable amount of data, it became a priority to collect and annotate suitable corpora. In 2020, I joined the team creating the Donate Speech datasets (puhelahjat). This corpus, with its approximately 3200 hours of donated speech, enabled various other projects, including our FinW2V2 project at LUMI. Using this dataset and Aalto’s Finnish Parliament ASR Corpus 2008-2020, we have developed numerous Finnish ASR systems over the years.
Currently, I am also involved in the LAREINA project, building large speech foundation models and making them available for Industrial partners.
Getman, Y., Grósz, T., Hiovain-Asikainen, K. & Kurimo, M. (2024), Exploring adaptation techniques of large speech foundation models for low-resource ASR: a case study on northern Sámi, in Proc. of Interspeech. DOI: 10.21437/Interspeech.2024-479
Karakasidis, G., Kurimo, M., Bell, P. & Grósz, T. (2024), Comparison and analysis of new curriculum criteria for end-to-end ASR, Speech Communication p. 103113. DOI: 10.1016/j.specom.2024.103113
Moisio, A., Porjazovski, D., Rouhe, A., Getman, Y., Virkkunen, A., AlGhezi, R., Lennes, M., Grósz, T., Linden, K. & Kurimo, M. (2023), Lahjoita puhetta: a large-scale corpus of spoken Finnish with some benchmarks, Language Resources and Evaluation 57(3), 1295–1327. DOI: 10.1007/s10579-022-09606-3
Phan, N., von Zansen, A., Kautonen, M., Grósz, T. & Kurimo, M. (2024), CaptainA a self-study mobile app for practising speaking, in Proc. of Interspeech. https://www.isca-archive.org/interspeech_2024/phan24b_interspeech.pdf
Virkkunen, A., Sarvas, M., Huang, G., Grósz, T. & Kurimo, M. (2024), Investigating the clusters discovered by pre-trained AV-Hubert, in Proc. of IEEE ICASSP 2024, pp. 11196–11200. DOI: 10.1109/icassp48485.2024.10447434
Getman, Y., Phan, N., Al-Ghezi, R., Voskoboinik, E., Singh, M., Grósz, T., Kurimo, M., Salvi, G., Svendsen, T., Strömbergsson, S. et al. (2023), Developing an AI-assisted low-resource spoken language learning app for children, in IEEE Access. DOI: 10.1109/access.2023.3304274
Grósz, T., Getman, Y., Al-Ghezi, R., Rouhe, A. & Kurimo, M. (2023), Investigating wav2vec2 context representations and the effects of fine-tuning, a case-study of a Finnish model, in Proc. of Interspeech. DOI: 10.21437/interspeech.2023-837
Grósz, T., Virkkunen, A., Porjazovski, D. & Kurimo, M. (2023), Discovering relevant sub-spaces of Bert, wav2vec 2.0, Electra and ViT embeddings for humor and mimicked emotion recognition with integrated gradients, in Proc. of the 4th Multimodal Sentiment Analysis Challenge and Workshop, pp. 27–34. DOI: 10.1145/3606039.3613102
Porjazovski, D., Getman, Y., Grósz, T. & Kurimo, M. (2023), Advancing audio emotion and intent recognition with large pre-trained models and Bayesian inference, in Proc. of the 31st ACM International Conference on Multimedia, pp. 9477–9481. DOI: 10.1145/3581783.3612848
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers of Social Sciences and Humanities to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.