Researcher of the Month: Pekka Posio

Photo: Maarit Kytöharju

Kielipankki – The Language Bank of Finland offers a comprehensive set of resources, tools and services in a high-performance environment. Pekka Posio tells us about a research project that explores the link between gender and language use in the Spanish-speaking world. The extensive CoLaGe Corpus compiled during the project will be available via the Language Bank.

Who are you?

I am Pekka Posio, Professor of Ibero-Romance Languages at the Faculty of Arts, University of Helsinki. I focus on Spanish and Portuguese, and examine sociolinguistics, pragmatics and language change and variation. Currently, I am the head of discipline for Portuguese, Galician and Basque languages.

I studied Romance Languages and General Linguistics at the University of Helsinki, where I obtained my PhD in 2012. The topic of my dissertation was the expression of subject pronouns in Spanish and Portuguese. During my post doc phase, I worked in Salamanca, Berlin, Cologne and Ghent, studying impersonal constructions in Spanish and Portuguese. I also worked for three years as a university lecturer in Spanish at Stockholm University before returning to Helsinki in 2019 as an associate professor. In 2024, I was appointed as a professor.

What is your research topic?

Currently, my research focuses on language and gender in the Spanish-speaking world and I lead the research project Gender, Society, and Language Use: Evidence from Mexico and Spain (2021-2025), funded by the Kone Foundation. Language and gender is a well-established area of research in the study of the English language and English linguistics, but has received less attention in Spanish studies.

In this project, we are particularly interested in the mechanisms that link society and gender to language use, and whether there are differences in the relationship between gender and language in different societies that use the same language. These questions will be approached through both sociolinguistics and social psychology. We have collected a wide range of data, including both spoken and transcribed language and socio-psychological data on our informants. By combining these data, we will be able to explore the links between language and gender in a completely new way and at the same time renew the concept of gender as a sociolinguistic variable. In addition to the traditional comparison of female and male speech, we use scalar variables such as speakers’ perceptions of their own masculinity and femininity, and gender-related attitudes and perceptions.

We study different phenomena of language use – for example, the prevalence of different grammatical persons and ways of interacting in speech – in two societies that share the same language but differ in terms of gender roles and norms. We collected data between 2022 and 2023 in Guadalajara, Mexico, and Valencia, Spain. The research data generated by this project will help to broaden and diversify our understanding of gender and its manifestations, particularly in the societies we studied.

The post doc researchers in this project are Gloria Uclés Ramada, Sven Kachel, Andrea Carcelén Guerrero and Fien de Latte. The project has also employed a number of students as data collectors, transcribers and coders in Finland, Spain, Mexico and Germany.

How is your research related to Kielipankki – the Language Bank of Finland?

We have produced a corpus called Corpus for the Study of Language and Gender in Mexico and Spain (CoLaGe), which contains 111 hours and over one million words of recorded and transcribed speech from 127 informants. The corpus is divided into a sub-corpus for Valencia (CoLaGe-V) and Guadalajara (CoLaGe-G), and a smaller CoLaGe-D(iversity) corpus collected in Guadalajara, with informants representing gender and/or sexual minorities. In collecting the data, we have tried to obtain data that are as comparable as possible, with speakers from two age groups (30-40 and 60-70) and two countries. The data include sociolinguistic interviews, role-plays simulating conflict situations and material elicited for phonetic research in which informants describe images they have seen.

In addition to comparability, the collection of data was guided by the need to make all the extensive material available to other researchers, which is why a great deal of attention has been paid to issues such as pseudonymisation. The majority of the speech material has also been recorded on studio equipment, which allows it to be used for phonetic analysis. The Language Bank of Finland has been a natural location for the CoLaGe corpus since its inception. The social psychology data from the project will also be made available to researchers via the Finnish Social Science Data Archive.

Selected publications from the project

Carcelen Guerrero, A., Posio, P., Kachel, S. & Uclés Ramada, G. (Accepted 2025). CoLaGe: Corpus for the study of language and gender in two varieties of Spanish. Corpora. https://researchportal.helsinki.fi/files/328418218/CoLaGe-accepted.pdf

Uclés Ramada, G., Kachel, S. & Posio, P., 2025. Conflict, gender, and amount of talk: Gender differences in Spanish role play data. Pragmatics and Society. DOI: 10.1075/ps.23144.ucl

Posio, P., Kachel, S., & Uclés Ramada, G. 2024. Morphosyntactic stereotypes of speakers with different genders and sexual orientations: an experimental investigation. Linguistics. DOI: 10.1515/ling-2022-0143

More publications from Pekka Posio: https://researchportal.helsinki.fi/en/persons/pekka-posio

Corpus

Corpus for the Study of Language and Gender in Mexico and Spain (CoLaGe)

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers of Social Sciences and Humanities to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Simo Määttä

Photo: Veikko Somerpuro

Kielipankki – The Language Bank of Finland offers a comprehensive set of resources, tools and services in a high-performance environment. Simo Määttä tells us about his research that is based on sociological translation studies, critical sociolinguistics and critical discourse studies.

Who are you?

I am Simo Määttä, Assistant Professor of Translation Studies at the Faculty of Arts, University of Helsinki. I am Head of the Translation Studies Research Community TRAST and hold a title of docent in French Studies. I teach in the Master’s programme in Translation and Interpreting at the University of Helsinki. I am Chair of the Board of the Register of Legal Interpreters.

I received my PhD from the University of California, Berkeley in 2004 and have since worked at several universities in Finland, and since 2014 at the University of Helsinki.

What is your research topic?

My research is based on sociological translation studies, critical sociolinguistics and critical discourse studies. I am interested in how language use and other interactions are represented and what meanings are given to linguistic interactions – especially multilingual communication and linguistic variation.

One of my main research interests is public service (or community) and legal interpreting. In this field, I examine language ideologies, accuracy of interpreting, multimodality, the agency of participants in the interpreter-mediated encounter, the expression of empathy and the realisation of linguistic rights. In particular, I have studied lingua franca interpreting, where both the interpreter and the client speaking a foreign language communicate in a language that is not their first language. This is common, for example, when an asylum seeker, migrant or foreign national suspected of or victim of a crime communicates with an interpreter in French or English.

I lead the Translation, Immigration and Democracy project (2022-2025) funded by the Kone Foundation, where our research team analyses translation policies and practices in multilingual communication targeted to migrant populations. The research focuses on organisations (e.g. municipalities, organisations, companies, universities, media) operating in the Helsinki metropolitan area (Helsinki, Espoo and Vantaa) and in Tallinn. The project combines theories and methods of functionalist and sociological translation studies and critical linguistics.

The project is founded on the idea that multilingualism constitutes not only an opportunity for democracy, but also a challenge: the language barrier prevents migrants from participating in social, cultural and political life and from becoming full members of their local community and society. Translation aims to promote migrants’ access to information and participation, but it does not reach all migrants. The project approaches translation as a practice of governmentality, through which power is exercised and produced. One of the objectives is to propose new solutions, together with different actors, to improve the quality of translation policies and practices.

I am also involved in the EU Horizon-funded project ARENAS (Analysis of and Responses to Extremist Narratives), coordinated by Professor Julien Longhi (Cergy Paris Université), in which our international, multidisciplinary consortium analyses the extremist narratives affecting and threatening European political and social life. We explore the nature of extremist narratives and seek to understand them, in particular those concerning science, gender and the Nation. By understanding how these narratives work, we aim to find ways to counter extremist narratives and thus contribute to the harmonious development of Europe.

Within the ARENAS project, I am involved in a work package related to the circulation of extremist narratives, coordinated by historian Steven Forti from the Autonomous University of Barcelona. The ARENAS team in Helsinki is led by Dr. Katalin Miklóssy, Jean Monet Professor and Associate Professor of Political History. I am responsible for a task of qualitative research on how extremist narratives circulate between political discourse, traditional media and new media. The qualitative data for the study is selected on the basis of the quantitative data produced and analysed in the other tasks of the work package.

I also analyse the theory of discourse, ideology (especially language ideology), performativity, and hate speech. My previous research has focused on the translation of sociolinguistic variation in literature and language policies related to regional and minority languages.

How is your research related to Kielipankki – the Language Bank of Finland?

In the part of the ARENAS project that I am responsible for, we use the corpora available in the Language Bank on speeches made in the Finnish Parliament, especially in plenary sessions. These data have allowed us to see exactly how the topics discussed in traditional and new media correspond to the political debate in Parliament. In addition, our research has made use of the ParlaMint corpus and a corpus compiled for the ARENAS project, which consists of social media posts by politicians in different countries.

I also used the Suomi24 corpus from the Language Bank in a study co-authored with Yrjö Lauranto to examine how online discussants express dissenting and sympathetic opinions about gender and sexual minorities. We also used Suomi24 data in articles written with Ulla Tuomarla and Karita Suomalainen in Finnish and English, analysing discussions on immigration.

Selected publications

Määttä, S. & Kinnunen, T. 2024. The Interplay between Linguistic and Non-verbal Communication in an Interpreter-mediated Main Hearing of a Victim’s Testimony. Multilingua: Journal of Cross-Cultural and Interlanguage Communication 43(3), 299–330. DOI: 10.1515/multi-2023-0153

Määttä, S., Kinnunen, T., Kuusi, P. & Probirskaja, S. 2024. Kohderyhmätietous monikielisen kriisiviestinnän asiantuntijatyössä koronapandemian aikana. Työelämän tutkimus 22(4), 555–587. https://journal.fi/tyoelamantutkimus/article/view/142675

Määttä, S. 2023. Linguistic and Discursive Properties of Hate Speech and Speech Facilitating the Expression of Hatred: Evidence from Finnish and French Online Discussion Boards. Internet Pragmatics 6(2), 156–172. DOI: 10.1075/ip.00094.maa

Määttä, S. & Wiklund, M. 2023. Resolving Comprehension Problems in a Telephone-interpreted Screening Interview. Teoksessa: E. de Boe, J. Vranjes & H. Salaets (toim.) Interactional Dynamics in Remote Interpreting: Micro-analytical Approaches. New York: Routledge, 42–65. https://www.routledge.com/Interactional-Dynamics-in-Remote-Interpreting-Micro-analytical-Approaches/Boe-Vranjes-Salaets/p/book/9781032213286

Määttä, S. & Hall, M. 2022. Ideology and Discourse: Convergent and Divergent Developments. Teoksessa: S. Määttä & M. Hall (toim.) Mapping Ideology in Discourse Studies. Boston & Berlin: De Gruyter Mouton, 1–20. DOI: 10.1515/9781501513602-001

Määttä, S. & Lauranto, Y. 2022. Eriävän ja myötämielisen mielipiteen esittäminen sukupuoli- ja seksuaalivähemmistöjä koskevissa Suomi24-keskusteluissa. Virittäjä 126(2), 205–230. https://journal.fi/virittaja/article/view/100240

Määttä, S., Puumala, E. & Ylikomi, R. 2021. Linguistic, Psychological, and Epistemic Vulnerability in Asylum Procedures: An Interdisciplinary Approach. Discourse Studies 23(1), 46–66. DOI: 10.1177/1461445620942909

Määttä, S., Suomalainen, K. & Tuomarla, U. 2021. Everyday Discourse as a Space of Citizenship: The Linguistic Construction of In-groups and Out-groups in Online Discussion Boards. Citizenship Studies 25(6), 773–790. DOI: 10.1080/13621025.2021.1968715

Vernet, S. & Määttä, S. 2021. Modalités syntaxiques et argumentatives du discours homophobe en ligne : chroniques de la haine ordinaire. Mots – Les langages du politique 125, 35–51. https://journals.openedition.org/mots/27943

Määttä, S., Suomalainen, K. & Tuomarla, U. 2020. Maahanmuuttovastaisen ideologian ja ryhmäidentiteetin rakentuminen Suomi24-keskustelussa. Virittäjä 124(2), 190–216. https://journal.fi/virittaja/article/view/81931

Corpora

More information

Suomeksi

Researcher of the Month: Marko Jouste

Photo: Sigga-Marja Magga

Kielipankki – The Language Bank of Finland offers a comprehensive set of resources, tools and services in a high-performance environment. Marko Jouste tells us about his research on Sámi culture and about his work with the Giellagas Corpus of Spoken Sámi languages.

Who are you?

I am Marko Jouste, a university lecturer and associate professor (Title of docent) specializing in Sámi culture at the Giellagas Institute for Sámi Studies, University of Oulu. Since the early 2010s, I have been an active member of the Giellagas Institute, where my research focuses on various aspects of Sámi culture, including music, history, and heritage. Additionally, I serve as the main developer of the Sámi Cultural Archive, located within the institute. Beyond my academic duties, I also work as a musician, performing with music groups such as Ulla Pirttijärvi & Ulda and Suõmmkar.

What is your research topic?

My research focuses primarily on Sámi music, culture and history, with a particular emphasis on historical audio recordings. Currently, I am leading several active research projects, including The Northern Sámi Fairy Tale Book 1956 – Returning Historical Archive Material to the Community and Developing Ethical and Legal Practices for Open Access (funded by the Kone Foundation), and Skolt Sámi Dance: The Transformative Journey of Tradition, Resilience, and the Arctic Quadrille, a collaborative project with dance researcher Petri Hoppu (funded by the Jenny and Antti Wihuri Foundation). Another project is Jaakko Sverloff’s Life Story – From Suonikylä in Petsamo Through the World Wars to a Leader of the Skolt Sámi (also supported by the Jenny and Antti Wihuri Foundation).

Additionally, I have contributed to the Research Council of Finland’s key project, the Skolt Sámi Memory Bank. This pilot project, operational between 2016 and 2018, focused on the management and cultural revitalization of Skolt Sámi music, language, and cultural materials preserved in sound archives in Finland. Through these projects, I aim to promote community engagement, advance ethical practices in archival work, and contribute to the revitalization and preservation of Sámi cultural heritage.

How is your research related to Kielipankki?

Kielipankki – The Language Bank of Finland is closely connected to my research through its integration with archival work. Since the 2010s, the Sámi Cultural Archive has collaborated with the Language Bank to develop Sámi language materials for broader use in both academic research and Sámi language communities. The Giellagas Institute’s Corpus of Spoken Sámi Languages currently includes three Sámi languages spoken in Finland: Northern Sámi, Aanaar (Inari) Saami, and Skolt Saami. Notably, the first sub corpus added to the Language Bank was the Northern Sámi sample corpus. In spring 2025, this collection will be expanded with the inclusion of the Aanaar Saami spoken language corpus.

The FIN-CLARIN consortium, which is the organization behind the Language Bank of Finland, has also provided funding for corpus development work at the Sámi Cultural Archive in 2014, 2019, and 2022. This collaboration significantly enhances the accessibility, preservation, and usability of Sámi language materials, aligning with my broader focus on Sámi culture and heritage. In my research, I extensively use language technology tools such as the Korp service, which facilitates the analysis and exploration of linguistic data, particularly in the context of Sámi languages.

Publications

Petri Hoppu & Marko Jouste (2025). Skolt Saami Dance: The Transformative Journey of Tradition, Resilience, and the Arctic Quadrille. London: Bloomsbury. [Painossa]

Jouste, Marko (2022) ”Skolt Saami Leuʹdd. Tradition as a medium of individual and collective remembrance”. The Sámi World. Edited by Sanna Valkonen, Áile Aikio, Saara Alakorva and Sigga-Marja Magga. London: Routledge, pp. 53–71.

Jouste, Marko & Mettovaara, Jukka & Morottaja, Petter & Partanen, Niko (2022). Archive Infrastructure and Spoken Language Corpora for Saami Languages in Finland. The 6th Digital Humanities in the Nordic and Baltic Countries 2022 Conference (DHNB 2022), Uppsala, Sweden, March 15-18, 2022. CEUR Workshop Proceedings. Aachen: RWTH Aachen University, pp. 269–278. https://ceur-ws.org/Vol-3232/paper25.pdf

Jouste, Marko & Lehtola, Veli-Pekka & Juutinen, Markus & Tanhua, Sonja (2022). ”Jääkk Sverloff johtajana ja kulttuuritulkkina – Kolttasaamelaisten historian käänteitä 1900-luvulla”. [Jääkk Sverloff as a Leader and a Cultural Interpreter – Turning points of Skolt Saami history in 20th century]. Suomen rajaseutujen kolonialismi. [Colonialism of Finnish Borderlands]. Toim. Rinna Kullaa, Janne Lahti ja Sami Lakomäki. Helsinki: Gaudeamus.

Jouste, Marko (2020). ”Suonikylän kolttasaamelainen itkuperinne 1900-luvulla”. [The Skolt Saami Lament Tradition of Suonikylä in the 20th Century]. Etnomusikologian vuosikirja Vol 32. Toim. Janne Mäkelä, Kaj Ahlsved, Viliina Silvonen. Helsinki: Suomen etnomusikologinen seura, pp. 10–45. https://doi.org/10.23985/evk.90118

Marko Jouste, Markus Juutinen, Eino Koponen (2020). ”Kolttasaamelaisen Näskk Moshnikoffin leuʹdd-kielen idiolekti ”. [The Idiolect of leuʹdd Language of Skolt Saami Näskk Moshnikoff]. Kulttuurintutkimus Vol 37, 1–2, pp. 32–56. Toim. Janne Saarikivi, Pirjo Kristiina Virtanen. Joensuu: Kulttuurintutkimuksen seura ry. https://journal.fi/kulttuurintutkimus/article/view/98099

Taarna Valtonen, Kati Kallio, Marko Jouste (2019). ”Olaus Sirman runojen vertailevaa luentaa -runojen poetiikka suhteessa suullisiin ja kirjallisiin lähikulttuureihin”. [Comparative Reading of Poems by Olaus Sirma. The Poetics of Poems in Relation to Oral and Literal Cultures Nearby]. Suomalais-Ugrilainen Seuran Aikakauskirja 97. Helsinki: Suomalais-Ugrilainen Seura, pp. 109–152. https://doi.org/10.33340/susa.75266

Marko Jouste, Markus Juutinen, Miika Lehtinen (2019): ”Isak Saba ja Paččjogas 1919:s čohkejuvvon nuortalaš leuʹddat. Isak Saba og de skoltesamiske leuʹddene som ble samlet inn i Paččjokk i 1919”. [Isak Saba and the Skolt Saami Leuʹdds Collected in Paččjogg in 1919]. Optegnelser. Isak Sabas folkeminnesamling. Čállosat. Isak Saba álbmotmuitočoakkáldat, Norsk Folkeminnelags skrifter 173 Oslo: Skandinavian Academic Press, pp. 283–301.

Jouste, Marko (2017). ”Áillohaš ja uuden joiun synty”. [Nils-Aslak Valkeapää and the Birth of the New Yoik]. Minä soin. Mun čuojan: Kirjoituksia Nils-Aslak Valkeapään elämäntyöstä. Toim. Valtonen, Taarna; Valkeapää, Leena. Rovaniemi: Lapland university press, pp. 233–258.

Marko Jouste (2011). Tullâčalmaaš kirdâččij ’tulisilmill lenteli’ – Inarinsaamelainen 1900-luvun alun musiikkikulttuuri paikallisen perinteen ja ympäröivien kulttuurien vuorovaikutuksessa. [The One Who Flew with the Fire eyes – The Musical Culture of the Aanar Sámi People in the Interaction of the Local Tradition and the Neighbouring Cultures]. Acta Universitatis Tamperensis 1650. Tampere: Tampere University Press. http://urn.fi/urn:isbn:978-951-44-8551-0

Corpora

The Giellagas Corpus of Spoken Saami Languages

More information

Giellagas Institute | University of Oulu

Suomeksi

Researcher of the Month: Tamás Grósz

Photo: Szabina Korbai

Kielipankki – The Language Bank of Finland offers a comprehensive set of resources, tools and services in a high-performance environment. Tamás Grósz tells us about his research on speech technology.

Who are you?

My name is Tamás Grósz and I am a Research Fellow in the Speech Recognition group at the Department of Information and Communications Engineering of Aalto University.

What is your research topic?

During my PhD years, my research was focused on Speech Technology, specifically on developing new deep-learning-based solutions for Automatic Speech Recognition (ASR). Although my main interest was acoustic modelling, I was also active in other areas. Paralinguistics, in particular, piqued my interest, and I worked on a wide variety of tasks. I regularly participated in the Interspeech ComParE challenges and won several times over the years. Perhaps the most notable of our systems is the one that automatically assesses the condition of patients suffering from Parkinson’s disease. Besides the competitions, I was also part of a project that concentrated on developing a speech-based solution for early detection of mild cognitive impairment. In the last years of my studies, my focus shifted towards silent speech interfaces. I had the pleasure of working with state-of-the-art prototypes and developing new systems that could generate speech from ultrasound tongue movement videos.

After graduation, I joined Mikko Kurimo’s lab as a postdoc, where I had an opportunity to work on other topics, including language modelling and AI explainability. Initially, I worked on subword-based language models for agglutinative languages like Hungarian and Finnish. While working with various models, I noticed the importance of curriculum learning. As a spin-off project, I have started investigating different ways of estimating the difficulties of training samples and constructing new curriculums for AI models.

Simultaneously, working on projects like Teflon, AASIS and Kielibuusti enabled me to learn more about children’s ASR, speech assessment and tools that can aid language learners. Our best models have been successfully integrated into a mobile application that can aid immigrants in learning the Finnish language.

In 2022, we developed a system that can recognize different kinds of stuttering (e.g. word/phrase repetition, prolongation, sound repetition and others) and won the INTERSPEECH 2022 Stefan Steidl Computational Paralinguistics Award. Later, we investigated how the emotional state of speakers can be recognized from non-verbal vocal expressions (such as laughter, cries, moans, and screams). Our system achieved first place for both tasks in the ACMMM CompParE competition. Since then, I have also worked on multimodal solutions for Emotion and Humor detection.

My current work mainly focuses on training and understanding Self-Supervised Foundation models as part of our Extreme-scale LUMI project and the LAREINA project. Explainable AI and model interpretation has been a long-term interest of mine, and with these new models and computational resources, I had the opportunity to explore new techniques. Recently, I have developed ways to find the relevant subspaces inside large foundation models and explore the concepts discovered by the models during pre-training, as well as understand the changes caused by the fine-tuning process. These techniques enabled us to better understand our models and guided us in designing new, better training algorithms.

How is your research related to Kielipankki?

As modern speech recognizers require a considerable amount of data, it became a priority to collect and annotate suitable corpora. In 2020, I joined the team creating the Donate Speech datasets (puhelahjat). This corpus, with its approximately 3200 hours of donated speech, enabled various other projects, including our FinW2V2 project at LUMI. Using this dataset and Aalto’s Finnish Parliament ASR Corpus 2008-2020, we have developed numerous Finnish ASR systems over the years.

Currently, I am also involved in the LAREINA project, building large speech foundation models and making them available for Industrial partners.

Recent publications

Getman, Y., Grósz, T., Hiovain-Asikainen, K. & Kurimo, M. (2024), Exploring adaptation techniques of large speech foundation models for low-resource ASR: a case study on northern Sámi, in Proc. of Interspeech. DOI: 10.21437/Interspeech.2024-479

Karakasidis, G., Kurimo, M., Bell, P. & Grósz, T. (2024), Comparison and analysis of new curriculum criteria for end-to-end ASR, Speech Communication p. 103113. DOI: 10.1016/j.specom.2024.103113

Moisio, A., Porjazovski, D., Rouhe, A., Getman, Y., Virkkunen, A., AlGhezi, R., Lennes, M., Grósz, T., Linden, K. & Kurimo, M. (2023), Lahjoita puhetta: a large-scale corpus of spoken Finnish with some benchmarks, Language Resources and Evaluation 57(3), 1295–1327. DOI: 10.1007/s10579-022-09606-3

Phan, N., von Zansen, A., Kautonen, M., Grósz, T. & Kurimo, M. (2024), CaptainA a self-study mobile app for practising speaking, in Proc. of Interspeech. https://www.isca-archive.org/interspeech_2024/phan24b_interspeech.pdf

Virkkunen, A., Sarvas, M., Huang, G., Grósz, T. & Kurimo, M. (2024), Investigating the clusters discovered by pre-trained AV-Hubert, in Proc. of IEEE ICASSP 2024, pp. 11196–11200. DOI: 10.1109/icassp48485.2024.10447434

Getman, Y., Phan, N., Al-Ghezi, R., Voskoboinik, E., Singh, M., Grósz, T., Kurimo, M., Salvi, G., Svendsen, T., Strömbergsson, S. et al. (2023), Developing an AI-assisted low-resource spoken language learning app for children, in IEEE Access. DOI: 10.1109/access.2023.3304274

Grósz, T., Getman, Y., Al-Ghezi, R., Rouhe, A. & Kurimo, M. (2023), Investigating wav2vec2 context representations and the effects of fine-tuning, a case-study of a Finnish model, in Proc. of Interspeech. DOI: 10.21437/interspeech.2023-837

Grósz, T., Virkkunen, A., Porjazovski, D. & Kurimo, M. (2023), Discovering relevant sub-spaces of Bert, wav2vec 2.0, Electra and ViT embeddings for humor and mimicked emotion recognition with integrated gradients, in Proc. of the 4th Multimodal Sentiment Analysis Challenge and Workshop, pp. 27–34. DOI: 10.1145/3606039.3613102

Porjazovski, D., Getman, Y., Grósz, T. & Kurimo, M. (2023), Advancing audio emotion and intent recognition with large pre-trained models and Bayesian inference, in Proc. of the 31st ACM International Conference on Multimedia, pp. 9477–9481. DOI: 10.1145/3581783.3612848

Corpora

Suomeksi

Researcher of the Month: Sofoklis Kakouros

Photo: Sofoklis Kakouros

Kielipankki – The Language Bank of Finland offers a comprehensive set of resources, tools and services in a high-performance environment. Sofoklis Kakouros tells us about his research on prosody and its associated phenomena.

Who are you?

I am Sofoklis Kakouros, a postdoctoral researcher with the Phonetics and Speech Synthesis Research Group in the Department of Digital Humanities at the University of Helsinki. Before joining this group, I held research positions at different universities across Finland and in the Netherlands, and I also worked in the industry as a speech scientist. My background centers on signal processing, cognitive science, and phonetics.

What is your research topic?

My research interests are rooted in speech and language, with a particular emphasis on understanding prosody and its associated phenomena. Prosody is about how we say something rather than what we say; it adds meaning beyond the words themselves. This includes elements like intonation and timing. Over the years, I have explored various aspects of prosody, focusing on information-theoretic processes within this domain. Overall, my work enhances our comprehension of how acoustic and linguistic variations are statistically organized into the prosody we perceive. For the past years, I have been working in the Research Council of Finland project titled ”Computational Modeling of Prosody in Speech”, aiming to understand the statistical organization in speech acoustics and its connections to prosodic dimensions such as prominence and emotions. This research has numerous applications, including the prosodic analysis of dialectal varieties and parliamentary speech.

How is your research related to Kielipankki?

To effectively analyze and train computational models for speech, an increasing amount of data is required. Kielipankki offers a diverse platform that provides access to the essential resources needed for my research, including materials for speech and language studies. In a recent project conducted by our group, I utilized the Finnish ASR corpus from Kielipankki to analyze recordings of Finnish parliamentary speeches.

Recent publications

Vainio, M., Suni, A., Šimko, J., and Kakouros, S. (2024). The Power of Prosody and Prosody of Power: An Acoustic Analysis of Finnish Parliamentary Speech. In Proceedings of the Conference of the Speech Prosody Special Interest Group (SProSIG) of the International Speech Communication Association – Speech Prosody (SpeechPro-2024), Leiden, The Netherlands, pp. 662–666. 10.21437/SpeechProsody.2024-134

Kakouros, S., Šimko, J., Vainio, M., and Suni, A. (2023). Investigating the Utility of Surprisal from Large Language Models for Speech Synthesis Prosody. In Proceedings of the 12th ISCA Speech Synthesis Workshop (SSW-2023), Grenoble, France, pp. 127–133. 10.21437/SSW.2023-20

Kakouros, S. and O’Mahony, J. (2023). What does BERT learn about prosody? In R. Skarnitzl, & J. Volín (Eds.), Proceedings of the 20th International Congress of Phonetic Sciences (ICPhS-2023) (pp. 1454-1458). GUARANT International spol. s r.o.., Prague, Czechia. https://www.internationalphoneticassociation.org/icphs-proceedings/ICPhS2023/full_papers/622.pdf

Kakouros, S., Stafylakis, T., Mošner, L., and Burget, L. (2023). Speech-based emotion recognition with self-supervised models using attentive channel-wise correlations and label smoothing. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-2023), Rhodes, Greece, pp. 1–5. 10.1109/ICASSP49357.2023.10094673

Corpora

Aalto Finnish Parliament ASR Corpus 2008–2020

Suomeksi

Researcher of the Month: Katri Hiovain-Asikainen

Photo: Kai Lukander

Kielipankki – The Language Bank of Finland offers a comprehensive set of resources, tools and services in a high-performance environment. Katri Hiovain-Asikainen describes her research on spoken Sámi languages and speech synthesis.

Who are you?

I am Katri Hiovain-Asikainen, working my fourth year as a speech and language technologist in the Divvun group at the Arctic University of Norway. Our group develops language and speech technology applications especially for the Sámi languages, but also for other minority languages. I am responsible for the design and implementation of speech technology projects, in which collecting different types of audio data and building speech corpora for different Sámi languages is essential.

This year, our team has published the world’s first speech synthesis for Lule Sámi, and updated North Sámi speech synthesis to modern standards. In October, we also published the world’s first South Sámi synthesis. All the software and tools that we have developed are free and easily accessible to all.

My background is in linguistics and phonetics, and I received my PhD from the University of Helsinki in autumn 2023. The topic of my dissertation was the influence of the majority languages on the spoken North Sámi language. The aim of the research was to investigate the variations of prosodic features, such as quantity and intonation, in the regional spoken varieties of North Sámi, where the contacts with the majority languages (Finnish and Norwegian) are very close and multidimensional.

What is your research topic?

Currently, I focus on the development of speech synthesis and automatic speech recognition for three Sámi languages: North, Lule and South Sámi, all of which are official languages in Norway. There is a great need for speech technology applications in the Sámi-speaking communities, as written forms of the Sámi languages are relatively new, and not all Sámi speakers have had the opportunity to learn the written language in school in the same way as the speakers of the majority languages. Speech technology enables the oral use of minority languages in new contexts: for example, as a reading assistant at school, for learning the pronunciation, as an easy-to-use tool for dyslexic or visually impaired people, and in general, even for listening to the news instead of reading. Audio books and other spoken language content are also becoming more common, allowing you to listen to books while doing something else with your hands. Today, a smart home and smart loudspeakers speak Lule Sámi in a home where the language of the family is Lule Sámi. This strengthens the role of the language and supports the revitalisation of Sámi languages at a new level.

An automatic speech recognizer, on the other hand, enables different speech interfaces, for example in the car and at home, and of course on smart devices. It will soon be possible to dictate texts in Sámi languages and, for example, to produce automatic transcriptions for old archival recordings so that researchers can make better use of them. The possibilities are endless.

The focus of my research is strongly related to speech technology, and I am currently a visiting researcher in the Phonetics and Speech Synthesis Research Group at the University of Helsinki. In collaboration with other researchers in the group, we have been working on automatic dialect recognition, where the aim is to automatically identify the speaker’s dialect based, among other things, on various prosodic features. In addition, I am interested in different methods of speech synthesis evaluation, for example, how well the speech synthesis learns to produce complex and rare prosodic features such as quantity.

How is your research related to Kielipankki?

In the Divvun group we are currently preparing various Sámi speech corpora for publication via Kielipankki. There are Sámi archive recordings in different countries, but they are relatively scattered or not necessarily processed for publication, and transcriptions are not always available. We believe that making these existing materials more accessible would help many researchers and developers of speech technologies without making new recordings.

I have also gained access to a North Sámi speech corpus (Giellagas) in Kielipankki for research purposes, and the corpus has been very useful because of its versatility, especially in the study of automatic dialect recognition. Our aim at Divvun is to make similar corpora available as soon as possible. However, in the case of indigenous and minority languages, the publication of the corpora should be treated with caution, which we respect in our work.

Recent publications

Hiovain-Asikainen, K. (2023). Prosodic change and majority language influence in spoken North Sámi varieties. Helsingin yliopisto, Humanistinen tiedekunta, Digitaalisten ihmistieteiden osasto. Helsingin yliopisto. http://urn.fi/URN:ISBN:978-951-51-9406-0

Kakouros, S., & Hiovain-Asikainen, K. (2023). North Sámi dialect identification with self-supervised speech models. arXiv Preprint arXiv:2305.11864. In Proceedings of the 24th INTERSPEECH Conference (pp. 5306–5310). https://doi.org/10.48550/arXiv.2305.11864

Pirinen, F., Moshagen, S., & Hiovain-Asikainen, K. (2023, May). GiellaLT—a stable infrastructure for Nordic minority languages and beyond. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa) (pp. 643-649). https://aclanthology.org/2023.nodalida-1.63/

Hiovain-Asikainen, K., & de la Rosa, J. (2023). Developing TTS and ASR for Lule and North Sámi languages. In Proceedings of the 2nd Annual Meeting of the Special Interest Group on Under-resourced Languages (SIGUL). http://dx.doi.org/10.21437/SIGUL.2023-11

Corpora and Tools

Giellagas, Samples of Northern Saami
Borealium – tools for the small languages of the Nordic countries.

More information

The Divvun research group (The Arctic University of Norway)

Suomeksi

Researcher of the Month: Elina Vaahensalo

Photo: Elina Vaahensalo

Kielipankki – The Language Bank of Finland offers a comprehensive set of resources, tools and services in a high-performance environment. Elina Vaahensalo tells us about her research on confrontation and otherness in online discussions.

Who are you?

I am Elina Vaahensalo, doctoral researcher in Digital Culture at the Faculty of Humanities, University of Turku, in the Degree Programme in Digital Culture, Landscape and Cultural Heritage. In addition, at the beginning of October I will start as a researcher in the Academy project SoliPro (”Solidariteetit käytäntöön – Nuorten arkiyhteisöt tunnustuksen lähteenä ja ehkäisevän sosiaalityön areenana”), coordinated by the University of Tampere.

What is your research topic?

In my dissertation, I examine online discussion that produces otherness, especially from the perspective of anonymous Finnish-language online communities. I am interested in how confrontation, alienation and even violent hostility are constructed in Finnish-language online discussion cultures, and what different forms the concept of otherness takes in these cultures. Otherness is a fruitful conceptual starting point for research on online discussions because it can be used in a variety of ways to outline descriptions of community, group identities, and the sense of being an outsider or downgraded and different. In Finnish-language online discussions, otherness takes very different – and also contradictory – forms: the other can be an enemy who is violently and dehumanisingly opposed, but also a relatable fellow sufferer with whom one shares common, peer-based experiences of marginalisation.

In addition, my colleague Lilli Sihvonen and I have studied online cultures from the framework of media archaeology. In particular, we are interested in what happens when a cybercultural phenomenon or object – a meme that has gone viral or a social media platform – dies, and what kind of afterlife can be associated with it. Our interest is driven by the perception of the vulnerability of digital phenomena. In our view, online phenomena in Finnish, for example, are particularly vulnerable because they often do not spread globally and are therefore not stored very widely online. In storing Finnish-language online cultural phenomena, Kielipankki has therefore done a valuable job by depositing online discussions from both the Suomi24 forum and the Ylilauta forum.

In my research for the SoliPro project, I will continue my work on othering, but from an even more robust perspective of community and solidarity. My aim is to examine the descriptions of community, otherness and solidarity shared by young people on social media.

How is your research related to Kielipankki?

In my more recent research, I have used qualitative and ethnographic online discussion data that was collected by myself, but the Suomi24 data from Kielipankki also plays an important role for the beginning of my research career. In 2017, I started as a research assistant in the ”Citizen Mindscapes” consortium project, funded by the Research Council of Finland. The project, where I also wrote my Master’s thesis, was built around the Suomi24 data from Kielipankki. Already then, I developed the concept of othering online discourse, and tested its identification and quantitative measurement using the Suomi24 data. Experimenting with corpus-based research was quite a dive into the unknown for a cultural researcher such as myself. However, with all its challenges, it was a valuable lesson to see how working on Master’s thesis provides opportunities to try out different research tools – also outside one’s own comfort zone.

From time to time, I also teach digital culture students, and my teaching focuses on the tools and methods that can be used for conducting qualitative research on online discussions. I always encourage my students to use the online discussion corpora in Kielipankki, as they are unique collections of Finnish online culture, and they also prove that the language used online is worth saving and remembering.

Recent publications

Vaahensalo, E., & Sihvonen, L. (2022). Elävät, kuolleet ja elävät kuolleet keskustelufoorumit: verkkoyhteisöjen elämänvaiheet ja niiden tutkiminen. In R. Mähkä, M. Ahonen, N. Heikkilä, S. Ollitervo, & M. Räsänen (Eds.), Kulttuurihistorian tutkimusmenetelmät (pp. 411-429). Turun yliopisto.

Vaahensalo, E. (2022). ”Uuniin siitä” – Väkivaltainen ja toiseuttava verkkokeskustelu Ylilaudalla. Lähikuva – audiovisuaalisen kulttuurin tieteellinen julkaisu, 35(3), 29–44. https://doi.org/10.23994/lk.121893

Vaahensalo, E. (2022). Organisaatiot ja toiseuttava verkkokeskustelu. In H. Kantanen & M. Koskela (Eds.), Procomma Academic 2022: Poikkeuksellinen viestintä. ProCom – Viestinnän ammattilaiset ry. https://doi.org/10.31885/2022.00001

Vaahensalo, E. (2021). Samanlaista toiseuttamista, erilaisia toisia: Toiseuttavan verkkokeskustelun muodot anonyymeissä suomenkielisissä keskustelukulttuureissa. Media & Viestintä, 44(3), 1–29. https://doi.org/10.23983/mv.111507

Vaahensalo, E. (2021). Kontekstualisointimalli sosiaalisen median lähdekritiikin avaimena. Informaatiotutkimus, 40(3), 110–141. https://doi.org/10.23978/inf.107897

Vaahensalo, E. (2021). Creating the other in online interaction: Othering online discourse theory. In J. Bailey, A. Flynn, & N. Henry (Eds.), Handbook on technology-facilitated violence and abuse: International perspectives and experiences (pp. 227-246). Emerald Studies on Digital Crime, Technology & Social Harms. https://doi.org/10.1108/978-1-83982-848-520211016

Suominen, J., Saarikoski, P., & Vaahensalo, E. (2019). Digitaalisia kohtaamisia: Verkkokeskustelut BBS-purkeista sosiaaliseen mediaan. Helsinki: Gaudeamus.

Corpora

More information

Suomeksi

Researcher of the Month: Aku Rouhe

Photo: Jasmine Gustafsson

Kielipankki – The Language Bank of Finland offers a comprehensive set of resources, tools and services in a high-performance environment. Aku Rouhe tells us about his research on speech recognition. His current work includes, among other things, fine-tuning large language models that are optimized for Finnish and Nordic languages. These openly available LLMs have been created through successful academia-enterprise collaboration.

Who are you?

I am Aku Rouhe. For several years, I did research in the Aalto University Speech Recognition research group, and defended my doctoral thesis there this past February. After Aalto, I moved to Silo AI (now owned by AMD), where I work with large language models (LLMs) – I have moved from speech to text. My interest in language is also part of my free time in creative writing.

What is your research topic?

In my doctoral thesis, I compared end-to-end models with more traditional multi-model decomposed systems. In recent years, both the academia and commercial deployments in speech recognition have largely moved to end-to-end models. However, my work showed how multi-model decomposed systems remain a competitive alternative, for instance, in terms of recognition accuracy. Indeed, the main advantage of end-to-end models is probably their simplicity.

End-to-end models often require vast training resources. Thus, it was important for me to study end-to-end models applied to under-resourced languages as well.

My current work at Silo is on fine-tuning large language models such as Poro and Viking, which are models optimized for Finnish and Nordic language. These LLMs were developed in a collaborative research project between Silo and TurkuNLP.

How is your research related to Kielipankki?

End-to-end models hunger for data, so large corpora are needed. I was involved in compiling the Aalto Finnish Parliament ASR Corpus 2008-2020, which consists of Finnish Parliament plenary session recordings, and also in the Lahjoita Puhetta project, where volunteers donated their speech to produce the Puhelahjat corpus. I got to combine both of these large speech corpora in an article that was published when I was finalizing my PhD, at a time when I was involved with the LAREINA project. Nowadays, the Finnish speech recognition resources are respectable for a language spoken by so few.

Recent publications

Rouhe, A., Grósz, T., Kurimo, M. 2024. Principled Comparisons for End-to-End Speech Recognition: Attention vs Hybrid at the 1000-Hour Scale. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 623-638, 2024. doi: 0.1109/taslp.2023.3336517

Virkkunen, A., Rouhe, A., Phan, N. et al. 2023. Finnish parliament ASR corpus. Lang Resources & Evaluation 57, 1645–1670 (2023). doi: 10.1007/s10579-023-09650-7

Moisio, A., Porjazovski, D., Rouhe, A. et al. 2023. Lahjoita puhetta: a large-scale corpus of spoken Finnish with some benchmarks. Lang Resources & Evaluation 57, 1295–1327 (2023). doi: 10.1007/s10579-022-09606-3

Rouhe, A., Virkkunen, A., Leinonen, J., Kurimo, M. 2022. Low Resource Comparison of Attention-based and Hybrid ASR Exploiting wav2vec 2.0. Proc. Interspeech 2022, 3543–3547,
doi: 10.21437/Interspeech.2022-11318

Corpora

More information

Suomeksi

Researcher of the Month: Tuukka Törö

Photo: Riina Kiianmies

Kielipankki – The Language Bank of Finland offers a comprehensive set of resources, tools and services in a high-performance environment. Tuukka Törö tells us about his research on Finnish speech synthesis. Neural network models, which are trained with large amounts of audio data from varied datasets, enable researchers to analyze speech in new ways.

Who are you?

I am Tuukka Törö. I have been working as a doctoral researcher at the University of Helsinki’s Phonetics and Speech Synthesis Research Group since the beginning of this year. My background is in linguistics and phonetics, and I hold a BA in English studies from the University of Malmö and an MA in Phonetics from the University of Helsinki. After writing my Master’s thesis on controlling speaking styles in speech synthesis, I spent some time working with YLE on AI radio projects where we created synthetic ‘actors’ for radio features.

In my current position, I work in the Academy of Finland funded project Predictive Processing Approach to Modelling Prosodic Hierarchy for Speech Synthesis. The project’s aim is to develop text-to-speech (TTS) synthesis inspired by the predictive processing theory of human cognition.

While my focus has become more technically inclined, the primary inspiration behind my work stems from a fascination with how social structures influence speech, from macro level variation to how people convey social dynamics in specific contexts.

What is your research topic?

Currently I am researching macro level language variation using neural-network models built for TTS and speech recognition. While the models’ original purpose is in technological applications, they enable us to analyze speech in new ways. As the models are trained with large amounts of audio, they can be used to model ’wild’ data of varying quality on a large scale instead of picking apart specific acoustic features from small, professionally recorded datasets.

Within the academy project, my aim is to tie together sociolinguistic variation with the predictive processing and speech synthesis side of things. Hopefully, in the coming years we will learn something new about how humans perceive social cues in speech and how high-level social predictions can be utilized to improve speech synthesis.

How is your research related to Kielipankki?

I often use corpora from Kielipankki such as Samples of Spoken Finnish (SKN), FinSyn (to be available in Kielipankki), and most of all Donate Speech (Lahjoita puhetta). In order to train speech synthesizers that we control on social variables – such as age, gender, and dialect – we need a large amount of audio data from people with a rich variety of backgrounds. With Finnish being a relatively small language, it is vital to have a concentrated effort for building large datasets like the Donate Speech corpus.

Recent publications

Törö, T., Suni, A. and Šimko, J. (2024). Analysis of regional variants in a vast corpus of Finnish spontaneous speech using a large-scale self-supervised model, Proceedings of Speech Prosody 2024, Leiden, Netherlands. DOI: 10.21437/SpeechProsody.2024-8

Šimko, J., Törö, T., Vainio M., and Suni, A. (2023). Prosody under control: Controlling prosody in text-to-speech synthesis by adjustments in latent reference space, Proceedings of the 18th International Congress of Phonetic Sciences, Prague, Czech Republic. http://hdl.handle.net/10138/565382

Other related work

Rembrandt, kissa ja lapsi (AI radio drama, Yle, 2023)
Toivon paluu (AI radio drama, Yle, 2023)

Corpora

Suomeksi

Researcher of the Month: Heidi Niva

Photo: Emmi Pollari

Kielipankki – The Language Bank of Finland offers a comprehensive set of resources, tools and services in a high-performance environment. Heidi Niva tells us about her research on Finnish grammatical phenomena and introduces a Vepsian-Finnish dictionary project. In a joint research, she also aims to evaluate the corpus of online discussions as a source for a language researcher.

Who are you?

I am Heidi Niva, a postdoc Finnish language researcher. I am currently a substitute lecturer of Finnish language and culture at the University of Helsinki. I am also actively involved in the LOST DOC collective, a community for postdoc language researchers.

What is your research topic?

Both in my dissertation and afterwards, grammatical phenomena have been in the focus of my research. Among other things, I have studied the structures that are used to express futurity in Finnish. Now I am involved in a joint project where we study the structures expressing avertivity, i.e. non-realization of events. I am also working in a project where we aim to compile a Vepsian-Finnish dictionary. Vepsian, also known as Veps, is a related but endangered language spoken south of Lake Onega (Ääninen). In addition to the dictionary project, I am also doing research on adpositional structures in the Veps language.

How is your research related to Kielipankki?

In my research on the Finnish grammar, instead of normativity, I am more interested in how people actually use linguistic structures, and what types of meanings and connotations these structures can convey. For this purpose, I have used the resources in Kielipankki: The Suomi24 Sentences Corpus 2001-2020 for the study of Modern Finnish, and the corpora of Early Modern Finnish and Old Literary Finnish for the study of the older forms of the language. I am also currently using the Corpus of Finnish Magazines and Newspapers from the 1990s and 2000s and the Finnish News Agency Archive Corpus.

In fact, the Suomi24 Sentences Corpus 2001-2020 is itself the subject of our joint research with Max Wahlström and Olli Silvennoinen. What is interesting about this corpus is that it largely represents informal language use but is still different from spoken language in terms of its linguistic features. In addition, the corpus is a diverse source in terms of the formality of language use and the occurrence of linguistic phenomena as they seem to be influenced by the various topics of discussion and their styles of expression. In our forthcoming article, we will critically examine what kind of source the Suomi24 corpus actually is for a language researcher.

Publications

Niva, Heidi 2022: Suomen progressiivirakenne intentioiden ja ennakoinnin ilmaisuissa. Helsinki: Helsingin yliopisto. Available: http://urn.fi/URN:ISBN:978-951-51-8727-7

Niva, Heidi 2024: Tulen muistamaan hänet aina. Tulla V-mAAn vääjäämättömän tulevaisuuden ilmaisukeinona. Virittäjä 128(2), 238–263. DOI: 10.23982/vir.126878

Corpora

Links

Suomeksi

Researcher of the Month: Krister Lindén

Photo: Juhani Jokinen

Kielipankki – The Language Bank of Finland offers a comprehensive set of resources, tools and services in a high-performance environment. Krister Lindén, the Director of the Language Bank, describes how researchers in Humanities can benefit from the use of artificial intelligence in their corpus-based research.

Who are you?

I am Krister Lindén. At the University of Helsinki, I am Research Director for Language Technology at the Department of Digital Humanities, and Deputy Team Leader at the Centre of Excellence for Ancient Near Eastern Empires. For national research infrastructures, I am the Director of the Language Bank of Finland, the National Coordinator of FIN-CLARIN, and the PI of FIN-CLARIAH. At the EU level, I am Chair of the National Coordinators Forum of CLARIN, a research infrastructure for the humanities and social sciences, and a member of the CLARIN Legal Issues Committee (CLIC).

What is your research topic?

I have always been interested in language technology and its application and, due to my involvement in the Language Bank, increasingly also in the prerequisites for developing and applying technology:

How can we use data to answer a broad range of research questions in the humanities and social sciences?
Where can we obtain development and test data to develop and evaluate our data processing methods?
Under what conditions can data be shared with other researchers so that they can verify the proclaimed performance of the methods?

An independent evaluation of methods is important to ensure progress and that we find the best methods in each case. If only a preliminary evaluation is needed, and a small-scale experiment is sufficient, you can give ChatGPT a few examples to see how it copes with the task. If there is insufficient data to reliably use a statistical method, and the task requires a high precision method, it may be quicker to use manually developed methods. On the other hand, if there is enough data, a suitable machine learning method is available, and the processing environment performance is sufficient, this combination often provides the most reproducible development path.

All the above development paths are data-driven and require data to be shared with other researchers for replication. In previous years, there has been a strong enthusiasm for completely open source data sets. While this is still a desirable goal, there are many datasets that, for one reason or another, cannot be made available to everyone. Gradually, as our community of researchers together with the lawmakers have succeeded in developing a legal framework for data access which is open enough for academic researchers to study the data and verify the results in a relatively straightforward way, while keeping the data accessible to a sufficiently small audience not to risk personal data nor infringe on copyrights.

A new development need is to create a method for researchers in the humanities and social sciences to discuss the content of datasets which they deposit in the Language Bank with an AI.

How is your research related to Kielipankki?

The Language Bank provides both a platform for tool development and an opportunity to show how different types of research-oriented datasets can be shared with other researchers in a safe and legal way.

Recent publications

Jauhiainen, T., Zampieri, M., Baldwin, T. C., & Linden, K. (2024). Automatic Language Identification in Texts. (Synthesis Lectures on Human Language Technologies). Springer. https://doi.org/10.1007/978-3-031-45822-4

Jauhiainen, T., Piitulainen, J., Axelson, E., Dieckmann, U., Lennes, M., Niemi, J., Rueter, J., & Linden, K. (2024). Investigating Multilinguality in the Plenary Sessions of the Parliament of Finland with Automatic Language Identification. In D. Fišer, M. Eskevich, & D. Bordon (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024): ParlaCLARIN IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (pp. 48-56). (International conference on computational linguistics), (LREC proceedings). European Language Resources Association (ELRA). https://researchportal.helsinki.fi/files/312866811/ArtikkeliJulkaistu.pdf

Sahala, A., & Linden, K. (2023). BabyLemmatizer 2.0 – A Neural Pipeline for POS-tagging and Lemmatizing Cuneiform Languages. In A. Anderson, S. Gordin, B. Li, Y. Liu, & M. C. Passarotti (Eds.), Proceedings of the Ancient Language Processing Workshop associated with the 14th International Conference on Recent Advances in Natural Language Processing, RANLP 2023 (pp. 203-212). INCOMA. https://aclanthology.org/2023.alp-1.23

Linden, K., Niemi, J., & Kontino, T. (Eds.) (2023). CLARIN Annual Conference Proceedings 2023. (CLARIN Annual Conference Proceedings). CLARIN ERIC. https://researchportal.helsinki.fi/files/298353929/CE-2023-2328_CLARIN2023_ConferenceProceedings.pdf

Lindén, K., Ruokolainen, T., Hämäläinen, L., & Harviainen, J. T. (2023). Ethically Archiving a Hard-to-Access Massive Research Data Set in the Language Bank of Finland: The Finnish Dark Web Marketplace Corpus (FINDarC). In M. M. Rantanen , S. Westerstrand, O. Sahlgren, & J. Koskinen (Eds.), Proceedings of the Conference on Technology Ethics 2023 – Tethics 2023 (pp. 114-131). (CEUR Workshop Proceedings; Vol. 3582). CEUR-WS.org. https://researchportal.helsinki.fi/files/295005165/FP_10.pdf

Kamocki, P., Linden, K., Puksas, A., & Kelli, A. (2023). EU Data Governance Act: Outlining a Potential Role for CLARIN. In T. Erjavec, & M. Eskevich (Eds.), Selected papers from the CLARIN Annual Conference 2022 (pp. 57-65). (Linköping Electronic Conference Proceedings; No. 198). CLARIN ERIC. https://doi.org/10.3384/ecp198006

Linden, K., Jauhiainen, T., & Hardwick, S. (2023). FinnSentiment: A Finnish Social Media Corpus for Sentiment Polarity Annotation. Language Resources and Evaluation, 57(2), 581-609. https://doi.org/10.1007/s10579-023-09644-5

Axelson, E., Hardwick, S., & Linden, K. (2023). HFST Training Environment and Recent Additions. In A. Hurskainen, K. Koskenniemi, & T. P. (Eds.), Rule-Based Language Technology (pp. 60-69). (NEALT Monograph Series; No. 2[1]). Northern European Association for Language Technology. http://hdl.handle.net/10062/89595

Researcher of the Month: Juraj Šimko

Photo: Veikko Somerpuro

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Juraj Šimko tells us about his research on speech articulation and prosody. The Phonetics and Speech Synthesis Research Group at the University of Helsinki also aims to use large language models for finding answers to certain theoretical questions related to speech.

Who are you?

I am a University Lecturer in Phonetics, working at the University of Helsinki since 2013. Prior to that I have studied and worked at several Universities in Slovakia, Ireland and Germany, and I spend several years as a Language Specialist in Microsoft. I currently also hold an Honorary Professorship at the Indian Institute of Technology in Guwahati. My background is in Maths, Cognitive Science and Phonetics.

I am a member of the Phonetics and Speech Synthesis Research Group at the Department of Digital Humanities, but I am currently also involved in an ERC Advanced grant (to Professor Alice Turk) called Planning the Articulation of Spoken Utterances at the University of Edinburgh, where we investigate and model cognitive processes behind speech production and articulation.

What is your research topic?

I am passionate about human speech research. Besides speech articulation, my own as well as our Group’s main research interest is speech prosody, that is, essentially, all those melodic, rhythmic, emotional aspects of speech that go beyond the linguistic message that we pass on when we speak. In our current project Predictive Processing Approach to Modelling Prosodic Hierarchy for Speech Synthesis we are working on a novel speech synthesis architecture that is inspired by the influential theoretical and modelling paradigm of human cognition called Predictive Processing. Of course, the first obvious aim is to produce a world-class speech synthesis, and our team has indeed been creating state-of-the-art Finnish and Finland Swedish synthesis systems. But we also want to use the huge language models that drive technological applications as statistical representations of speech material used for their training, and use them to answer theoretical questions related to speech. These questions include, among others, distribution and evolution of accents and dialects, relationship between sociolinguistics and prosody, and prosodic patterns in politicians’ parliamentary speeches.

How is your research related to Kielipankki?

In order to do all that, we need quite a lot of data. Some of it we create ourselves, with invaluable assistance from Kielipankki experts: we have designed and recorded FinSyn corpus of high quality speech material intended for speech technology application, primarily for speech synthesis. The corpus contains ~75 hours of studio quality recordings from three voice talents, two of them speaking Finnish and one Finland Swedish. This corpus will appear as a part of Kielipankki collection. Our work on dialects and sociolinguistics heavily relies on other Kielipankki corpora, primarily the groundbreaking Donate Speech (Lahjoita puhetta) Corpus and Aalto Finnish Parliament ASR Corpus.

Recent publications

Vainio, M., Suni, A., Šimko, J. and Kakouros, S. (2024). The Power of Prosody and Prosody of Power: An Acoustic Analysis of Finnish Parliamentary Speech, Proceedings of Speech Prosody 2024, Leiden, Netherlands. DOI: 10.21437/SpeechProsody.2024

Elie, B., and Šimko, J., and Turk, A. (2024). Optimization-based modeling of Lombard speech articulation: Supraglottal characteristics. JASA Express Letters, 4(1). https://doi.org/10.1121/10.0024364

Kakouros, S., Šimko, J., Vainio M., and Suni, A. (2023). Investigating the Utility of Surprisal from Large Language Models for Speech Synthesis Prosody, Proceedings of the 12th ISCA Speech Synthesis Workshop (SSW), Grenoble, France. https://doi.org/10.21437/SSW.2023-20

Šimko, J., Adigwe, A., Suni, A. and Vainio M. (2022). A Hierarchical Predictive Processing Approach to Modelling Prosody, Proc. 11th International Conference on Speech Prosody, Lisbon, Portugal. https://doi.org/10.21437/SpeechProsody.2022-86

Corpora

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

Suomeksi

Researcher of the Month: Lotta Leiwo

Photo: Veikko Somerpuro

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Lotta Leiwo tells us about her research in folkloristics, digging into the life and work of Finnish-American T-Bone Slim.

Who are you?

I am Lotta Leiwo, a doctoral researcher at the University of Helsinki, where I am studying for a PhD in history and cultural heritage. My dissertation in Folklore Studies examines the political role and nature-related rhetoric of Finnish-American women in the Finnish Socialist Federation (FSF) in the early 20th century. My main research data consists of FSF documents and a newspaper called Toveritar. The Toveritar, a mouthpiece of the FSF, targeted women and was edited and written mainly by women.

Prior to my doctoral project, I worked for two years as a research assistant on the project T-Bone Slim and the transnational poetics of the migrant left in North America (Kone Foundation 2022–2023). My main responsibility in this international project was the construction of the T-Bone Slim corpus and database. During the project, I wrote my Master’s thesis on Finnish socialist women in North America and found the topic for my dissertation.

What is your research topic?

In the T-Bone Slim project, an international research team studied the life and literary works of the second-generation American Finnish Matti Valentinpoika Huhta (1882–1942), also known as T-Bone Slim. Huhta was born in Ashtabula, Ohio, to a Finnish family that emigrated from Kälviä, Central Ostrobothnia. He spent his childhood and youth in Finnish communities in the US, working as a dock worker and as a correspondent for the local chapter of the temperance movement. In the 1910s, Huhta abandoned his family and took up a life as a ’hobo’ or itinerant worker. By the 1920s, Huhta became radicalised, joining the Industrial Workers of the World (IWW) and becoming a columnist for IWW newspapers and periodicals. He continued his writing career under the pen name T-Bone Slim until his death. Huhta lived his last years in New York, where he worked as a deck scow captain. In May 1942, he was found drowned in New York’s East River and was almost forgotten for several decades. For further exploration of the unresolved questions surrounding T-Bone Slim’s death, please visit our project blog and read Saku Pinta’s two-part text ”Who Killed T-Bone Slim” Part I and Part II.

In the late 2010s, musician John Westmoreland, a relative of Slim’s, discovered his ”Uncle Matt’s” T-Bone Slim writing career. Around the same time, academic interest in Slim, who had a Finnish background, began to grow, and his relatives and researchers found each other over T-Bone Slim Studies. The research continued in a project funded by the Kone Foundation, which brought together John Westmoreland and scholars from Finland, the UK, the US, Canada, and Australia. Kirsti Salmi-Niklander is the Principal Investigator of the project. We collected the T-Bone Slim materials gathered by the researchers from various archives organizing them into a corpus to enchance accessibility for others interested in the subject. In total, data from 14 archives across three continents and five countries – the United States, Canada, Finland, Sweden and Australia – provided the materials.

The corpus encompasses a total of 1294 texts written by T-Bone Slim and published in English in IWW periodicals. However, Slim also wrote in Finnish on occasion and occasionally used Swedish. Furthermore, the corpus also includes the surviving manuscripts written by Slim.

The texts written by T-Bone Slim are a gold mine for researchers. Slim used language cleverly, combining different genres and means of expression. In addition, the historical, literary and cultural references found in the texts provide an opportunity to examine the IWW movement, transnational migration and history in the United States from diverse perspectives. The language employed in the texts is rich, insightful, and even playful, and may be of interest to linguists. As the material comprises both published and unpublished texts, it offers insights into both the editorial processes of political publishing and the writing practices of an individual author.

Within the framework of the project, I have examined the literary practices, literacy acquisition of Finnish migrant-settlers and Slim’s utilization of genres from a semiotic perspective. Notably, Slim’s texts exhibit multilingualism in both background and content, incorporating intertextuality and multimodality across various genres and oral-literary practices. Such practices are evident, for example, in his song lyrics. In typical IWW style, Slim wrote lyrics addressing social injustices to popular song tunes known to readers. The lyrics were thus written to be sung, with the aim of provoking the reader/singer to reflect on the message of the lyrics. As Owen Clayton, a collaborator on our project, has observed, T-Bone Slim sought to activate and engage readers through language and words. I, too, am continually amazed and delighted by Slim’s skilful written expression.

How is your research related to Kielipankki?

In the early stages of the project, we thought long and hard about a suitable repository for the T-Bone Slim corpus and database. Our priority was to find a long-term storage solution for the materials that would ensure the materials’ widespread accessibility. Equally important was the need for the corpus to be explored and analysed through digital humanities methods.

The T-Bone Slim corpus and database will be published in April 2024 in Kielipankki, which fulfills all our storage and access requirements. The collection consists of photographic and microfilm scans of the original materials (newspapers, periodicals and manuscripts) with transcriptions and a database. The database includes all the texts in the corpus accompanied by metadata (date of publication, publication, title of the text, archive from which the material was collected, language, etc.). Additionally, we have experimented abstracting the data into a subset of the materials. For example, the people and places mentioned by T-Bone Slim and information about the poems or songs contained in the texts are listed in the abstracted data. The purpose of the database is to facilitate data navigation and serve as a foundation for more detailed abstraction of the data by other researchers.

T-Bone Slim Corpus and Database Launching Event

Welcome to the Resurrection – T-Bone Slim Corpus and Database Launching Event on Monday May 20th, 2024 at 15:00–17:00. The launching event is open to the public and the program can be followed both via Zoom and on-site at the Finnish Literature Society (Hallituskatu 1, Helsinki). More information and registration for remote participants.

Publications

Apajalahti, Eeva-Lotta et al. (2022). ”Ihmistieteelliset näkökulmat metsiin tuottavat tietoa moninaisista metsäsuhteista ja niiden tulevaisuuksista.” Vuosilusto 14(2022): 13–51. Available: https://lusto.fi/wp-content/uploads/2022/12/Lusto-Vuosilusto14.pdf.

Leiwo, Lotta (2024). ”When One’s Life Becomes the Field. Assessing the Field in Collaborative Autoethnography.” Marburg Journal of Religion 25(1). https://doi.org/10.17192/mjr.2024.25.8693.

Leiwo, Lotta (2023). ”Luontokin näkyy olevan köyhälistöä vastaan” Luonto kolmantena tilana Toveritar-lehden paikkakuntakirjeissä 1916–1917. Master’s thesis. Helsinki: University of Helsinki. http://urn.fi/URN:NBN:fi:hulib-202305302306.

Leiwo, Lotta (2023). ”Suomen koloniaalin osallisuuden kontekstit haltuun: Hoegaerts, Josephine, Tuire Liimatainen, Laura Hekanaho ja Elizabeth Peterson (toim.). 2022. Finnishness, Whiteness and Coloniality.” Elore, 30(2), 142–147. Book review. https://doi.org/10.30666/elore.137470.

Mäkelä, Heidi Henriikka, Leiwo, Lotta, Linkola, Hannu ja Rinne, Jenni (2023). ”The spiritual forest: an ethnographic exploration on Finnish forest yoga and the forest landscape.” Landscape Research. https://doi.org/10.1080/01426397.2023.2268550.

Corpora

T-Bone Slim Corpus, source (Kielipankki)

T-Bone Slim Corpus, Westmoreland materials (Metashare)

Entries from the Research Project’s Blog

Leiwo, Lotta (2023). ”T-Bone Slim Database – Final Steps.” ’T-Bone Slim and the transnational poetics of the migrant left in North America’ Research Project’s Blog. 18.12.2023. https://blogs.helsinki.fi/tboneslim/2023/12/18/t-bone-slim-database-final-steps/.

Leiwo, Lotta (2023). ”T-Bone Slim Database – Next Steps.” ’T-Bone Slim and the transnational poetics of the migrant left in North America’ Research Project’s Blog. Published 22.6.2023. https://blogs.helsinki.fi/tboneslim/2023/06/22/t-bone-slim-database-next-steps/.

Salmi-Niklander, Kirsti (2023).”’T-Bone Slim’ eli Matti V. Huhta ajatteli ja kirjoitti kahdella kielellä kulkurielämästä ja työläisten oikeuksista” ’Vähäisiä lisiä’ Blog. Published 12.5.2023. https://www.finlit.fi/ajankohtaista/blogi/t-bone-slim-eli-matti-v-huhta-ajatteli-ja-kirjoitti-kahdella-kielella-kulkurielamasta-ja-tyolaisten-oikeuksista/.

Clayton, Owen (2023). ”Technocracy and T-Bone Slim’s Break with Ralph Chaplin” ’T-Bone Slim and the transnational poetics of the migrant left in North America’ Research Project’s Blog. Published 1.3.2023. https://blogs.helsinki.fi/tboneslim/2023/03/01/technocracy-and-t-bone-slims-break-with-ralph-chaplin/.

Dalbello, Marija (2022). ” From my Archival ‘Digs’, part I. Finding Slim!” ’T-Bone Slim and the transnational poetics of the migrant left in North America’ Research Project’s Blog. Published 12.12.2022. https://blogs.helsinki.fi/tboneslim/2022/12/12/finding-slim/.

Pinta, Saku (2022). ”T-Bone Slim’s Forgotten Finnish-Language Writings in the IWW Press” ’T-Bone Slim and the transnational poetics of the migrant left in North America’ Research Project’s Blog. Published 20.10.2022. https://blogs.helsinki.fi/tboneslim/2022/10/20/t-bone-slims-forgotten-finnish-language-writings-in-the-iww-press/.

Leiwo, Lotta (2022). ”T-Bone Slim Database – First Steps.” ’T-Bone Slim and the transnational poetics of the migrant left in North America’ Research Project’s Blog. Published 5.10.2022. https://blogs.helsinki.fi/tboneslim/2022/10/05/t-bone-slim-database-first-steps/.

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Harri Uusitalo

Photo: Timo Tuovinen

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Harri Uusitalo tells us about his research using various types of Finnish-language corpora from different time periods.

Who are you?

I am Harri Uusitalo, postdoctoral researcher at the University of Turku. I am a researcher of the Finnish language and currently, I am working at the School of History, Culture and Arts Studies in the interdisciplinary projects Fauna et Flora Fennica and Disappeared, Endangered and Newly Arrived Species: The Human Relationship with the Changing Biodiversity of the Baltic Sea. In the research groups, we examine the historical relationship of the Finnish people with nature.

What is your research topic?

I have studied Finnish texts from different periods, from the time of Agricola to the present day. My doctoral thesis focused on the legal language of the 17th century, and more recently I have been fascinated by environmental themes and ecolinguistic perspectives.

How is your research related to Kielipankki?

Together with my colleagues, I have used the Kielipankki data in some of my research. For example, together with Karita Suomalainen, we used the Suomi24 corpus and the Korp tool to investigate how Finnish people identify and discuss invasive alien species. With Duha Elsayed and Heidi Salmi, we used the Morpho-Syntactic Database of Mikael Agricola’s Works to study the translative form of the A-infinitive in Agricola’s works.

In my future research, I will certainly make use of many other corpora in Kielipankki, such as the Corpus of Old Literary Finnish, the Corpus of Early Modern Finnish and the Newspaper and Periodical Corpus of the National Library of Finland.

Publications

Uusitalo Harri, Lähdesmäki Heta, Sonck-Rautio Kirsi, Latva Otto, Salmi Hannu & Alenius Teija (forthcoming): Alien Plants between Practices and Representations: the Cases of European Spruce and Beach Rose in Finland. Plant Perspectives.

Uusitalo Harri & Suomalainen Karita 2023: Ecolinguistic Approach to Online Finnish Discourse on Invasive Alien Species. Language@Internet 21. https://www.languageatinternet.org/articles/2023/uusitalo

Elsayed Duha, Salmi Heidi & Uusitalo Harri 2022: A-infinitiivin translatiivi Mikael Agricolan teksteissä. Sananjalka 64. Suomen Kielen Seura, Turku. DOI: 10.30673/sja.107377

Corpora and tools

The Korp tool

The Suomi24 resource group

The Morpho-Syntactic Database of Mikael Agricola’s Works

The Corpus of Old Literary Finnish (VKS)

The Corpus of Early Modern Finnish (VNSK)

The Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version (KLK)

More information

Fauna et Flora Fennica (FaFFe) project

Disappeared, Endangered and Newly Arrived Species: The Human Relationship with the Changing Biodiversity of the Baltic Sea (HumBio) project

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Tanja Säily

Photo: Veikko Somerpuro

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Tanja Säily tells us about her research on the English language, which combines corpus linguistics, digital humanities and historical sociolinguistics.

Who are you?

I am Tanja Säily, Assistant Professor in English Language at the University of Helsinki.

What is your research topic?

I study variation and change in the English language from a sociolinguistic perspective. My research combines corpus linguistics, digital humanities and historical sociolinguistics. I frequently collaborate with other linguists and historians, and I develop new methods with data scientists and language technologists. I analyse sociolinguistic variation especially in linguistic productivity, such as the use of neologisms. I have also studied gendered styles and factors influencing the rate of language change.

How is your research related to Kielipankki?

In my research, I use English text corpora, which I have also deposited in Kielipankki for myself and others to use. I am currently studying the productivity of various linguistic constructions in the Corpus of Historical American English (e.g. Säily & Vartiainen, forthcoming). I have been using this corpus with the Korp tool and have also downloaded it to my own computer.

I have prepared openly available teaching materials on the methods of historical corpus linguistics for graduate students and other interested parties. They are included in the Method Bank for Linguistics, and the Early Modern English section of the Helsinki Corpus of English Texts used in the exercises can be found in Kielipankki.

Publications

Here are a few of my most recent publications; the entire list can be found at https://tanjasaily.fi/publications/

Accepted. Säily, Tanja, Martin Hilpert & Jukka Suomela. New approaches to investigating change in derivational productivity: Gender and internal factors in the development of -ity and -ness, 1600–1800. Patricia Ronan, Theresa Neumaier, Lisa Westermayer, Andreas Weilinghoff & Sarah Buschfeld (eds.), Crossing boundaries through corpora: Innovative approaches to corpus linguistics (Studies in Corpus Linguistics). Amsterdam: John Benjamins.

Accepted. Säily, Tanja & Turo Vartiainen. Historical linguistics. Michaela Mahlberg & Gavin Brooks (eds.), Bloomsbury handbook of corpus linguistics. London: Bloomsbury.

Accepted. Säily, Tanja, Turo Vartiainen, Harri Siirtola & Terttu Nevalainen. Changing styles of letter-writing? Evidence from 400 years of early English letters in a POS-tagged corpus. Luisella Caon, Moragh Gordon & Thijs Porck (eds.), Unlocking the history of English: Pragmatics, prescriptivism and text types (Current Issues in Linguistic Theory). Amsterdam: John Benjamins.

2023. Landert, Daniela, Tanja Säily & Mika Hämäläinen. TV series as disseminators of emerging vocabulary: Non-codified expressions in the TV Corpus. ICAME Journal 47(1): 63–79. DOI: 10.2478/icame-2023-0004

2022. Rodríguez-Puente, Paula, Tanja Säily & Jukka Suomela. New methods for analysing diachronic suffix competition across registers: How -ity gained ground on -ness in Early Modern English. International Journal of Corpus Linguistics27(4): 506–528. Special issue, Corpus studies of language through time, ed. by Tony McEnery, Gavin Brookes & Isobelle Clarke. DOI: 10.1075/ijcl.22014.rod

2021. Säily, Tanja, Eetu Mäkelä & Mika Hämäläinen. From plenipotentiary to puddingless: Users and uses of new words in early English letters. Mika Hämäläinen, Niko Partanen & Khalid Alnajjar (eds.), Multilingual Facilitation, 153–169. Helsinki: University of Helsinki. DOI: 10.31885/9789515150257.15

2020. Mäkelä, Eetu, Krista Lagus, Leo Lahti, Tanja Säily, Mikko Tolonen, Mika Hämäläinen, Samuli Kaislaniemi & Terttu Nevalainen. Wrangling with non-standard data. Sanita Reinsone, Inguna Skadiņa, Anda Baklāne & Jānis Daugavietis (eds.), Proceedings of the Digital Humanities in the Nordic Countries 5th Conference, Riga, Latvia, October 21–23, 2020 (CEUR Workshop Proceedings 2612), 81–96. Aachen: CEUR-WS.org. DHN 2020 Best Paper Award. http://ceur-ws.org/Vol-2612/paper6.pdf

2020. Nevalainen, Terttu, Tanja Säily, Turo Vartiainen, Aatu Liimatta & Jefrey Lijffijt. History of English as punctuated equilibria? A meta-analysis of the rate of linguistic change in Middle English. Journal of Historical Sociolinguistics 6(2): article 20190008. Special issue, Comparative Sociolinguistic Perspectives on the Rate of Linguistic Change, ed. by Terttu Nevalainen, Tanja Säily & Turo Vartiainen. DOI:10.1515/jhsl-2019-0008

2019. Hill, Mark J., Ville Vaara, Tanja Säily, Leo Lahti & Mikko Tolonen. Reconstructing intellectual networks: From the ESTC’s bibliographic metadata to historical material. Costanza Navarretta, Manex Agirrezabal & Bente Maegaard (eds.), Proceedings of the Digital Humanities in the Nordic Countries 4th Conference, Copenhagen, Denmark, March 6–8, 2019 (CEUR Workshop Proceedings 2364), 201–219. Aachen: CEUR-WS.org. DHN 2019 Best Paper Award. http://ceur-ws.org/Vol-2364/19_paper.pdf

2018. Säily, Tanja. Change or variation? Productivity of the suffixes -ness and -ity. Terttu Nevalainen, Minna Palander-Collin & Tanja Säily (eds.), Patterns of Change in 18th-century English: A Sociolinguistic Approach (Advances in Historical Sociolinguistics 8), 197–218. Amsterdam: John Benjamins. DOI: 10.1075/ahs.8

Corpora and teaching materials

The Corpus of Historical American English (COHA)

Helsinki Corpus of English Texts, Early Modern English section

Historical Corpus Linguistics (The Method Bank for Linguistics)

More information

Homepage: https://tanjasaily.fi

ORCID: https://orcid.org/0000-0003-4407-8929

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Liisa Mustanoja

Photo: Antti Yrjönen

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Liisa Mustanoja tells us about her research on sociolinguistics. With the help of a longitudinal corpus, it is possible to observe changes in the spoken language of the same people at different points in time.

Who are you?

I am Liisa Mustanoja, PhD, from Tampere. I work as a University Lecturer of Finnish Language in the Unit of Languages at the Faculty of Information Technology and Communication, University of Tampere. From January 2024, I will be the Head of the Unit of Languages for the next five years. I am also an Associate Professor of Finnish at the University of Oulu, specialising in sociolinguistics.

What is your research topic?

So far, all my research fits under the large umbrella of sociolinguistics. I am interested in the relationship between language and society, especially in all forms of change, upheaval and movement. In my doctoral research, I examined the change of the spoken language of Tampere at the level of the idiolect. This was a so-called real-time panel survey, in which I examined the language of the same people in the light of two points in time. Later, together with my colleagues, I have extended the study to the spoken language of Helsinki, and we have also included a third time point. The focus has largely been on the phonetic and formal structure of the language, but the data has also allowed for a sociophonetic approach. In one article, for example, we investigated changes in pitch over time.

In addition to the path of variation studies, I am interested in the interface between spoken and written language, and this has provided me with another research direction, namely the study of letter writing. I have investigated – both on my own as well as together with Finnish language students – the correspondence during the Second World War. As there was no other means of communication during the war, everyone took up their pen, regardless of age, profession or educational background. Although this correspondence resource is old, it has provided essential insights into the importance of human contact in times of crisis, as well as into everyday life and humanity in the midst of world turmoil.

How is your research related to Kielipankki?

For some time now, Kielipankki has made accessible the Longitudinal Corpus of Finnish Spoken in Helsinki, which has provided me and my colleagues with an important source of data for studying language change. This corpus will hopefully be joined in the coming months by a little sister, the Longitudinal data of Tampere spoken language. Previously, recordings of the spoken language of Tampere had been made in the 1970s and 1990s. In 2019, I started a third round of data collection in Tampere, which has been continued by students up to the present day. Thanks to the funding I received from FIN-CLARIN, I have also been able to hire some temporary help to work on the material. Everything is now in place, except for the final paperwork. The transfer and archiving of personal speech data has its own complications, but Kielipankki is by far the best possible repository for this valuable longitudinal data. On the eve of handing over the material, it feels like there should be more material and it should be more complete, and the transcripts should be revised countless more times. But really, every little addition to Kielipankki is a great gift to the research community. And by opening up even a part of the resource, someone else has also the possibility to join the transcription work if they want to!

From the resources in Kielipankki, I would also like to mention the Suomi24 Corpus, which suits well for student work. Nowadays, when data protection matters are demanding, it is a relief to be able to direct students to these ready-made resources. For me, too, there is still a lot of new things to wonder about in Kielipankki. My interest in wartime letters, for example, has recently led me to Kalle Päätalo’s Iijoki series, and I have been quite surprised by the research potential of this cornucopia.

Publications

Mustanoja Liisa, O’Dell Michael & Lappalainen Hanna, 2022: Helsinkiläis- ja tamperelaispuhujien äänenkorkeuden muutokset 1970-luvulta 2010-luvulle. Puhe ja kieli. https://doi.org/10.23997/pk.121404

Kuparinen Olli, Santaharju Jenni, Leino Unni, Mustanoja Liisa & Peltonen Jaakko 2022: Katomuotojen eteneminen hd-yhtymässä Helsingin puhekielessä. Virittäjä 126, s. 316–338. https://doi.org/10.23982/vir.100585

Kuparinen Olli, Peltonen Jaakko, Mustanoja Liisa, Leino Unni & Santaharju Jenni, 2021: Lects in Helsinki Finnish – a probabilistic component modeling approach. Language Variation and Change. https://doi.org/10.1017/S0954394521000041

Lappalainen Hanna, Mustanoja Liisa & O’Dell Michael, 2019: Miten ja milloin yksilön kieli muuttuu? Helsinkiläisidiolektien muutos ja muutoksen tutkimuksen menetelmät. Virittäjä 123, s. 550–581. https://doi.org/10.23982/vir.67808

Kuparinen Olli, Mustanoja Liisa, Peltonen Jaakko, Santaharju Jenni & Leino Unni, 2019: Muutosmallit kolmen aikapisteen pitkittäisaineiston valossa. Sananjalka 61. s. 30–56. https://doi.org/10.30673/sja.80056

Mustanoja Liisa, 2018: Sydämellisiä kirjeitä talvisodasta. Hämäläisten sotilaiden kiitoskirjeet aikansa kielen ja kirjeenvaihtokulttuurin heijastajina. Sisko Brunni, Niina Kunnas, Santeri Palviainen ja Jari Sivonen (toim.), Kuinka mahottomasti nää tekkiit. Juhlakirja Harri Mantilan 60-vuotispäivän kunniaksi. Studia humaniora ouluensia 16. Oulu, s. 251–285. https://urn.fi/URN:ISBN:9789526221120

Mustanoja Liisa (toim.), 2017: Arjen sirpaleita ja suuria tunteita: Kirjeet sodan sanoittajina ja ihmissuhteiden ylläpitäjinä 1939–1944. Tampere Studies in Language, Translation and Literature B5. Tampereen yliopisto. https://urn.fi/URN:ISBN:978-952-03-0527-7

Mustanoja Liisa, 2011: Idiolekti ja sen muuttuminen: reaaliaikatutkimus Tampereen puhekielestä. Tampere: Tampere University Press. https://urn.fi/urn:isbn:978-951-44-8417-9

Corpora

The Longitudinal Corpus of Finnish Spoken in Helsinki (1970s, 1990s and 2010s)

Longitudinal data of Tampere spoken language

Suomi 24 resource group

Iijoki, the University of Oulu Päätalo collection

More information

Faculty of Information Technology and Communication Sciences | Tampere Universities community (tuni.fi)

The Linguistic Study of Wartime Correspondence | Tampere Universities community (tuni.fi)

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Tiina Onikki-Rantajääskö

Photo: Veikko Somerpuro

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Tiina Onikki-Rantajääskö tells us about the principles of the Helsinki Term Bank for the Arts and Sciences (HTB) and invites interested experts to join the collaborative terminology work.

Who are you?

I am Tiina Onikki-Rantajääskö, Professor of Finnish at the University of Helsinki. I also lead the Helsinki Term Bank for the Arts and Sciences (HTB).

What is your research topic?

I am generally interested in how vocabulary and grammatical structures construe linguistic meaning and how they function in relation to the wider textual context. Most of my published research is related to the local cases of the Finnish language. Currently, I am delighted to see how younger researchers aim to combine qualitative and quantitative research in the project Platforms and Rhetorical Group Strategies (in Finnish, ”Alustat ja retoriset ryhmästrategiat”), run by me and Eetu Mäkelä and funded by Kone Foundation. I am particularly interested in discovering whether some constructions can indicate broader discourse structures. However, during this winter, I am spending most of my time on my duties as the Finnish Language Rapporteur, appointed by the Ministry of Justice.

How is your research related to Kielipankki?

I tend to use the Finnish language resources in Kielipankki whenever I need information about the context of a word or grammatical element. Many of the corpora that I have used in the past can now be found in Kielipankki, such as the HS.fi News and Comments Corpus that was compiled in one of my earlier projects.

In addition, the Helsinki Term Bank for the Arts and Sciences (HTB) is part of the FIN-CLARIAH Research Infrastructure, together with Kielipankki. This is reflected in the fact that the online service of the HTB is also accessible via Kielipankki. The HTB also has an employee funded through the FIN-CLARIAH project (FIRI funding from the Research Council of Finland). There is a need for collaboration in the field of language technologies.

The contents Helsinki Term Bank for the Arts and Sciences (HTB) are still in the construction phase. We are constantly working to involve more and more researchers from different disciplines in the terminology work and to invite new disciplines to join the HTB. Defining scientific terms and providing other background information on concepts require expertise in each field. Therefore, the selected method is niche-sourcing of experts, supported by our project planner. The aim is to promote the multilingualism of science in addition to providing openly accessible information describing the formation of scientific knowledge and facilitating the utilization of science. Scientific concepts are at the heart of research. Multilingualism can be promoted by offering translation equivalents for terms in different languages. The Finnish language is in focus, since the aim is to support Finnish as a language of science. However, it is possible to present definitions and concept pages in languages other than Finnish. The term bank thus opens up opportunities for international collaboration. Especially for multilingual and multidisciplinary research groups, the term bank provides an opportunity to shape the common terminological ground. All interested experts are welcome to participate.

My research interests in the Helsinki Term Bank for the Arts and Sciences (HTB) include the presentation of background knowledge frames and the emergence of prototypicality, as well as collaborative interactions: the network of experts in the HTB and the online service interact and form a field of action that differs from traditional research projects.

Publications

Enqvist, Johanna & Tiina Onikki.Rantajääskö & Kaarina Pitkänen-Heikkilä 2021: Terminology work as open, communal and collaborative crowdsourcing practice of academic communities. – Terminology 27:1, Pp. 56-79. DOI: 10.1075/term.00058.enq

Jaakola, Minna & Tiina Onikki-Rantajääskö (eds.) 2023: The Finnish Cases System: Cognitive Linguistic Perspectives. Helsinki:SKS. DOI: doi.org/10.21435/sflin.23

Kettunen, Harri & Tiina Onikki-Rantajääskö (forthcoming): Vetenskapstermbanken i Finland i samhällets tjänst. – Publikation Nordterm 2023.

Kettunen, Harri & Tiina Onikki-Rantajääskö (forthcoming): Tieteen termipankki tieteentekemisen ytimessä. – Kieliviesti 2/2023.

Onikki-Rantajääskö, Tiina & Harri Kettunen 2023: Vuosi 2022 Tieteen termipankissa: Laajenemista uusille aihealueille ja tunnustuspalkintoja avoimen tieteen edistämisestä. – Tieteen termipankin blogi. Helmikuu/2023. https://blogs.helsinki.fi/tieteentermipankki/2023/02/16/vuosi-2022-tieteen-termipankissa-laajenemista-uusille-aihealueille-ja-tunnustuspalkintoja-avoimen-tieteen-edistamisesta/

Corpora

HS.fi News and Comments Corpus

More information

Helsinki Term Bank for the Arts and Sciences (HTB)

Instructions for a new expert for joining the collaborative terminology work

FIN-CLARIAH Research Infrastructure

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Aleksi Sahala

Photo: Marianne Ough

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Aleksi Sahala tells us about his research on the development and application of Natural Language Processing (NLP) methods for annotating and analyzing ancient text data.

Who are you?

I am Aleksi Sahala, a postdoc researcher in Assyriology and Language Technology. I am currently working for the University of Helsinki in an Academy of Finland funded project “The Origins of Emesal”, where our goal is to investigate how Emesal, the only known language variety of Sumerian, came to be and evolved over time using computational methods.

I did my master’s degree in Assyriology and Computational Linguistics, and in 2021 I finished my PhD thesis “Contributions to Computational Assyriology”. In 2022, I was a visiting scholar at the University of California, Berkeley, and in 2024 I will visit the University of Innsbruck in Austria. I have also worked in close co-operation with the Centre of Excellency in Ancient Near Eastern Empires at the University of Helsinki.

What is your research topic?

My research focuses on the development and application of NLP (Natural Language Processing) methods for annotating and analyzing ancient text data. My particular interest lies in the Mesopotamian cuneiform texts written in Sumerian (3200 BCE – 100 CE) and Akkadian (2500 BCE – 100 CE). Analysis of Sumerian and Akkadian texts is not only challenging due to data sparsity and the fragmentary nature of the primary sources, but also due to the complexity of the cuneiform writing system and inflectional morphology. In theory, most words can occur in several thousands of different forms, each of which can also be spelled in several different ways.

My focal point has been on the development of a pipeline that is able to linguistically annotate raw transliterations of cuneiform texts so that these texts can be used for data analysis and visualization. This allows for the analysis of thousands of transliterated texts simultaneously and, for example, the visualization and study of how different words, concepts or entities are related to each other on a larger scale. Although Assyriologists have digitized over 20,000 Akkadian and over 100,000 Sumerian texts in various text corpora, these texts have mostly been studied qualitatively by close-reading. By applying a more computational approach, it becomes easier to reveal larger patterns within specific groups of texts.

I have developed a finite-state morphology for Akkadian (BabyFST), as well as a language independent neural lemmatizer and tagger with a special support for cuneiform languages (BabyLemmatizer). In addition, I have built a word-embedding-based tool for analyzing semantic relationships of words and in sparse and fragmentary data sets (PMI Embeddings).

My current project focuses on Emesal, a liturgic variant of the Sumerian language, which is only attested in writing after Sumerian was no longer used as a vernacular. Although it is known that Emesal was used in liturgic context, such as lamentations, and occasional to indicate direct speech of goddesses and women, its origins and evolution are still widely debated. None of the Emesal texts were entirely written in this language variant, but rather in Sumerian, and Emesal was only used here and there as keywords to indicate that the current line or passage should be read in this dialect. The rules behind this code switching, if such ever existed, remain largely unknown. We hope, that a larger scale analysis of Emesal texts could reveal some patterns that could explain, what kinds of environments triggered the use of Emesal words exactly, and how the use of this language variant was introduced in written documents and how evolved over its 2000 year old history.

How is your research related to Kielipankki?

Kielipankki has been co-operating with the Centre of Excellence in Ancient Near Eastern Empires by annotating cuneiform texts and publishing them in Korp concordance service. My responsibilities have been collecting and converting these data sets into Korp-compatible format and developing tools for annotating and harmonizing them with the existing resources in a way, that they can be used efficiently together for quantitative analysis.

Recently, we have been working on the harmonization, lemmatization and tagging of Achemenet, a collection of Neo-Babylonian administrative and legal documents.

Publications

Alstola, T., Zaia, S., Sahala, A., Jauhiainen, H., Svärd, S., & Lindén, K. (2019). Aššur and his friends: a statistical analysis of neo-assyrian texts. Journal of Cuneiform Studies, 71(1), 159–180. http://hdl.handle.net/10138/303986

Alstola, T., Jauhiainen, H., Svärd, S., Sahala, A., & Lindén, K. (2023). Digital Approaches to Analyzing and Translating Emotion: What Is Love?. In The Routledge Handbook of Emotions in the Ancient Near East. Taylor & Francis. http://hdl.handle.net/10138/348398

Bennet, E. & Sahala, A. (2023). Using Word Embeddings for Identifying Emotions Relating to the Body in a Neo-Assyrian Corpus. In Proceedings of the Ancient Natural Language Processing Workshop at RANLP 2023. http://hdl.handle.net/10138/565513

Ihalainen, P. & Sahala, A. (2020). Evolving Conceptualisations of Internationalism in the UK Parliament. Digital Histories, 199.

Luukko, M., Sahala, A., Hardwick, S., & Lindén, K. (2020). Akkadian treebank for early neo-assyrian royal inscriptions. In Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories. The Association for Computational Linguistics. http://hdl.handle.net/10138/322305

Sahala, A. J. A. (2017). Johdatus sumerin kieleen. Suomen itämainen seura.

Sahala, A., Silfverberg, M., Arppe, A., & Lindén, K. (2020). BabyFST: Towards a finite-state based computational model of ancient babylonian. In Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 3886–3894). http://hdl.handle.net/10138/317691

Sahala, A., Silfverberg, M., Arppe, A., & Lindén, K. (2020). Automated phonological transcription of Akkadian cuneiform text. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020). European Language Resources Association (ELRA). http://hdl.handle.net/10138/317688

Sahala, A. (2021). Contributions to Computational Assyriology. PhD Thesis. University of Helsinki. http://urn.fi/URN:ISBN:978-951-51-7416-1

Sahala, A., & Töyräänvuori, J. (2022). Kirjoitustaidon kehittyminen. In Svärd, S. & Töyräänvuori, J. (eds.), Muinaisen Lähi-idän imperiumit. Kadonneiden suurvaltojen kukoistus ja tuho, s.49–69. Gaudeamus, Helsinki.

Sahala, A., & Svärd, S. (2022). Language technology approach to “seeing” in Akkadian. In The Routledge Handbook of the Senses in the Ancient Near East. Taylor & Francis. http://hdl.handle.net/10138/339256

Sahala, A., Alstola, T., Valk, J., & Lindén, K. (2023, June). Lemmatizing and POS-tagging Akkadian with BabyLemmatizer and Dictionary-Based Post-Correction. In Selected papers from the CLARIN Annual Conference 2022 (pp. 111–119). http://hdl.handle.net/10138/563733

Sahala, A. & Lindén, K. (2023). A Neural Pipeline for Lemmatizing and POS-tagging Cuneiform Languages. In Proceedings of the Ancient Natural Language Processing Workshop at RANLP 2023.

Svärd, S., Jauhiainen, H., Sahala, A., & Lindén, K. (2018). Semantic Domains in Akkadian Texts. CyberResearch on the Ancient Near East and Neighboring Regions. Case Studies on Archaeological Data, Objects, Texts, and Digital Archiving, 2, 224–256. http://hdl.handle.net/10138/241805

Svärd, S., Alstola, T., Jauhiainen, H., Sahala, A., & Lindén, K. (2020). Fear in akkadian texts: New digital perspectives on lexical semantics. In The Expression of Emotions in Ancient Egypt and Mesopotamia (pp. 470–502). Brill. http://hdl.handle.net/10138/328017

Tools

BabyLemmatizer, OpenNMT based neural lemmatizer and tagger. Pretrained models available for Ancient Greek, Latin and various cuneiform languages.

BabyFST, Finite-state morphology of Akkadian, specifically Babylonian dialect.

PMI-Embeddings, Hyper-parametrized tool for creating PMI+SVD based word embeddings from sparse or fragmentary data sets.

Corpora

Open Richly Annotated Cuneiform Corpus (Oracc)

Achemenet Babylonian texts

More information

Centre of Excellency in Ancient Near Eastern Empires (ANEE)

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Anna Dmitrieva

Anna Dmitrieva (standing) with Aleksandra Konovalova (sitting), co-creators of the Parallel Corpus of Finnish and Easy-to-read Finnish. Photo: Anna Dmitrieva

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Anna Dmitrieva tells us about her research on text simplification. Computational methods and the compiling of parallel corpora are an integral part of her work.

Who are you?

I am Anna Dmitrieva, a doctoral researcher at HELSLANG, the Doctoral Programme in Language Studies at the University of Helsinki.

What is your research topic?

My main field of interest is text simplification. I have studied computational linguistics since 2012, when I started my studies for the Bachelor’s degree. Since then, I have been involved in many projects related to natural language processing (NLP), but text simplification has been my main focus during my doctoral studies.

Text simplification is a process of making a text “easier”. A simplified text should be more readable and accessible to a broader audience. In NLP, text simplification can be viewed as a monolingual machine translation problem. We train models that are capable of translating or transforming texts, taking a source text in a particular language and producing a “simpler” version of the text in the same language. This task typically requires a lot of parallel data, where there is a corresponding “easy” target text for each source text.

I work with languages that do not have a lot of simplification data, make datasets for them, and train simplification models. During my time as a doctoral researcher, I have made Russian and Finnish text simplification datasets and models. I am also investigating controlled text simplification, the task of manipulating certain linguistic properties in the output of the simplification model.

How is your research related to Kielipankki?

As a Finnish university student, I have naturally thought of making a Finnish simplification model. Since there were no parallel simplification corpora for Finnish, I had to make one myself. The most obvious choice for the data source was Yle Easy-to-read Finnish News: they exist in the form of text, have been around for a relatively long time, and have equivalents in “regular” Finnish. It was a relief to know that I didn’t have to scrape the news myself using Yle’s API because all the archives are already on Kielipankki.

However, I had to solve the problem of aligning Easy Finnish and Standard Finnish news. I performed automatic alignment, but there was no golden test set of document pairs to test the quality of the alignments. This is where my friend Aleksandra Konovalova (University of Turku) stepped in and helped me, evaluating 1919 pairs of documents herself. Together, we created the Parallel Corpus of Finnish and Easy-to-read Finnish, which is now available in Kielipankki. Currently, I am adding more document pairs and creating a sentence-aligned version, which will hopefully also be made available via Kielipankki when completed.

Publications

Dmitrieva, A. & Konovalova, A. Creating a parallel Finnish—Easy Finnish dataset from news articles. Jun 2023, Proceedings of the 1st Workshop on Open Community-Driven Machine Translation. Esplá-Gomis, M., Forcada, M., Kuzman, T., Ljubešić, N., van Noord, R., Ramírez-Sánchez, G., Tiedemann, J. & Toral, A. (eds.). Universitat d’Alacant, p. 21-26 6 p. https://macocu.eu/static/media/proceedings.37b7e88ce3dbab99adf9.pdf#page=27

Dmitrieva, A. Automatic text simplification of Russian texts using control tokens. May 2023, Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023). Piskorski, J., Marcińczuk, M. & Nakov, et al., P. (eds.). Stroudsburg: Association for Computational Linguistics (ACL), p. 70-77 8 p. DOI: 10.18653/v1/2023.bsnlp-1.9

Dmitrieva, A. The role of language technology in accessible communication research. Jun 2023, Emerging Fields in Easy Language and Accessible Communication Research. Deilen, S., Hansen-Schirra, S., Hernández Garrido, S., Maaß, C. & Tardel, A. (eds.). Frank & Timme, p. 319-338 20 p. (Easy – Plain – Accessible; vol. 14). https://researchportal.helsinki.fi/fi/publications/the-role-of-language-technology-in-accessible-communication-resea

Corpora

Yle News Archive in Kielipankki

Parallel Corpus of Finnish and Easy-to-read Finnish

More information

HELSLANG – The Doctoral Programme in Language Studies at the University of Helsinki

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Sampo Pyysalo

Photo: Pasi Leino / University of Turku

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Sampo Pyysalo tells us about his research on natural language processing. Openly available large language models are necessary for developing tools similar to ChatGPT also for smaller languages, such as Finnish.

Who are you?

I’m Sampo Pyysalo, University Research Fellow at the TurkuNLP group of the University of Turku.

What is your research topic?

My research is on machine learning approaches to natural language processing, with particular focus on processing Finnish text and analyzing biomedical domain scientific literature. A lot of my recent work revolves around training large neural network models, including general ”foundation” models such as FinBERT and FinGPT as well as task-specific models such as a named entity recognition model for Finnish. I also work on data, both compiling raw text resources for the unsupervised training of foundation models and running manual annotation efforts to create resources for supervised training, such as the Turku NER and TurkuONE corpora.

Large neural language models are central to a lot of state-of-the-art natural language processing and the basis for tools such as ChatGPT, but most such models focus on English and many of the best models are not publicly available. We believe that openly available Finnish models such as FinBERT and FinGPT are necessary to enable the creation of tools for processing Finnish language with comparable capabilities to tools available for English.

How is your research related to Kielipankki?

Creating large language models from scratch requires billions of words of text, and collections of Finnish of this size are not readily available. To compile sufficiently large corpora for language model training we have drawn on various sources, including web crawls and resources available through Kielipankki such as the Yle News Archive, the Finnish News Agency Archive (STT) and the Suomi 24 Corpus. We also distribute resources created by TurkuNLP through Kielipankki among other channels.

In the near future, we hope that we will be able to provide access to the full text resources used to create our models for research purposes through Kielipankki to improve the replicability of our work and to make it easier for future efforts to create models for Finnish.

Publications

J. Luoma & LH. Chang & F. Ginter & S. Pyysalo. 2021. Fine-grained Named Entity Annotation for Finnish. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pages 135–144, Reykjavik, Iceland (Online). Linköping University Electronic Press, Sweden. https://aclanthology.org/2021.nodalida-main.14

A. Virtanen & J. Kanerva & R. Ilo & J. Luoma & J. Luotolahti & T. Salakoski & F. Ginter & S. Pyysalo. 2019. Multilingual is not enough: BERT for Finnish. In CoRR, abs/1912.07076. https://doi.org/10.48550/arXiv.1912.07076

Corpora

Turku NER Corpus (data available via GitHub)

TurkuONE Corpus (data available via GitHub)

The Yle News Archive resource group in Kielipankki

The Finnish News Agency Archive resource group in Kielipankki

The Suomi 24 Corpus resource group in Kielipankki

More information

TurkuNLP group of the University of Turku

FinBERT, a version of Google’s BERT deep transfer learning model for Finnish, developed by the TurkuNLP Group

FinGPT, generative GPT-3-like models for Finnish

Finnish NER, a Named Entity Recognition system for Finnish (based on FinBERT and a new NER annotation layer of the UD_Finnish-TDT treebank)

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Hae Kielipankki-portaalista:
Haku:

Kuukauden tutkija: Pekka Posio

Uutisia

Laajennettu ja päivitetty aineistokokoelma Suomi24 2001–2023 nyt VRT-muodossa latauspalvelussa (julkaisuehdokas) (17.4.2025)
Kuukauden tutkija: Pekka Posio (17.4.2025)
Suomi24-aineiston laajennus vuosilta 2021–2023 sekä täydennysversio 2001-2020 Korpissa (julkaisuehdokas) (14.4.2025)
Kuukauden tutkija: Simo Määttä (20.3.2025)
Uusi aineisto (beta): BALT: Babylonian Administrative and Legal Texts - Kielipankin versio 2025-02, Korp (18.3.2025)

Lisää uutisia

Tulevat tapahtumat

Datasta tutkimukseen – Miten hyödyntää Finnan, Kansallisarkiston ja Kansalliskirjaston aineistoja?
21.5.2025 12.30–16.00

FIN-CLARIAH Meeting 13.6.2025
13.6.2025 10.30–17.00

CLARIN Annual Conference 2025
30.9.2025–2.10.2025

Näytä kaikki tapahtumat

Yhteystiedot
Kielipankin tekninen ylläpito:
kielipankki (ät) csc.fi
p. 09 4572001

Aineistoihin ja muuhun sisältöön liittyvät asiat:
fin-clarin (ät) helsinki.fi
p. 029 4129317

Tarkemmat yhteystiedot

© 2015–2024 Kielipankki, FIN-CLARIN ja CSC – Tieteen tietotekniikan keskus

Kielipankin käyttöehdot, tietosuojakäytänteet ja saavutettavuusseloste