Anna Dmitrieva (standing) with Aleksandra Konovalova (sitting), co-creators of the Parallel Corpus of Finnish and Easy-to-read Finnish. Photo: Anna Dmitrieva
Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Anna Dmitrieva tells us about her research on text simplification. Computational methods and the compiling of parallel corpora are an integral part of her work.
I am Anna Dmitrieva, a doctoral researcher at HELSLANG, the Doctoral Programme in Language Studies at the University of Helsinki.
My main field of interest is text simplification. I have studied computational linguistics since 2012, when I started my studies for the Bachelor’s degree. Since then, I have been involved in many projects related to natural language processing (NLP), but text simplification has been my main focus during my doctoral studies.
Text simplification is a process of making a text “easier”. A simplified text should be more readable and accessible to a broader audience. In NLP, text simplification can be viewed as a monolingual machine translation problem. We train models that are capable of translating or transforming texts, taking a source text in a particular language and producing a “simpler” version of the text in the same language. This task typically requires a lot of parallel data, where there is a corresponding “easy” target text for each source text.
I work with languages that do not have a lot of simplification data, make datasets for them, and train simplification models. During my time as a doctoral researcher, I have made Russian and Finnish text simplification datasets and models. I am also investigating controlled text simplification, the task of manipulating certain linguistic properties in the output of the simplification model.
As a Finnish university student, I have naturally thought of making a Finnish simplification model. Since there were no parallel simplification corpora for Finnish, I had to make one myself. The most obvious choice for the data source was Yle Easy-to-read Finnish News: they exist in the form of text, have been around for a relatively long time, and have equivalents in “regular” Finnish. It was a relief to know that I didn’t have to scrape the news myself using Yle’s API because all the archives are already on Kielipankki.
However, I had to solve the problem of aligning Easy Finnish and Standard Finnish news. I performed automatic alignment, but there was no golden test set of document pairs to test the quality of the alignments. This is where my friend Aleksandra Konovalova (University of Turku) stepped in and helped me, evaluating 1919 pairs of documents herself. Together, we created the Parallel Corpus of Finnish and Easy-to-read Finnish, which is now available in Kielipankki. Currently, I am adding more document pairs and creating a sentence-aligned version, which will hopefully also be made available via Kielipankki when completed.
Dmitrieva, A. & Konovalova, A. Creating a parallel Finnish—Easy Finnish dataset from news articles. Jun 2023, Proceedings of the 1st Workshop on Open Community-Driven Machine Translation. Esplá-Gomis, M., Forcada, M., Kuzman, T., Ljubešić, N., van Noord, R., Ramírez-Sánchez, G., Tiedemann, J. & Toral, A. (eds.). Universitat d’Alacant, p. 21-26 6 p. https://macocu.eu/static/media/proceedings.37b7e88ce3dbab99adf9.pdf#page=27
Dmitrieva, A. Automatic text simplification of Russian texts using control tokens. May 2023, Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023). Piskorski, J., Marcińczuk, M. & Nakov, et al., P. (eds.). Stroudsburg: Association for Computational Linguistics (ACL), p. 70-77 8 p. DOI: 10.18653/v1/2023.bsnlp-1.9
Dmitrieva, A. The role of language technology in accessible communication research. Jun 2023, Emerging Fields in Easy Language and Accessible Communication Research. Deilen, S., Hansen-Schirra, S., Hernández Garrido, S., Maaß, C. & Tardel, A. (eds.). Frank & Timme, p. 319-338 20 p. (Easy – Plain – Accessible; vol. 14). https://researchportal.helsinki.fi/fi/publications/the-role-of-language-technology-in-accessible-communication-resea
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.