Suomeksi

Researcher of the Month: Krister Lindén

Krister Lindén
Photo: Juhani Jokinen

Kielipankki – The Language Bank of Finland offers a comprehensive set of resources, tools and services in a high-performance environment. Krister Lindén, the Director of the Language Bank, describes how researchers in Humanities can benefit from the use of artificial intelligence in their corpus-based research.

Who are you?

I am Krister Lindén. At the University of Helsinki, I am Research Director for Language Technology at the Department of Digital Humanities, and Deputy Team Leader at the Centre of Excellence for Ancient Near Eastern Empires. For national research infrastructures, I am the Director of the Language Bank of Finland, the National Coordinator of FIN-CLARIN, and the PI of FIN-CLARIAH. At the EU level, I am Chair of the National Coordinators Forum of CLARIN, a research infrastructure for the humanities and social sciences, and a member of the CLARIN Legal Issues Committee (CLIC).

What is your research topic?

I have always been interested in language technology and its application and, due to my involvement in the Language Bank, increasingly also in the prerequisites for developing and applying technology:

  • How can we use data to answer a broad range of research questions in the humanities and social sciences?
  • Where can we obtain development and test data to develop and evaluate our data processing methods?
  • Under what conditions can data be shared with other researchers so that they can verify the proclaimed performance of the methods?

An independent evaluation of methods is important to ensure progress and that we find the best methods in each case. If only a preliminary evaluation is needed, and a small-scale experiment is sufficient, you can give ChatGPT a few examples to see how it copes with the task. If there is insufficient data to reliably use a statistical method, and the task requires a high precision method, it may be quicker to use manually developed methods. On the other hand, if there is enough data, a suitable machine learning method is available, and the processing environment performance is sufficient, this combination often provides the most reproducible development path.

All the above development paths are data-driven and require data to be shared with other researchers for replication. In previous years, there has been a strong enthusiasm for completely open source data sets. While this is still a desirable goal, there are many datasets that, for one reason or another, cannot be made available to everyone. Gradually, as our community of researchers together with the lawmakers have succeeded in developing a legal framework for data access which is open enough for academic researchers to study the data and verify the results in a relatively straightforward way, while keeping the data accessible to a sufficiently small audience not to risk personal data nor infringe on copyrights.

A new development need is to create a method for researchers in the humanities and social sciences to discuss the content of datasets which they deposit in the Language Bank with an AI.

How is your research related to Kielipankki?

The Language Bank provides both a platform for tool development and an opportunity to show how different types of research-oriented datasets can be shared with other researchers in a safe and legal way.

Recent publications

Jauhiainen, T., Zampieri, M., Baldwin, T. C., & Linden, K. (2024). Automatic Language Identification in Texts. (Synthesis Lectures on Human Language Technologies). Springer. https://doi.org/10.1007/978-3-031-45822-4

Jauhiainen, T., Piitulainen, J., Axelson, E., Dieckmann, U., Lennes, M., Niemi, J., Rueter, J., & Linden, K. (2024). Investigating Multilinguality in the Plenary Sessions of the Parliament of Finland with Automatic Language Identification. In D. Fišer, M. Eskevich, & D. Bordon (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024): ParlaCLARIN IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (pp. 48-56). (International conference on computational linguistics), (LREC proceedings). European Language Resources Association (ELRA). https://researchportal.helsinki.fi/files/312866811/ArtikkeliJulkaistu.pdf

Sahala, A., & Linden, K. (2023). BabyLemmatizer 2.0 – A Neural Pipeline for POS-tagging and Lemmatizing Cuneiform Languages. In A. Anderson, S. Gordin, B. Li, Y. Liu, & M. C. Passarotti (Eds.), Proceedings of the Ancient Language Processing Workshop associated with the 14th International Conference on Recent Advances in Natural Language Processing, RANLP 2023 (pp. 203-212). INCOMA. https://aclanthology.org/2023.alp-1.23

Linden, K., Niemi, J., & Kontino, T. (Eds.) (2023). CLARIN Annual Conference Proceedings 2023. (CLARIN Annual Conference Proceedings). CLARIN ERIC. https://researchportal.helsinki.fi/files/298353929/CE-2023-2328_CLARIN2023_ConferenceProceedings.pdf

Lindén, K., Ruokolainen, T., Hämäläinen, L., & Harviainen, J. T. (2023). Ethically Archiving a Hard-to-Access Massive Research Data Set in the Language Bank of Finland: The Finnish Dark Web Marketplace Corpus (FINDarC). In M. M. Rantanen , S. Westerstrand, O. Sahlgren, & J. Koskinen (Eds.), Proceedings of the Conference on Technology Ethics 2023 – Tethics 2023 (pp. 114-131). (CEUR Workshop Proceedings; Vol. 3582). CEUR-WS.org. https://researchportal.helsinki.fi/files/295005165/FP_10.pdf

Kamocki, P., Linden, K., Puksas, A., & Kelli, A. (2023). EU Data Governance Act: Outlining a Potential Role for CLARIN. In T. Erjavec, & M. Eskevich (Eds.), Selected papers from the CLARIN Annual Conference 2022 (pp. 57-65). (Linköping Electronic Conference Proceedings; No. 198). CLARIN ERIC. https://doi.org/10.3384/ecp198006

Linden, K., Jauhiainen, T., & Hardwick, S. (2023). FinnSentiment: A Finnish Social Media Corpus for Sentiment Polarity Annotation. Language Resources and Evaluation, 57(2), 581-609. https://doi.org/10.1007/s10579-023-09644-5

Axelson, E., Hardwick, S., & Linden, K. (2023). HFST Training Environment and Recent Additions. In A. Hurskainen, K. Koskenniemi, & T. P. (Eds.), Rule-Based Language Technology (pp. 60-69). (NEALT Monograph Series; No. 2[1]). Northern European Association for Language Technology. http://hdl.handle.net/10062/89595

Links

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers of Social Sciences and Humanities to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.