Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Filip Ginter tells us about his work with the TurkuNLP research group.
I am Filip Ginter and I am an associate professor of language technology at the University of Turku. I am also presently the longest-serving member of the TurkuNLP research group. I am a computer scientist by training, profoundly enjoying the many unique challenges human language poses.
Not blessed with patience nor long attention span, I have managed to dip into quite many research topics over the years with our TurkuNLP team. We started off with scientific literature mining, but then branched into more general development of various NLP tools and resources. I’ve always had a soft spot for Finnish and chose to contribute especially to Finnish NLP, perhaps to give back to the society which so generously hosted me for my PhD research. My personally most important – or at least most visible – undertaking was the Turku Dependency Treebank, which later on became one of the first treebanks in the super-successful Universal Dependencies (UD) initiative and allowed TurkuNLP to be an important member of the UD community from Day 1. The treebank was also the basis for the relatively broadly used line of statistical syntactic Finnish language dependency parsers from TurkuNLP. I am proud that this work helped to bring Finnish into the results tables of ACL papers and to close the gap to much more studied languages, at least in terms of parsing accuracy.
Recently, I of course could not help but jump on board the deep learning tsunami. TurkuNLP’s previous work on crawling the Finnish Internet and gathering billions of words of Finnish paid off when it became a crucial part of the training corpus of the FinBERT model. If you have recently done any machine learning on Finnish language, it is quite likely you used this model to squeeze that extra few percent points on your accuracy. The story of FinBERT is a story of having plenty of language data ready at the right moment and shows the importance of gathering and maintaining language resources. You never know when you next need a few billion words of Finnish.
And where do I go from here? I see it as my goal to bring to Finnish, one way or another, most of the tools, tasks, and resources that the bigger languages have. Think about question answering, summarization, semantic search, paraphrase models and many other NLP tasks not yet properly covered for Finnish. If they can exist for English, then they should also for Finnish. We are living exciting times in NLP and now we have many more opportunities to make it happen than we had yet five years ago. And of course, with the LUMI supercomputer around the corner, you can expect new exciting language models from the TurkuNLP workshop.
Apart from these more or less mainstream NLP projects, I have had several I dare say successful collaborations in the field of digital humanities, in particular with the historians. I enjoyed these projects as they challenged us with interesting technical and algorithmic problems to solve.
Perhaps my most visible contribution to the Language Bank is the Finnish dependency parser (of course there was many of us working on it in TurkuNLP), which is used by the Language Bank to make data more accessible to researchers. The most recent version of the parser brings about a substantial improvement in accuracy on all levels of analysis. One day, when the legislation catches up with present-day language technology needs, I hope to see also our Internet Parsebank and other large-scale web-based data contributed to the Language Bank.
Naturally, we have used the Language Bank’s resources extensively here in TurkuNLP, perhaps most of them the Suomi24 corpus, in various research projects as well as in language model training. We have also benefited enormously from the Newspaper and Periodical OCR Corpus of the National Library of Finland in our work with the historians.
I cannot stress how important it is for Finnish NLP that we all contribute open datasets and free tools and models to the Language Bank and also maintain our edge in terms of computational resources, with LUMI being the perfect example
J. Kanerva & F. Ginter & S. Pyysalo 2020. Turku Enhanced Parser Pipeline: From Raw Text to Enhanced Graphs in the IWPT 2020 Shared Task. Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies. DOI: 10.18653/v1/2020.iwpt-1.17
J. Kanerva & F. Ginter & T. Salakoski 2020. Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks. Natural Language Engineering. DOI: 10.1017/S1351324920000224
J. Kanerva & F. Ginter & N. Miekka & A. Leino & T. Salakoski 2018. Turku Neural Parser Pipeline: An End-to-End System for the CoNLL 2018 Shared Task. Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. DOI: 10.18653/v1/K18-2013
A. Vesanto & A. Nivala & T. Salakoski & H. Salmi & F. Ginter 2017. A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora. Proceedings of the 21st Nordic Conference on Computational Linguistics (NoDaLiDa). https://aclanthology.org/W17-0249
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.