<< List of all deliverables

D1.1.2: Ingesting new unstructured resources

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 1.1: Report on ingesting new unstructured resources
Date of reporting: 30-11-2023

Report authors: Mietta Lennes, Jussi Piitulainen (University of Helsinki)
Contributors: Ute Dieckmann, Erik Axelson, Jyrki Niemi, Jack Rueter, Tommi Jauhiainen, Krister Lindén (University of Helsinki)
Deliverable location: Corpora and tools available via the Language Bank of Finland

Keywords for the deliverable page: corpus, data set, automatic language identification

Description

The Newspaper and Periodical Corpus of the National Library of Finland was extended with a significant amount of new material from the National Library. The new version was organized according to the automatically identified language of each sentence. The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2 (klk-fi-v2), consisting of more than 22 billion word tokens, was published in Korp in summer 2023. It consists of the text elements that contain at least one ”fin” sentence (from the new material, from the previous version of klk-fi, and from the previous klk-sv). Moreover, the summary attributes indicate the frequency distribution of languages within each text and each paragraph. An extended version of the Swedish sub-corpus (klk-sv-v2) has been compiled in a similar way (any ”swe” in a text), but the Swedish data is currently still waiting for the rest of the annotations to be completed. For details of the reorganization process of the National Library data according to language, see Jauhiainen et al. 2022.

The HeLI-OTS language identification tool was adapted for the format used in the Language Bank of Finland, together with a post-processor written to correct the identification of each sentence within its context. Another new tool was written to partition the corpus, first by the main identified languages, then by the year of publication.

As a demonstration of ingesting resources including parallel spoken material in multiple languages, the corpus Christmas Gospel text-to-speech in four Uralic languages was prepared and made available for searching and playback via Korp (for details on this effort, see D2.3.2).

Other corpora published in Korp during the years 2022-23 include, e.g., the Finnish News Agency Archive 1992-2018, Kielipankki Korp Version; Corpus of Contemporary American English (COCA) – Kielipankki Korp version 2020 and Erzya and Moksha Extended Corpora (ERME) version 2, Korp.

In addition, various downloadable resources were published, e.g., Corpus of Contemporary American English – Kielipankki VRT version 2020; FinnTreeBank 1, 2 and 3; Word embeddings trained with word2vec from the Finnish Text Collection; The Coronavirus Corpus (Mark Davies, english-corpora.org) – Kielipankki version 2021-05; and The Finnish Dark Web Marketplace Corpus.

During the project, the resource publication pipeline of the Language Bank of Finland has been refined and documented. The structure of the pipeline was first presented at the CLARIN Annual Conference in 2022 and described in the conference proceedings (Dieckmann & al., 2023, see below).

Publications

  • Jauhiainen, T., Piitulainen, J., Axelson, E., Lindén, K. (2022) Language diversity in the newspaper and periodical corpus of the National Library of Finland. Poster presented at Digital Research Data and Human Sciences (DRDHum), 1.-3.12.2022, Jyväskylä, Suomi. Download the poster
  • Dieckmann, U., Lennes, M., Piitulainen, J., Niemi, J., Axelson, E., Jauhiainen, T., Lindén, K. (2023) The Pipeline for Publishing Resources in the Language Bank of Finland. Erjavec, T., Eskevich, M. (editors), Selected Papers from the CLARIN Annual Conference 2022, pp. 33-43. Linköping University Electronic Press.
Search the Language Bank Portal:
Elina Vaahensalo
Researcher of the Month: Elina Vaahensalo

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information