Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months
WP 1.1: Report on Updating LBF resource selection
Date of reporting: 2022-09
Report author: Jussi Piitulainen (UHEL)
Contributors: Ute Dieckmann, Varpu Vehomäki, Krister Lindén, Mietta Lennes (UHEL)
Deliverable location: Corpora | Kielipankki
The Kielipankki data sets are available in appropriate channels: the download service, the Korp concordance engine, and a data directory in the Puhti computing enviroment. The data sets have persistent identifiers and are documented in public metadata records, resource family pages, and resource group pages.
We are in progress updating data sets (Suomi24, STT newswire) with Universal Dependencies (UD2) annotations in addition to the previous annotation model. We are in progress using automatic language identification to separate the Finnish and Swedish texts in a large new batch of the National Library newspaper corpus (KLK). Data sets in the ingestion pipeline are being documented and prioritized to become available in the appropriate Kielipankki channels.