This page outlines the project deliverables for 2026-2029 (see template and instructions for reporting).
Each WP has a leader (L:) and one or more participants from the consortium partners (P:) and collaborators (C:). The WP leader and participants contribute to the work in the WP. Collaborators are test users providing feedback, evaluation and beta testing of the deliverables.
The module handles the basic language processing when a new resource is licensed from the rights holder, integrated into the infrastructure and made available through various distribution channels such as metadata servers, content search facilities and collaboration platforms. These processes need to be upgraded in view of recent developments in transformer technology, LLMs and AI. (L:UHEL/ARTS Krister Lindén)
To streamline and consolidate the text annotation in the RI components. (L:UHEL/ARTS Jussi Piitulainen; P:CSC; C:UEF, UTU, AALTO)
D1.1.1 | Support common CLARIN formats like TEI (CSC/Martin Matthiesen). | 2026-12 |
D1.1.2 | Convert VRT to TEI and showcase the result in a compatible web interface like the KorAP platform used in German CLARIN. (CSC/Martin Matthiesen) | 2027-07 |
D1.1.3 | Apply new technologies such as LLMs for ingesting accruing data sets and improving annotation of existing data sets. (UHEL/ARTS/Jussi Piitulainen) | 2028-04 |
D1.1.4 | Develop metadata interoperability of FIN-CLARIAH resources for other infrastructures like ALT-EDIC (UHEL/ARTS/Jussi Piitulainen) | 2029-10 |
To provide automated speech recognition with an emphasis on recognizing, classifying and annotation of everyday speech and dialects. (L:CSC Sam Hardwick; P:UHEL/ARTS; C:AALTO, Kotus, OU, UTU, UEF, UHEL/SOC, UHEL/NLF)
D1.2.1 | Updated backend of existing ASRs (CSC/Sam Hardwick) | 2026-10 |
D1.2.2 | A pipeline for the automated collection, processing, transcription and annotation (e.g. diarization and demographic annotation) of multimodal social media data. (OU/Steven Coats) | 2027-08 |
D1.2.3 | Support for additional future models and make the processing pipeline transparent for easy evaluation of suitability for data with elevated security requirements (CSC/Sam Hardwick) | 2028-06 |
D1.2.4 | Expansion and upgrade of Oulu Clarin-D centre to C or B status; provision of access to additional language resources sourced from multimedia social media content. (OU/Steven Coats) | 2029-11 |
To simplify researcher use, management, annotation and sharing of collections of video recordings. (L:UHEL/ARTS Mietta Lennes; P:CSC; C:JYU, OU)
D1.3.1 | Develop licensing and protection schemes for sharing sign language data (UHEL/ARTS/Mietta Lennes) | 2026-06 |
D1.3.2 | Data handling model for the entry and removal for large amounts of video data for research (CSC/Sam Hardwick) | 2027-08 |
D1.3.3 | Inventory and installation of tools for automated annotation of video and sign language data with LLM technologies (UHEL/ARTS/Mietta Lennes) | 2028-09 |
D1.3.4 | Inventory and installation of tools for accessing video and sign language data (UHEL/ARTS/Mietta Lennes) | 2029-10 |
To share language resources and tools for datasets containing personal or copyrighted data. (L:CSC Martin Matthiesen; P:UHEL/ARTS; C:UHEL/SOC, UTU)
D2.1.1 | Document the current options and fitness for purpose to use other processing environments, like supercomputers provided by CSC. (CSC/Martin Matthiesen) | 2026-05 |
D2.1.2 | Propose a proof-of-concept to address issues found in D 2.1.1. (CSC/Martin Matthiesen) | 2027-09 |
D2.1.3 | Pilot a processing pipeline with a real research use case, e.g. KAVI audio data. (CSC/Martin Matthiesen) | 2028-06 |
D2.1.4 | Protected processing and sharing of matriculation essays for research. (UHEL/ARTS/Mietta Lennes) | 2029-11 |
To provide interactive online training environments for humanities scholars for creating specialised processing modules from LLMs. (L:UHEL/ARTS Erik Axelsson; P:CSC; C:AALTO, JYU, UTU, OU, Kotus)
D2.2.1 | Training environment for DH scholars applying LLMs to annotation of text resources (UHEL/ARTS Erik Axelsson) | 2026-12 |
D2.2.2 | Training environment for DH scholars applying LLMs to annotation of audio resources (UHEL/ARTS Erik Axelsson) | 2027-12 |
D2.2.3 | Training environment for DH scholars applying LLMs to annotation of video resources (UHEL/ARTS Erik Axelsson) | 2028-06 |
D2.2.4 | Training environment for DH scholars applying LLMs to annotation of multimodal resources (UHEL/ARTS Erik Axelsson) | 2029-08 |
D2.3.1 | Develop policies for processing and sharing translation memories (UHEL/ARTS Tommi Jauhiainen) | 2026-05 |
D2.3.2 | Install pipeline for automated cleaning and transcription of multilingual audio and video data (UHEL/ARTS Tommi Jauhiainen) | 2027-06 |
D2.3.3 | Provide access to transcriptions of multilingual audio and video data (UHEL/ARTS Tommi Jauhiainen) | 2028-08 |
D2.3.4 | A pipeline for the automated collection, processing, transcription and annotation of multilingual media (UHEL/ARTS Tommi Jauhiainen) | 2029-10 |
D2.4.1 | Initiate and develop terminology groups on biology, microbiology, ecology, evolutionary biology, biotechnology, and genetics. | 2026-09 |
D2.4.2 | Initiate and develop terminology groups on geography, social geography, and environmental sciences. | 2027-12 |
D2.4.2 | Initiate and develop terminology groups on social policy, economics, and political science. | 2028-05 |
D2.4.3 | Initiate and develop terminology groups on sociology, psychology, social psychology, and educational sciences. | 2029-11 |
This module standardises efforts in data capture and provides resources and incentives for collaboration by processing unstructured text and metadata with different areas of Digital Humanities (DH) as use cases. (L:UHEL/ARTS Mikko Tolonen)
To significantly upgrade the data management, versioning and workflow automation capabilities that underlie the whole infrastructure for data ingestion. (L:CSC Anni Järvenpää; P:UHEL/ARTS; C:UHEL/NLF, UHEL/SOC, NAF, OU, JYU)
D3.1.1 | Upgrading the base data storage, access and processing infrastructure to handle the large volumes of multimodal data needed to both train and use foundational models | 2026-05 |
D3.1.2 | Upgrading the data workflow automation and versioning capabilities to handle the large volumes of multimodal data needed to both train and use foundational models | 2027-09 |
D3.1.3 | Second upgrade of the base data infrastructure to account for the rapidly changing systems and requirements | 2028-04 |
D3.1.4 | Second upgrade of the workflow and versioning to account for the rapidly changing systems and requirements | 2029-10 |
To improve the RI by connecting it to accruing data sources. (L:UHEL/NLF Johanna Lilja; P:Aalto, OU, JYU, UHEL/ARTS; C:CSC)
D3.2.1 | Ingestion of visual cultural heritage. Validation of the API solution and further development of the interoperability between Finna and FIN-CLARIAH-infrastructure. (NLF/FINNA/Riitta Peltonen) | 2026-11
|
D3.2.2 | Ingestion of new types of data More comprehensive engagement of the cultural heritage organisations that provides new types of data and facilitating dialogue between them and researchers. (NLF/FINNA/Riitta Peltonen) | 2027-06 |
D3.2.3 | Ingestion of in-copyright publications/webarchive. Building a research environment for legal deposit material (NLF/Aija Vahtola) | 2028-12 |
D3.2.4 | Ingestion of in-copyright publications/webarchive. Piloting the research environment for legal deposit material with researchers (NLF/Aija Vahtola) | 2029-11 |
D3.3.1 | Statistical methods for denoising and enrichment of structured cultural heritage data (UTU/Leo Lahti) | 2029-11 |
D3.3.2 | Neuro-symbolic tools based on Generative AI and LLMs for enriching metadata (Aalto/Annastiiina Ahola) | 2027-11 |
D3.3.3 | Using foundational models to deeply enrich and sample from massive but noisy, multilingual web data (UTU/Veronika Laippala) | 2029-11 |
D3.3.4 | Multimodal modelling for deep enrichment of archival documents (JYU/ Antero Holmila) | 2029-11 |
D3.3.5 | Multimodal modelling for the deep enrichment of livestream data (JYU, Raine Koskimaa) | 2029-11 |
The module will develop the technical services needed to support data-intensive SSH research on the various types of raw data. (L:UHEL/ARTS Mikko Tolonen)
D4.1.1 | Analytical and conceptual tools for multimodal cultural heritage analysis. (OU/Ilkka Lähteenmäki) | 2029-11 |
D4.1.2 | Develop a national digital ecosystem (“Nordic Digital Observatory”) for effective use of large-scale social media data in fundamental research (UEF/ Mikko Laitinen) | 2029-11 |
D4.1.3 | Analysis tools for Social Science data from multiple data sources (UHEL/SOC/Maria Valaste) | 2029-11 |
D4.1.4 | Analysis tools for multimodal livestream data (JYU/Raine Koskimaa) | 2029-11 |
D5.1.1 | Community engagement: Researchers using LLMs as research tools. (TAU:/Sanna Kumpulainen) | 2026-06 |
D5.1.2 | Educational resources for infrastructure tools and data. (L:TAU:/Sanna Kumpulainen) | 2027-11 |
D5.1.3 | Community engagement: User interaction with multimodal data. (TAU:/Sanna Kumpulainen) | 2028-06 |
D5.1.4 | Evidence-based infrastructure development: User experience and the feedback instrument. (TAU:/Sanna Kumpulainen) | 2029-11 |
Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 3.1: Report on Integrate environment for personal data
Date of reporting: 30-09-2024
Report authors: Mietta Lennes (UH)
Contributors: Martin Matthiesen (CSC)
Deliverable location: https://www.kielipankki.fi/support/sd-services/
Keywords for the deliverable page: sensitive data; confidential data; secure desktop; SD services
In case a research dataset contains special categories of personal data or other types of confidential information that cannot be removed without hampering the research purpose, it may be necessary to use a secure environment for processing the data (cf. Deliverable 2.1.2 of the previous funding period of FIN-CLARIAH 2022-2023).
CSC – IT Center for Science provides Sensitive Data services for sharing and analyzing data securely from a web browser. The sensitive data files can be encrypted and uploaded via SD Connect, where they are available to the secure desktop instances of the members of the same project. The virtual machines for the secure desktops are configured and accessed via SD Desktop.
It is also possible to install and use special tools in the SD Desktop environment. Researchers who need to process audio and video material securely can now also conveniently install tools such as ELAN (video and audio) or Praat (audio) for viewing, editing, annotating, querying and analyzing their data, or well-known command-line tools such as Whisper (automatic speech recognition) as part of their workflow in the secure environment. For faster access to audio and video files, and external volume can be selected when configuring the virtual machine.
We will continue testing, documenting and improving the functionalities of the SD Desktop with the users of the Language Bank. We are also looking into the possibility of the Language Bank using SD Desktop instances for providing individual users with restricted access to specific sensitive datasets. The SD services are still under active development and the remaining issues can be addressed in collaboration with the experts at CSC.
For researchers in the SSH fields, the step-by-step instructions for using the Sensitive Data services are now maintained on a support page in the online portal of the Language Bank of Finland.
Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 2.1: Data collection for minority languages
Date of reporting: 26-09-2024
Report authors: Martin Matthiesen (CSC)
Contributors: Wilhelmina Dyster (UH), Sjur Moshagen, Katri Hiovain-Asikainen (UiT)
Deliverable location: n/a
Keywords for the deliverable page: Finland-Swedish, Sámi
In this workpackage two minority languages are collected: Swedish spoken in Finland and Sámi languages spoken in Norway, Sweden and Finland.
Data collected during the Donera Prat campaign[1] is currently manually transliterated. This work is expected to be ready by November 2024. The planned release date for the data for research is January 2025.
The data collection for Sámi languages is focusing on the broadcasting companies in the Nordic Countries (NRK[2], SVT[3], YLE[4]) where they are spoken and the University of Tromsø. The national broadcasters already have some of their Sámi data subtitled in a Sámi language and their respective national languages, making it a valuable resource for research.
We achieved a general understanding that the Language Bank of Finland can serve as the main sharing organisation for Sámi data and we already did test transfers of data from SVT and Tromsø. YLE’s Sámi data is available via KAVI[5]. Before the data can be shared via the Language Bank of Finland, we need to overcome technical and legal hurdles. While on the technical side we already reached broad agreement and will for example, share the data from the various sources with no or little changes, and KAVI and Aalto University already have experience in collaborating using the LUMI supercomputer, the legal side seems to be a bigger challenge. NRK, SVT and YLE are currently investigating legal implications of sharing their data via the Finnish Language Bank.
[1] Donera Prat https://svenska.yle.fi/a/7-10009203
[2] Norwegian Television: https://www.nrk.no/about/
[3] Swedish Television: https://omoss.svt.se/about-svt.html
[4] Finnish Television: https://yle.fi/aihe/about-yle
[5] The Finnish National Audio Visual Institute, https://kavi.fi/en/
Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 3.1: Report on Comprehensive data versioning
Date of reporting: 25-09-2024
Report authors: Martin Matthiesen (CSC)
Contributors: Erik Axelson, Eetu Mäkelä, Ville Vaara (UH), Sam Hardwick, Anni Järvenpää (CSC)
Deliverable location: https://github.com/CSCfi/kielipankki-nlf-harvester
Keywords for the deliverable page: versioning, updates, differences
The versioning mechanism has been tested with new data from the National Library. We discovered that we will likely need to make changes to the mechanism how data is packaged into zip files to avoid unnecessary growth of the versions stored in Allas.
Interviews with potential users of the data have been conducted: Erik Axelson and Ville Vaara (both UH). Both interviews are summarized below.
In 2024 FIN-CLARIN has published a new version of ”The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2 (1771-1874), VRT”[1], klk-fi-v2-1874-vrt, for short. This version was created using data directly obtained from the National Library, since our harvesting mechanism was not quite ready at the start of the project to create the new dataset. The NLF source data was extracted, tokenized and syntactically annotated and converted to the VRT format[3]. A list of included publications was compiled[4] and also End user notes, which document inconsistencies found after publication[5]. FIN-CLARIN has well established processes to obtain new copies from the National Library and these copies are in a different internal format than the data provided in this workpackage[2]. However, the differences are small and the data is well suited to be a basis for the next iteration. Since a new version of klk-fi-v2-1874-vrt is not planned during this project we will demonstrate the changes needed with a proof-of-concept.
Another use case for the data is the Elastic Search based tool developed in the previous FIN-CLARIAH development round in WP4.3[6]. In that use case the NLF data is converted to JSON suitable as input data for an Elastic Search Engine. When considering newer versions it became clear that an easy way of finding differences between the versions is a reasonable addition to the present implementation. The dataset is presently 10 TB in size and comparing two datasets of that size (the present version and an earlier version) to find out the differences is something that should be done once during the update and provided to the user as a service, enabling easier updates of indexes.
Moving forward we need to investigate the unnecessary growth of the versions and add functionality to make incremental updates of derived datasets (like in the Elastic Search case mentioned above) easier, by providing the differences between versions in a machine readable way. In deliverable 3.1.2 we will demonstrate the changes with working code.
[1] National Library of Finland. The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2 (1771-1874), VRT [data set]. Kielipankki. Retrieved from http://urn.fi/urn:nbn:fi:lb-2024060401
[2] See the Harvester documentation for details.
[3] Introduction to VRT: http://urn.fi/urn:nbn:fi:lb-2023020121
[4] List of publications: http://urn.fi/urn:nbn:fi:lb-2023092801
[5] End user notes: http://urn.fi/urn:nbn:fi:lb-2023101001
[6] See Deliverable 4.3.2 of FIN-CLARIAH 2022-2023. The current implementation can be found here: https://dariahfi-es.2.rahtiapp.fi (access available upon request)
The Language Bank of Finland maintains metadata records of all the resources it distributes. Each individual resource version has its own metadata record with a persistent identifier.
For providing the metadata records, the Language Bank has been using a platform called META-SHARE, but the system is no longer supported. All our currently existing metadata records have been moved to COMEDI, a service hosted by a Norwegian CLARIN centre, CLARINO Bergen. The persistent identifiers of the metadata records curated by the Language Bank of Finland now point to the corresponding records on the COMEDI system.
Please note that, although the metadata records now look a bit different, the content and location of the actual language resources remain unchanged.
Kielipankki ylläpitää kuvailutietueita kaikista välittämistään aineistoista. Jokaisella yksittäisellä aineistoversiolla on oma kuvailutietue, jolla on pysyvä tunniste.
Kielipankissa on käytetty kuvailutietojen tarjoamiseen META-SHARE-nimistä alustaa, mutta sen tuki on loppunut. Kielipankin kaikki nykyiset kuvailutietueet on siirretty norjalaisen CLARIN-keskuksen, CLARINO Bergenin ylläpitämälle COMEDI-alustalle. Kaikkien Kielipankin hoitamien kuvailutietueiden pysyvät tunnisteet on automaattisesti ohjattu uusiin osoitteisiin COMEDIssa.
Huomaathan, että vaikka kuvailutietueet näyttävät nyt vähän erilaisilta, itse aineistojen sisältö tai sijainti eivät ole muuttuneet.
Due to very low usage, the Mylly service (https://mylly.rahtiapp.fi) will be shut down at the same time as CSC’s cloud services move to Rahti’s new version during the summer 2024. Mylly will be available until 17th June 2024. Due to the short notice, we will keep the users’ data for three months after the shutdown.
In case you wish to download your data, you can do it yourself by 17th June or by contacting CSC service desk within three months.
In case you wish to utilise the tool scripts from Mylly on other services (e.g., Puhti or CSC Notebooks), the software will still be available on GitHub.
Vähäisestä käytöstä johtuen Mylly-palvelu (https://mylly.rahtiapp.fi) ajetaan alas samassa yhteydessä, kun CSC:n pilvipalvelut siirtyvät Rahtin uuteen versioon kesän 2024 aikana. Mylly on käytettävissä vielä 17.6.2024 asti. Nopeasta aikataulusta johtuen pyrimme säilyttämään käyttäjien aineistot vielä 3 kuukautta tämän jälkeen.
Jos haluat Myllyssä olleet aineistosi talteen, voit ladata ne itse 17.6. asti tai seuraavan kolmen kuukauden ajan ottamalla yhteyttä CSC:n asiakaspalveluun.
Jos haluat hyödyntää Myllyn työkaluskriptejä muilla alustoilla (esim. Puhti tai CSC Notebooks), skriptit ovat saatavilla GitHubista myös jatkossa.
An article presenting the LAREINA – Language Resource Infrastructure for AI (2023–25) project has been published on the website of the University of Helsinki. The LAREINA project is funded by Business Finland and implemented by Aalto University and the University of Helsinki as part of Tietoevry’s Veturi programme. The project involves companies and public sector organisations as partners.
The LAREINA project develops speech recognition and speech synthesis for Finnish, Finnish-Swedish and the Sámi languages. The project partners will test the components in different tasks and in areas such as call centres and machine translation. The LAREINA project aims to ensure that high-quality speech interfaces and speech-based AI services are also available for speakers of small languages.
The outputs of the LAREINA project will be published under an open licence, allowing also for commercial use, and they will also be available through the Language Bank of Finland – Kielipankki.
Read more about the LAREINA project on the University of Helsinki website: ”Speech-based AI services needed for small languages as well – researchers support companies in product development” (Published on 11.04.2024)
Visit the LAREINA project webpage: https://www.kielipankki.fi/business/lareina/
Helsingin yliopiston verkkosivuilla on julkaistu juttu, jossa esitellään LAREINA – Language Resource Infrastructure for AI -hanke (2023–25). Business Finlandin rahoittaman hankkeen toteuttavat Aalto-yliopisto ja Helsingin yliopisto osana Tietoevryn Veturi-ohjelmaa. Hankkeessa on mukana yhteistyökumppaneina yrityksiä ja julkishallinnon puolen organisaatioita.
LAREINA-hankkeessa kehitetään puheentunnistusta ja puhesynteesiä suomen, suomenruotsin sekä saamen kielille. Hankkeessa mukana olevat kumppanit testaavat niitä esimerkiksi puhelinpalveluissa ja kääntämisessä. LAREINA-hankkeen tavoitteena on varmistaa, että laadukkaita puhekäyttöliittymiä ja puhepohjaisia tekoälypalveluita pystytään tuottamaan myös pienten kielten puhujille.
LAREINA-hankkeen tuotoksia julkaistaan avoimella, myös kaupallisen käytön sallivalla lisenssillä myöhemmin myös Kielipankin kautta.
Lue lisää LAREINA-hankkeesta Helsingin yliopiston verkkosivuilta: ”Puheella toimivia tekoälypalveluja tarvitaan myös pienille kielille – tutkijat vauhdittavat yritysten tuotekehitystä” (julkaistu 11.4.2024).
Tutustu LAREINA-hankkeen verkkosivuihin: https://www.kielipankki.fi/yrityksille/lareina/
CLARIN is a consortium partner in the Advancing FronTier Research In the Arts and hUManities (ATRIUM) project.
The ATRIUM project invites researchers to apply to participate in Transnational Access training visits to support their research. ATRIUM’s Transnational Access (TNA) scheme offers researchers the possibility to apply for a fully funded placement at several different partner organisations to access expert knowledge and advice from leading Data Management organisations across Europe.
The TNA scheme aims to recruit and support approximately 200 Arts and Humanities researchers with mentorship and access to knowledge, data and tools from 14 different institutions across Europe. Researchers who are successful in their applications will be supported to visit the infrastructure providers in our consortium in person, benefiting from direct contact, knowledge sharing and network building. In total, 388 weeks of Transnational Access will be provided during the ATRIUM project.
There are two types of types of TNA applications:
The first collection date is 31 May, 2024, and applicants will be notified by 28 June, 2024. Calls for applications will be issued several times per year throughout the duration of the project (March 2024 to December 2028).
Individual Access applications will be offered on a rolling basis with a deadline every three months. Summer Schools will be offered 1 to 2 times a year with a fixed deadline 3 to 4 months ahead of the scheduled event.
Visit www.atrium-research.eu for more information.
Suurkiitos kaikille lahjoittajille!
Yle, Helsingin yliopisto ja Valtion kehitysyhtiö Vake (sittemmin Ilmastorahasto Oy) toteuttivat yhdessä suomenkielisen puheen Lahjoita puhetta -keruukampanjan, joka on ollut käynnissä 16.6.2020 lähtien. Puhelahjoituksia kertyi ensimmäisen vuoden aikana menestyksekkäästi yli 3000 tuntia. Viime vuosina ja kuukausina lahjoituksia on kuitenkin tullut enää harvakseltaan. Pienemmällä Donera prat -kampanjalla kerättiin vuodesta 2021 alkaen myös suomenruotsia.
Molemmat keruukampanjat on nyt suljettu. Aineistot järjestellään ja tallennetaan Kielipankkiin, jonka kautta tutkijat ja yritykset voivat saada puhedataa käyttöönsä tietyillä ehdoilla. Toivomme, että aineistot auttavat tutkijoita ja yrityksiä luomaan parempia suomenkielisen puheen malleja sekä kehittämään tulevaisuuden palveluita, jotka toimivat sujuvasti suomen kielellä.
Mer information på de finska och finlandssvenska kampanjerna 2020-2024 (på engelska)
Mer information på de finska och finlandssvenska kampanjerna 2020-2024 (på engelska)
The Korp concordancing service in the Language Bank of Finland is moving to a new, faster server during 1-2 pm. on 7th February 2024. There will be a short service break.
After the change, Korp will be available via the new address https://www.kielipankki.fi/korp/. The old address https://korp.csc.fi will be automatically redirected to the new site.
The corpora offered via Korp will continue to be available via the new server. However, queries will be completed faster.
NB: At the moment, logging in to the new Korp (to access ACA or RES licensed corpora) may sometimes fail in case the complete address (www.kielipankki.fi/korp/) was not used for starting Korp. However, this issue will be fixed as soon as possible, and you should already be able to log in normally when using the link mentioned above.
Korp siirtyy 7.2.2024 klo 13-14 uudelle, nopeammalle palvelimelle. Siirron yhteydessä on lyhyt käyttökatko.
Samalla Kielipankin Korpin uudeksi osoitteeksi muuttuu https://www.kielipankki.fi/korp/. Vanha osoite https://korp.csc.fi ohjautuu automaattisesti uuteen paikkaan.
Korpin kautta käytössä olevat korpukset toimivat myös uudella palvelimella entiseen tapaan, mutta haut toimivat nopeammin.
Huom. Toistaiseksi kirjautuminen Korpiin (ACA- ja RES-lisensoidut korpukset) saattaa joskus epäonnistua, jos Korp-palveluun siirtyessä ei ole käytetty kokonaista www-alkuista osoitetta (www.kielipankki.fi/korp/). Tämä ongelma pyritään korjaamaan mahdollisimman pian. Kirjautumisen pitäisi kuitenkin toimia normaalisti jo nyt yllä mainitun linkin kautta.
Suomen puupankki FinnTreeBank 3 sisältää samaa aineistoa, joka on saatavilla myös kahden erillisen korpuksen suomenkielisissä osissa, Helsinki Korp Europarl -aineistokokoelmassa (Europarl) ja Helsinki Korp JRC-Acquis -aineistokokoelmassa (JRC-Acquis).
Europarlin, JRC-Acquis’n ja Finnish TreeBank 3:n kuvailutietoihin ja lisenssisivuihin on tarkennettu oikeudenhaltijoiden tiedot. Kun viittaat johonkin näistä aineistoista, muistathan huomioida niiden viittausohjeet.
Lisätietoja on osoitteessa http://urn.fi/urn:nbn:fi:lb-2023111302.
The Finnish TreeBank 3 shares data with two corpora: the Finnish part of the Helsinki Korp Europarl Bilingual Corpora (Europarl) and the Finnish part of the Helsinki Korp JRC-Acquis Bilingual Parallel Corpora (JRC-Acquis).
The resource metadata and the license pages of Europarl, JRC-Acquis and Finnish TreeBank 3 have been updated with the details of the rightholders. When referring to one of these resources, please pay attention to the corresponding citation instructions.
For further information please see http://urn.fi/urn:nbn:fi:lb-2023111301.
Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months
WP 1.1: Report on ingesting new unstructured resources
Date of reporting: 30-11-2023
Report authors: Mietta Lennes, Jussi Piitulainen (University of Helsinki)
Contributors: Ute Dieckmann, Erik Axelson, Jyrki Niemi, Jack Rueter, Tommi Jauhiainen, Krister Lindén (University of Helsinki)
Deliverable location: Corpora and tools available via the Language Bank of Finland
Keywords for the deliverable page: corpus, data set, automatic language identification
The Newspaper and Periodical Corpus of the National Library of Finland was extended with a significant amount of new material from the National Library. The new version was organized according to the automatically identified language of each sentence. The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2 (klk-fi-v2), consisting of more than 22 billion word tokens, was published in Korp in summer 2023. It consists of the text elements that contain at least one ”fin” sentence (from the new material, from the previous version of klk-fi, and from the previous klk-sv). Moreover, the summary attributes indicate the frequency distribution of languages within each text and each paragraph. An extended version of the Swedish sub-corpus (klk-sv-v2) has been compiled in a similar way (any ”swe” in a text), but the Swedish data is currently still waiting for the rest of the annotations to be completed. For details of the reorganization process of the National Library data according to language, see Jauhiainen et al. 2022.
The HeLI-OTS language identification tool was adapted for the format used in the Language Bank of Finland, together with a post-processor written to correct the identification of each sentence within its context. Another new tool was written to partition the corpus, first by the main identified languages, then by the year of publication.
As a demonstration of ingesting resources including parallel spoken material in multiple languages, the corpus Christmas Gospel text-to-speech in four Uralic languages was prepared and made available for searching and playback via Korp (for details on this effort, see D2.3.2).
Other corpora published in Korp during the years 2022-23 include, e.g., the Finnish News Agency Archive 1992-2018, Kielipankki Korp Version; Corpus of Contemporary American English (COCA) – Kielipankki Korp version 2020 and Erzya and Moksha Extended Corpora (ERME) version 2, Korp.
In addition, various downloadable resources were published, e.g., Corpus of Contemporary American English – Kielipankki VRT version 2020; FinnTreeBank 1, 2 and 3; Word embeddings trained with word2vec from the Finnish Text Collection; The Coronavirus Corpus (Mark Davies, english-corpora.org) – Kielipankki version 2021-05; and The Finnish Dark Web Marketplace Corpus.
During the project, the resource publication pipeline of the Language Bank of Finland has been refined and documented. The structure of the pipeline was first presented at the CLARIN Annual Conference in 2022 and described in the conference proceedings (Dieckmann & al., 2023, see below).
Last modified on 2023-11-30