Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 4.1: Report on analysis tools for multimodal born-digital social media: Nordic Tweet Stream (NTS)
Date of reporting: 18-12-2024
Report author: Mikko Laitinen (UEF)
Contributors: Paula Rautionaho (UEF), Masoud Fatemi (UEF), Mehrdad Salimi (UEF)
Deliverable location: https://nordictweetstream.fi/
The Nordic Tweet Stream (NTS) is a monitor corpus of geolocated tweets and associated metadata from the Nordic region covering over 11 years from 2013 to 2023. It is accessible through a graphic interface that allows users to search, subset, visualize, and download extremely large-scale user-generated data from one social media application.
The objective of this digital interface is to enable easy access to and distribution of born-digital data for basic research. We have recently witnessed the closing down of free access to various digital sources because of the APIcalypse (Bruns 2019) and feel that, despite restrictive measures by social media giants, it is extremely important to store cultural heritage from social media. We operate according to the FAIR Data Principle. The guiding principles of FAIR aim at making data findable, accessible, interoperable, and reusable (Wilkinson et al. 2016).
The NTS provides data spanning from January 2013 to May 2023, encompassing over 900 million tokens from more than 73 million messages, generated by nearly 900,000 individuals. The dataset includes content in 73 languages. The largest languages are Swedish (c. 31 %), English (c. 26 %) and Finnish (c. 13 %). Detailed information of the material is found in the Statistics pages of the interface.
The NTS dataset is intended for use by researchers across various disciplines, including sociolinguistics, dialectology, social sciences, and cultural studies. It can serve as both primary data and supplementary material alongside structured corpus data. This interface is designed for users seeking quick access to the data. Advanced users, however, may prefer to utilize the download function to retrieve the data for further processing in other environments.
Laitinen, M., Lundberg, J., Levin, M., & Martins, R. M. 2018. The Nordic Tweet Stream: A Dynamic Real-Time Monitor Corpus of Big and Rich Language Data. In DHN 2018 Digital Humanities in the Nordic Countries 3rd Conference: Proceedings of the Digital Humanities in the Nordic Countries 3rd Conference Helsinki, Finland, pp. 349–362. https://erepo.uef.fi/handle/123456789/6697
NTS presented in the following event:
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.
Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 4.1: Report on Enrich survey data with register data and unstructured text
Date of reporting: 12-12-2024
Report authors: Adeline Clarke (University of Helsinki), Maria Valaste (University of Helsinki)
Contributors: Adeline Clarke (University of Helsinki), Maria Valaste (University of Helsinki)
Deliverable location: https://cran.r-project.org/web/packages/finnsurveytext/index.html
The finnsurveytext R package has been developed to aid researchers in analyzing responses to open-ended survey questions and other structured text data. This user-friendly tool facilitates reproducible analysis of text data by providing features such as summarizing response properties, identifying frequent words and phrases, visualizing responses, and generating concept network plots. The second version of the package, released in August 2024, integrates with the widely-used R package survey, allowing for survey design to be incorporated into the analysis. Although originally designed for analyzing text in Finnish, the package is versatile and can be used for text analysis in other languages as well.
R package finnsurveytext was released with 2 updates to CRAN. The R package is located at CRAN and additional material is available on the website. An article on the package has been written and is available on Zenodo and for review in the new DARIAH publication.
The results of the work package were presented at two events: an invited lecture at the Workshop on Survey Statistics 2024, held in Poznan, Poland from 26-30 August, and at Statistics Sweden and Örebro University Summer School 2024 in August 28.
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.
This page outlines the project deliverables for 2026-2029 (see template and instructions for reporting).
Each WP has a leader (L:) and one or more participants from the consortium partners (P:) and collaborators (C:). The WP leader and participants contribute to the work in the WP. Collaborators are test users providing feedback, evaluation and beta testing of the deliverables.
The module handles the basic language processing when a new resource is licensed from the rights holder, integrated into the infrastructure and made available through various distribution channels such as metadata servers, content search facilities and collaboration platforms. These processes need to be upgraded in view of recent developments in transformer technology, LLMs and AI. (L:UHEL/ARTS Krister Lindén)
To streamline and consolidate the text annotation in the RI components. (L:UHEL/ARTS Jussi Piitulainen; P:CSC; C:UEF, UTU, AALTO)
D1.1.1 | Support common CLARIN formats like TEI (CSC/Martin Matthiesen). | 2026-12 |
D1.1.3 | Apply new technologies such as LLMs for ingesting accruing data sets and improving annotation of existing data sets. (UHEL/ARTS/Jussi Piitulainen) | 2028-04 |
D1.1.4 | Develop metadata interoperability of FIN-CLARIAH resources for other infrastructures like ALT-EDIC (UHEL/ARTS/Jussi Piitulainen) | 2029-10 |
To provide automated speech recognition with an emphasis on recognizing, classifying and annotation of everyday speech and dialects. (L:CSC Sam Hardwick; P:UHEL/ARTS; C:AALTO, Kotus, OU, UTU, UEF, UHEL/SOC, UHEL/NLF)
D1.2.1 | Updated backend of existing ASRs (CSC/Sam Hardwick) | 2026-10 |
D1.2.2 | A pipeline for the automated collection, processing, transcription and annotation (e.g. diarization and demographic annotation) of multimodal social media data. (OU/Steven Coats) | 2027-08 |
D1.2.3 | Support for additional future models and make the processing pipeline transparent for easy evaluation of suitability for data with elevated security requirements (CSC/Sam Hardwick) | 2028-06 |
To simplify researcher use, management, annotation and sharing of collections of video recordings. (L:UHEL/ARTS Mietta Lennes; P:CSC; C:JYU, OU)
D1.3.1 | Develop licensing and protection schemes for sharing sign language data (UHEL/ARTS/Mietta Lennes) | 2026-06 |
D1.3.2 | Data handling model for the entry and removal for large amounts of video data for research (CSC/Sam Hardwick) | 2027-08 |
D1.3.3 | Inventory and installation of tools for automated annotation of video and sign language data with LLM technologies (UHEL/ARTS/Mietta Lennes) | 2028-09 |
D1.3.4 | Inventory and installation of tools for accessing video and sign language data (UHEL/ARTS/Mietta Lennes) | 2029-10 |
To share language resources and tools for datasets containing personal or copyrighted data. (L:CSC Martin Matthiesen; P:UHEL/ARTS; C:UHEL/SOC, UTU)
D2.1.1 | Document the current options and fitness for purpose to use other processing environments, like supercomputers provided by CSC. (CSC/Martin Matthiesen) | 2026-05 |
D2.1.2 | Propose a proof-of-concept to address issues found in D 2.1.1. (CSC/Martin Matthiesen) | 2027-09 |
D2.1.3 | Pilot a processing pipeline with a real research use case, e.g. KAVI audio data. (CSC/Martin Matthiesen) | 2028-06 |
D2.1.4 | Protected processing and sharing of matriculation essays for research. (UHEL/ARTS/Mietta Lennes) | 2029-11 |
To provide interactive online training environments for humanities scholars for creating specialised processing modules from LLMs. (L:UHEL/ARTS Erik Axelsson; P:CSC; C:AALTO, JYU, UTU, OU, Kotus)
D2.2.1 | Training environment for DH scholars applying LLMs to annotation of text resources (UHEL/ARTS Erik Axelsson) | 2026-12 |
D2.2.2 | Training environment for DH scholars applying LLMs to annotation of audio resources (UHEL/ARTS Erik Axelsson) | 2027-12 |
D2.2.3 | Training environment for DH scholars applying LLMs to annotation of video resources (UHEL/ARTS Erik Axelsson) | 2028-06 |
D2.2.4 | Training environment for DH scholars applying LLMs to annotation of multimodal resources (UHEL/ARTS Erik Axelsson) | 2029-08 |
D2.3.1 | Develop policies for processing and sharing translation memories (UHEL/ARTS Tommi Jauhiainen) | 2026-05 |
D2.3.2 | Install pipeline for automated cleaning and transcription of multilingual audio and video data (UHEL/ARTS Tommi Jauhiainen) | 2027-06 |
D2.3.3 | Provide access to transcriptions of multilingual audio and video data (UHEL/ARTS Tommi Jauhiainen) | 2028-08 |
D2.3.4 | A pipeline for the automated collection, processing, transcription and annotation of multilingual media (UHEL/ARTS Tommi Jauhiainen) | 2029-10 |
D2.4.1 | Initiate and develop terminology groups on biology, microbiology, ecology, evolutionary biology, biotechnology, and genetics. | 2026-09 |
D2.4.2 | Initiate and develop terminology groups on geography, social geography, and environmental sciences. | 2027-12 |
D2.4.2 | Initiate and develop terminology groups on social policy, economics, and political science. | 2028-05 |
D2.4.3 | Initiate and develop terminology groups on sociology, psychology, social psychology, and educational sciences. | 2029-11 |
This module standardises efforts in data capture and provides resources and incentives for collaboration by processing unstructured text and metadata with different areas of Digital Humanities (DH) as use cases. (L:UHEL/ARTS Mikko Tolonen)
To significantly upgrade the data management, versioning and workflow automation capabilities that underlie the whole infrastructure for data ingestion. (L:CSC Anni Järvenpää; P:UHEL/ARTS; C:UHEL/NLF, UHEL/SOC, NAF, OU, JYU)
D3.1.1 | Upgrading the base data storage, access and processing infrastructure to handle the large volumes of multimodal data needed to both train and use foundational models | 2026-05 |
D3.1.2 | Upgrading the data workflow automation and versioning capabilities to handle the large volumes of multimodal data needed to both train and use foundational models | 2027-09 |
D3.1.3 | Second upgrade of the base data infrastructure to account for the rapidly changing systems and requirements | 2028-04 |
D3.1.4 | Second upgrade of the workflow and versioning to account for the rapidly changing systems and requirements | 2029-10 |
To improve the RI by connecting it to accruing data sources. (L:UHEL/NLF Johanna Lilja; P:Aalto, OU, JYU, UHEL/ARTS; C:CSC)
D3.2.1 | Ingestion of visual cultural heritage. Validation of the API solution and further development of the interoperability between Finna and FIN-CLARIAH-infrastructure. (NLF/FINNA/Riitta Peltonen) | 2026-11
|
D3.2.2 | Ingestion of new types of data More comprehensive engagement of the cultural heritage organisations that provides new types of data and facilitating dialogue between them and researchers. (NLF/FINNA/Riitta Peltonen) | 2027-06 |
D3.2.3 | Ingestion of in-copyright publications/webarchive. Building a research environment for legal deposit material (NLF/Aija Vahtola) | 2028-12 |
D3.2.4 | Ingestion of in-copyright publications/webarchive. Piloting the research environment for legal deposit material with researchers (NLF/Aija Vahtola) | 2029-11 |
D3.3.1 | Statistical methods for denoising and enrichment of structured cultural heritage data (UTU/Leo Lahti) | 2029-11 |
D3.3.2 | Neuro-symbolic tools based on Generative AI and LLMs for enriching metadata (Aalto/Annastiiina Ahola) | 2027-11 |
D3.3.3 | Using foundational models to deeply enrich and sample from massive but noisy, multilingual web data (UTU/Veronika Laippala) | 2029-11 |
D3.3.4 | Multimodal modelling for deep enrichment of archival documents (JYU/ Antero Holmila) | 2029-11 |
D3.3.5 | Multimodal modelling for the deep enrichment of livestream data (JYU, Raine Koskimaa) | 2029-11 |
The module will develop the technical services needed to support data-intensive SSH research on the various types of raw data. (L:UHEL/ARTS Mikko Tolonen)
D4.1.1 | Analytical and conceptual tools for multimodal cultural heritage analysis. (OU/Ilkka Lähteenmäki) | 2029-11 |
D4.1.2 | Develop a national digital ecosystem (“Nordic Digital Observatory”) for effective use of large-scale social media data in fundamental research (UEF/ Mikko Laitinen) | 2029-11 |
D4.1.3 | Analysis tools for Social Science data from multiple data sources (UHEL/SOC/Maria Valaste) | 2029-11 |
D4.1.4 | Analysis tools for multimodal livestream data (JYU/Raine Koskimaa) | 2029-11 |
D5.1.1 | Community engagement: Researchers using LLMs as research tools. (TAU:/Sanna Kumpulainen) | 2026-06 |
D5.1.2 | Educational resources for infrastructure tools and data. (L:TAU:/Sanna Kumpulainen) | 2027-11 |
D5.1.4 | Evidence-based infrastructure development: User experience and the feedback instrument. (TAU:/Sanna Kumpulainen) | 2029-11 |
Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 3.1: Report on Integrate environment for personal data
Date of reporting: 30-09-2024
Report authors: Mietta Lennes (UH)
Contributors: Martin Matthiesen (CSC)
Deliverable location: https://www.kielipankki.fi/support/sd-services/
Keywords for the deliverable page: sensitive data; confidential data; secure desktop; SD services
In case a research dataset contains special categories of personal data or other types of confidential information that cannot be removed without hampering the research purpose, it may be necessary to use a secure environment for processing the data (cf. Deliverable 2.1.2 of the previous funding period of FIN-CLARIAH 2022-2023).
CSC – IT Center for Science provides Sensitive Data services for sharing and analyzing data securely from a web browser. The sensitive data files can be encrypted and uploaded via SD Connect, where they are available to the secure desktop instances of the members of the same project. The virtual machines for the secure desktops are configured and accessed via SD Desktop.
It is also possible to install and use special tools in the SD Desktop environment. Researchers who need to process audio and video material securely can now also conveniently install tools such as ELAN (video and audio) or Praat (audio) for viewing, editing, annotating, querying and analyzing their data, or well-known command-line tools such as Whisper (automatic speech recognition) as part of their workflow in the secure environment. For faster access to audio and video files, and external volume can be selected when configuring the virtual machine.
We will continue testing, documenting and improving the functionalities of the SD Desktop with the users of the Language Bank. We are also looking into the possibility of the Language Bank using SD Desktop instances for providing individual users with restricted access to specific sensitive datasets. The SD services are still under active development and the remaining issues can be addressed in collaboration with the experts at CSC.
For researchers in the SSH fields, the step-by-step instructions for using the Sensitive Data services are now maintained on a support page in the online portal of the Language Bank of Finland.
Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 2.1: Data collection for minority languages
Date of reporting: 26-09-2024
Report authors: Martin Matthiesen (CSC)
Contributors: Wilhelmina Dyster (UH), Sjur Moshagen, Katri Hiovain-Asikainen (UiT)
Deliverable location: n/a
Keywords for the deliverable page: Finland-Swedish, Sámi
In this workpackage two minority languages are collected: Swedish spoken in Finland and Sámi languages spoken in Norway, Sweden and Finland.
Data collected during the Donera Prat campaign[1] is currently manually transliterated. This work is expected to be ready by November 2024. The planned release date for the data for research is January 2025.
The data collection for Sámi languages is focusing on the broadcasting companies in the Nordic Countries (NRK[2], SVT[3], YLE[4]) where they are spoken and the University of Tromsø. The national broadcasters already have some of their Sámi data subtitled in a Sámi language and their respective national languages, making it a valuable resource for research.
We achieved a general understanding that the Language Bank of Finland can serve as the main sharing organisation for Sámi data and we already did test transfers of data from SVT and Tromsø. YLE’s Sámi data is available via KAVI[5]. Before the data can be shared via the Language Bank of Finland, we need to overcome technical and legal hurdles. While on the technical side we already reached broad agreement and will for example, share the data from the various sources with no or little changes, and KAVI and Aalto University already have experience in collaborating using the LUMI supercomputer, the legal side seems to be a bigger challenge. NRK, SVT and YLE are currently investigating legal implications of sharing their data via the Finnish Language Bank.
[1] Donera Prat https://svenska.yle.fi/a/7-10009203
[2] Norwegian Television: https://www.nrk.no/about/
[3] Swedish Television: https://omoss.svt.se/about-svt.html
[4] Finnish Television: https://yle.fi/aihe/about-yle
[5] The Finnish National Audio Visual Institute, https://kavi.fi/en/
Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 3.1: Report on Comprehensive data versioning
Date of reporting: 25-09-2024
Report authors: Martin Matthiesen (CSC)
Contributors: Erik Axelson, Eetu Mäkelä, Ville Vaara (UH), Sam Hardwick, Anni Järvenpää (CSC)
Deliverable location: https://github.com/CSCfi/kielipankki-nlf-harvester
Keywords for the deliverable page: versioning, updates, differences
The versioning mechanism has been tested with new data from the National Library. We discovered that we will likely need to make changes to the mechanism how data is packaged into zip files to avoid unnecessary growth of the versions stored in Allas.
Interviews with potential users of the data have been conducted: Erik Axelson and Ville Vaara (both UH). Both interviews are summarized below.
In 2024 FIN-CLARIN has published a new version of ”The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2 (1771-1874), VRT”[1], klk-fi-v2-1874-vrt, for short. This version was created using data directly obtained from the National Library, since our harvesting mechanism was not quite ready at the start of the project to create the new dataset. The NLF source data was extracted, tokenized and syntactically annotated and converted to the VRT format[3]. A list of included publications was compiled[4] and also End user notes, which document inconsistencies found after publication[5]. FIN-CLARIN has well established processes to obtain new copies from the National Library and these copies are in a different internal format than the data provided in this workpackage[2]. However, the differences are small and the data is well suited to be a basis for the next iteration. Since a new version of klk-fi-v2-1874-vrt is not planned during this project we will demonstrate the changes needed with a proof-of-concept.
Another use case for the data is the Elastic Search based tool developed in the previous FIN-CLARIAH development round in WP4.3[6]. In that use case the NLF data is converted to JSON suitable as input data for an Elastic Search Engine. When considering newer versions it became clear that an easy way of finding differences between the versions is a reasonable addition to the present implementation. The dataset is presently 10 TB in size and comparing two datasets of that size (the present version and an earlier version) to find out the differences is something that should be done once during the update and provided to the user as a service, enabling easier updates of indexes.
Moving forward we need to investigate the unnecessary growth of the versions and add functionality to make incremental updates of derived datasets (like in the Elastic Search case mentioned above) easier, by providing the differences between versions in a machine readable way. In deliverable 3.1.2 we will demonstrate the changes with working code.
[1] National Library of Finland. The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2 (1771-1874), VRT [data set]. Kielipankki. Retrieved from http://urn.fi/urn:nbn:fi:lb-2024060401
[2] See the Harvester documentation for details.
[3] Introduction to VRT: http://urn.fi/urn:nbn:fi:lb-2023020121
[4] List of publications: http://urn.fi/urn:nbn:fi:lb-2023092801
[5] End user notes: http://urn.fi/urn:nbn:fi:lb-2023101001
[6] See Deliverable 4.3.2 of FIN-CLARIAH 2022-2023. The current implementation can be found here: https://dariahfi-es.2.rahtiapp.fi (access available upon request)
Donera prat-kampanjerna på finska och finlandssvenska är avslutade från och med 6.3.2024. Ett stort tack till alla donatorer!
Från och med den 16 juni 2020 har Yle, tidigare Vake Oy (Valtion kehitysyhtiö; för närvarande Ilmastorahasto Oy) och Helsingfors universitet drivit kampanjen Lahjoita puhetta för insamling av finskt tal. I en mindre Donera prat -kampanj som startade 2021 har även finlandssvenskt tal samlats in. Under det första året av den finska kampanjen donerades mer än 3000 timmar tal. På senare tid har dock mycket få donationer kommit in.
Donationskampanjerna för finskt och finlandssvenskt tal är nu avslutade. Datamängderna kommer att organiseras och lagras av Språkbanken i Finland (Kielipankki). Via den finska Språkbanken kan forskare och företag få tillgång till Donate Speech-datamängder på särskilda villkor. Vi hoppas att data kommer att hjälpa både forskare och företag att skapa bättre modeller av finskt och finlandssvenskt tal och att utveckla framtida tjänster som lätt kan användas på finska och finlandsvenska.
Läs mer:
Uppdaterat: 6.3.2024
Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months
WP 1.1: Report on ingesting new unstructured resources
Date of reporting: 30-11-2023
Report authors: Mietta Lennes, Jussi Piitulainen (University of Helsinki)
Contributors: Ute Dieckmann, Erik Axelson, Jyrki Niemi, Jack Rueter, Tommi Jauhiainen, Krister Lindén (University of Helsinki)
Deliverable location: Corpora and tools available via the Language Bank of Finland
Keywords for the deliverable page: corpus, data set, automatic language identification
The Newspaper and Periodical Corpus of the National Library of Finland was extended with a significant amount of new material from the National Library. The new version was organized according to the automatically identified language of each sentence. The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2 (klk-fi-v2), consisting of more than 22 billion word tokens, was published in Korp in summer 2023. It consists of the text elements that contain at least one ”fin” sentence (from the new material, from the previous version of klk-fi, and from the previous klk-sv). Moreover, the summary attributes indicate the frequency distribution of languages within each text and each paragraph. An extended version of the Swedish sub-corpus (klk-sv-v2) has been compiled in a similar way (any ”swe” in a text), but the Swedish data is currently still waiting for the rest of the annotations to be completed. For details of the reorganization process of the National Library data according to language, see Jauhiainen et al. 2022.
The HeLI-OTS language identification tool was adapted for the format used in the Language Bank of Finland, together with a post-processor written to correct the identification of each sentence within its context. Another new tool was written to partition the corpus, first by the main identified languages, then by the year of publication.
As a demonstration of ingesting resources including parallel spoken material in multiple languages, the corpus Christmas Gospel text-to-speech in four Uralic languages was prepared and made available for searching and playback via Korp (for details on this effort, see D2.3.2).
Other corpora published in Korp during the years 2022-23 include, e.g., the Finnish News Agency Archive 1992-2018, Kielipankki Korp Version; Corpus of Contemporary American English (COCA) – Kielipankki Korp version 2020 and Erzya and Moksha Extended Corpora (ERME) version 2, Korp.
In addition, various downloadable resources were published, e.g., Corpus of Contemporary American English – Kielipankki VRT version 2020; FinnTreeBank 1, 2 and 3; Word embeddings trained with word2vec from the Finnish Text Collection; The Coronavirus Corpus (Mark Davies, english-corpora.org) – Kielipankki version 2021-05; and The Finnish Dark Web Marketplace Corpus.
During the project, the resource publication pipeline of the Language Bank of Finland has been refined and documented. The structure of the pipeline was first presented at the CLARIN Annual Conference in 2022 and described in the conference proceedings (Dieckmann & al., 2023, see below).
Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP x.y: Report on <topic of the deliverable>
Date of reporting: dd-mm-2024
Report authors: Firstname Lastname (Organization)
Contributors: Firstname Lastname (Organization)
Deliverable location: <link to, e.g., a GitHub repository, or other external location that includes further information or relevant content>
Keywords for the deliverable page: (any relevant keywords separated with semicolons; for search engines etc.)
The description text (max. 3000 characters) may include the following, if applicable:
The publication-ready deliverable should be emailed as a MS Word document (or similar) to wilhelmina.dyster (ATT) helsinki.fi, Cc:krister.linden (ATT) helsinki.fi.
Deadline for deliverables due 2024-12: Send the content for your deliverable page by 12.12.2024.
This page showcases the project deliverables (see template and instructions for reporting).
FIN-CLARIAH Funding period 2024-2025
FIN-CLARIAH Funding period 2022-2023 (Completed)
D1.1.1 | Named-entity annotation | 2024-09 |
D1.1.2 | Ingesting new unstructured resources | 2025-12 |
D1.2.1 | Data collection for minority languages | 2024-09 |
D1.2.2 | Transcription service for minority languages | 2025-09 |
D1.3.1 | Tools and guidelines for video processing | 2025-06 |
D2.1.1 | Integrate environment for personal data | 2024-09 |
D2.1.2 | Framework for processing copyrighted data for verification of research | 2025-09 |
D2.2.1 | Transformer training for specialised data | |
D2.2.2 | Transformer adaptation for specialised data | 2025-12 |
D2.3.1 | Remote access to text data repositories | |
D2.3.2 | Remote access to video data repositories | 2025-12 |
D2.4.1 | Term definition discovery procedures | 2024-09 |
D2.4.2 | Initializing terminology collections | 2025-12 |
D3.1.1 | Comprehensive data versioning | 2024-09 |
D3.1.2 | Workflow automation and version syncing | 2025-09 |
D3.2.1 | Ingestion of structured data from Finna (NLF) | 2025-03 |
D3.2.2 | Ingestion of heritage and societal data from Sampo | 2025-06 |
D3.2.3 | Ingestion of multimodal societal data from the Web | 2025-12 |
D3.3.1 | Automated metadata of archival data from NARC | 2025-03 |
D3.3.2 | Automated harmonisation and enrichment of metadata | |
D3.3.3 | Machine-learning -based enrichment of social media | 2025-06 |
D3.3.4 | Computer vision -based enrichment of multimodal data | 2025-09 |
D4.1.1 | Analysis of video stream interactions with AI solutions | 2025-06 |
D4.1.2 | Analysis Tools for Multimodal Born-digital Social Media | 2024-12 |
D4.1.3 | Analysis of interactions and regional language variation in social media | 2025-12 |
D4.1.4 | Analysis of multimodal properties of naturalistic speech | 2025-12 |
D4.1.5 | Analysis of multimodal cultural heritage | 2025-12 |
D4.1.6 | Enrich survey data with register data and unstructured text | 2025-06 |
D5.1.1 | Community engagement: multim. societal data researchers | 2024-09 |
D5.1.2 | Community engagement: multim. heritage researchers | 2025-06 |
D5.1.3 | Evidence-based infrastructure development | 2024-12 |
D5.1.4 | Educational resource development | 2025-12 |
Completed
D1.1.1 | Updating LBF resource selection | 2022-09 |
D1.1.2 | Ingesting new unstructured resources | 2023-12 |
D1.2.1 | Forced-Alignment Service | 2022-09 |
D1.2.2 | Transcription Service for Finnish Interviews | 2023-09 |
D1.3.1 | Corpora of non-standard language | 2022-09 |
D1.3.2 | System for detecting toxic language | 2023-06 |
D1.3.3 | Models for retrieving QA pairs from the web | 2023-09 |
D1.3.4 | QA pair corpora | 2023-12 |
D2.1.1 | Licensing agreements for personal data | 2022-09 |
D2.1.2 | Licensing agreements for special categories | 2023-06 |
D2.2.1 | Speech recognition for L2 | 2022-12 |
D2.2.2 | Speech recognition for L2 update | 2023-12 |
D2.3.1 | Licensing interpretation sessions | 2022-12 |
D2.3.2 | Aligning and retrieving | 2023-12 |
D2.4.1 | Term discovery procedures | 2022-09 |
D2.4.2 | Terminology application | 2023-06 |
D2.4.3.1 | Initializing terminology collections | 2022-09 |
D2.4.3.2 | Initializing terminology collections | 2023-06 |
D2.4.3.3 | Initializing terminology collections | 2023-12 |
D2.5.1 | Test performances storage | 2022-12 |
D2.5.2 | Analysis and annotation tools for learner performances | 2023-12 |
D3.1.1 | Initial NLF data | 2022-09 |
D3.1.2 | Ingestion framework | 2022-12 |
D3.1.3 | Versioning support | 2023-06 |
D3.1.4 | Incremental update process | 2023-12 |
D3.2.1 | Pipeline for transferring archival data | |
D3.2.2 | Annotation & analysis tools for NARC data | 2023-12 |
D3.3.1 | Qualitative survey data concept network | 2022-09 |
D3.3.2 | R package for data concept network |
D3.4.1 | Livestream data collector | 2022-12 |
D3.5.1 | Text network analysis of political texts | |
D3.5.2 | Text network analysis of political texts |
D4.1.1 | Harmonized FNB | 2022-09 |
D4.1.2 | Harmonization code | 2022-12 |
D4.1.3 | Visualisation workflow | 2023-06 |
D4.1.4 | R/Python module | 2023-12 |
D4.2.1 | LDF knowledge extraction tools | 2022-12 |
D4.2.2 | Parliament of Finland Ontology | 2023-12 |
D4.3.1 | Subsetting tool | 2022-09 |
D4.3.2 | Statistical overviews and bias detection | 2023-06 |
D4.3.3 | Representative Twitter dataset | 2023-12 |
D5.1.1 | User experience questionnaire | 2022-09 |
D5.1.2 | Log data collection and analysis | 2023-06 |
D5.1.3 | Protocol for collecting workshop data | 2023-12 |
D5.2.1 | Actor network | 2022-12 |
D5.2.2 | Educational material | 2023-12 |
Kielipankki Live on verkkotapahtumien sarja, jossa haastatellaan tutkijoita ja keskustellaan ajankohtaisista Kielipankkiin liittyvistä aiheista. Tapahtumissa tallennetut esitykset julkaistaan jälkikäteen YouTubessa (katso linkit aiempien tapahtumien kohdalta). Kun haluat pysyä ajan tasalla Kielipankki Live -tilaisuuksista ja muista Kielipankin uutisista, tilaa uutiskirje!
Ilmoittaudu tapahtumaan tällä lomakkeella viimeistään 11.12.2020. Ilmoittautumisen yhteydessä voit esittää kysymyksiä tutkijavieraille ja Kielipankin asiantuntijoille. Myös tapahtuman aikana on mahdollisuus kysyä ja keskustella.
Kaikille ennakkoon ilmoittautuneille lähetetään liittymislinkki Zoom-alustalle ennen tilaisuuden alkua. Myös ennakkoilmoittautumisen päätyttyä voit saada liittymislinkin lähettämällä sähköpostia osoitteeseen fin-clarin [AT] helsinki.fi.
Huomaathan, että Kielipankki Live -tapahtumat tallennetaan ja videotallenteen keskeiset osuudet julkaistaan verkossa jälkikäteen. Jos et halua kuvasi tai äänesi olevan mukana tallenteessa, pidäthän kameran ja mikrofonin pois päältä tapahtuman aikana. Keskusteluun voi osallistua myös chatissa. Tapahtuman osallistujien nimiä tai yhteystietoja ei julkaista.
järjestetään Joensuussa Itä-Suomen yliopistossa. Tapahtuman teemana on kieli, elämä ja yhteiskunta. Myös Kielipankki näkyy paikan päällä ja etenkin perjantaiaamupäivällä 17.5. saatat bongata yliopistolla ihmisiä, joilla on yllään vaaleansininen possupaita… Vedä meitä hihasta, poikkea esittelypisteellä tai tule kuuntelemaan esitelmiä!
will be organized in Joensuu by the University of Eastern Finland. The theme of the conference is language, life, and the society. The Language Bank of Finland will be present during the conference and especially on Friday morning, you might notice some people wearing a pale blue t-shirt with a happy piglet… Come and talk to us, visit our stand or see our presentations!
Introduction to the Language Bank of Finland at the workshop “Digital Parliamentary data and research”
Friday 3 May at 12.00
Aalto University (Otaniemi), CS-Building, Room T4 / A238 (Konemiehentie 2)
The aim of the workshop was to discuss the novel digital parliamentary datasets—in particular those of Parliament of Finland—their use in research, the related research resources and tools, and their future development for researchers, but also for citizens and the media. FIN-CLARIN and the Korp version 1.1 of the Plenary Sessions of the Parliament of Finland, available in the Language Bank of Finland, was also presented during the afternoon.
Mietta Lennes: FIN-CLARIN and Parliamentary Data in Kielipankki – the Language Bank of Finland (PowerPoint / PDF slides)
Further information including the programme of the workshop can be found at https://www.helsinki.fi/en/helsinki-centre-for-digital-humanities/workshop-digital-parliamentary-data-and-research.
Introduction to the Language Bank of Finland at the workshop “Digital Parliamentary data and research”
Friday 3 May at 12.00
Aalto University (Otaniemi), CS-Building, Room T4 / A238 (Konemiehentie 2)
The aim of the workshop was to discuss the novel digital parliamentary datasets—in particular those of Parliament of Finland—their use in research, the related research resources and tools, and their future development for researchers, but also for citizens and the media. FIN-CLARIN and the Korp version 1.1 of the Plenary Sessions of the Parliament of Finland, available in the Language Bank of Finland, was also presented during the afternoon.
Mietta Lennes: FIN-CLARIN and Parliamentary Data in Kielipankki – the Language Bank of Finland (PowerPoint / PDF slides)
Further information including the programme of the workshop can be found at https://www.helsinki.fi/en/helsinki-centre-for-digital-humanities/workshop-digital-parliamentary-data-and-research.
CLARIN ERIC on koostanut näyttävän julkaisun vuonna 2016 alkaneelta Tour de CLARIN -esittelykierrokselta, jossa CLARIN-jäsenmaat ja niiden aineistot, työkalut ja tutkimushankkeet pääsevät vuorotellen valokeilaan. Juuri ilmestyneessä Tour de CLARIN -kokoelman ykkösosassa esitellään ihan ensimmäisenä suomalainen FIN-CLARIN. Julkaisussa ovat mukana myös Ruotsi, Itävalta, Alankomaat, Puola, Belgian Flanderi, Tšekin tasavalta, Kreikka sekä Liettua.
Tour de CLARIN -kierros jatkuu ja sitä voi seurata tuoreeltaan CLARIN ERICin verkkosivuilta.
Hyvä Suomi!
The quickest way to explore the Language Bank’s services is to try the Korp interface where many of our corpora are deposited and can be queried without logging in or applying for access rights of the language resources. Korp features e.g. the Suomi 24 discussion forum corpus that is interesting from the points of view of several digital humanities and social sciences.
Other good ways to begin are the Newspaper and Periodical Corpus of the National Library of Finland and the Plenary Sessions of the Parliament of Finland, which are also available for download in addition to Korp.
Every year, the Language Bank of Finland is presented in Roadshow events that are organized at each of the member organizations of FIN-CLARIN. Come and see how you could use the services of the Language Bank in your research!
Roadshow schedule:
2020:
2019:
kuva: Risto Turunen
Kielipankki koostuu kattavasta joukosta aineistoja sekä niiden tutkimiseen soveltuvista ohjelmistoista tehokkaassa laiteympäristössä. Tampereen yliopiston tohtorikoulutettava Risto Turunen kertoo Kielipankissa olevaa Kansalliskirjaston sanoma- ja aikakauslehtikokoelmaa koskevasta tutkimuksestaan.
Olen Risto Turunen. Teen historiatieteen väitöskirjaa Tampereen yliopiston yhteiskunta- ja kulttuuritieteiden yksikössä.
Suomessa oli Euroopan suurin sosialistinen puolue vuonna 1907. Tutkin sosialismin läpimurtoa erityisesti kielen näkökulmasta. Minkälainen diskurssi, käsitejärjestelmä tai poliittinen kieli suomalainen sosialismi oikeastaan oli? Etenkin työväenlehdistö kylvi tehokkaasti sosialismin siementä kansan keskuuteen. Miltei kaikki suomenkieliset sanomalehdet on digitoitu vuoteen 1910 saakka. Koska lehdet ovat koneluettavassa muodossa, voin tutkia sosialismin kieltä makrotasolla kvantitatiivisin menetelmin.
Olen tutkinut näitä lehtiä muun muassa Kielipankin Korp-käyttöliittymän avulla. Voin esimerkiksi selvittää, milloin ”sosialismi” yleistyy sanana koko lehdistössä tai mitkä yksittäiset lehdet kirjoittavat eniten ”sosialismista”. Lisäksi olen vertaillut ”sosialismin” kielellistä esiintymiskontekstia sosialistisissa ja ei-sosialistisissa lehdissä. Vertailu paljastaa, millaisia merkityksiä aatteen kannattajat ja vastustajat yrittävät liittää sanaan.
Kielipankissa olevan Kansalliskirjaston sanoma- ja aikakauslehtikokoelman hankinnan taustatiedot
FIN-CLARIN eli suomalaisten yliopistojen, Tieteen tietotekniikan keskuksen ja Kotimaisten kielten keskuksen muodostama konsortio auttaa humanististen tieteiden tutkijoita käyttämään, jalostamaan, säilyttämään ja jakamaan tutkimusaineistoja. Aineistoja ja työkaluja tarjoaa Kielipankki.
Kaikki toistaiseksi esitellyt Kielipankin käyttäjät löytyvät Kuukauden tutkija -arkistosta.
kuva: Mika Federley
Kielipankki koostuu kattavasta joukosta aineistoja sekä niiden tutkimiseen soveltuvista ohjelmistoista tehokkaassa laiteympäristössä. Helsingin yliopiston tohtorikoulutettava Hanna Westerlund kertoo Kielipankissa olevia laki- ja säädöskielen aineistoja koskevasta tutkimuksestaan.
Olen Hanna Westerlund, käännöstieteen tutkijakoulutettava kielentutkimuksen tohtorikoulutusohjelmassa.
Olen kiinnostunut kollokaatioista eli yhteisesiintymistä kääntäjän haasteena ja kielentutkimuksen mahdollisuuksista selvittää kollokaatioiden tunnistamiseen ja tuottamiseen liittyviä kysymyksiä. Varsinainen tutkimusaineistoni koostuu Euroopan unionin suomeksi käännetyistä asetuksista, joista olen koonnut tekstikorpuksen Suomen yhteisöön liittymisen ajalta. Verrannaisaineisto sisältää vastaavia Suomen lainsäädännön tekstejä.
Pidän tärkeänä selvittää ainakin osittain, mitä tutkimusaineistosta löytämilleni esiintymille on ajan mittaan tapahtunut: ovatko käännösten mukana suomalaiseen säädöskieleen saapuneet yhteisesiintymät edelleen löydettävissä säädöskokoelmasta, ovatko ne syrjäyttäneet kotoperäiset vaihtoehdot vai elävätkö ne kaikki teksteissä rinnakkain. Tekstikorpuksen kokoaminen, käsittely ja hallinnointi ovat osoittautuneet sekä teknisesti että laadullisesti haastaviksi ja aikaa vieviksi tehtäviksi, ja vastaavan aineiston kokoaminen nykyisestä säädöskokoelmasta olisi minulle täysin mahdoton tehtävä. Onneksi ei tarvitsekaan: Kielipankki tarjoaa tutkimustani varten sekä teknisesti puhtaan ja luotettavan laeista ja direktiiveistä koostuvan aineiston että työkaluja aineiston käsittelyyn. Tutkimukseni toisen osion toteuttamisessa Kielipankin laki- ja säädöskielen aineistot ovat aivan korvaamattomia.
FIN-CLARIN eli suomalaisten yliopistojen, Tieteen tietotekniikan keskuksen ja Kotimaisten kielten keskuksen muodostama konsortio auttaa humanististen tieteiden tutkijoita käyttämään, jalostamaan, säilyttämään ja jakamaan tutkimusaineistoja. Aineistoja ja työkaluja tarjoaa Kielipankki.
Kaikki toistaiseksi esitellyt Kielipankin käyttäjät löytyvät Kuukauden tutkija -arkistosta.
Tutkijat kertovat miten he hyödyntävät Kielipankin aineistoja: http://bit.ly/2g6Ds1J.