D4.1.2: Analysis Tools for Multimodal Born-digital Social Media

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 4.1: Report on analysis tools for multimodal born-digital social media: Nordic Tweet Stream (NTS)
Date of reporting: 18-12-2024

Report author: Mikko Laitinen (UEF)
Contributors: Paula Rautionaho (UEF), Masoud Fatemi (UEF), Mehrdad Salimi (UEF)
Deliverable location: https://nordictweetstream.fi/

Description

The Nordic Tweet Stream (NTS) is a monitor corpus of geolocated tweets and associated metadata from the Nordic region covering over 11 years from 2013 to 2023. It is accessible through a graphic interface that allows users to search, subset, visualize, and download extremely large-scale user-generated data from one social media application.

The objective of this digital interface is to enable easy access to and distribution of born-digital data for basic research. We have recently witnessed the closing down of free access to various digital sources because of the APIcalypse (Bruns 2019) and feel that, despite restrictive measures by social media giants, it is extremely important to store cultural heritage from social media. We operate according to the FAIR Data Principle. The guiding principles of FAIR aim at making data findable, accessible, interoperable, and reusable (Wilkinson et al. 2016).

The NTS provides data spanning from January 2013 to May 2023, encompassing over 900 million tokens from more than 73 million messages, generated by nearly 900,000 individuals. The dataset includes content in 73 languages. The largest languages are Swedish (c. 31 %), English (c. 26 %) and Finnish (c. 13 %). Detailed information of the material is found in the Statistics pages of the interface.

The NTS dataset is intended for use by researchers across various disciplines, including sociolinguistics, dialectology, social sciences, and cultural studies. It can serve as both primary data and supplementary material alongside structured corpus data. This interface is designed for users seeking quick access to the data. Advanced users, however, may prefer to utilize the download function to retrieve the data for further processing in other environments.

Publications

Laitinen, M., Lundberg, J., Levin, M., & Martins, R. M. 2018. The Nordic Tweet Stream: A Dynamic Real-Time Monitor Corpus of Big and Rich Language Data. In DHN 2018 Digital Humanities in the Nordic Countries 3rd Conference: Proceedings of the Digital Humanities in the Nordic Countries 3rd Conference Helsinki, Finland, pp. 349–362. https://erepo.uef.fi/handle/123456789/6697

Events

NTS presented in the following event:

FIN-CLARIAH meeting, Tampere, Dec 2023: LINK
DRDHUM Pre-conference workshop, Joensuu, Dec 2024: FIN-CLARIAH tools to make sense of web dataReferences

References

Bruns, Axel. 2019. After the ‘APIcalypse’: Social media platforms and their fight against critical scholarly research. Information, Communication & Society, 22(11), 1544–1566, doi: 10.1080/1369118X.2019.1637447
Wilkinson, M. D. et al. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3, 160018. doi:10.1038/sdata.2016.18

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D4.1.6: Enrich survey data with register data and unstructured text

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 4.1: Report on Enrich survey data with register data and unstructured text
Date of reporting: 12-12-2024

Report authors: Adeline Clarke (University of Helsinki), Maria Valaste (University of Helsinki)
Contributors: Adeline Clarke (University of Helsinki), Maria Valaste (University of Helsinki)
Deliverable location: https://cran.r-project.org/web/packages/finnsurveytext/index.html

Description

The finnsurveytext R package has been developed to aid researchers in analyzing responses to open-ended survey questions and other structured text data. This user-friendly tool facilitates reproducible analysis of text data by providing features such as summarizing response properties, identifying frequent words and phrases, visualizing responses, and generating concept network plots. The second version of the package, released in August 2024, integrates with the widely-used R package survey, allowing for survey design to be incorporated into the analysis. Although originally designed for analyzing text in Finnish, the package is versatile and can be used for text analysis in other languages as well.

R package finnsurveytext was released with 2 updates to CRAN. The R package is located at CRAN and additional material is available on the website. An article on the package has been written and is available on Zenodo and for review in the new DARIAH publication.

The results of the work package were presented at two events: an invited lecture at the Workshop on Survey Statistics 2024, held in Poznan, Poland from 26-30 August, and at Statistics Sweden and Örebro University Summer School 2024 in August 28.

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

FIN-CLARIAH Deliverables

<< FIN-CLARIAH Overview

This page outlines the project deliverables for 2026-2029 (see template and instructions for reporting).

FIN-CLARIAH Funding period 2026-2029

Each WP has a leader (L:) and one or more participants from the consortium partners (P:) and collaborators (C:). The WP leader and participants contribute to the work in the WP. Collaborators are test users providing feedback, evaluation and beta testing of the deliverables.

Module 1: Natural Language Processing (NLP)

The module handles the basic language processing when a new resource is licensed from the rights holder, integrated into the infrastructure and made available through various distribution channels such as metadata servers, content search facilities and collaboration platforms. These processes need to be upgraded in view of recent developments in transformer technology, LLMs and AI. (L:UHEL/ARTS Krister Lindén)

W1.1 Text processing and annotation environments

To streamline and consolidate the text annotation in the RI components. (L:UHEL/ARTS Jussi Piitulainen; P:CSC; C:UEF, UTU, AALTO)

D1.1.1	Support common CLARIN formats like TEI (CSC/Martin Matthiesen).	2026-12
~~D1.1.2~~	~~Convert VRT to TEI and showcase the result in a compatible web interface like the KorAP platform used in German CLARIN. (CSC/Martin Matthiesen)~~	~~2027-07~~
D1.1.3	Apply new technologies such as LLMs for ingesting accruing data sets and improving annotation of existing data sets. (UHEL/ARTS/Jussi Piitulainen)	2028-04
D1.1.4	Develop metadata interoperability of FIN-CLARIAH resources for other infrastructures like ALT-EDIC (UHEL/ARTS/Jussi Piitulainen)	2029-10

W1.2 Speech processing and annotation

To provide automated speech recognition with an emphasis on recognizing, classifying and annotation of everyday speech and dialects. (L:CSC Sam Hardwick; P:UHEL/ARTS; C:AALTO, Kotus, OU, UTU, UEF, UHEL/SOC, UHEL/NLF)

D1.2.1	Updated backend of existing ASRs (CSC/Sam Hardwick)	2026-10
D1.2.2	A pipeline for the automated collection, processing, transcription and annotation (e.g. diarization and demographic annotation) of multimodal social media data. (OU/Steven Coats)	2027-08
D1.2.3	Support for additional future models and make the processing pipeline transparent for easy evaluation of suitability for data with elevated security requirements (CSC/Sam Hardwick)	2028-06
~~D1.2.4~~	~~Expansion and upgrade of Oulu Clarin-D centre to C or B status; provision of access to additional language resources sourced from multimedia social media content. (OU/Steven Coats)~~	~~2029-11~~

W1.3 Video processing and annotation

To simplify researcher use, management, annotation and sharing of collections of video recordings. (L:UHEL/ARTS Mietta Lennes; P:CSC; C:JYU, OU)

D1.3.1	Develop licensing and protection schemes for sharing sign language data (UHEL/ARTS/Mietta Lennes)	2026-06
D1.3.2	Data handling model for the entry and removal for large amounts of video data for research (CSC/Sam Hardwick)	2027-08
D1.3.3	Inventory and installation of tools for automated annotation of video and sign language data with LLM technologies (UHEL/ARTS/Mietta Lennes)	2028-09
D1.3.4	Inventory and installation of tools for accessing video and sign language data (UHEL/ARTS/Mietta Lennes)	2029-10

Module 2: Language Research Infrastructure (LRI)

This module takes care of the specialised language processing needs in the fields of language-based research. (L:UHEL/ARTS Krister Lindén)

W2.1 Processing Research Data

To share language resources and tools for datasets containing personal or copyrighted data. (L:CSC Martin Matthiesen; P:UHEL/ARTS; C:UHEL/SOC, UTU)

D2.1.1	Document the current options and fitness for purpose to use other processing environments, like supercomputers provided by CSC. (CSC/Martin Matthiesen)	2026-05
D2.1.2	Propose a proof-of-concept to address issues found in D 2.1.1. (CSC/Martin Matthiesen)	2027-09
D2.1.3	Pilot a processing pipeline with a real research use case, e.g. KAVI audio data. (CSC/Martin Matthiesen)	2028-06
D2.1.4	Protected processing and sharing of matriculation essays for research. (UHEL/ARTS/Mietta Lennes)	2029-11

W2.2 Training environments

To provide interactive online training environments for humanities scholars for creating specialised processing modules from LLMs. (L:UHEL/ARTS Erik Axelsson; P:CSC; C:AALTO, JYU, UTU, OU, Kotus)

D2.2.1	Training environment for DH scholars applying LLMs to annotation of text resources (UHEL/ARTS Erik Axelsson)	2026-12
D2.2.2	Training environment for DH scholars applying LLMs to annotation of audio resources (UHEL/ARTS Erik Axelsson)	2027-12
D2.2.3	Training environment for DH scholars applying LLMs to annotation of video resources (UHEL/ARTS Erik Axelsson)	2028-06
D2.2.4	Training environment for DH scholars applying LLMs to annotation of multimodal resources (UHEL/ARTS Erik Axelsson)	2029-08

W2.3 Translation and Interpretation

To provide infrastructure for translation and interpretation research on fact checking and verification of LLM output. (L:UHEL/ARTS Tommi Jauhiainen; P:CSC; C:UTA, UEF)

D2.3.1	Develop policies for processing and sharing translation memories (UHEL/ARTS Tommi Jauhiainen)	2026-05
D2.3.2	Install pipeline for automated cleaning and transcription of multilingual audio and video data (UHEL/ARTS Tommi Jauhiainen)	2027-06
D2.3.3	Provide access to transcriptions of multilingual audio and video data (UHEL/ARTS Tommi Jauhiainen)	2028-08
D2.3.4	A pipeline for the automated collection, processing, transcription and annotation of multilingual media (UHEL/ARTS Tommi Jauhiainen)	2029-10

W2.4 Terminology

To provide infrastructure for the terminology work in the Helsinki Term Bank for the Arts and Sciences (HTB) and related terminology development projects. (L:UHEL/ARTS Tiina Onikki; C:UVAASA)

D2.4.1	Initiate and develop terminology groups on biology, microbiology, ecology, evolutionary biology, biotechnology, and genetics.	2026-09
D2.4.2	Initiate and develop terminology groups on geography, social geography, and environmental sciences.	2027-12
D2.4.2	Initiate and develop terminology groups on social policy, economics, and political science.	2028-05
D2.4.3	Initiate and develop terminology groups on sociology, psychology, social psychology, and educational sciences.	2029-11

Module 3: Structuring Data

This module standardises efforts in data capture and provides resources and incentives for collaboration by processing unstructured text and metadata with different areas of Digital Humanities (DH) as use cases. (L:UHEL/ARTS Mikko Tolonen)

W3.1 Data Management

To significantly upgrade the data management, versioning and workflow automation capabilities that underlie the whole infrastructure for data ingestion. (L:CSC Anni Järvenpää; P:UHEL/ARTS; C:UHEL/NLF, UHEL/SOC, NAF, OU, JYU)

D3.1.1	Upgrading the base data storage, access and processing infrastructure to handle the large volumes of multimodal data needed to both train and use foundational models	2026-05
D3.1.2	Upgrading the data workflow automation and versioning capabilities to handle the large volumes of multimodal data needed to both train and use foundational models	2027-09
D3.1.3	Second upgrade of the base data infrastructure to account for the rapidly changing systems and requirements	2028-04
D3.1.4	Second upgrade of the workflow and versioning to account for the rapidly changing systems and requirements	2029-10

W3.2 Data Ingestion

To improve the RI by connecting it to accruing data sources. (L:UHEL/NLF Johanna Lilja; P:Aalto, OU, JYU, UHEL/ARTS; C:CSC)

D3.2.1	Ingestion of visual cultural heritage. Validation of the API solution and further development of the interoperability between Finna and FIN-CLARIAH-infrastructure. (NLF/FINNA/Riitta Peltonen)	2026-11
D3.2.2	Ingestion of new types of data More comprehensive engagement of the cultural heritage organisations that provides new types of data and facilitating dialogue between them and researchers. (NLF/FINNA/Riitta Peltonen)	2027-06
D3.2.3	Ingestion of in-copyright publications/webarchive. Building a research environment for legal deposit material (NLF/Aija Vahtola)	2028-12
D3.2.4	Ingestion of in-copyright publications/webarchive. Piloting the research environment for legal deposit material with researchers (NLF/Aija Vahtola)	2029-11

W3.3 Enrichment

To enable the systematic and detailed analysis of noisy datasets in different formats and thereby provide unseen possibilities for SSH research. (All the deliverables set to 2029 also have sub-deliverables. However, for presentation clarity, only the overall development strand names and final deliverables are shown.) (L:UTU Veronika Laippala; P:UEF, JYU, OU, UHEL/ARTS, UHEL/SOC, Aalto; C:UHEL/NLF)

D3.3.1	Statistical methods for denoising and enrichment of structured cultural heritage data (UTU/Leo Lahti)	2029-11
D3.3.2	Neuro-symbolic tools based on Generative AI and LLMs for enriching metadata (Aalto/Annastiiina Ahola)	2027-11
D3.3.3	Using foundational models to deeply enrich and sample from massive but noisy, multilingual web data (UTU/Veronika Laippala)	2029-11
D3.3.4	Multimodal modelling for deep enrichment of archival documents (JYU/ Antero Holmila)	2029-11
D3.3.5	Multimodal modelling for the deep enrichment of livestream data (JYU, Raine Koskimaa)	2029-11

Module 4: Analyzing Structured Data

The module will develop the technical services needed to support data-intensive SSH research on the various types of raw data. (L:UHEL/ARTS Mikko Tolonen)

W4.1 Analytical Support for computational SSH

To enable researchers to utilise large born-digital data effectively and to focus on analysis rather than dealing with technical details in often high volume and high velocity. (All the deliverables also have sub-deliverables. However, for presentation clarity, only the overall development strand names and final deliverables are shown.) (L:UEF Mikko Laitinen; P:JYU, OU, UHEL/SOC; C:UHEL/NLF)

D4.1.1	Analytical and conceptual tools for multimodal cultural heritage analysis. (OU/Ilkka Lähteenmäki)	2029-11
D4.1.2	Develop a national digital ecosystem (“Nordic Digital Observatory”) for effective use of large-scale social media data in fundamental research (UEF/ Mikko Laitinen)	2029-11
D4.1.3	Analysis tools for Social Science data from multiple data sources (UHEL/SOC/Maria Valaste)	2029-11
D4.1.4	Analysis tools for multimodal livestream data (JYU/Raine Koskimaa)	2029-11

Module 5: Information Interaction (IIA)

Interaction refers to the need 1) to collect information on how researchers interact with the RI in order to develop the tools and services accordingly, and 2) to offer education and consultation on how researchers can enhance their work by using the infrastructure, thus increasing the RI’s active user base. (L:TAU Sanna Kumpulainen)

W5.1 Evidence-Based Infrastructure Development

To provide a close dialogue with the user community to ensure the best possible development of the RI. (L:TAU Sanna Kumpulainen; P:UHEL/ARTS; C:UHEL/NLF, UTU, CSC, UHEL/SOC, AALTO, JYU, UEF, OU)

D5.1.1	Community engagement: Researchers using LLMs as research tools. (TAU:/Sanna Kumpulainen)	2026-06
D5.1.2	Educational resources for infrastructure tools and data. (L:TAU:/Sanna Kumpulainen)	2027-11
~~D5.1.3~~	~~Community engagement: User interaction with multimodal data. (TAU:/Sanna Kumpulainen)~~	~~2028-06~~
D5.1.4	Evidence-based infrastructure development: User experience and the feedback instrument. (TAU:/Sanna Kumpulainen)	2029-11

Top of page

<< FIN-CLARIAH Overview

<< List of all deliverables

D2.1.1: Integrate environment for personal data

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 3.1: Report on Integrate environment for personal data
Date of reporting: 30-09-2024

Report authors: Mietta Lennes (UH)
Contributors: Martin Matthiesen (CSC)
Deliverable location: https://www.kielipankki.fi/support/sd-services/

Keywords for the deliverable page: sensitive data; confidential data; secure desktop; SD services

Description

In case a research dataset contains special categories of personal data or other types of confidential information that cannot be removed without hampering the research purpose, it may be necessary to use a secure environment for processing the data (cf. Deliverable 2.1.2 of the previous funding period of FIN-CLARIAH 2022-2023).

CSC – IT Center for Science provides Sensitive Data services for sharing and analyzing data securely from a web browser. The sensitive data files can be encrypted and uploaded via SD Connect, where they are available to the secure desktop instances of the members of the same project. The virtual machines for the secure desktops are configured and accessed via SD Desktop.

It is also possible to install and use special tools in the SD Desktop environment. Researchers who need to process audio and video material securely can now also conveniently install tools such as ELAN (video and audio) or Praat (audio) for viewing, editing, annotating, querying and analyzing their data, or well-known command-line tools such as Whisper (automatic speech recognition) as part of their workflow in the secure environment. For faster access to audio and video files, and external volume can be selected when configuring the virtual machine.

We will continue testing, documenting and improving the functionalities of the SD Desktop with the users of the Language Bank. We are also looking into the possibility of the Language Bank using SD Desktop instances for providing individual users with restricted access to specific sensitive datasets. The SD services are still under active development and the remaining issues can be addressed in collaboration with the experts at CSC.

For researchers in the SSH fields, the step-by-step instructions for using the Sensitive Data services are now maintained on a support page in the online portal of the Language Bank of Finland.

<< List of all deliverables

D1.2.1: Data collection for minority languages

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 2.1: Data collection for minority languages
Date of reporting: 26-09-2024

Report authors: Martin Matthiesen (CSC)
Contributors: Wilhelmina Dyster (UH), Sjur Moshagen, Katri Hiovain-Asikainen (UiT)
Deliverable location: n/a

Keywords for the deliverable page: Finland-Swedish, Sámi

Description

In this workpackage two minority languages are collected: Swedish spoken in Finland and Sámi languages spoken in Norway, Sweden and Finland.

Data collected during the Donera Prat campaign[1] is currently manually transliterated. This work is expected to be ready by November 2024. The planned release date for the data for research is January 2025.

The data collection for Sámi languages is focusing on the broadcasting companies in the Nordic Countries (NRK[2], SVT[3], YLE[4]) where they are spoken and the University of Tromsø. The national broadcasters already have some of their Sámi data subtitled in a Sámi language and their respective national languages, making it a valuable resource for research.

We achieved a general understanding that the Language Bank of Finland can serve as the main sharing organisation for Sámi data and we already did test transfers of data from SVT and Tromsø. YLE’s Sámi data is available via KAVI[5]. Before the data can be shared via the Language Bank of Finland, we need to overcome technical and legal hurdles. While on the technical side we already reached broad agreement and will for example, share the data from the various sources with no or little changes, and KAVI and Aalto University already have experience in collaborating using the LUMI supercomputer, the legal side seems to be a bigger challenge. NRK, SVT and YLE are currently investigating legal implications of sharing their data via the Finnish Language Bank.

[1] Donera Prat https://svenska.yle.fi/a/7-10009203

[2] Norwegian Television: https://www.nrk.no/about/

[3] Swedish Television: https://omoss.svt.se/about-svt.html

[4] Finnish Television: https://yle.fi/aihe/about-yle

[5] The Finnish National Audio Visual Institute, https://kavi.fi/en/

<< List of all deliverables

D3.1.1: Comprehensive data versioning

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 3.1: Report on Comprehensive data versioning
Date of reporting: 25-09-2024

Report authors: Martin Matthiesen (CSC)
Contributors: Erik Axelson, Eetu Mäkelä, Ville Vaara (UH), Sam Hardwick, Anni Järvenpää (CSC)
Deliverable location: https://github.com/CSCfi/kielipankki-nlf-harvester

Keywords for the deliverable page: versioning, updates, differences

Description

The versioning mechanism has been tested with new data from the National Library. We discovered that we will likely need to make changes to the mechanism how data is packaged into zip files to avoid unnecessary growth of the versions stored in Allas.

Interviews with potential users of the data have been conducted: Erik Axelson and Ville Vaara (both UH). Both interviews are summarized below.

Using the data set as a potential source for newer versions of the KLK dataset in Kielipankki. (Erik Axelson)

In 2024 FIN-CLARIN has published a new version of ”The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2 (1771-1874), VRT”[1], klk-fi-v2-1874-vrt, for short. This version was created using data directly obtained from the National Library, since our harvesting mechanism was not quite ready at the start of the project to create the new dataset. The NLF source data was extracted, tokenized and syntactically annotated and converted to the VRT format[3]. A list of included publications was compiled[4] and also End user notes, which document inconsistencies found after publication[5]. FIN-CLARIN has well established processes to obtain new copies from the National Library and these copies are in a different internal format than the data provided in this workpackage[2]. However, the differences are small and the data is well suited to be a basis for the next iteration. Since a new version of klk-fi-v2-1874-vrt is not planned during this project we will demonstrate the changes needed with a proof-of-concept.

Using the dataset as a basis for an Elastic Search instance containing NLF data (Ville Vaara)

Another use case for the data is the Elastic Search based tool developed in the previous FIN-CLARIAH development round in WP4.3[6]. In that use case the NLF data is converted to JSON suitable as input data for an Elastic Search Engine. When considering newer versions it became clear that an easy way of finding differences between the versions is a reasonable addition to the present implementation. The dataset is presently 10 TB in size and comparing two datasets of that size (the present version and an earlier version) to find out the differences is something that should be done once during the update and provided to the user as a service, enabling easier updates of indexes.

Next steps

Moving forward we need to investigate the unnecessary growth of the versions and add functionality to make incremental updates of derived datasets (like in the Elastic Search case mentioned above) easier, by providing the differences between versions in a machine readable way. In deliverable 3.1.2 we will demonstrate the changes with working code.

References

[1] National Library of Finland. The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2 (1771-1874), VRT [data set]. Kielipankki. Retrieved from http://urn.fi/urn:nbn:fi:lb-2024060401

[2] See the Harvester documentation for details.

[3] Introduction to VRT: http://urn.fi/urn:nbn:fi:lb-2023020121

[4] List of publications: http://urn.fi/urn:nbn:fi:lb-2023092801

[5] End user notes: http://urn.fi/urn:nbn:fi:lb-2023101001

[6] See Deliverable 4.3.2 of FIN-CLARIAH 2022-2023. The current implementation can be found here: https://dariahfi-es.2.rahtiapp.fi (access available upon request)

Donera prat (Lahjoita puhetta)

Suomeksi | In English

Donera prat-kampanjerna på finska och finlandssvenska är avslutade från och med 6.3.2024. Ett stort tack till alla donatorer!

Från och med den 16 juni 2020 har Yle, tidigare Vake Oy (Valtion kehitysyhtiö; för närvarande Ilmastorahasto Oy) och Helsingfors universitet drivit kampanjen Lahjoita puhetta för insamling av finskt tal. I en mindre Donera prat -kampanj som startade 2021 har även finlandssvenskt tal samlats in. Under det första året av den finska kampanjen donerades mer än 3000 timmar tal. På senare tid har dock mycket få donationer kommit in.

Donationskampanjerna för finskt och finlandssvenskt tal är nu avslutade. Datamängderna kommer att organiseras och lagras av Språkbanken i Finland (Kielipankki). Via den finska Språkbanken kan forskare och företag få tillgång till Donate Speech-datamängder på särskilda villkor. Vi hoppas att data kommer att hjälpa både forskare och företag att skapa bättre modeller av finskt och finlandssvenskt tal och att utveckla framtida tjänster som lätt kan användas på finska och finlandsvenska.

Läs mer:

Uppdaterat: 6.3.2024

<< List of all deliverables

D1.1.2: Ingesting new unstructured resources

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 1.1: Report on ingesting new unstructured resources
Date of reporting: 30-11-2023

Report authors: Mietta Lennes, Jussi Piitulainen (University of Helsinki)
Contributors: Ute Dieckmann, Erik Axelson, Jyrki Niemi, Jack Rueter, Tommi Jauhiainen, Krister Lindén (University of Helsinki)
Deliverable location: Corpora and tools available via the Language Bank of Finland

Keywords for the deliverable page: corpus, data set, automatic language identification

Description

The Newspaper and Periodical Corpus of the National Library of Finland was extended with a significant amount of new material from the National Library. The new version was organized according to the automatically identified language of each sentence. The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2 (klk-fi-v2), consisting of more than 22 billion word tokens, was published in Korp in summer 2023. It consists of the text elements that contain at least one ”fin” sentence (from the new material, from the previous version of klk-fi, and from the previous klk-sv). Moreover, the summary attributes indicate the frequency distribution of languages within each text and each paragraph. An extended version of the Swedish sub-corpus (klk-sv-v2) has been compiled in a similar way (any ”swe” in a text), but the Swedish data is currently still waiting for the rest of the annotations to be completed. For details of the reorganization process of the National Library data according to language, see Jauhiainen et al. 2022.

The HeLI-OTS language identification tool was adapted for the format used in the Language Bank of Finland, together with a post-processor written to correct the identification of each sentence within its context. Another new tool was written to partition the corpus, first by the main identified languages, then by the year of publication.

As a demonstration of ingesting resources including parallel spoken material in multiple languages, the corpus Christmas Gospel text-to-speech in four Uralic languages was prepared and made available for searching and playback via Korp (for details on this effort, see D2.3.2).

Other corpora published in Korp during the years 2022-23 include, e.g., the Finnish News Agency Archive 1992-2018, Kielipankki Korp Version; Corpus of Contemporary American English (COCA) – Kielipankki Korp version 2020 and Erzya and Moksha Extended Corpora (ERME) version 2, Korp.

In addition, various downloadable resources were published, e.g., Corpus of Contemporary American English – Kielipankki VRT version 2020; FinnTreeBank 1, 2 and 3; Word embeddings trained with word2vec from the Finnish Text Collection; The Coronavirus Corpus (Mark Davies, english-corpora.org) – Kielipankki version 2021-05; and The Finnish Dark Web Marketplace Corpus.

During the project, the resource publication pipeline of the Language Bank of Finland has been refined and documented. The structure of the pipeline was first presented at the CLARIN Annual Conference in 2022 and described in the conference proceedings (Dieckmann & al., 2023, see below).

Publications

Jauhiainen, T., Piitulainen, J., Axelson, E., Lindén, K. (2022) Language diversity in the newspaper and periodical corpus of the National Library of Finland. Poster presented at Digital Research Data and Human Sciences (DRDHum), 1.-3.12.2022, Jyväskylä, Suomi. Download the poster
Dieckmann, U., Lennes, M., Piitulainen, J., Niemi, J., Axelson, E., Jauhiainen, T., Lindén, K. (2023) The Pipeline for Publishing Resources in the Language Bank of Finland. Erjavec, T., Eskevich, M. (editors), Selected Papers from the CLARIN Annual Conference 2022, pp. 33-43. Linköping University Electronic Press.

<< List of all deliverables

DX.Y.Z: Title of Deliverable

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP x.y: Report on <topic of the deliverable>
Date of reporting: dd-mm-2024

Report authors: Firstname Lastname (Organization)
Contributors: Firstname Lastname (Organization)
Deliverable location: <link to, e.g., a GitHub repository, or other external location that includes further information or relevant content>

Keywords for the deliverable page: (any relevant keywords separated with semicolons; for search engines etc.)

Description

The description text (max. 3000 characters) may include the following, if applicable:

Links to external resources
Publications, if any (including DOI)
Events, if any (including links)
Insert the following text to the end of your report: FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

The publication-ready deliverable should be emailed as a MS Word document (or similar) to wilhelmina.dyster (ATT) helsinki.fi, Cc:krister.linden (ATT) helsinki.fi.

Deadline for deliverables due 2024-12: Send the content for your deliverable page by 12.12.2024.

FIN-CLARIAH Deliverables

<< FIN-CLARIAH Overview

This page showcases the project deliverables (see template and instructions for reporting).

FIN-CLARIAH Funding period 2024-2025
FIN-CLARIAH Funding period 2022-2023 (Completed)

FIN-CLARIAH Funding period 2024-2025

Module 1: Natural Language Processing (NLP)

W1.1 Text processing and annotation environments

D1.1.1	Named-entity annotation	2024-09
D1.1.2	Ingesting new unstructured resources	2025-12

W1.2 Speech processing and annotation

D1.2.1	Data collection for minority languages	2024-09
D1.2.2	Transcription service for minority languages	2025-09

W1.3 Video processing and annotation

D1.3.1

Tools and guidelines for video processing

2025-06

Module 2: Language Research Infrastructure (LRI)

W2.1 Personal and Copyrighted Research Data

D2.1.1	Integrate environment for personal data	2024-09
D2.1.2	Framework for processing copyrighted data for verification of research	2025-09

W2.2 Training environments

D2.2.1	Transformer training for specialised data	~~2024-12~~ 2025-03
D2.2.2	Transformer adaptation for specialised data	2025-12

W2.3 Translation and Interpretation

D2.3.1	Remote access to text data repositories	~~2024-12~~ 2025-03
D2.3.2	Remote access to video data repositories	2025-12

W2.4 Terminology

D2.4.1	Term definition discovery procedures	2024-09
D2.4.2	Initializing terminology collections	2025-12

Module 3: Structuring Data

W3.1 Data Management

D3.1.1	Comprehensive data versioning	2024-09
D3.1.2	Workflow automation and version syncing	2025-09

W3.2 Data Ingestion

D3.2.1	Ingestion of structured data from Finna (NLF)	2025-03
D3.2.2	Ingestion of heritage and societal data from Sampo	2025-06
D3.2.3	Ingestion of multimodal societal data from the Web	2025-12

W3.3 Enrichment

D3.3.1	Automated metadata of archival data from NARC	2025-03
D3.3.2	Automated harmonisation and enrichment of metadata	~~2024-12~~ 2025-03
D3.3.3	Machine-learning -based enrichment of social media	2025-06
D3.3.4	Computer vision -based enrichment of multimodal data	2025-09

Module 4: Analyzing Structured Data

W4.1 Analytical Support for computational SSH

D4.1.1	Analysis of video stream interactions with AI solutions	2025-06
D4.1.2	Analysis Tools for Multimodal Born-digital Social Media	2024-12
D4.1.3	Analysis of interactions and regional language variation in social media	2025-12
D4.1.4	Analysis of multimodal properties of naturalistic speech	2025-12
D4.1.5	Analysis of multimodal cultural heritage	2025-12
D4.1.6	Enrich survey data with register data and unstructured text	2025-06

Module 5: Information Interaction (IIA)

W5.1 Evidence-Based Infrastructure Development

D5.1.1	Community engagement: multim. societal data researchers	2024-09
D5.1.2	Community engagement: multim. heritage researchers	2025-06
D5.1.3	Evidence-based infrastructure development	2024-12
D5.1.4	Educational resource development	2025-12

FIN-CLARIAH Funding period 2022-2023

Completed

Module 1: Natural Language Processing (NLP)

W1.1 Text processing and annotation environments

D1.1.1	Updating LBF resource selection	2022-09
D1.1.2	Ingesting new unstructured resources	2023-12

W1.2 Speech processing and annotation

D1.2.1	Forced-Alignment Service	2022-09
D1.2.2	Transcription Service for Finnish Interviews	2023-09

W1.3 Noise-tolerant NLP

D1.3.1	Corpora of non-standard language	2022-09
D1.3.2	System for detecting toxic language	2023-06
D1.3.3	Models for retrieving QA pairs from the web	2023-09
D1.3.4	QA pair corpora	2023-12

Module 2: Language Research Infrastructure

W2.1 Social Data Science

D2.1.1	Licensing agreements for personal data	2022-09
D2.1.2	Licensing agreements for special categories	2023-06

W2.2 Learners’ Assessment Environments

D2.2.1	Speech recognition for L2	2022-12
D2.2.2	Speech recognition for L2 update	2023-12

W2.3 Translation and Interpretation

D2.3.1	Licensing interpretation sessions	2022-12
D2.3.2	Aligning and retrieving	2023-12

W2.4 Terminology

D2.4.1	Term discovery procedures	2022-09
D2.4.2	Terminology application	2023-06
D2.4.3.1	Initializing terminology collections	2022-09
D2.4.3.2	Initializing terminology collections	2023-06
D2.4.3.3	Initializing terminology collections	2023-12

W2.5 Solutions for better use of language learner performances in research

D2.5.1	Test performances storage	2022-12
D2.5.2	Analysis and annotation tools for learner performances	2023-12

Module 3: Structuring Data

W3.1 Increasingly automated ingestion of material

D3.1.1	Initial NLF data	2022-09
D3.1.2	Ingestion framework	2022-12
D3.1.3	Versioning support	2023-06
D3.1.4	Incremental update process	2023-12

W3.2 AI solutions to better use of National Archives mass digitisation services

D3.2.1	Pipeline for transferring archival data	~~2022-12~~ 2023-06
D3.2.2	Annotation & analysis tools for NARC data	2023-12

W3.3 AI solutions to better use of textual qualitative survey data

D3.3.1	Qualitative survey data concept network	2022-09
D3.3.2	R package for data concept network	~~2023-09~~ 2023-12

W3.4 Developing analysis methods for real-time chats in gameplay streams

D3.4.1

Livestream data collector

2022-12

W3.5 Developing analysis methods for text network analysis of political texts

D3.5.1	Text network analysis of political texts	~~2022-12~~ 2023-06
D3.5.2	Text network analysis of political texts	~~2023-09~~ 2023-12

Module 4: Analyzing Structured Data

W4.1 Metadata harmonization and analysis

D4.1.1	Harmonized FNB	2022-09
D4.1.2	Harmonization code	2022-12
D4.1.3	Visualisation workflow	2023-06
D4.1.4	R/Python module	2023-12

W4.2 Linked Open Data Services

D4.2.1	LDF knowledge extraction tools	2022-12
D4.2.2	Parliament of Finland Ontology	2023-12

W4.3 Subsetting data

D4.3.1	Subsetting tool	2022-09
D4.3.2	Statistical overviews and bias detection	2023-06
D4.3.3	Representative Twitter dataset	2023-12

Module 5: Information Interaction

W5.1 Evidence-based RI development

D5.1.1	User experience questionnaire	2022-09
D5.1.2	Log data collection and analysis	2023-06
D5.1.3	Protocol for collecting workshop data	2023-12

W5.2 Education and dissemination

D5.2.1	Actor network	2022-12
D5.2.2	Educational material	2023-12

Top of page

<< FIN-CLARIAH Overview

Tweet #kielipankkilive

Kielipankki Live

In English

Kielipankki Live on verkkotapahtumien sarja, jossa haastatellaan tutkijoita ja keskustellaan ajankohtaisista Kielipankkiin liittyvistä aiheista. Tapahtumissa tallennetut esitykset julkaistaan jälkikäteen YouTubessa (katso linkit aiempien tapahtumien kohdalta). Kun haluat pysyä ajan tasalla Kielipankki Live -tilaisuuksista ja muista Kielipankin uutisista, tilaa uutiskirje!

Seuraava Kielipankki Live 14.12.2020 klo 13-15

Pääaihe: Puhetta sisältävät tutkimusaineistot ja niiden tietosuojakäytänteet
Luvassa asiantuntevia vieraita ja keskustelua! Esitykset pidetään englanniksi, mutta kysymyksiä voi esittää myös suomeksi. Tilaisuus alkaa klo 13.00 ja päättyy joustavasti, kuitenkin viimeistään klo 15.

Ohjelma

Mietta Lennes: Ajankohtaisia asioita Kielipankissa
Krister Lindén: Tietoisku kieliaineistojen oikeudellisista kysymyksistä
Haastattelussa Rosa González Hautamäki ja Tomi Kinnunen: Kokemuksia AVOID-korpuksen ja muiden puheaineistojen keräämisestä ja jakamisesta puheteknologiseen tutkimukseen
Satu Saalasti: DELAD-projekti tähtää poikkeavan puheen aineistojen jakamiseen tutkijoille
Aleksi Rossi: Lyhyt tilannekatsaus Lahjoita puhetta -kampanjan tilanteesta
Questions & Answers: Kysy Kielipankin henkilökunnalta ja asiantuntijoilta
Avoin keskustelu

Ilmoittautuminen

Ilmoittaudu tapahtumaan tällä lomakkeella viimeistään 11.12.2020. Ilmoittautumisen yhteydessä voit esittää kysymyksiä tutkijavieraille ja Kielipankin asiantuntijoille. Myös tapahtuman aikana on mahdollisuus kysyä ja keskustella.

Kaikille ennakkoon ilmoittautuneille lähetetään liittymislinkki Zoom-alustalle ennen tilaisuuden alkua. Myös ennakkoilmoittautumisen päätyttyä voit saada liittymislinkin lähettämällä sähköpostia osoitteeseen fin-clarin [AT] helsinki.fi.

Kielipankki Live -tapahtumat tallennetaan

Huomaathan, että Kielipankki Live -tapahtumat tallennetaan ja videotallenteen keskeiset osuudet julkaistaan verkossa jälkikäteen. Jos et halua kuvasi tai äänesi olevan mukana tallenteessa, pidäthän kameran ja mikrofonin pois päältä tapahtuman aikana. Keskusteluun voi osallistua myös chatissa. Tapahtuman osallistujien nimiä tai yhteystietoja ei julkaista.

Kaikki Kielipankki Live -tapahtumat

14.12.2020 klo 13-15 (Ilmoittaudu tapahtumaan)
24.8.2020

In English

XLVI Kielitieteen päivät 16.–18. toukokuuta 2019

järjestetään Joensuussa Itä-Suomen yliopistossa. Tapahtuman teemana on kieli, elämä ja yhteiskunta. Myös Kielipankki näkyy paikan päällä ja etenkin perjantaiaamupäivällä 17.5. saatat bongata yliopistolla ihmisiä, joilla on yllään vaaleansininen possupaita… Vedä meitä hihasta, poikkea esittelypisteellä tai tule kuuntelemaan esitelmiä!

Kielipankki-aiheisten esitysten alustava aikataulu

Torstaina 16.5. klo 16:30 sali AG106 / Selkokielen työpaja (Klaara-verkosto):
Kielipankin selkosuomen aineistot (Hanna Westerlund)
Perjantaina 17.5. klo 10:00-10:30 sali AG101:
Kielipankin kiertue 2019: Työkalut, aineistot ja muut palvelut (PowerPoint-esitys 31 MB; Mietta Lennes)

Kielitieteen päivien päivitetty ohjelma ja lisätiedot

Tervetuloa tutustumaan Kielipankkiin esittelypisteellä konferenssin aikana!

Suomeksi

The XLVI Annual Conference of Linguistics

will be organized in Joensuu by the University of Eastern Finland. The theme of the conference is language, life, and the society. The Language Bank of Finland will be present during the conference and especially on Friday morning, you might notice some people wearing a pale blue t-shirt with a happy piglet… Come and talk to us, visit our stand or see our presentations!

Pre-final schedule of the presentations related to the Language Bank of Finland:

Thursday 16.5. 16:30 room AG106 / Selkokielen työpaja (Klaara-verkosto):
Kielipankin selkosuomen aineistot (The Easy-to-read Finnish corpora in the Language Bank of Finland; Hanna Westerlund)
Friday 17.5. 10:00-10:30 room AG101:
Kielipankin kiertue 2019: Työkalut, aineistot ja muut palvelut (Kielipankki Roadshow 2019: Tools, corpora and other services; Mietta Lennes)

Updated programme and further information about the Annual Conference of Linguistics

Welcome to meet Kielipankki, The Language Bank of Finland at its stand during the conference!

Introduction to the Language Bank of Finland at the workshop “Digital Parliamentary data and research”

Friday 3 May at 12.00
Aalto University (Otaniemi), CS-Building, Room T4 / A238 (Konemiehentie 2)

The aim of the workshop was to discuss the novel digital parliamentary datasets—in particular those of Parliament of Finland—their use in research, the related research resources and tools, and their future development for researchers, but also for citizens and the media. FIN-CLARIN and the Korp version 1.1 of the Plenary Sessions of the Parliament of Finland, available in the Language Bank of Finland, was also presented during the afternoon.

Mietta Lennes: FIN-CLARIN and Parliamentary Data in Kielipankki – the Language Bank of Finland (PowerPoint / PDF slides)

Further information including the programme of the workshop can be found at https://www.helsinki.fi/en/helsinki-centre-for-digital-humanities/workshop-digital-parliamentary-data-and-research.

Introduction to the Language Bank of Finland at the workshop “Digital Parliamentary data and research”

Friday 3 May at 12.00
Aalto University (Otaniemi), CS-Building, Room T4 / A238 (Konemiehentie 2)

Mietta Lennes: FIN-CLARIN and Parliamentary Data in Kielipankki – the Language Bank of Finland (PowerPoint / PDF slides)

Further information including the programme of the workshop can be found at https://www.helsinki.fi/en/helsinki-centre-for-digital-humanities/workshop-digital-parliamentary-data-and-research.

FIN-CLARIN ja Kielipankki kansainvälisesti esillä Tour de CLARIN -kirjassa

Darja Fišer ja Jakob Lenardič, toim. (2018). Tour de CLARIN – Volume One (pdf-versio)

CLARIN ERIC on koostanut näyttävän julkaisun vuonna 2016 alkaneelta Tour de CLARIN -esittelykierrokselta, jossa CLARIN-jäsenmaat ja niiden aineistot, työkalut ja tutkimushankkeet pääsevät vuorotellen valokeilaan. Juuri ilmestyneessä Tour de CLARIN -kokoelman ykkösosassa esitellään ihan ensimmäisenä suomalainen FIN-CLARIN. Julkaisussa ovat mukana myös Ruotsi, Itävalta, Alankomaat, Puola, Belgian Flanderi, Tšekin tasavalta, Kreikka sekä Liettua.

Tour de CLARIN -kierros jatkuu ja sitä voi seurata tuoreeltaan CLARIN ERICin verkkosivuilta.

Hyvä Suomi!

Get to know the Language Bank of Finland

The quickest way to explore the Language Bank’s services is to try the Korp interface where many of our corpora are deposited and can be queried without logging in or applying for access rights of the language resources. Korp features e.g. the Suomi 24 discussion forum corpus that is interesting from the points of view of several digital humanities and social sciences.

Other good ways to begin are the Newspaper and Periodical Corpus of the National Library of Finland and the Plenary Sessions of the Parliament of Finland, which are also available for download in addition to Korp.

Introductory videos

Presentation in Tiedekulma (Think Corner) on 8.11.2016. The video includes English subtitles.

Poster of the Language Bank of Finland (Kielipankki)

A poster of the services offered by Kielipankki – the Language Bank of Finland and FIN-CLARIN

Roadshow events

Every year, the Language Bank of Finland is presented in Roadshow events that are organized at each of the member organizations of FIN-CLARIN. Come and see how you could use the services of the Language Bank in your research!

Roadshow schedule:

2020:

12.2.2020 University of Vaasa

2019:

26.2.2019 University of Turku
16.-18.5.2019 Kielitieteen päivät, Joensuu
14.-16.8.2019 Research Data and Humanities – RDHum 2019 conference, Oulu
11.10.2019 University of Jyväskylä

Presentations and examples from the roadshow in 2016–2017

Presentation of FIN-CLARIN and the Language Bank of Finland (from the 20th Jubilee Roadshow)
How to search for the words ”mieleni pahoitin” from the Suomi 24 Sentences corpus in Korp and show the trend diagram (no soundtrack)

Kuukauden tutkija: Risto Turunen

Kuva Risto Turusesta

kuva: Risto Turunen

Kielipankki koostuu kattavasta joukosta aineistoja sekä niiden tutkimiseen soveltuvista ohjelmistoista tehokkaassa laiteympäristössä. Tampereen yliopiston tohtorikoulutettava Risto Turunen kertoo Kielipankissa olevaa Kansalliskirjaston sanoma- ja aikakauslehtikokoelmaa koskevasta tutkimuksestaan.

Kuka olet?

Olen Risto Turunen. Teen historiatieteen väitöskirjaa Tampereen yliopiston yhteiskunta- ja kulttuuritieteiden yksikössä.

Mikä on tutkimuksesi aihe?

Suomessa oli Euroopan suurin sosialistinen puolue vuonna 1907. Tutkin sosialismin läpimurtoa erityisesti kielen näkökulmasta. Minkälainen diskurssi, käsitejärjestelmä tai poliittinen kieli suomalainen sosialismi oikeastaan oli? Etenkin työväenlehdistö kylvi tehokkaasti sosialismin siementä kansan keskuuteen. Miltei kaikki suomenkieliset sanomalehdet on digitoitu vuoteen 1910 saakka. Koska lehdet ovat koneluettavassa muodossa, voin tutkia sosialismin kieltä makrotasolla kvantitatiivisin menetelmin.

Miten Kielipankki liittyy tutkimukseesi?

Olen tutkinut näitä lehtiä muun muassa Kielipankin Korp-käyttöliittymän avulla. Voin esimerkiksi selvittää, milloin ”sosialismi” yleistyy sanana koko lehdistössä tai mitkä yksittäiset lehdet kirjoittavat eniten ”sosialismista”. Lisäksi olen vertaillut ”sosialismin” kielellistä esiintymiskontekstia sosialistisissa ja ei-sosialistisissa lehdissä. Vertailu paljastaa, millaisia merkityksiä aatteen kannattajat ja vastustajat yrittävät liittää sanaan.

Kielipankissa olevan Kansalliskirjaston sanoma- ja aikakauslehtikokoelman hankinnan taustatiedot

FIN-CLARIN eli suomalaisten yliopistojen, Tieteen tietotekniikan keskuksen ja Kotimaisten kielten keskuksen muodostama konsortio auttaa humanististen tieteiden tutkijoita käyttämään, jalostamaan, säilyttämään ja jakamaan tutkimusaineistoja. Aineistoja ja työkaluja tarjoaa Kielipankki.

Kaikki toistaiseksi esitellyt Kielipankin käyttäjät löytyvät Kuukauden tutkija -arkistosta.

Kuukauden tutkija: Hanna Westerlund

kuva Hanna Westerlundista

kuva: Mika Federley

Kielipankki koostuu kattavasta joukosta aineistoja sekä niiden tutkimiseen soveltuvista ohjelmistoista tehokkaassa laiteympäristössä. Helsingin yliopiston tohtorikoulutettava Hanna Westerlund kertoo Kielipankissa olevia laki- ja säädöskielen aineistoja koskevasta tutkimuksestaan.

Kuka olet?

Olen Hanna Westerlund, käännöstieteen tutkijakoulutettava kielentutkimuksen tohtorikoulutusohjelmassa.

Mikä on tutkimuksesi aihe?

Olen kiinnostunut kollokaatioista eli yhteisesiintymistä kääntäjän haasteena ja kielentutkimuksen mahdollisuuksista selvittää kollokaatioiden tunnistamiseen ja tuottamiseen liittyviä kysymyksiä. Varsinainen tutkimusaineistoni koostuu Euroopan unionin suomeksi käännetyistä asetuksista, joista olen koonnut tekstikorpuksen Suomen yhteisöön liittymisen ajalta. Verrannaisaineisto sisältää vastaavia Suomen lainsäädännön tekstejä.

Miten Kielipankki liittyy tutkimukseesi?

Pidän tärkeänä selvittää ainakin osittain, mitä tutkimusaineistosta löytämilleni esiintymille on ajan mittaan tapahtunut: ovatko käännösten mukana suomalaiseen säädöskieleen saapuneet yhteisesiintymät edelleen löydettävissä säädöskokoelmasta, ovatko ne syrjäyttäneet kotoperäiset vaihtoehdot vai elävätkö ne kaikki teksteissä rinnakkain. Tekstikorpuksen kokoaminen, käsittely ja hallinnointi ovat osoittautuneet sekä teknisesti että laadullisesti haastaviksi ja aikaa vieviksi tehtäviksi, ja vastaavan aineiston kokoaminen nykyisestä säädöskokoelmasta olisi minulle täysin mahdoton tehtävä. Onneksi ei tarvitsekaan: Kielipankki tarjoaa tutkimustani varten sekä teknisesti puhtaan ja luotettavan laeista ja direktiiveistä koostuvan aineiston että työkaluja aineiston käsittelyyn. Tutkimukseni toisen osion toteuttamisessa Kielipankin laki- ja säädöskielen aineistot ovat aivan korvaamattomia.

Kaikki toistaiseksi esitellyt Kielipankin käyttäjät löytyvät Kuukauden tutkija -arkistosta.

Mitä voin lainata Kielipankista?

Tutkijat kertovat miten he hyödyntävät Kielipankin aineistoja: http://bit.ly/2g6Ds1J.

Search the Language Bank Portal:

Researcher of the Month: Sofoklis Kakouros

Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information