Jyväskylä Corpus of Middle French

A digitized corpus for the study of the lexis and syntax of Middle French and for text editions. The corpus consists of 14 documents and 430 000 words. It comprises prose, novels, plays and lyrical poetry from the period 1300-1550.

More information on the corpus can be found here (in Finnish).

Latest versions/subcorpora:
Jyväskylä Corpus of Middle French Metadata and license Attribution instructions	Access the corpus in Puhti
Search for all versions in META-SHARE

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2023040501

Tweet #oulu

Oulu Corpus

Suomeksi

Latest versions/subcorpora:

Oulu Corpus
Metadata and license
Attribution instructions

Apply for access

This version is available via the computing environment Puhti

Search for all versions in META-SHARE

Content

The Oulu Corpus is a research corpus of Standard Finnish in the 1960’s. The original material was collected by a group led by prof. Pauli Saukkonen at the University of Oulu. The original corpus project aimed to collect a representative sample of Standard Finnish language in the 1960’s media in order to create a frequency dictionary of Finnish. The annotated text material was converted into SGML format by the Institute for the Languages of Finland in 1997.

The resource is available via the computing environment. Access rights can be granted for research use by individual application.

Last updated: 10.5.2023

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2023040502

The Wikipedia Corpus (Mark Davies, english-corpora.org) – Kielipankki version

This resource contains a copy of the original The Wikipedia Corpus, provided by Mark Davies on 4th June 2021 via the corpus service at https://www.english-corpora.org. The corpus contains the full text of Wikipedia from the year 2014, with 1.9 billion words in more than 4.4 million articles. The corpus is related to many other corpora of English, formerly known as the ”BYU Corpora”.

More information on Mark Davies’ corpora at Kielipankki.

Latest versions/subcorpora:
The Wikipedia Corpus (Mark Davies, english-corpora.org) – Kielipankki version, source Metadata and license Attribution instructions	The corpus will be available soon
Search for all versions in META-SHARE

Of this language corpus different versions/subcorpora are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2023032905

Tweet #taaf

Triangle of Aspects Analysis of Frozen

This data set includes an analysis of the original, English-language version and of the Dutch-language version (as released in the Netherlands) of the nine songs with lyrics from the Disney film Frozen. This analysis employs the triangle of aspects, an analytical model developed specifically for translation research into songs from musical films. The collection of these data is part of the licensor’s Ph. D. project, tentatively titled “Musical, visual and verbal aspects of animated film song dubbing: A case study of Disney’s Frozen” (projected for publication in early 2020). This data set comprises 9 PDF files, one for each song, as well as a Word document that summarizes the findings and provides copyright notices.

Note: This resource was previously available for download in Kielipankki – The Language Bank of Finland. However, the license has expired on 21.12.2023 and the resource can no longer be accessed. Please note that in case you downloaded the data, you are not allowed to continue using it and you must delete it from your devices.

Latest versions/subcorpora:
Triangle of Aspects Analysis of Frozen Metadata and license Attribution instructions	not available anymore
Search for all versions in META-SHARE

Of this language corpus different versions/subcorpora are published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2023032104

Collection of OTA Texts in Public Use

This is a snapshot of the Oxford Text Archive, for testing purposes. For more up-to-date versions of the archive see http://ota.ox.ac.uk/
The snapshot is available in Kielipankki – the Language Bank of Finland (puhti.csc.fi, /appl/data/kielipankki/ota), see Access rights.

Latest versions/subcorpora:
Collection of OTA Texts in Public Use Metadata and license Attribution instructions	Access the corpus in Puhti
Search for all versions in META-SHARE

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2023032101

Collection of corpora from The University of Helsinki Language Corpus Server (UHLCS)

Suomeksi

The University of Helsinki Language Corpus Server (UHLCS) was a multilingual data bank founded in the late 1980s. The UHLCS collection includes text corpora of more than 50 languages, including minority languages and various text types. There are also tools specifically developed for analyzing the UHLCS corpora. The use of most corpora is restricted for research and teaching. Read more…

Subcorpora:
Chuvash Corpus (UHLCS) Metadata and license Attribution instructions	Apply for access rights Access the corpus in Puhti
English Corpus (UHLCS) Metadata and license Attribution instructions	Apply for access rights Access the corpus in Puhti
Corpus of Erzya and Moksha Mordvin Literature and Journals and Komi Zyrian Literature (UHLCS) Metadata and license Attribution instructions	Apply for access rights Access the corpus in Puhti
Erzya and Moksha Mordvin Word List Corpus (UHLCS) Metadata and license Attribution instructions	Apply for access rights Access the corpus in Puhti
Estonian Corpus 1 (UHLCS) Metadata and license Attribution instructions	Apply for access rights Access the corpus in Puhti
Estonian Corpus 2 (UHLCS) Metadata and license Attribution instructions	Apply for access rights Access the corpus in Puhti
Finnish Corpus (Bibles) (UHLCS) Metadata and license Attribution instructions	Apply for access rights Access the corpus in Puhti
Finnish Corpus (Literature) (UHLCS) Metadata and license Attribution instructions	Apply for access rights Access the corpus in Puhti
The Helsinki Korp Version of the Finland-Swedish Text Corpus (UHLCS) Metadata and license Attribution instructions	Apply for access rights Access the corpus in Korp
The Finland-Swedish Text Corpus (UHLCS), source Metadata and license Attribution instructions	Apply for access rights Access the corpus in Puhti
Ingrian Corpus (UHLCS) Metadata and license Attribution instructions	Apply for access rights Access the corpus in Puhti
Khanty Corpus (North Khanty, Corpora and Translations) (UHLCS) Metadata and license Attribution instructions	Apply for access rights Access the corpus in Puhti
Komi Zyrian Corpus (UHLCS) Metadata and license Attribution instructions	Apply for access rights Access the corpus in Puhti
Latin Corpus (UHLCS) Metadata and license Attribution instructions	Apply for access rights Access the corpus in Puhti
Lude (Ludian) Corpus (UHLCS) Metadata and license Attribution instructions	Apply for access rights Access the corpus in Puhti
Nenets Corpus (Tundra Nenets) (UHLCS) Metadata and license Attribution instructions	Apply for access rights Access the corpus in Puhti
North Saami Corpus (Literature) (UHLCS) Metadata and license Attribution instructions	Apply for access rights Access the corpus in Puhti
North Saami Corpus (Sámikultuvradoaibmagotti smiehttamush) (UHLCS) Metadata and license Attribution instructions	Apply for access rights Access the corpus in Puhti
Quantifiers and Quantification in Finnish and Languages Spoken in the Central Volga–Kama Region (UHLCS) Metadata and license Attribution instructions	Apply for access rights Access the corpus in Puhti
Somali Corpus (UHLCS) Metadata and license Attribution instructions	Apply for access rights Access the corpus in Puhti
The Susanne Corpus (UHLCS) Metadata and license Attribution instructions	Apply for access rights Access the corpus in Puhti
Ume Saami Corpus (UHLCS) Metadata and license Attribution instructions	Apply for access rights Access the corpus in Puhti
Uralic, Turkic, Indo-Iranian and Mongol languages; languages of Siberia and Caucasia (UHLCS) Metadata and license Attribution instructions	Apply for access rights Access the corpus in Puhti
Uzbek-English Dictionary (UHLCS) Metadata and license Attribution instructions	Apply for access rights Access the corpus in Puhti
Lists of Words Corpus (UHLCS) Metadata and license Attribution instructions	Apply for access rights Access the corpus in Puhti

Corpus contents

The University of Helsinki Language Corpus Server (UHLCS) is a multilingual data bank founded in the late 1980s and maintained by the Department of General Linguistics at the University of Helsinki until September 2007. When the old server was taken out of use, the UHLCS corpora were moved to servers maintained by CSC – IT Center for Science, and the corpora were made available via the Language Bank of Finland.

At present, the UHLCS collection includes text corpora of more than 50 languages, including samples of minority languages and extensive corpora representing different text types. There are also tools specifically developed for analyzing the UHLCS corpora.

The use of most corpora is restricted for research and teaching. Resource-specific information and license conditions can be found in the metadata record of the corpus in question.

In 2000, the corpora from the Uralic, Turkic, Tungusic, Mongolic, Chukotko-Kamchatkan, Iranian and North-East Caucasian languages were edited for public use with the financial support of the Max Planck Institute for Evolutionary Anthropology, Leipzig. In summer 2003, the basis for the metadata descriptions of the corpora were prepared with the financial support of the ECHO project (ECHO = European Cultural Inheritance Online).

Last updated: 28.2.2024

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2023030901

The Intelligent Web Corpus (Mark Davies, english-corpora.org) – Kielipankki version

This resource contains a copy of the original The Intelligent Web Corpus (iWeb), provided by Mark Davies on 4th June 2021 via the corpus service at https://www.english-corpora.org. The corpus contains 14 billion words in 22 million web pages. The data was taken in 2017 from around 100,000 of the most widely-used websites (for English) in the world.

The corpus is related to many other corpora of English, formerly known as the ”BYU Corpora”.

More information on Mark Davies’ corpora at Kielipankki.

Latest versions/subcorpora:
The Intelligent Web Corpus (Mark Davies, english-corpora.org) – Kielipankki version, source Metadata and license Attribution instructions	The corpus will be available soon
Search for all versions in META-SHARE

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022112310

Psycholinguistic Descriptives

This material comprises a dataset and a query tool for acquiring commonly used psycholinguistic descriptives for Finnish words. The dataset is based on six large corpora from sources such as magazines, newspapers, movie and tv-series subtitles, encyclopedia topics and Internet discussions.
The material includes word surface form frequencies, lemma frequencies, syllable frequencies and letter n-gram frequencies. In addition the query tool can be used to acquire descriptives such as orthographic neighbors for lists of words.

Latest versions/subcorpora:
Psycholinguistic Descriptives Metadata and license Attribution instructions	Download the resource
Search for these versions in META-SHARE

Of this language corpus different versions are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021081601

Opusparcus: Open Subtitles Paraphrase Corpus for Six Languages

Opusparcus is a paraphrase corpus for six European languages: German, English, Finnish, French, Russian, and Swedish. The paraphrases are extracted from the OpenSubtitles2016 corpus, which contains subtitles from movies and TV shows.

The data in Opusparcus has been extracted from OpenSubtitles2016, which is in turn based on data from http://www.opensubtitles.org.

For each target language, the Opusparcus data have been partitioned into three types of data sets: training, development and test sets. The training sets are large, consisting of millions of sentence pairs, and have been compiled automatically, with the help of probabilistic ranking functions. The development and test sets consist of sentence pairs that have been annotated manually; each set contains approximately 1000 sentence pairs that have been verified to be acceptable paraphrases by two annotators.

Opusparcus is available for download at the Language Bank of Finland. The README file in the download folder contains detailed descriptions of the data sets.

Please cite the following paper in any work that utilizes any part of the Opusparcus corpus:
Mathias Creutz (2018). Open Subtitles Paraphrase Corpus for Six Languages. In Proceedings of the 11th edition of the Language Resources and Evaluation Conference (LREC 2018), 7-12 May, Miyazaki, Japan.

Latest versions/subcorpora:
Opusparcus: Open Subtitles Paraphrase Corpus for Six Languages (version 1.0) Metadata and license Attribution instructions	Download the resource
Search for these versions in META-SHARE

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021081203

Finnish OpenSubtitles 2017

Currently available versions of this resource

Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level
Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level

Further information

The corpus contains Finnish subtitles for movies and TV-series from http://www.opensubtitles.org

The corpus is a derivative of the OPUS OpenSubtitles2018 multilingual corpus. Information on the material processing up to sentence splitting can be found in the original publication Lison & Tiedemann (2016). The corpus has been tokenized and annotated with morpho-syntactic analysis produced with the Turku Dependency Parser.

P. Lison and J. Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)

Of this language corpus different versions are published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021081202

New Year’s Speeches of the Presidents of the Republic of Finland

This corpus contains the New year’s speeches given by the presidents of the republic of Finland in 1935-2007.

More information on the corpus: http://kaino.kotus.fi/korpus/teko/meta/presidentti/presidentti_coll_rdf.xml

Last versions/subcorpora:
New Year’s Speeches of the Presidents of the Republic of Finland Metadata and license Attribution instructions	Select the corpus in Korp
Search for these versions in META-SHARE

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021051202

FinStud86 Corpus

The corpus contains Finnish language essays / compositions written by Finnish-speaking students taking the matriculation examination in 1986.

Latest versions/subcorpora:
FinStud86 Corpus Metadata and license Attribution instructions	Select the corpus in Korp
Search for these versions in META-SHARE

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021042604

Corpus of Finnish Matriculation Examination Essays from 1994, 1999 and 2004

The corpus contains Finnish essays written by the students of the 1994, 1999 and 2004 matriculation examinations.

Latest versions/subcorpora:
Corpus of Finnish Matriculation Examination Essays from 1994, 1999 and 2004 Metadata and license Attribution instructions	Select the corpus in Korp
Search for these versions in META-SHARE

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021042603

FiRuLex, Russian-Finnish Comparable Corpus of Legal Texts

Currently available versions of this resource

Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level
Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level

The corpus contains juridical texts in Russian and Finnish arranged as a comparable text corpus. More information can be found from https://mustikka.uta.fi

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021092402

Search the Language Bank Portal:

Researcher of the Month: Sofoklis Kakouros

Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information

Jyväskylä Corpus of Middle French

Oulu Corpus

Content

The Wikipedia Corpus (Mark Davies, english-corpora.org) – Kielipankki version

Triangle of Aspects Analysis of Frozen

Collection of OTA Texts in Public Use

Collection of corpora from The University of Helsinki Language Corpus Server (UHLCS)

Corpus contents

The Intelligent Web Corpus (Mark Davies, english-corpora.org) – Kielipankki version

Psycholinguistic Descriptives

Opusparcus: Open Subtitles Paraphrase Corpus for Six Languages

Finnish OpenSubtitles 2017

Currently available versions of this resource

Further information

New Year’s Speeches of the Presidents of the Republic of Finland

FinStud86 Corpus

Corpus of Finnish Matriculation Examination Essays from 1994, 1999 and 2004

FiRuLex, Russian-Finnish Comparable Corpus of Legal Texts

Currently available versions of this resource

News

Contact