A digitized corpus for the study of the lexis and syntax of Middle French and for text editions. The corpus consists of 14 documents and 430 000 words. It comprises prose, novels, plays and lyrical poetry from the period 1300-1550.
More information on the corpus can be found here (in Finnish).
Latest versions/subcorpora: | |
Jyväskylä Corpus of Middle French Metadata and license Attribution instructions |
Puhti | Access the corpus in
Search for all versions in META-SHARE |
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2023040501
Latest versions/subcorpora: | |
Oulu Corpus Metadata and license Attribution instructions |
Apply for access This version is available via the |
Search for all versions in META-SHARE |
The Oulu Corpus is a research corpus of Standard Finnish in the 1960’s. The original material was collected by a group led by prof. Pauli Saukkonen at the University of Oulu. The original corpus project aimed to collect a representative sample of Standard Finnish language in the 1960’s media in order to create a frequency dictionary of Finnish. The annotated text material was converted into SGML format by the Institute for the Languages of Finland in 1997.
The resource is available via the computing environment. Access rights can be granted for research use by individual application.
Last updated: 10.5.2023
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2023040502
This resource contains a copy of the original The Wikipedia Corpus, provided by Mark Davies on 4th June 2021 via the corpus service at https://www.english-corpora.org. The corpus contains the full text of Wikipedia from the year 2014, with 1.9 billion words in more than 4.4 million articles. The corpus is related to many other corpora of English, formerly known as the ”BYU Corpora”.
More information on Mark Davies’ corpora at Kielipankki.
Latest versions/subcorpora: | |
The Wikipedia Corpus (Mark Davies, english-corpora.org) – Kielipankki version, source Metadata and license Attribution instructions |
The corpus will be available soon |
Search for all versions in META-SHARE |
Of this language corpus different versions/subcorpora are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2023032905
This data set includes an analysis of the original, English-language version and of the Dutch-language version (as released in the Netherlands) of the nine songs with lyrics from the Disney film Frozen. This analysis employs the triangle of aspects, an analytical model developed specifically for translation research into songs from musical films. The collection of these data is part of the licensor’s Ph. D. project, tentatively titled “Musical, visual and verbal aspects of animated film song dubbing: A case study of Disney’s Frozen” (projected for publication in early 2020). This data set comprises 9 PDF files, one for each song, as well as a Word document that summarizes the findings and provides copyright notices.
Note: This resource was previously available for download in Kielipankki – The Language Bank of Finland. However, the license has expired on 21.12.2023 and the resource can no longer be accessed. Please note that in case you downloaded the data, you are not allowed to continue using it and you must delete it from your devices.
Latest versions/subcorpora: | |
Triangle of Aspects Analysis of Frozen Metadata and license Attribution instructions |
not available anymore |
Search for all versions in META-SHARE |
Of this language corpus different versions/subcorpora are published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2023032104
This is a snapshot of the Oxford Text Archive, for testing purposes. For more up-to-date versions of the archive see http://ota.ox.ac.uk/
The snapshot is available in Kielipankki – the Language Bank of Finland (puhti.csc.fi, /appl/data/kielipankki/ota), see Access rights.
Latest versions/subcorpora: | |
Collection of OTA Texts in Public Use Metadata and license Attribution instructions |
Puhti | Access the corpus in
Search for all versions in META-SHARE |
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2023032101
The University of Helsinki Language Corpus Server (UHLCS) was a multilingual data bank founded in the late 1980s. The UHLCS collection includes text corpora of more than 50 languages, including minority languages and various text types. There are also tools specifically developed for analyzing the UHLCS corpora. The use of most corpora is restricted for research and teaching. Read more…
Subcorpora: | |
Chuvash Corpus (UHLCS) Metadata and license Attribution instructions |
Apply for access rights Access the corpus in Puhti |
English Corpus (UHLCS) Metadata and license Attribution instructions |
Apply for access rights Access the corpus in Puhti |
Corpus of Erzya and Moksha Mordvin Literature and Journals and Komi Zyrian Literature (UHLCS) Metadata and license Attribution instructions |
Apply for access rights Access the corpus in Puhti |
Erzya and Moksha Mordvin Word List Corpus (UHLCS) Metadata and license Attribution instructions |
Apply for access rights Access the corpus in Puhti |
Estonian Corpus 1 (UHLCS) Metadata and license Attribution instructions |
Apply for access rights Access the corpus in Puhti |
Estonian Corpus 2 (UHLCS) Metadata and license Attribution instructions |
Apply for access rights Access the corpus in Puhti |
Finnish Corpus (Bibles) (UHLCS) Metadata and license Attribution instructions |
Apply for access rights Access the corpus in Puhti |
Finnish Corpus (Literature) (UHLCS) Metadata and license Attribution instructions |
Apply for access rights Access the corpus in Puhti |
The Helsinki Korp Version of the Finland-Swedish Text Corpus (UHLCS) Metadata and license Attribution instructions |
Apply for access rights Access the corpus in Korp |
The Finland-Swedish Text Corpus (UHLCS), source Metadata and license Attribution instructions |
Apply for access rights Access the corpus in Puhti |
Ingrian Corpus (UHLCS) Metadata and license Attribution instructions |
Apply for access rights Access the corpus in Puhti |
Khanty Corpus (North Khanty, Corpora and Translations) (UHLCS) Metadata and license Attribution instructions |
Apply for access rights Access the corpus in Puhti |
Komi Zyrian Corpus (UHLCS) Metadata and license Attribution instructions |
Apply for access rights Access the corpus in Puhti |
Latin Corpus (UHLCS) Metadata and license Attribution instructions |
Apply for access rights Access the corpus in Puhti |
Lude (Ludian) Corpus (UHLCS) Metadata and license Attribution instructions |
Apply for access rights Access the corpus in Puhti |
Nenets Corpus (Tundra Nenets) (UHLCS) Metadata and license Attribution instructions |
Apply for access rights Access the corpus in Puhti |
North Saami Corpus (Literature) (UHLCS) Metadata and license Attribution instructions |
Apply for access rights Access the corpus in Puhti |
North Saami Corpus (Sámikultuvradoaibmagotti smiehttamush) (UHLCS) Metadata and license Attribution instructions |
Apply for access rights Access the corpus in Puhti |
Quantifiers and Quantification in Finnish and Languages Spoken in the Central Volga–Kama Region (UHLCS) Metadata and license Attribution instructions |
Apply for access rights Access the corpus in Puhti |
Somali Corpus (UHLCS) Metadata and license Attribution instructions |
Apply for access rights Access the corpus in Puhti |
The Susanne Corpus (UHLCS) Metadata and license Attribution instructions |
Apply for access rights Access the corpus in Puhti |
Ume Saami Corpus (UHLCS) Metadata and license Attribution instructions |
Apply for access rights Access the corpus in Puhti |
Uralic, Turkic, Indo-Iranian and Mongol languages; languages of Siberia and Caucasia (UHLCS) Metadata and license Attribution instructions |
Apply for access rights Access the corpus in Puhti |
Uzbek-English Dictionary (UHLCS) Metadata and license Attribution instructions |
Apply for access rights Access the corpus in Puhti |
Lists of Words Corpus (UHLCS) Metadata and license Attribution instructions |
Apply for access rights Access the corpus in Puhti |
The University of Helsinki Language Corpus Server (UHLCS) is a multilingual data bank founded in the late 1980s and maintained by the Department of General Linguistics at the University of Helsinki until September 2007. When the old server was taken out of use, the UHLCS corpora were moved to servers maintained by CSC – IT Center for Science, and the corpora were made available via the Language Bank of Finland.
At present, the UHLCS collection includes text corpora of more than 50 languages, including samples of minority languages and extensive corpora representing different text types. There are also tools specifically developed for analyzing the UHLCS corpora.
The use of most corpora is restricted for research and teaching. Resource-specific information and license conditions can be found in the metadata record of the corpus in question.
In 2000, the corpora from the Uralic, Turkic, Tungusic, Mongolic, Chukotko-Kamchatkan, Iranian and North-East Caucasian languages were edited for public use with the financial support of the Max Planck Institute for Evolutionary Anthropology, Leipzig. In summer 2003, the basis for the metadata descriptions of the corpora were prepared with the financial support of the ECHO project (ECHO = European Cultural Inheritance Online).
Last updated: 28.2.2024
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2023030901
This resource contains a copy of the original The Intelligent Web Corpus (iWeb), provided by Mark Davies on 4th June 2021 via the corpus service at https://www.english-corpora.org. The corpus contains 14 billion words in 22 million web pages. The data was taken in 2017 from around 100,000 of the most widely-used websites (for English) in the world.
The corpus is related to many other corpora of English, formerly known as the ”BYU Corpora”.
More information on Mark Davies’ corpora at Kielipankki.
Latest versions/subcorpora: | |
The Intelligent Web Corpus (Mark Davies, english-corpora.org) – Kielipankki version, source Metadata and license Attribution instructions |
The corpus will be available soon |
Search for all versions in META-SHARE |
Of this language corpus different versions/subcorpora are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022112310
This material comprises a dataset and a query tool for acquiring commonly used psycholinguistic descriptives for Finnish words. The dataset is based on six large corpora from sources such as magazines, newspapers, movie and tv-series subtitles, encyclopedia topics and Internet discussions.
The material includes word surface form frequencies, lemma frequencies, syllable frequencies and letter n-gram frequencies. In addition the query tool can be used to acquire descriptives such as orthographic neighbors for lists of words.
Latest versions/subcorpora: | |
Psycholinguistic Descriptives Metadata and license Attribution instructions | Download the resource |
Search for these versions in META-SHARE |
Of this language corpus different versions are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021081601
Opusparcus is a paraphrase corpus for six European languages: German, English, Finnish, French, Russian, and Swedish. The paraphrases are extracted from the OpenSubtitles2016 corpus, which contains subtitles from movies and TV shows.
The data in Opusparcus has been extracted from OpenSubtitles2016, which is in turn based on data from http://www.opensubtitles.org.
For each target language, the Opusparcus data have been partitioned into three types of data sets: training, development and test sets. The training sets are large, consisting of millions of sentence pairs, and have been compiled automatically, with the help of probabilistic ranking functions. The development and test sets consist of sentence pairs that have been annotated manually; each set contains approximately 1000 sentence pairs that have been verified to be acceptable paraphrases by two annotators.
Opusparcus is available for download at the Language Bank of Finland. The README file in the download folder contains detailed descriptions of the data sets.
Please cite the following paper in any work that utilizes any part of the Opusparcus corpus:
Mathias Creutz (2018). Open Subtitles Paraphrase Corpus for Six Languages. In Proceedings of the 11th edition of the Language Resources and Evaluation Conference (LREC 2018), 7-12 May, Miyazaki, Japan.
Latest versions/subcorpora: | |
Opusparcus: Open Subtitles Paraphrase Corpus for Six Languages (version 1.0) Metadata and license Attribution instructions |
Download the resource |
Search for these versions in META-SHARE |
Of this language corpus different versions are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021081203
Shortname | Name and metadata | License | Location | Cite | Resource group and help | Apply | Publication year | Support level |
---|---|---|---|---|---|---|---|---|
Shortname | Name and metadata | License | Location | Cite | Resource group and help | Apply | Publication year | Support level |
The corpus contains Finnish subtitles for movies and TV-series from http://www.opensubtitles.org
The corpus is a derivative of the OPUS OpenSubtitles2018 multilingual corpus. Information on the material processing up to sentence splitting can be found in the original publication Lison & Tiedemann (2016). The corpus has been tokenized and annotated with morpho-syntactic analysis produced with the Turku Dependency Parser.
P. Lison and J. Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)
Of this language corpus different versions are published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021081202
This corpus contains the New year’s speeches given by the presidents of the republic of Finland in 1935-2007.
More information on the corpus: http://kaino.kotus.fi/korpus/teko/meta/presidentti/presidentti_coll_rdf.xml
Last versions/subcorpora: | |
New Year’s Speeches of the Presidents of the Republic of Finland Metadata and license Attribution instructions | Select the corpus in Korp |
Search for these versions in META-SHARE |
Of this language corpus different versions/subcorpora are published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021051202
The corpus contains Finnish language essays / compositions written by Finnish-speaking students taking the matriculation examination in 1986.
Latest versions/subcorpora: | |
FinStud86 Corpus Metadata and license Attribution instructions | Select the corpus in Korp |
Search for these versions in META-SHARE |
Of this language corpus different versions/subcorpora are published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021042604
The corpus contains Finnish essays written by the students of the 1994, 1999 and 2004 matriculation examinations.
Latest versions/subcorpora: | |
Corpus of Finnish Matriculation Examination Essays from 1994, 1999 and 2004 Metadata and license Attribution instructions | Select the corpus in Korp |
Search for these versions in META-SHARE |
Of this language corpus different versions/subcorpora are published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021042603
Shortname | Name and metadata | License | Location | Cite | Resource group and help | Apply | Publication year | Support level |
---|---|---|---|---|---|---|---|---|
Shortname | Name and metadata | License | Location | Cite | Resource group and help | Apply | Publication year | Support level |
The corpus contains juridical texts in Russian and Finnish arranged as a comparable text corpus. More information can be found from https://mustikka.uta.fi
Of this language corpus different versions/subcorpora are published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021092402