This corpus contains newspapers and magazines from Finland starting from 1770, compiled by the National Library of Finland.
NB: The Finnish acronym for the corpora The Newspaper and Periodical OCR Corpus of the National Library of Finland used to be ”Digilib”. Currently, however, the acronym ”klk” and the short names klk-fi-1874-dl and klk-fi-1920-dl are recommended instead.
Latest versions/subcorpora: | |
The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2 (1771-1874), VRT Metadata and license Attribution instructions |
Download the resource |
The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2, VRT Metadata and license Attribution instructions |
Download the resource |
The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2, Korp Metadata and license Attribution instructions Example queries in Korp |
Select the corpus in Korp |
The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version Metadata and license Attribution instructions |
Select the corpus in Korp |
The Swedish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version Metadata and license Attribution instructions |
Select the corpus in Korp |
The Newspaper and Periodical OCR Corpus of the National Library of Finland (1771-1874) Metadata and license Attribution instructions |
Download the resource |
The Newspaper and Periodical OCR Corpus of the National Library of Finland (1875-1920) Metadata and license Attribution instructions |
Download the resource |
The Newspaper and Periodical Corpus of the National Library of Finland, Swedish sub-corpus, 1771–1879, VRT Metadata and license Attribution instructions |
Download the resource |
The Newspaper and Periodical Corpus of the National Library of Finland, Swedish sub-corpus, 1880–1948, scrambled, VRT Metadata and license Attribution instructions |
Download the resource |
Search for these versions in META-SHARE |
Of this language corpus different versions/subcorpora are published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
Based on the KLK data, word-level collections of uni-, bi- and trigrams have been created and are available for download. These are their own data sets:
The N-grams of the Newspaper and Periodical Corpus of the National Library of Finland
The corpora consist mainly of digitized versions of texts originally printed on paper. These physical papers have been scanned, and optical character recognition (OCR) was performed on the resulting images. The digitized material spans a long period and contains different kinds of texts, writing styles and fonts. Scanning some parts of the material is more complex than scanning other parts, and the physical condition of the original texts also varies. The OCR techniques used have also varied, and there is the possibility that some of the texts have gone through manual post-correction. This results in some parts of the corpora being of terrible quality while others are of good quality. We have collected a list of publications related to OCR quality and collection processing:
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021092404
Last updated: 19.6.2024