The Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version

Suomeksi

This corpus contains newspapers and magazines from Finland starting from 1770, compiled by the National Library of Finland.

NB: The Finnish acronym for the corpora The Newspaper and Periodical OCR Corpus of the National Library of Finland used to be ”Digilib”. Currently, however, the acronym ”klk” and the short names klk-fi-1874-dl and klk-fi-1920-dl are recommended instead.

Latest versions/subcorpora:  
The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2 (1771-1874), VRT
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Download the resource
The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2, VRT
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Download the resource
The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2, Korp
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Example queries in Korp
Select the corpus in Korp
The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Select the corpus in Korp
The Swedish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Select the corpus in Korp
The Newspaper and Periodical OCR Corpus of the National Library of Finland (1771-1874)
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Download the resource
The Newspaper and Periodical OCR Corpus of the National Library of Finland (1875-1920)
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Download the resource
The Newspaper and Periodical Corpus of the National Library of Finland, Swedish sub-corpus, 1771–1879, VRT
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Download the resource
The Newspaper and Periodical Corpus of the National Library of Finland, Swedish sub-corpus, 1880–1948, scrambled, VRT
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Download the resource
Search for these versions in META-SHARE  

Of this language corpus different versions/subcorpora are published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

N-grams

Based on the KLK data, word-level collections of uni-, bi- and trigrams have been created and are available for download. These are their own data sets:

The N-grams of the Newspaper and Periodical Corpus of the National Library of Finland

 

Example queries from Korp

 

Concordance view of any form of the word 'sosialismi' in the Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2, Korp
Concordance view of any form of the word ’sosialismi’ in the Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2, Korp

 

Word picture of the word 'sosialismi' in klk-fi-v2-korp
Word picture of the word ’sosialismi’ in the Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2, Korp

 

Trend diagram of all forms of the word 'sosialismi' occurring in klk-fi-v2-korp
Trend diagram of all forms of the word ’sosialismi’ occurring in the Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2, Korp

OCR quality

The corpora consist mainly of digitized versions of texts originally printed on paper. These physical papers have been scanned, and optical character recognition (OCR) was performed on the resulting images. The digitized material spans a long period and contains different kinds of texts, writing styles and fonts. Scanning some parts of the material is more complex than scanning other parts, and the physical condition of the original texts also varies. The OCR techniques used have also varied, and there is the possibility that some of the texts have gone through manual post-correction. This results in some parts of the corpora being of terrible quality while others are of good quality. We have collected a list of publications related to OCR quality and collection processing:

 


This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021092404

Last updated: 19.6.2024

Search the Language Bank Portal:
Krister Lindén
Researcher of the Month: Krister Lindén

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information