This corpus is extracted from the Finnish parliament plenary session transcripts and videos by the
Aalto Speech Recognition group. The original session transcripts and videos are available at the web
portals of the Parliament of Finland (avoindata.eduskunta.fi and verkkolahetys.eduskunta.fi).
The Finnish corpus is split into three parts:
1. 2015-2020 set
2. 2008-2016 set
3. Development and evaluation sets
A non-overlapping combination of the 2008-2016 set and the 2015-2020 set form a training set of size:
– 1 422 318 sample pairs
– 3 130 hours of speech
– 19 356 831 word tokens
The Finland Swedish corpus contains:
– 3889 sample pairs
– 6.4 hours of speech
– 333 483 word tokens
All audio files in this corpus are single-channel wavs with sample rate 16 kHz and 16-bit precision.
The transcript files (.trn) are plain text files.
Latest versions/subcorpora: | |
Aalto Finnish Parliament ASR Corpus 2008-2020, version 2 Metadata and license Attribution instructions |
Download the resource |
Aalto Finland Swedish Parliament ASR Corpus 2015-2020 Metadata and license Attribution instructions |
Download the resource |
Of this language corpus different versions are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021081105