Donate Speech datasets for commercial use: see further details on another page
Important information for all users of this resource: Removal requests
Versions of this resource: | |
---|---|
Donate Speech Corpus, version 1.0 Metadata License (for researchers) Attribution instructions |
, academic research use only Apply for access rights +PRIV: This resource contains personal data. Submit public information about personal data processing Download the resource |
Donate Speech Corpus: Sample Metadata License (for researchers) Attribution instructions |
Download the resource |
Donate Speech Corpus: Selected dataset Metadata License (for researchers) Attribution instructions |
Download the resource |
Donate Speech Corpus: Training data (100h) Metadata License (for researchers) Attribution instructions |
|
Donate Speech Corpus: Test data (10h) Metadata License (for researchers) Attribution instructions |
|
Donate Speech Corpus: Development data (10h) Metadata License (for researchers) Attribution instructions |
|
Donate Speech Corpus: Multi-transcriber test data (1h) Metadata License (for researchers) Attribution instructions |
|
Donate Speech Corpus: Test data from multi-transcriber speakers (10h) Metadata License (for researchers) Attribution instructions |
|
Look for other versions of this resource |
The Donate Speech Corpus, abbreviated Puhelahjat, was compiled in the Donate Speech campaign implemented by Vake Oy (later Ilmastorahasto), Yle and the University of Helsinki, launched on June 16, 2020. During the project, anyone who speaks some Finnish had the opportunity to donate their own speech in order to promote language research and the development of language technology. The donated speech was recorded via an easy-to-use browser or mobile application.
The first version of the audio material includes the speech samples that were donated by spring 2021. The total duration of the recordings in this version is approximately 3200 hours. In 2021, approximately 1,600 hours of the recordings were transcribed by hand, and the resulting transcriptions were aligned with the corresponding audio recordings using automatic methods.
The version 1.0 of the dataset is available in the download service for researchers that have been granted access. Some subsets of the complete dataset (selected for instance for the development of automatic speech recognition) will also be made available as separate download packages. The description and the citation practices of each subset can be found in the corresponding metadata records.
The Donate Speech datasets can be updated later, for instance after a sufficient amount of new donations have accumulated. New versions can also be created as researchers or companies continue to transcribe and annotate the existing recordings more extensively.
The research use of the Donate Speech Corpus and any of its subsets is subject to the license of the resource. Note that the license also includes resource-specific data protection conditions.
The instructions for commercial use can be found on a separate page.
Last updated: 7.3.2024
Persistent identifier of this page: urn:nbn:fi:lb-2022102121