Please note that the descriptions and size information are based on our current estimates and may be updated at a later stage.
For companies and non-academic organizations, the following versions of this resource are currently available or forthcoming: | |
---|---|
Donate Speech Corpus: Sample Metadata A free sample that contains a randomly selected set of 40 audio files and their corresponding transcripts as plain text files and as annotation files including time alignments. The metadata regarding the recorded samples and the background details supplied by the speakers (if available) are also included. The total duration of the audio files is about 35 minutes. Reference instructions for this version |
Price: Free of charge
See instructions. |
Donate Speech: Selected dataset Metadata This resource contains five different subsets that were selected at Aalto University especially for developing, training and testing ASR systems. The total duration of the audio files is about 131 hours. Reference instructions for this version |
Price: 1000 €
See instructions. |
Donate Speech: Annotated dataset Metadata This resource contains all the annotated audio files, their transcriptions as raw text files and annotation files, and the background information regarding the recordings and speakers. The total duration of the audio files is about 1600 hours. Reference instructions for this version |
Price: 5000 €
See instructions. |
Donate Speech: Complete dataset, version 1 Metadata The Complete dataset (version 1) includes the Annotated dataset (and the Selected dataset and the Sample). In addition, the Complete dataset also includes the audio files that were not transcribed or annotated. Reference instructions for this version |
Price: 10 000 €
See instructions. |
The first version of the Donate Speech Corpus (Puhelahjat) is a collection of speech recordings accumulated during the Donate Speech campaign between 16.6.2020 and 14.9.2021.
The resource contains a total of about 3200 hours of speech recordings, out of which about 1600 hours have been transcribed. The resource also includes information about the elicitation tasks for which each of the speech samples was donated in the original campaign, and the background details that were voluntarily provided by speech donors.
The resource is available via the download service of the Language Bank of Finland under restricted terms and conditions. The services of the Language Bank are directed at academic researchers. For companies and non-academic organizations, access to Puhelahjat datasets may be acquired for a fee. Further details can be requested by email at lahjoita-puhetta@helsinki.fi.
NB: These instructions are still subject to change.
In accordance with the specific terms and conditions of the Puhelahjat resource, it is also possible to grant access to the data for commercial and non-academic purposes. However, in this case, a separate license agreement between the University of Helsinki and the company or organization is required. When the agreement is signed and the payment has been made, access can be granted to the representative authorized by the user organization.
When applying for the use of paid material, it must be shown that the license fee has been paid.
Last updated: 20.4.2023
Persistent Identifier of this page: urn:nbn:fi:lb-2022111627