Donate Speech datasets (puhelahjat) for research use

Donate Speech datasets for commercial use: see further details on another page

Important information for all users of this resource: Removal requests

Versions of this resource:
Donate Speech Corpus, version 1.0 Metadata License (for researchers) Attribution instructions	Apply for access rights, academic research use only +PRIV: This resource contains personal data. Submit public information about personal data processing Download the resource
Donate Speech Corpus: Sample Metadata License (for researchers) Attribution instructions	Download the resource
Donate Speech Corpus: Selected dataset Metadata License (for researchers) Attribution instructions	Download the resource
Donate Speech Corpus: Training data (100h) Metadata License (for researchers) Attribution instructions	Downloadable as part of the Selected dataset, see above
Donate Speech Corpus: Test data (10h) Metadata License (for researchers) Attribution instructions	Downloadable as part of the Selected dataset, see above
Donate Speech Corpus: Development data (10h) Metadata License (for researchers) Attribution instructions	Downloadable as part of the Selected dataset, see above
Donate Speech Corpus: Multi-transcriber test data (1h) Metadata License (for researchers) Attribution instructions	Downloadable as part of the Selected dataset, see above
Donate Speech Corpus: Test data from multi-transcriber speakers (10h) Metadata License (for researchers) Attribution instructions	Downloadable as part of the Selected dataset, see above
Look for other versions of this resource

Contents of the resource

The Donate Speech Corpus, abbreviated Puhelahjat, was compiled in the Donate Speech campaign implemented by Vake Oy (later Ilmastorahasto), Yle and the University of Helsinki, launched on June 16, 2020. During the project, anyone who speaks some Finnish had the opportunity to donate their own speech in order to promote language research and the development of language technology. The donated speech was recorded via an easy-to-use browser or mobile application.

The first version of the audio material includes the speech samples that were donated by spring 2021. The total duration of the recordings in this version is approximately 3200 hours. In 2021, approximately 1,600 hours of the recordings were transcribed by hand, and the resulting transcriptions were aligned with the corresponding audio recordings using automatic methods.

The version 1.0 of the dataset is available in the download service for researchers that have been granted access. Some subsets of the complete dataset (selected for instance for the development of automatic speech recognition) will also be made available as separate download packages. The description and the citation practices of each subset can be found in the corresponding metadata records.

The Donate Speech datasets can be updated later, for instance after a sufficient amount of new donations have accumulated. New versions can also be created as researchers or companies continue to transcribe and annotate the existing recordings more extensively.

How to obtain access to use the material?

The research use of the Donate Speech Corpus and any of its subsets is subject to the license of the resource. Note that the license also includes resource-specific data protection conditions.

Research use

Researchers can apply for the right to use the data via the usual application procedure in the Language Bank Rights system (see instructions).
When applying for access, the researcher must consider to the license requirements, including the resource-specific data protection terms and conditions regarding the processing of personal data, see license (for researchers).
Before starting to process the data, the researcher must submit the title of the project and the link to the public Privacy Notice regarding the processing of personal data in their project (see the e-form).
When the application is approved, the researcher can access the entire Donate Speech Corpus as well as all versions and subsets of the resource.

The instructions for commercial use can be found on a separate page.

Last updated: 7.3.2024

Persistent identifier of this page: urn:nbn:fi:lb-2022102121

Search the Language Bank Portal:

Researcher of the Month: Pekka Posio

Näytä kaikki tapahtumat

Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information

Donate Speech datasets (puhelahjat) for research use

Contents of the resource

How to obtain access to use the material?

Research use

News

Contact