In addition to publicly available corpora, resources can be stored or distributed by the Language Bank of Finland whose handling requires special protective measures due to personal data included in them.
Personal data are usually included, for example, in speech corpora that contain audio or video material. In such cases, the material cannot usually be made entirely openly available, but, under certain terms, it too can be stored in the Language Bank of Finland. This page describes some of the protective measures that can be utilised, on a case-by-case basis, in the distribution of material that contains personal data in particular.
In the end, decisions pertaining to the processing of personal data and the required protective measures are the responsibility of the controller of the relevant data. The services required can be negotiated with the Language Bank of Finland.
Collecting research data and organising them into the form required by the research consumes time, effort and money. If further use is expected for the data, their storage may be justified, even if they contain personal data or if parts of the data are sensitive. However, the grounds for storing personal data must always be carefully documented and the data subjects informed of them. Furthermore, the appropriate protective measures must be utilised in the processing of personal data.
The opportunity for other researchers to replicate previously published studies using exactly the same data to ensure that the original study was appropriately conducted is important for scientific research. For example, the peer-review process of scientific research may require access to the data used in the study for the assessor.
It is common to aim to carry out further research on the same topic or a similar topic after the completion of the original research project, which makes it necessary to use the same data again. For example, the research hypothesis may need to be reformulated, or another kind of analysis technique may be trialled. At times, more extensive research is needed in a given area, for which it is necessary to analyse datasets collected earlier. Accumulating an entirely new and massive dataset from the start would often be too costly or otherwise arduous. In such cases, previously collected, carefully documented and securely stored data may turn out to be a treasure trove.
Alongside scientific research, historical research purposes or statistical purposes can also serve as grounds for the processing of personal data. In the case of corpora, the storage of speech recordings, for example, may in certain cases be justified by their historical and cultural value.
An ethics review is required for certain types of research. In such cases, the researcher requests a statement from an ethics committee before initiating the collection of data. For example, in the case of medical research, the research ethics committee may, as a rule, require the destruction of the data after the conclusion of research. In fact, the grounds for storing the corpus should be clearly stated in the request for a statement, in addition to which the request should include a specific plan on the protective measures to be applied to the corpus.
In connection with research, study subjects must also be clearly informed of the protective measures used in the processing of personal data. It should also be noted that if the subjects have been informed of the intent to destroy their data after a given date when originally informing them of the project, such a commitment cannot usually be revoked at a later date (unless the study subjects can be contacted again, informed of the further research and asked to participate in the further research as well).
You can read more about data protection and informing study subjects in, for example, the Data Management Guidelines published by the Finnish Social Science Data Archive.
The information security of the equipment and systems used in the processing of personal data must be sufficient and up to date. The protection of corpora must be seamlessly carried out from start to finish, also during transfer of data.
When necessary, the party who has compiled the corpus or is storing the corpus in the Language Bank of Finland can protect corpora that contain personal data, for example, by pseudonymising them in a way that fits the purpose of processing and by classifying the personal data in a way that makes them less identifying.
When necessary, identifying data can be stored by encrypting them with a sufficiently strong encryption key.
If a corpus is pseudonymised and a code key associated with the study subjects needs to be stored, the key must be separate from the actual corpus, both technically and administratively. Please note that the Language Bank of Finland does not pseudonymise corpora, nor does it accept for storage code keys associated with study subjects. This means that researchers themselves are responsible for coding corpora content, file names, etc.
Corpora that can be fully anonymised, which means that no living persons can be identified in any way on the basis of the anonymised data (not even by combining details from the corpus with data available elsewhere), no longer contain any personal data. Fully anonymised corpora need not be separately protected on the basis of data protection regulations. In other words, fully anonymised corpora to which no copyright restrictions apply can be openly published.
Fully anonymising data is often impossible in practice, either due to the amount of effort required or technical reasons, or when full anonymisation would make the data useless for research purposes. If such corpora need to be stored regardless, any clearly unnecessary identifiers must be removed, whenever possible. In the Language Bank of Finland, other protective measures can be applied to corpora, such as restricting access to specific users (see below).
You can read more about anonymisation and pseudonymisation in, for example, the Data Management Guidelines published by the Finnish Social Science Data Archive.
When corpora are stored in the Language Bank of Finland, they are centrally maintained and distributed under conditions agreed together with the rights holders. Standardised, clear and fit-for-purpose practices are helpful to researchers who need corpora in their work. At the same time, they reduce the risk of misconduct.
The user administration and other technical solutions of the services provided by the Language Bank of Finland are managed by CSC – IT Center for Science Ltd. For example, students and researchers covered by Finnish and international federations of trust can log in to the Language Bank of Finland securely using the username granted to them by their organisation.
The following corpus-specific protective measures are included in the list of measures that can be applied to the corpora stored in the Language Bank of Finland at the moment:
When necessary, other protective measures required by individual corpora can be negotiated with the Language Bank of Finland.
Guidelines for processing corpora containing personal data stored in the Language Bank of Finland
Privacy practices of the Language Bank of Finland
Last updated: 8.2.2024