A language resource consists of three parts at the minimum:
In addition, a language resource may have its own license page and instructions, if needed. In case several members of a single language resource family share license terms, only one license information document is produced. Language resource specific instruction pages describe only such specific features related to the said resource’s usage that have not been covered in the applicable tool’s or another application’s general instructions.
All parts of a language resource are referred to using persistent identifiers (PID). The Language Bank of Finland uses both the URN and Handle systems. Of these two, URN is more common in the Nordic countries and Handle is more prolific globally. At the Language Bank URNs and Handles have a 1:1 mapping, e.g. hdl:11113/lb-201710212 and urn:nbn:fi:lb-201710212 point to the same page.
A persistent identifier in the Language Bank means that the user can rely on the information referred to by the identifier to remain accessible, even if the language resource’s location changes. The new location is accessible either directly (the identifier points directly to the new location) or indirectly (the identifier points at a page with information about the location of the old version and how to continue using it as well as how to access the new version).
Persistent identifiers have two main functions:
A language resource may have several different variants (i.e. versions) that form a language resource family.
Examples of language resource families:
In all aforementioned cases, it is important that the language resource’s user be able to unambiguously refer to the applicable resource at present as well as in the future. This is why each version always has its own abbreviation, metadata page and location. On the other hand, a language resource family may share a license or instruction page.
To see how the Language Bank fares in relation to RDA recommendations, see the commented RDA Data Versioning Working Group report.
A new version of a corpus is generated when the corpus’s content changes significantly. What constitutes a significant change is defined individually for each corpus. If the corpus description does not specify otherwise, such changes that may substantially affect research results or that are not easily reversible are considered significant. All non-significant changes are recorded in the change log in the corpus’s metadata.
Examples of non-significant changes:
If a new version of a corpus is generated, its relation to the previous versions is recorded in the metadata in COMEDI. The new version receives a new PID and a new metadata record. In the metadata record, the new and old versions are linked with the IsNewVersionOf, IsPreviousVersionOf relations, see below.
In case the previous version is no longer relevant to research, the new version replaces it in the Language Bank’s corpus list. The kielipankki.fi/<abbreviation> links also always point at the most recent versions. However, PIDs are always preserved. They point at either the old version or relevant information (”tombstone page”) about how to obtain it or how queries executed in the old version can be reproduced in the new version.
Suomi24: The corpus is updated biannually. The versions’ abbreviations follow the format Suomi24-<year><year half>, e.g. Suomi24-2016H1. Newer versions always contain the previous versions, and queries can be reproduced by defining the period accordingly.
New corpora receive new version numbers, e.g. helpuhe-v2. The metadata contains a description of the difference between the new and the old version. The old version is archived if need be, and PIDs point at a ”tombstone page”.
The Language Bank does not delete the deposited language resources without their owner’s consent.
Two versions or variations of a language resource, e.g. a corpus packaged in different ways. Downloadable versions are usually considered the ”OriginalFormOf” VariantForms.
The language resource is derived from another, e.g. a frequency lexicon or a language model.
The language resource is a previous / newer version of the related resource.
Eg. Version 1 points to version 2 using IsPreviousVersionOf. Example: lehdet90ff-v1.
The language resource is a part of another (broader resource or collection). Can be used e.g. for parts of a serial corpus.
The corpus is continuation to another. The content is different but the compilation method is the same.
The tool that was used in creating the corpus, e.g. a parser.
If none of the relations described above applies, other possible relations can be found at DataCite ([1]). Using relation terminology other than DataCite’s is not permitted.
[1] DataCite Metadata Working Group. (2016, alkaen sivulta 37). DataCite Metadata Schema Documentation for the Publication and Citation of Research Data. Version 4.0. DataCite e.V. http://doi.org/10.5438/0012