Life cycle and metadata model of language resources

Parts of a language resource
Persistent identifiers
Language resource versions
When is a new version generated?
How is a new version generated?
- Accumulating corpora
- Other corpora
Preservation of language resources
Common language resource relations
Sources

Parts of a language resource

A language resource consists of three parts at the minimum:

Abbreviation (used in URLs, directory names and downloadable packages’ names)
Metadata (edited in COMEDI)
Location of the content, e.g. in Korp or the download service.

In addition, a language resource may have its own license page and instructions, if needed. In case several members of a single language resource family share license terms, only one license information document is produced. Language resource specific instruction pages describe only such specific features related to the said resource’s usage that have not been covered in the applicable tool’s or another application’s general instructions.

Persistent identifiers

All parts of a language resource are referred to using persistent identifiers (PID). The Language Bank of Finland uses both the URN and Handle systems. Of these two, URN is more common in the Nordic countries and Handle is more prolific globally. At the Language Bank URNs and Handles have a 1:1 mapping, e.g. hdl:11113/lb-201710212 and urn:nbn:fi:lb-201710212 point to the same page.

A persistent identifier in the Language Bank means that the user can rely on the information referred to by the identifier to remain accessible, even if the language resource’s location changes. The new location is accessible either directly (the identifier points directly to the new location) or indirectly (the identifier points at a page with information about the location of the old version and how to continue using it as well as how to access the new version).

Persistent identifiers have two main functions:

To ensure accessibility of information if its location changes (e.g. if the corpora in Korp have been migrated elsewhere).
To retain information about past language resources continuing to provide the old version is not practical (e.g. for financial reasons).

Language resource versions

A language resource may have several different variants (i.e. versions) that form a language resource family.

Examples of language resource families:

Different parsers’ morphological analysis results for a single corpus.
Text version of an audio or video corpus (manually or automatically generated)
Accumulating corpus: the content is almost identical but one version has more or newer content.
Repaired corpus: flaws in a corpus have been identified and fixed manually or automatically.

In all aforementioned cases, it is important that the language resource’s user be able to unambiguously refer to the applicable resource at present as well as in the future. This is why each version always has its own abbreviation, metadata page and location. On the other hand, a language resource family may share a license or instruction page.

To see how the Language Bank fares in relation to RDA recommendations, see the commented RDA Data Versioning Working Group report.

When is a new version generated?

A new version of a corpus is generated when the corpus’s content changes significantly. What constitutes a significant change is defined individually for each corpus. If the corpus description does not specify otherwise, such changes that may substantially affect research results or that are not easily reversible are considered significant. All non-significant changes are recorded in the change log in the corpus’s metadata.

Examples of non-significant changes:

A single article in a large conversation corpus has to be removed at an informant’s request. In this case, providing the previous version would not be possible in the first place.
Some hand-written tags in a large corpus have been found to contain a typographical error.
A corpus has been automatically converted from Latin-1 to UTF-8 character encoding. The old encoding remains accessible in the archive.

How is a new version generated?

If a new version of a corpus is generated, its relation to the previous versions is recorded in the metadata in COMEDI. The new version receives a new PID and a new metadata record. In the metadata record, the new and old versions are linked with the IsNewVersionOf, IsPreviousVersionOf relations, see below.

In case the previous version is no longer relevant to research, the new version replaces it in the Language Bank’s corpus list. The kielipankki.fi/<abbreviation> links also always point at the most recent versions. However, PIDs are always preserved. They point at either the old version or relevant information (”tombstone page”) about how to obtain it or how queries executed in the old version can be reproduced in the new version.

Accumulating corpora

Suomi24: The corpus is updated biannually. The versions’ abbreviations follow the format Suomi24-<year><year half>, e.g. Suomi24-2016H1. Newer versions always contain the previous versions, and queries can be reproduced by defining the period accordingly.

Other corpora

New corpora receive new version numbers, e.g. helpuhe-v2. The metadata contains a description of the difference between the new and the old version. The old version is archived if need be, and PIDs point at a ”tombstone page”.

Preservation of language resources

The Language Bank does not delete the deposited language resources without their owner’s consent.

Common language resource relations

IsVariantFormOf / IsOriginalFormOf

Two versions or variations of a language resource, e.g. a corpus packaged in different ways. Downloadable versions are usually considered the ”OriginalFormOf” VariantForms.

IsDerivedFrom / IsSourceOf

The language resource is derived from another, e.g. a frequency lexicon or a language model.

IsPreviousVersionOf / Is NewVersionOf

The language resource is a previous / newer version of the related resource.

Eg. Version 1 points to version 2 using IsPreviousVersionOf. Example: lehdet90ff-v1.

IsPartOf / HasPart

The language resource is a part of another (broader resource or collection). Can be used e.g. for parts of a serial corpus.

IsContinuedBy / Continues

The corpus is continuation to another. The content is different but the compilation method is the same.

IsCompiledBy / Compiles

The tool that was used in creating the corpus, e.g. a parser.

Other relations

If none of the relations described above applies, other possible relations can be found at DataCite ([1]). Using relation terminology other than DataCite’s is not permitted.

Sources

[1] DataCite Metadata Working Group. (2016, alkaen sivulta 37). DataCite Metadata Schema Documentation for the Publication and Citation of Research Data. Version 4.0. DataCite e.V. http://doi.org/10.5438/0012

Search the Language Bank Portal:

Researcher of the Month: Pekka Posio

Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information