Creating a metadata record
<< Development <<
This document describes the minimum set of details that should be included in a metadata record that is created with COMEDI.
NB: To request the creation of a metadata record for a new/forthcoming resource, Language Bank users are requested to fill in this form:
Submit information about a language resource to Kielipankki
Access the preliminary details submitted via the e-form (NB. Only for Kielipankki elomake admin users)
How to create a metadata record
Metadata checklist
Legend:
* = The metadata field is mandatory, i.e., at least some relevant information should be filled in right from the start.
(*) = The metadata field is required by the Language Bank at a later stage, but it is not mandatory for publishing the initial version of the metadata record.
- *Identification info: Specify the identification info details:
- *Resource name (Required: Identification): The full name or title by which the resource is known.
- Preferably, the resource name should be available both in English and in Finnish. Swedish is to be avoided, since it is not well supported and might be confusing.
- The resource name may also include an extension at the end, identifying a specific variant or version of the original content, e.g., ”The Suomi24 Corpus 2001-2017, VRT version 1.1”. See resource naming conventions.
- The element can be repeated for the different language versions using the ”lang” attribute to specify the language.
- *Description (Required: Identification): Provides the description of the resource in prose.
- The element can be repeated for the different language versions using the ”lang” attribute to specify the language. Provide a Description at least in English. A Finnish translation is usually also provided.
- The Description field should begin with the following type of sentence (English):
”This resource [is | will be] available [via Korp | for download] in Kielipankki – the Language Bank of Finland[, see Access location].”
A standard expression can then be used for finding all resources that are distributed via Kielipankki.
- Do not repeat URNs or other links that are already mentioned elsewhere in the metadata!
- Do not mention physical links like ”korp.csc.fi/download”. (If really necessary, use more current aliases: kielipankki.fi/download; kielipankki.fi/lataus.)
- In the text, try to describe things like what, why, how much, created how, by whom‚ etc. The starting point of the English and Finnish versions should preferably be written by the corpus owner/depositor.
- If external documentation is available, you may use a standard note like ”For links to further information, please see Documentation.”
- Please enter the keyword/hashtag to be used for the resource on its own line at the end of the Description(s). See instructions in naming conventions, e.g.,
#lb_suomi24
- Do not include Change Log information in the Description (see instructions under Documentation).
- (*)Resource short name (Required: Identification): The short form (abbreviation, acronym etc.) used to identify the resource
- Do not localize the short name! It is usually in English. It is used to create directory names, or it may be part of a filename.
- Do not use spaces. Use only lowercase letters. See the resource naming conventions.
- Formatting the short names for different versions and variants of the same resource group:
- If the resource is the source version (the first version obtained by the Language Bank), end the shortname with ”-src”, as in urn:nbn:fi:lb-2017070501
- If the resource is the Korp version, end the shortname with ”-korp”, as in urn:nbn:fi:lb-2019120403
- If the resource is the VRT-file exported from Korp, intended for the download service, end the shortname with ”-vrt”, as in urn:nbn:fi:lb-2019052701
- If the resource is a scrambled version, insert a ”-s-” in the shortname before the -src, -korp, or -vrt, as in urn:nbn:fi:lb-2019120404
- Especially if the resource is a parallel version of another corpus, inset ”-par-” in the shortname. A bit like urn:nbn:fi:lb-2019042605 (but the example could be improved by having the ”-par-” before the previously mentioned acronyms and after the information of the dates like ”-2018-”).
- (*)Url (Required: Identification): The ”Access location” of the resource. Usually a URN, but not neccessarilly. Can be a download location or a URL pointing to Korp.
- (*)Identifier (Required: Identification): The unique citable URN is the primary ID of this particular version or variant of the resource. The URN is used to refer to the resource from various services, e.g., from the corresponding end-user license pages, from the Language Bank Rights system, and from the list of resources on the Language Bank website.
- For the time being, we add ”http://urn.fi/” in front of the URN, in order to make the link clickable via COMEDI.
- The URN may be requested by the Language Bank and added to the Identifier field of the metadata record as soon as
- the resource already exists, and
- a preliminary assessment has been made that the resource can probably be included in the Language Bank, even if a deposition agreement or a specific license has not yet been fully confirmed.
- *Distribution (Required: Identification): Specify the license details.
- *Availability:
- Minimally, select ”Under Negotiation”, in case the resource is not yet available (and/or nothing further is known).
- Select ”Available – Restricted Use” in case the resource is available and access to the resource is restricted in some way (CLARIN ACA, CLARIN RES, or similar).
- Select ”Available – Unrestricted Use” in case the resource is available and access to the resource is not restricted in any way (CLARIN PUB, Creative Commons licenses, or similar).
- If the general license category (such as CLARIN RES) or a specific license (e.g., CC-BY) is already known to apply on the final resource, the Licence (see below) can be added at an early stage – even if Availability is technically still ”Under Negotiation”.
- (*)Licence: Select the appropriate license category and the individual terms and conditions that are applicable. Note, however, that the license cannot be specified fully via COMEDI, and this is why we always additionally include a persistent reference to the license page under Documentation.
- Note that it is in principle possible to define the metadata for several different licenses for a given resource, but this feature is not systematically used by the Language Bank. (It might, however, be useful in case the same content is available, e.g., for research use as well as for commercial use. To avoid confusion, such parallel license documents should be very clearly separated in the Documentation section for the different types of users and purposes.)
- (*)Attribution text: Information regarding the recommended/required way of citing the resource.
- If the resource is not yet available in the Language Bank, the automatic citation instructions will not yet be found on the Language Bank website. Meanwhile, for the convenience of the resource creator/depositor, we can offer to include the corresponding citation format in the Attribution text field.
- When the resource is available in the Language Bank, the Attribution text should include the text: ”see Documentation” (English), ”katso Documentation” (Finnish). At the same time, the link to the automatically generated citation instructions should be added under Documentation.
- (*)Licensor: In case an organization or a person (or both) specifically licenses the resource to be distributed by the Language Bank, they should be listed here.
- This field can list the parties who signed the deposition license agreement with the University of Helsinki (that represents the Language Bank).
- In the case of resources that contain personal data, the Licensors should include the Data Controller.
- It is recommended that this field is filled in when possible.
- (*)Distribution rights holder: This field is relevant only in cases of ACA and RES licenses. It has not been systematically used by the Language Bank, but it is recommended to include this information for future reference, if possible.
- The Distribution rights holder is usually the University of Helsinki, in case the license was given to the Language Bank in a deposition license agreement (for resources including personal data, this is the default option 1 in agreements made after 2021) and the Language Bank is not required to ask the original rightholder’s permission for granting access to the resource.
- In some cases of RES licenses, the deposition license agreement may have been made so that the original depositor/Data Controller remains in charge of distribution and the Language Bank is only a Data Processor. In this case, the Distribution rights holder is the same entity as the Licensor.
- (*)Ipr holder: Regardless of the licensing process, the IPR holders of the resource can be listed here.
- It is recommended that this field is filled in (if possible and relevant).
- (*)Availability start/end date:
- (*)Availability start date can be used for keeping records of when the resource was first made available via the Language Bank.
- The date should be added at the time of publishing the resource.
- NB: Previously, this field has not been used generally, and accurate information may be lacking for many older resources.
- In addition, in case there are specific terms in the license agreement, for instance an embargo period that allows the Language Bank to keep the resource available to users during a specified time period only, the availability start and/or end dates can be defined here even before the resource is actually available.
- *Contact person (Required: Identification): Contact information for inquiries about the resource, e.g., regarding access to the resource or obtaining further information about the content.
- In case a specific contact person cannot be specified for the resource, please use the existing contact records for Language Bank helpdesk, ”FIN-CLARIN User support fin-clarin@helsinki.fi” or ”User support at CSC – IT Center for Science Ltd. The Language Bank of Finland kielipankki@csc.fi”.
- Resource documentation info (Recommended: Resource Documentation): Further information items that can include brief pieces of text or more elaborate references to external documentation of the resource. The Language Bank regularly includes the following details:
- (*)documentInfoType: License. A reference to the end-user license page in the Language Bank (see the internal instructions on creating and updating license pages). This information should be added as soon as the license details have been confirmed with the resource depositor.
- (*)documentInfoType: A reference to resource group page on the Language Bank website.
- The resource group page collects together all the current versions and variants of the same resource as well as the available instructions, manuals and pieces of further documentation that may be available for the group of resources.
- Especially for large groups of resources, it makes sense to create a documentInfoType item once.
- The title of the document should be ”Resource group page: <shortname>”. This will make it easy to reuse the link for other resources in the same group on COMEDI.
- The URL of the document should be the URN of the (English) resource group page.
- (*)documentInfoType: A ’Notes for the user’ page on the Language Bank website (optional; if required).
- The ’Notes for the user’ page offers information about found issues in the data of a particular (version of a) corpus.
- Especially for large groups of resources, it makes sense to create a documentInfoType item once.
- The title of the document should be ”<shortname>: Notes for the user” (in Finnish: ”<shortname>: Huomautuksia käyttäjälle”). This will make it easy to reuse the link for other resources in the same group on COMEDI.
- The URL of the document should be the URN of the (English) notes’ page.
- (*)documentUnstructured: Citation instructions. A reference to the automatically generated citation instructions on the Language Bank website. This information is added at the time when the resource is published in the Language Bank. (example)
- In the documentUnstructured dialog box, insert the text: How to cite / Viittausohje: https://www.kielipankki.fi/viittaus/?key=[CORPUS-URN]&lang=en(in the citation link, replace the text [CORPUS-URN] with the URN of the resource as mentioned in the Identifier field, but excluding the first part ”http://urn.fi/”).
- Test the citation link and make sure it works. If it does not, make sure that the information in the Portal is correct.
- (*)documentUnstructured: Change Log. (See example)
- This field can be used for keeping track of
- major changes in the metadata (such as modifying the corpus name/title) or
- minor, ”backwards compatible” changes in the resource itself (in case the changes are not significant enough to create a completely new metadata record).
- Note that there is a size limit of 1000 characters in the documentUnstructured type of field, so you should add a new CHANGE LOG item if required.
- Please use the date format like 2017-07-17 (in the order of year, month and day; ISO standard)
- Example of a ”Change Log”:
CHANGE LOG:
<date1>: what changed;
<date2>: what changed item1
* what changed item2
* what changed item3;
<date3>: what changed
-
-
- Unfortunately, within the content inserted in documentUnstructured fields it is not possible to make any given links clickable in COMEDI at the moment.
- Do not use the Required: Revision or Recommended: Version/Revision fields for storing the log details.
- Version (Recommended: Version > Version+Revision): These fields are not consistently used by the Language Bank. They may include information about the most recent version of the resource, or potentially about frequent updates to the resource, if planned. If versions or revisions are made, you should include a comment about them in the Change Log (as explained above, see Documentation).
- NB: this section describes the versioning and revision of the resource itself.
- The corresponding field for metadata revisions is Required: Metadata > Revision.
- (*)Resource creation info: Specify the details:
- (*)Resource creator (Recommended: Resource Creation): This field should contain the names of the people who are to be cited as the ”authors” of the resource.
- (Until now, the field has not been used consistently, but the information should be added for future reference.)
- The same author/creator names should be mentioned in the citation instructions that are also provided via the list of Corpora/Resources on the Language Bank website.
- (*)Relations (Recommended: Relations > Show): The Relation fields must be heavily used in case the resource group includes several versions or variants of the resource.
- Each relation describes a specific relation that the current resource has with regard to a ”target” resource. For instance, (the current resource) IsNewVersionOf ”<Name of the target resource>, <URN of the target resource>”.
Please note: Names of Institutions (e.g. entered as Licensor) should be used consistently in COMEDI. A list of official institutions’ names can be found here.
Corpus Text Info
(to be defined)
In order to specify the languages of the text(s) included in the resource, the following links might help to find the correct language codes:
https://kotoistus.fi/suositukset/suositukset-kielet-fi-koodi/
https://iso639-3.sil.org/code_tables/639/data/all
Corpus Audio Info
(to be defined)
Corpus Video Info
(to be defined)
How to create a new metadata record on COMEDI
Before starting to create a new metadata record on COMEDI, please make sure that a record does not already exist for the resource in question.
Please see more detailed documentation on how to create a metadata record on COMEDI.
<< Development <<
Last modified on 2024-12-03