This guideline is intended as a short guide to define the minimal steps necessary to prepare a corpus data publication for download at the Language Bank of Finland.
The corpus needs
If an older version of the same corpus exists, a decision needs to be made whether to update metadata in an existing description or to create new metadata. See our Lifecycle Model for details. The name of the corpus will be visible in the column ”Description” of page kielipankki.fi/download/ and the text should link to the metadata record. The name is essentially the same as the metadata long name of the corpus, possibly shortened a bit if the long name is too long. It the directory does not have a metadata page, just create a descriptive name for it (e.g. the semfinlex corpus has subcorpora that are grouped under a common directory).
The package has to have a license to inform the user what he or she can and cannot do with the software. Less restrictive licenses are preferred, the license should be stated in the README.txt or a LICENSE.txt file.
The README.txt should at least contain the Name of the corpus, a short decscription of the content and a PID to the metadata record describing this resource. License can be given in README.txt or in a separate LICENSE.txt. README.txt should also contain a short description of corpus, including directory and filename scheme if there are several of them.
The descriptive metadata describes a specific instance of the corpus. It is not a manual, but helps a user searching for corpora to determine whether the corpora is worth downloading. The PID pointing to the metadata is the persistent identifier of the corpus version in question. The metadata in turn points to the download location of the corpus and explains where the manual can be found (e.g. inside the package or on a separate web page). Every update gets a new version number. The PID of the metadata needs to be mentioned in the README.txt of the downloadable packages.
A quick reminder of the topics above.
A case example: The semfinlex corpus was first published in korp with beta status and it was advertised in korp. After it had been available for testing for two weeks, the beta status was removed and no backward incompatible changes to the corpus were allowed from that on. The download packages were created at this point. The corpus (including the freshly generated download packages) was then advertised to a wider audience in the portal.
Most of the corpora have the name, README.txt, metadata etc. in English but some are in Finnish.