[Importing corpus data to Korp: technical documentation]

Importing a new corpus to Korp (or updating an old one)

Steps in brief

Decide if the corpus should be split into subcorpora
Decide the identifier of the corpus and its possible subcorpora
Encode the corpus and import it to Korp
1. Convert corpus data to the VRT format.
2. Validate the VRT data and otherwise verify its correctness.
3. Run parser and named-entity recognizer on the VRT data (if the tools exist for language of the corpus)
4. Run korp-make on the (parsed and NER-tagged) VRT data to make a corpus package. For parallel corpora,
  1. run korp-make for each aligned language but do not package them;
  2. add alignment information; and
  3. package all the languages to a single package (korp-make-package.sh)
Add the corpus to the Korp user interface:
1. Add corpus configuration to Korp’s configuration file (config.js, modes/modename_mode.js)
2. Add translations of corpus attribute names to translation files (translations/corpora-{fi,en,sv}.js)
3. Commit the changes to the configuration to the Kielipankki-korp-frontend repository in GutHub
Add a piece of news on the corpus to Korp’s newsdesk.
Install the corpus to korp.csc.fi
1. Install the corpus package (korp-install-corpora.sh)
2. Install the changes to the Korp configuration from the GitHub repository (korp-install.sh)

General instructions

The directories mentioned in these instructions refer to the currently preferred environment for importing corpora to Korp, CSC’s computing environment, and the Korp server korp.csc.fi. Please refer to the documentation on the environment.

Each (sub)corpus that is to be shown as its own item in the Korp corpus selector must be imported to CWB as its own corpus, and vice versa.

It is recommended that you make a script or makefile (or similar) for the different stages of corpus processing, so that they can be easily repeated after possible fixes or other changes.

Corpus organization

Each distinct item (at the leaf level) in the Korp corpus selector corresponds to a single corpus from the point of view of Korp and CWB. If it is desired (or necessary) that a corpus is shown as divided into subcorpora in the corpus selector, each subcorpus must be made its own corpus for CWB. The corpora from the point of view of CWB are sometimes referred to as physical corpora, as opposed to the logical corpus comprising all the subcorpora and typically having one metadata record, for example.

There are two main reasons for splitting logical corpora:

The corpus consists of clearly defined subcorpora that users may wish to select individually, for example, by year, genre or author. That can also be done with conditions on text attributes containing the same information in the extended search but not in the simple search. There may be more than one level of divisions, so the subcorpora may form a tree in the Korp corpus selector.
The corpus is so large that it needs to be split into subcorpora (for example, Suomi24). The size of a physical corpus should preferably be kept under 500 million tokens.

The results of each (sub)corpus in Korp are shown separately in the concordance, and the results are sorted within a (sub)corpus, not across them. The statistics result shows the number of hits in each (sub)corpus in its own column. Depending on the case, these may be a desired or undesired consequences of splitting a logical corpus into several subcorpora.

In a parallel corpus, each aligned language (or variant) forms its own (sub)corpus. In addition, a parallel corpus may have a division to subcorpora.

Corpus identifier

Each corpus in Korp and CWB is identified by a corpus name or corpus id(entifier). The identifier is needed at the latest when running korp-make on the corpus, but may be a good idea to choose the corpus identifier earlier, so that it can be used in file and directory names related to the corpus.

A corpus id may contain only lower-case letters a…z, digits 0…9, underscores _ and dashes -, and it must begin with a letter. A corpus id should be relatively short, preferably shorter than 30 characters, and never longer than 63 characters.

The corpus id should be resemble the name of the corpus, but abbreviations may (and often should) be used. The short name of the corpus (as shown in the metadata record of the corpus) may sometimes be used as the corpus id, or at least it may give hints for devising a corpus id.

Do not use too generic, vague or ambiguous corpus ids: for example, puhe for a speech corpus, since Korp has many speech corpora. Even if a corpus would be the only one of its kind at the moment of adding it to Korp, other corpora of the same type may be added later.

When a logical corpus is split into multiple physical subcorpora, each subcorpus needs its own corpus id, since they are the actual corpora from the point of view of CWB and Korp. It is customary that the corpus ids of the subcorpora have a common prefix, followed by an underscore and a distinguishing suffix: for example, vks_agricola, vks_almanakat, vks_biblia for subcorpora of the Corpus of old literary Finnish (VKS). Multiple levels of subcorpora can be represented similarly: for example, coha_1810s_fic, coha_1810s_mag, coha_1820s_fic, coha_1820s_mag for the COHA corpus. (However, this convention does not preclude using the underscore to separate words in corpus ids.)

For parallel corpora, it is customary to append an underscore followed by a two-letter code of the language: for example, mulcold_fi, mulcold_ru for the MULCOLD corpus.

Converting a corpus and importing it to Korp

For information on the corpus input file format (VRT), please read the page on Korp corpus input format. Additional information may be found in the CWB Corpus Encoding Tutorial in the CWB documentation (a possibly more up-to-date local copy retrieved from the CWB version control) and on Språkbanken’s Korp backend information page.

Character encoding

The input data for Korp must be UTF-8-encoded Unicode. If the original data is not in UTF-8, you need to convert it, preferably as early as possible in corpus processing, unless the script converting the data to VRT does it. In any case, you should know the character encoding of the original data.

For conversion, you can use for example the iconv program:


  iconv -f latin1 -t utf-8 < input > output

You can also use iconv at the beginning of a conversion pipeline to avoid intermediate files.

For more information on the required and recommended character content of VRT files, please refer to the section Character encoding and character content on the page for the Korp corpus input format.

Convert the data to VRT (custom script)

To convert the original data to the VRT format, you can use an existing script, make a copy of an existing script and modify it for the corpus, or write your own conversion script. The resulting VRT should represent tokenized and sentence-split text.

Even if you may need to write your own script for doing the basic conversion, the tokenization and sentence-splitting scripts could be shared. (We might eventually consider moving those stages to korp-make, in which case its input could be untokenized running text with XML tags carrying structural attribute information.)

Using existing scripts

The GitHub repository Kielipankki-konversio for corpus conversion scripts contains some scripts for converting from source formats to VRT. Some conversion scripts are in the scripts/ subdirectory for general-purpose scripts, and others in subdirectories specific to a corpus, corpus group or corpus origin under the corp/ subdirectory, unfortunately not completely consistently.

The scripts currently in the Kielipankki-konversio repository are written mainly in Python, Bash (or plain Bourne shell) or Perl.

[TODO: List available conversion scripts]

Modifying an existing script

If you find in the conversion script repository a script that does much of what is needed but not quite, you can make a copy of it and modify the copy. To make it easier to see what changes you have made, preferably commit the copy to the Git repository before making any changes. You can develop the script on your own private branch first and merge it to the master branch later.

Writing your own script

For your own scripts, you may use other programming languages instead of Python, Bash and Perl, but to make it easier for others to modify the script or a copy of it, lesser-known languages should be avoided. Even though this document refers to conversion scripts, you may use languages not considered as scripting languages. The conversion script should run on Linux, preferably (also) in CSC’s computing environment, unless it is justified to make the script run only elsewhere.

You should add your own script to the Kielipankki-konversio GitHub repository. Scripts (and associated data) specific to a corpus (or group of corpora, corpus origin (owner) or corpus type (such as speech)) should be placed in a subdirectory of the top-level directory corp/. You can develop the script on your own (private) branch first and merge it to the master branch when you think the script is stable enough.

For Python scripts, you may find some useful functions (and classes) in the modules under scripts/korpimport/ (package korpimport), and for Bash scripts in scripts/korp-lib.sh.

VRT output

Depending on the size of the corpus and the input file structure, the output VRT may be a single VRT file, in particular for small corpora, or a VRT file corresponding to each input file, or something in between, for example, a VRT file corresponding to each directory.

Validate the converted VRT

After converting the original data to VRT, you should validate the resulted VRT against the guidelines on the VRT format page. A VRT validator script will be provided later.

[TODO: Add instructions on using Jussi’s VRT validator once available.]

Run parser and named-entity recognizer on the VRT data

For Finnish and other languages with a parser and named-entity recognizer available, they should be run on the validated VRT data. Currently these programs are run by Jussi Piitulainen. The steps of the process are the following:

Package the VRT files to zip archives. For smallish corpora, a single zip archive suffices. If the corpus is packaged into multiple zip files, please name them in a way that their order can be easily reconstucted. [How large should or may a single input zip files be?]
Copy the files to the /proj/clarin/vrt-in directory (or a subdirectory under it) and make sure their group is clarin and they have read permissions for the group (chgrp clarin file; chmod g+r file).
Send email to Jussi informing of the file names.
Jussi runs the parser and NER tagger on the files, places the results as zip files in /proj/clarin/vrt-out/ and informs you.
Get the zips containing parsed VRT files from /proj/clarin/vrt-out/ for further processing.

Run `korp-make` on the VRT data

What `korp-make` does

The korp-make script processes VRT files to make a Korp corpus package containing CWB data files and Korp MySQL database import files. korp-make replaces a number of the steps previously needed for generating all the required data based on the VRT file. In particular, korp-make

adds lemmas without compound boundaries (for corpora with lemmas containing compound boundary markers)
adds lemgrams (combined lemma and part of speech code, for corpora with lemma and part-of-speech annotations)
adds datefrom and dateto attributes to text structures in the required format, based on other attributes (unless already present in the VRT)
adds timefrom and timeto attributes
sorts text structures based on the values of a structural attribute (if requested)
adds unique (within a corpus) id attributes to text, paragraph and sentence structures (unless already present)
encodes certain special characters in the attribute values for Korp (space, /, <, > and |)
encodes the data for CWB (using cwb-encode)
indexes and compresses the CWB data (cwb-make)
creates an .info file for the corpus
creates database tables for corpus time data
creates database tables for lemgram frequency information
creates database tables for the Korp word picture
converts name attributes to the format used in Korp
creates a Korp corpus package for installation and archival
imports data to the Korp MySQL database (by request; neither supported nor needed in CSC’s computing environment)

Note that korp-make does not currently fully support processing the alignment information for parallel corpora.

If you find that korp-make does not suit to the corpus you are processing, please inform Jyrki Niemi. Or you may modify the code yourself if you wish.

korp-make has a large number of options. The most important options are described below. Run korp-make --help to list all options.

Usage

korp-make is run as follows:


  korp-make [options] [corpus] [input_file ...]

The corpus id must be specified either as the first non-option argument (corpus) or via the option --corpus-id.

The input files may be either (possibly compressed) VRT files, or ZIP or (possibly compressed) tar archives containing such VRT files. If no input files are specified, korp-make reads from the standard input.

korp-make uses corpus directories under the corpus root /proj/clarin/korp/corpora by default, but that can be overridden by assigning the appropriate directory to the CORPUS_ROOT environment variable or via the --corpus-root option. This corpus root is referred to as corpus_root below.

Output

korp-make creates a corpus package under corpus_root/pkgs/corpus_id/ (unless --no-package has been specified). The name of the corpus package is of the form corpus_korp_yyyymmdd[-nn].ext, where corpus is the name of the corpus, yyyymmdd is the date of the most recent corpus file, the optional nn a number distinguishing between corpus packages with the same date, and ext the filename extension for the package, by default tgz for a gzipped tar archive.

korp-make writes a log file to corpus_root/log/korp-make_corpus_yyyymmddhhmmss.log, where corpus is the corpus id and yyyymmddhhmmss the time of invocation of korp-make.

Configuration file

Because of the large number of options, korp-make also supports specifying them via a configuration file specified by the option --config-file. It is recommended that you create a korp-make configuration file for the corpora you process and add them to the Kielipankki-konversio Git repository to the subdirectory for the corpus under the top-level directory corp/.

The configuration file is written in (a variant of) the INI file format, as recognized by the Python configparser module, with two extensions:

options are specified without a section heading
an option that can be specified multiple times on the command line may be specified multiple times with different values also in the configuration file

In a configuration file, option names are specified without the leading dashes, and the dashes within the option name may be replaced with underscores, and CamelCase is also allowed. For example, the option name --text-sort-transform may be written in a configuration file as text-sort-transform, text_sort_transform or TextSortTransform. An argumentless option on the command line needs to be given the value 1 in the configuration file.

For example, the following is a typical korp-make configuration file:


[TODO: ADD EXAMPLE]

Options

General

--config-file, --configuration-file FILE: Read FILE as an INI-style configuration file.
--force: Force all stages of processing by first removing all the output files if they exist. This is often needed when rerunning korp-make on a corpus after changes to the original VRT.
--times: Output the amount of CPU time used for each step.
--quiet: Do not output information on the processing steps. Using this option is not recommended, because the log file will not contain information on the processing, either.

Corpus licence information

--licence-type LIC: Set the corpus licence type (category) to LIC, where LIC is one of PUB, ACA, ACA-Fi or RES. The licence type should not include any additional conditions, such as “+NC”.
--lbr-id URN: Set the LBR id of the corpus to URN, which is of the form [urn:nbn:fi:lb-]YYYYMMNNN[@LBR], where YYYYMM is year and month and NNN 3 to 5 digits; the bracketed parts are added if left out. The LBR id is usually the same as the metadata URN for the corpus.

Attributes

--input-attrs, --input-fields ATTRS: Specify the names of the positional attributes in the input in the order they are in the VRT, excluding the first one (”word” or token), separated by spaces. The default is ”ref lemma pos msd dephead deprel nertag”, which is appropriate for dependency-parsed and NER-tagged VRT data.

Date information

--corpus-date DATE: Use DATE as the fixed date of all texts in the corpus; ”unknown” if not known.
--corpus-date-pattern PATTERN: Recognize corpus date information based on PATTERN of the form ”ELEM ATTR REGEX”: extract date information from the attribute ATTR of element (structural attribute) ELEM using the regular expression REGEX. ELEM and ATTR may be ”*” (any element or attribute) or they may contain several attribute or element names separated by vertical bars. REGEX may contain named groups (subpatterns) in Python’s regular expressions Y, M and D, which extract year, month and day; for example, ”(?P<Y>[0-9]{4})” (without the quotation marks) would recognize a year. (However, this particular case is also covered by the default pattern, so you need not specify it explicitly.) REGEX may also cover both the start and end date, in which case the subpatterns for the start date are Y1, M1 and D1, and those for the end date, Y2, M2 and D2. If REGEX does not contain named subpatterns, recognize the first group as the start date and the possible second group as the end date.
--corpus-date-full-order ORDER: Recognize full dates in the order ORDER, which must be one of ”ymd”, ”dmy”, ”mdy”.
--corpus-date-ranges: Make the patterns recognize date ranges with different start and end days.

Annotation mappings

--lemgram-posmap, --posmap POSMAP_FILE: Use POSMAP_FILE as the mapping file from the corpus parts of speech to those used in Korp lemgrams (the category codes of SUC2). The file should contain lines with corpus POS and lemgram POS separated by a tab. The default mapping is for the TDT POS codes (corp/lemgram_posmap_tdt.tsv in the Kielipankki-konversio repository).
--wordpict-relmap, --wordpicture-relation-map RELMAP_FILE: Use RELMAP_FILE as the mapping file from corpus dependency relation codes to those used in the Korp word picture. [TODO: Add a link to a list of the relation codes.] The file should contain lines with corpus dependency relation code and word picture dependency relation code separated by a tab. The default mapping is for the TDT dependency relation codes (corp/wordpict_relmap_tdt.tsv in the Kielipankki-konversio repository).

Sorting `text` structures

--text-sort-attribute ATTR: Sort text elements in the corpus by the value of the attribute ATTR; sort by byte values into ascending order, without taking the locale into account.
--text-sort-transform TRANSFORM: Transform the sorting attribute values with the Perl regular expression substitution (s/.../.../) TRANSFORM to get the key to be used for sorting. This option may be specified multiple times, in which case the substitutions are applied in the specified order. If read from a configuration file, TRANSFORM is treated as enclosed in single quotes, so you need not protect $ nor \.

Structure ids

--add-structure-ids, --add-element-ids STRUCTLIST: Add id attributes to the structures listed in STRUCTLIST (separated by spaces). The attribute values are positive integers in ascending order. If STRUCTLIST is an empty string, do not add id attributes. Default: ”text paragraph sentence”.
--overwrite-structure-ids, --overwrite-element-ids: Overwrite possible existing id attribute values in the structures listed with --add-structure-ids.
-scramble STRUCTS: Scramble structures listed in STRUCTS, separated by spaces. Allowed structures are sentence and paragraph; they are scrambled within the immediately containing structure, typically within paragraph and text, respectively; ”sentence paragraph” scrambles both ways.
--scramble-seed SEED: use the string SEED as the random number generator seed for scrambling data; ”0” for random seed (non-reproducible order) (default: corpus id)

Controlling the output

--no-lemmas-without-boundaries, --skip-lemmas-without-boundaries: Do not add lemmas without compound boundaries. This option should be used if the lemmas in the input have no compound boundaries. Currently the only recognized compound boundary marker is the vertical bar as used by the TDT.

Packaging

If possible, you should provide a read-me file or other documentation on the corpus and conversion process, and conversion or other scripts used to process the corpus, to be included in the corpus package.

--no-package: Do not create a corpus package.

--package-readme-file, --readme-file FILE: Include FILE as a top-level read-me file in the corpus package. FILE may contain shell wildcards (but braces are not expanded).
--package-doc-dir, --doc-dir DIR: Include DIR as a documentation directory ”doc” in the corpus package.
--package-doc-file, --doc-file FILE: Include FILE as a documentation file in directory ”doc” in the corpus package; FILE may contain shell wildcards.
--package-script-dir, --script-dir DIR: Include DIR as a (conversion) script directory ”scripts” of the corpus package.
--package-script-file, --script-file FILE: Include FILE as a (conversion) script file in directory ”scripts” of the corpus package; FILE may contain shell wildcards.

Processing a parallel corpus

A parallel corpus consists of two or more separate corpora which have been aligned with each other using alignment attributes. By convention, the id of each separate corpus of a multilingual parallel corpus is of the form corpus_lg, where corpus is the id of the whole corpus and lg is a (typically) two-letter code for the language. If a corpus has multiple versions for the same language, a number may be added after the language code.

The content of parallel corpora require an alignment or linking structure (element) which has an attribute that marks the links between the aligned corpora, typically id. The linking structure may be a separate structure, preferably named link, or sentence or paragraph may be used if the alignment is one-to-one at the level of sentences or paragraphs.

Importing a parallel corpus to Korp (CWB) currently requires of the following steps. In the descriptions, corpus is the corpus id, l1 and l2 language codes and linkstruct the linking structure, typically link, paragraph or sentence.

Encode the corpora for each language using korp-make with the option --no-package:


korp-make --no-package [options] corpus_l1 corpus_l1.vrt ...
korp-make --no-package [options] corpus_l2 corpus_l2.vrt ...

Create alignment files (in the current directory, unless the file names contain directory components; assumes that the attribute id is used to mark aligned structures):


cwb-align -v -r /v/corpora/registry -o corpus_l1_l2.align -V link_id corpus_l1 corpus_l2 link
cwb-align -v -r /v/corpora/registry -o corpus_l2_l1.align -V link_id corpus_l2 corpus_l1 link

Add the alignment attributes to the registry files of the individual corpora:


cwb-regedit -r /v/corpora/registry corpus_l1 :add :a corpus_l2
cwb-regedit -r /v/corpora/registry corpus_l2 :add :a corpus_l1

Encode the alignment attributes for CWB:


cwb-align-encode -v -r /v/corpora/registry -D corpus_l1_l2.align
cwb-align-encode -v -r /v/corpora/registry -D corpus_l2_l1.align

Package all the aligned corpora to a single Korp corpus package:


korp-make-corpus-package.sh --target-corpus-root /v/corpora --database-format tsv --include-vrt-dir [other_options] corpus corpus_l1 corpus_l2

Step 2 assumes that the corpora have already been aligned. cwb-align also provides a simple alignment method; please see its manual page for more information.

A simple script will eventually be provided for the steps 2 to 4 (or possibly including 5). [TODO: Update the instructions once the script is available.]

If the corpus contains more than two aligned languages, the above commands have to be repeated for each language pair as appropriate.

Processing a corpus with an ACA or RES licence

For a corpus with a CLARIN ACA or RES licence, the licence type and LBR id (metadata URN) needs to be given to korp-make with the options --licence-type and --lbr-id; see above. The licence type of the corpus can be seen on the metadata record of the corpus, under the heading “Licence”. The LBR id is required for a RES corpus and for an ACA corpus with the possibility to apply for corpus-specific access rights.

Alternatively, or if you do not use korp-make, you can generate the TSV files containing the appropriate information with the script korp-make-auth-info.sh, which takes the same options. With it, you can add the same information to several corpora at the same time, which is useful if the corpus has been divided into multiple subcorpora in Korp.

Adding news to the Korp newsdesk

When a new or a user-visibly updated corpus has been configured in Korp and is ready to use, it is worthwhile to write a short piece of news that is shown in Korp’s internal newsdesk, opened from the bell icon near the top right corner of the Korp page.

Initializing a Git workspace for the news

Korp news are currently stored in the independent branch news/master of the Kielipankki-korp-frontend GitHub repository. Since the branch is independent (has a completely separate set of files from the other branches), it is better to have a separate directory for its workspace, instead of switching back and forth between it and the usual Korp frontend code in the same workspace. To create a separate workspace, clone the korp-frontend repository as follows:


git clone --branch news/master git@github.com:CSCfi/Kielipankki-korp-frontend.git Kielipankki-korp-frontend-news

If you are using Git version 1.7.10 or newer, you can add the option --single-branch, so that the clone only contains the branch news/master:


git clone --branch news/master --single-branch git@github.com:CSCfi/Kielipankki-korp-frontend.git Kielipankki-korp-frontend-news

Alternatively, if you are using Git version 2.5 or newer, you can create a separate worktree for the news; for more information, please see git help worktree.

Adding pieces of news

Pieces of news on the production Korp are in the subdirectory korp and those on the Korp laboratory (Korp beta) in the directory korpbeta. The news are in a single file by language: fi.txt contains Finnish, en.txt English and sv.txt Swedish news. A piece of news in English is required for it to appear in any language, but in general, you should also write a Finnish version and at least for Swedish corpora preferably also in Swedish. The news are shown in the user interface language of Korp, or if a piece of news is not available in the language, in English by default.

Each piece of news begins with an HTML/XML comment that contains a heading for the piece of news, followed by the date of the news in the ISO format YYYY-MM-DD:


<!-- A new corpus added 2017-12-01 -->

If there are more than one piece of news for a single day, you need to append a lowercase letter to the date, for example, 2017-12-01a, 2017-12-01b. You can also specify an expiration date after which the piece of news will not be shown any more: it is added after the date of the news in the same format. The end date is hardly necessary for news on corpora, but it is useful for news on service breaks, for example.

The actual text body of a piece of news comes after the heading and an empty line. The body text may use Markdown markup for formatting. A piece of news need not (and usually should not) be long, but it should have a link to an added corpus in Korp or its metadata URN. Please look at previous pieces of news for examples. One option is to write a piece of news also to Kielipankki’s news (in English), so that the body of the piece of news in Korp may essentially be a link to the piece of news in question. Such a link should preferably open to a new window (or tab), in which case the link must be written as HTML and not with Markdown markup:


<a href="https://www.kielipankki.fi/uutiset/…" target="_blank">Uutinen…</a>

You should add new news at the top of the file, so that the news are in the file in reverse temporal order.

Compiling and committing news

When you have added a piece of news to the desired files, the files have to be compiled to JSON by running the following command in the top directory of the Git workspace for news:


./compile.bash

This command creates JSON files in the subdirectory json.

If you wish to see how the piece of news looks like in Korp before committing it to Git, copy the file json/korpnews.json (or for the beta version, json/korpbetanews.json) to the subdirectory news/json of the Korp frontend installation directory.

After this, you need to commit the changes (in particular, the changed JSON files) and push the commits to the master GitHub repository, so that they will visible when updating the Korp frontend installation (korp-install.sh also installs the news):


git commit .
git push

More information on using Git for Korp.

Search the Language Bank Portal:

Researcher of the Month: Simo Määttä

Näytä kaikki tapahtumat

Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information