[Importing corpus data to Korp: technical documentation]
korp-make
on the (parsed and NER-tagged) VRT data to make a corpus package. For parallel corpora,
korp-make
for each aligned language but do not package them;korp-make-package.sh
)config.js
, modes/modename_mode.js
)translations/corpora-{fi,en,sv}.js
)korp.csc.fi
korp-install-corpora.sh
)korp-install.sh
)The directories mentioned in these instructions refer to the currently preferred environment for importing corpora to Korp, CSC’s computing environment, and the Korp server korp.csc.fi
. Please refer to the documentation on the environment.
Each (sub)corpus that is to be shown as its own item in the Korp corpus selector must be imported to CWB as its own corpus, and vice versa.
It is recommended that you make a script or makefile (or similar) for the different stages of corpus processing, so that they can be easily repeated after possible fixes or other changes.
Each distinct item (at the leaf level) in the Korp corpus selector corresponds to a single corpus from the point of view of Korp and CWB. If it is desired (or necessary) that a corpus is shown as divided into subcorpora in the corpus selector, each subcorpus must be made its own corpus for CWB. The corpora from the point of view of CWB are sometimes referred to as physical corpora, as opposed to the logical corpus comprising all the subcorpora and typically having one metadata record, for example.
There are two main reasons for splitting logical corpora:
The results of each (sub)corpus in Korp are shown separately in the concordance, and the results are sorted within a (sub)corpus, not across them. The statistics result shows the number of hits in each (sub)corpus in its own column. Depending on the case, these may be a desired or undesired consequences of splitting a logical corpus into several subcorpora.
In a parallel corpus, each aligned language (or variant) forms its own (sub)corpus. In addition, a parallel corpus may have a division to subcorpora.
Each corpus in Korp and CWB is identified by a corpus name or corpus id(entifier). The identifier is needed at the latest when running korp-make
on the corpus, but may be a good idea to choose the corpus identifier earlier, so that it can be used in file and directory names related to the corpus.
A corpus id may contain only lower-case letters a
…z
, digits 0
…9
, underscores _
and dashes -
, and it must begin with a letter. A corpus id should be relatively short, preferably shorter than 30 characters, and never longer than 63 characters.
The corpus id should be resemble the name of the corpus, but abbreviations may (and often should) be used. The short name of the corpus (as shown in the metadata record of the corpus) may sometimes be used as the corpus id, or at least it may give hints for devising a corpus id.
Do not use too generic, vague or ambiguous corpus ids: for example, puhe
for a speech corpus, since Korp has many speech corpora. Even if a corpus would be the only one of its kind at the moment of adding it to Korp, other corpora of the same type may be added later.
When a logical corpus is split into multiple physical subcorpora, each subcorpus needs its own corpus id, since they are the actual corpora from the point of view of CWB and Korp. It is customary that the corpus ids of the subcorpora have a common prefix, followed by an underscore and a distinguishing suffix: for example, vks_agricola
, vks_almanakat
, vks_biblia
for subcorpora of the Corpus of old literary Finnish (VKS). Multiple levels of subcorpora can be represented similarly: for example, coha_1810s_fic
, coha_1810s_mag
, coha_1820s_fic
, coha_1820s_mag
for the COHA corpus. (However, this convention does not preclude using the underscore to separate words in corpus ids.)
For parallel corpora, it is customary to append an underscore followed by a two-letter code of the language: for example, mulcold_fi
, mulcold_ru
for the MULCOLD corpus.
For information on the corpus input file format (VRT), please read the page on Korp corpus input format. Additional information may be found in the CWB Corpus Encoding Tutorial in the CWB documentation (a possibly more up-to-date local copy retrieved from the CWB version control) and on Språkbanken’s Korp backend information page.
The input data for Korp must be UTF-8-encoded Unicode. If the original data is not in UTF-8, you need to convert it, preferably as early as possible in corpus processing, unless the script converting the data to VRT does it. In any case, you should know the character encoding of the original data.
For conversion, you can use for example the iconv
program:
iconv -f latin1 -t utf-8 < input > output
You can also use iconv
at the beginning of a conversion pipeline to avoid intermediate files.
For more information on the required and recommended character content of VRT files, please refer to the section Character encoding and character content on the page for the Korp corpus input format.
To convert the original data to the VRT format, you can use an existing script, make a copy of an existing script and modify it for the corpus, or write your own conversion script. The resulting VRT should represent tokenized and sentence-split text.
Even if you may need to write your own script for doing the basic conversion, the tokenization and sentence-splitting scripts could be shared. (We might eventually consider moving those stages to korp-make
, in which case its input could be untokenized running text with XML tags carrying structural attribute information.)
The GitHub repository Kielipankki-konversio for corpus conversion scripts contains some scripts for converting from source formats to VRT. Some conversion scripts are in the scripts/
subdirectory for general-purpose scripts, and others in subdirectories specific to a corpus, corpus group or corpus origin under the corp/
subdirectory, unfortunately not completely consistently.
The scripts currently in the Kielipankki-konversio repository are written mainly in Python, Bash (or plain Bourne shell) or Perl.
[TODO: List available conversion scripts]
If you find in the conversion script repository a script that does much of what is needed but not quite, you can make a copy of it and modify the copy. To make it easier to see what changes you have made, preferably commit the copy to the Git repository before making any changes. You can develop the script on your own private branch first and merge it to the master branch later.
For your own scripts, you may use other programming languages instead of Python, Bash and Perl, but to make it easier for others to modify the script or a copy of it, lesser-known languages should be avoided. Even though this document refers to conversion scripts, you may use languages not considered as scripting languages. The conversion script should run on Linux, preferably (also) in CSC’s computing environment, unless it is justified to make the script run only elsewhere.
You should add your own script to the Kielipankki-konversio GitHub repository. Scripts (and associated data) specific to a corpus (or group of corpora, corpus origin (owner) or corpus type (such as speech)) should be placed in a subdirectory of the top-level directory corp/
. You can develop the script on your own (private) branch first and merge it to the master branch when you think the script is stable enough.
For Python scripts, you may find some useful functions (and classes) in the modules under scripts/korpimport/
(package korpimport
), and for Bash scripts in scripts/korp-lib.sh
.
Depending on the size of the corpus and the input file structure, the output VRT may be a single VRT file, in particular for small corpora, or a VRT file corresponding to each input file, or something in between, for example, a VRT file corresponding to each directory.
After converting the original data to VRT, you should validate the resulted VRT against the guidelines on the VRT format page. A VRT validator script will be provided later.
[TODO: Add instructions on using Jussi’s VRT validator once available.]
For Finnish and other languages with a parser and named-entity recognizer available, they should be run on the validated VRT data. Currently these programs are run by Jussi Piitulainen. The steps of the process are the following:
/proj/clarin/vrt-in
directory (or a subdirectory under it) and make sure their group is clarin and they have read permissions for the group (chgrp clarin file; chmod g+r file
)./proj/clarin/vrt-out/
and informs you./proj/clarin/vrt-out/
for further processing.korp-make
on the VRT datakorp-make
doesThe korp-make
script processes VRT files to make a Korp corpus package containing CWB data files and Korp MySQL database import files. korp-make
replaces a number of the steps previously needed for generating all the required data based on the VRT file. In particular, korp-make
datefrom
and dateto
attributes to text
structures in the required format, based on other attributes (unless already present in the VRT)timefrom
and timeto
attributestext
structures based on the values of a structural attribute (if requested)id
attributes to text
, paragraph
and sentence
structures (unless already present)/
, <
, >
and |
)cwb-encode
)cwb-make
).info
file for the corpusNote that korp-make
does not currently fully support processing the alignment information for parallel corpora.
If you find that korp-make
does not suit to the corpus you are processing, please inform Jyrki Niemi. Or you may modify the code yourself if you wish.
korp-make
has a large number of options. The most important options are described below. Run korp-make --help
to list all options.
korp-make
is run as follows:
korp-make [options] [corpus] [input_file ...]
The corpus id must be specified either as the first non-option argument (corpus) or via the option --corpus-id
.
The input files may be either (possibly compressed) VRT files, or ZIP or (possibly compressed) tar archives containing such VRT files. If no input files are specified, korp-make
reads from the standard input.
korp-make
uses corpus directories under the corpus root /proj/clarin/korp/corpora
by default, but that can be overridden by assigning the appropriate directory to the CORPUS_ROOT
environment variable or via the --corpus-root
option. This corpus root is referred to as corpus_root
below.
korp-make
creates a corpus package under corpus_root/pkgs/corpus_id/
(unless --no-package
has been specified). The name of the corpus package is of the form corpus_korp_yyyymmdd
[-nn
].ext
, where corpus is the name of the corpus, yyyymmdd is the date of the most recent corpus file, the optional nn a number distinguishing between corpus packages with the same date, and ext the filename extension for the package, by default tgz
for a gzipped tar archive.
korp-make
writes a log file to corpus_root/log/korp-make_corpus_yyyymmddhhmmss.log, where corpus is the corpus id and yyyymmddhhmmss the time of invocation of
korp-make
.
Because of the large number of options, korp-make
also supports specifying them via a configuration file specified by the option --config-file
. It is recommended that you create a korp-make
configuration file for the corpora you process and add them to the Kielipankki-konversio Git repository to the subdirectory for the corpus under the top-level directory corp/
.
The configuration file is written in (a variant of) the INI file format, as recognized by the Python configparser
module, with two extensions:
In a configuration file, option names are specified without the leading dashes, and the dashes within the option name may be replaced with underscores, and CamelCase is also allowed. For example, the option name --text-sort-transform
may be written in a configuration file as text-sort-transform
, text_sort_transform
or TextSortTransform
. An argumentless option on the command line needs to be given the value 1
in the configuration file.
For example, the following is a typical korp-make
configuration file:
[TODO: ADD EXAMPLE]
--config-file, --configuration-file
FILE--force
korp-make
on a corpus after changes to the original VRT.--times
--quiet
--licence-type LIC
PUB
, ACA
, ACA-Fi
or RES
. The licence type should not include any additional conditions, such as “+NC”.--lbr-id URN
urn:nbn:fi:lb-
]YYYYMMNNN[@LBR
], where YYYYMM is year and month and NNN 3 to 5 digits; the bracketed parts are added if left out. The LBR id is usually the same as the metadata URN for the corpus.--input-attrs, --input-fields
ATTRSword
” or token), separated by spaces. The default is ”ref lemma pos msd dephead deprel nertag
”, which is appropriate for dependency-parsed and NER-tagged VRT data.--corpus-date
DATE--corpus-date-pattern
PATTERNELEM ATTR REGEX
”: extract date information from the attribute ATTR of element (structural attribute) ELEM using the regular expression REGEX. ELEM and ATTR may be ”*
” (any element or attribute) or they may contain several attribute or element names separated by vertical bars. REGEX may contain named groups (subpatterns) in Python’s regular expressions Y, M and D, which extract year, month and day; for example, ”(?P<Y>[0-9]{4})
” (without the quotation marks) would recognize a year. (However, this particular case is also covered by the default pattern, so you need not specify it explicitly.) REGEX may also cover both the start and end date, in which case the subpatterns for the start date are Y1, M1 and D1, and those for the end date, Y2, M2 and D2. If REGEX does not contain named subpatterns, recognize the first group as the start date and the possible second group as the end date.--corpus-date-full-order ORDER
ymd
”, ”dmy
”, ”mdy
”.--corpus-date-ranges
--lemgram-posmap, --posmap POSMAP_FILE
corp/lemgram_posmap_tdt.tsv
in the Kielipankki-konversio repository).--wordpict-relmap, --wordpicture-relation-map RELMAP_FILE
corp/wordpict_relmap_tdt.tsv
in the Kielipankki-konversio repository).text
structures--text-sort-attribute ATTR
text
elements in the corpus by the value of the attribute ATTR; sort by byte values into ascending order, without taking the locale into account.--text-sort-transform TRANSFORM
s/.../.../
) TRANSFORM to get the key to be used for sorting. This option may be specified multiple times, in which case the substitutions are applied in the specified order. If read from a configuration file, TRANSFORM is treated as enclosed in single quotes, so you need not protect $
nor \
.--add-structure-ids, --add-element-ids STRUCTLIST
id
attributes to the structures listed in STRUCTLIST (separated by spaces). The attribute values are positive integers in ascending order. If STRUCTLIST is an empty string, do not add id
attributes. Default: ”text paragraph sentence
”.--overwrite-structure-ids, --overwrite-element-ids
--add-structure-ids
.-scramble STRUCTS
sentence paragraph
” scrambles both ways.--scramble-seed SEED
0
” for random seed (non-reproducible order) (default: corpus id)--no-lemmas-without-boundaries, --skip-lemmas-without-boundaries
If possible, you should provide a read-me file or other documentation on the corpus and conversion process, and conversion or other scripts used to process the corpus, to be included in the corpus package.
--no-package
--package-readme-file, --readme-file FILE
--package-doc-dir, --doc-dir DIR
doc
” in the corpus package.--package-doc-file, --doc-file FILE
doc
” in the corpus package; FILE may contain shell wildcards.--package-script-dir, --script-dir DIR
scripts
” of the corpus package.--package-script-file, --script-file FILE
scripts
” of the corpus package; FILE may contain shell wildcards.A parallel corpus consists of two or more separate corpora which have been aligned with each other using alignment attributes. By convention, the id of each separate corpus of a multilingual parallel corpus is of the form corpus_lg
, where corpus is the id of the whole corpus and lg is a (typically) two-letter code for the language. If a corpus has multiple versions for the same language, a number may be added after the language code.
The content of parallel corpora require an alignment or linking structure (element) which has an attribute that marks the links between the aligned corpora, typically id
. The linking structure may be a separate structure, preferably named link
, or sentence
or paragraph
may be used if the alignment is one-to-one at the level of sentences or paragraphs.
Importing a parallel corpus to Korp (CWB) currently requires of the following steps. In the descriptions, corpus is the corpus id, l1 and l2 language codes and linkstruct the linking structure, typically link
, paragraph
or sentence
.
korp-make
with the option --no-package
:
korp-make --no-package [options] corpus_l1 corpus_l1.vrt ...
korp-make --no-package [options] corpus_l2 corpus_l2.vrt ...
id
is used to mark aligned structures):
cwb-align -v -r /v/corpora/registry -o corpus_l1_l2.align -V link_id corpus_l1 corpus_l2 link
cwb-align -v -r /v/corpora/registry -o corpus_l2_l1.align -V link_id corpus_l2 corpus_l1 link
cwb-regedit -r /v/corpora/registry corpus_l1 :add :a corpus_l2
cwb-regedit -r /v/corpora/registry corpus_l2 :add :a corpus_l1
cwb-align-encode -v -r /v/corpora/registry -D corpus_l1_l2.align
cwb-align-encode -v -r /v/corpora/registry -D corpus_l2_l1.align
korp-make-corpus-package.sh --target-corpus-root /v/corpora --database-format tsv --include-vrt-dir [other_options] corpus corpus_l1 corpus_l2
Step 2 assumes that the corpora have already been aligned. cwb-align
also provides a simple alignment method; please see its manual page for more information.
A simple script will eventually be provided for the steps 2 to 4 (or possibly including 5). [TODO: Update the instructions once the script is available.]
If the corpus contains more than two aligned languages, the above commands have to be repeated for each language pair as appropriate.
For a corpus with a CLARIN ACA or RES licence, the licence type and LBR id (metadata URN) needs to be given to korp-make
with the options --licence-type
and --lbr-id
; see above. The licence type of the corpus can be seen on the metadata record of the corpus, under the heading “Licence”. The LBR id is required for a RES corpus and for an ACA corpus with the possibility to apply for corpus-specific access rights.
Alternatively, or if you do not use korp-make
, you can generate the TSV files containing the appropriate information with the script korp-make-auth-info.sh
, which takes the same options. With it, you can add the same information to several corpora at the same time, which is useful if the corpus has been divided into multiple subcorpora in Korp.
When a new or a user-visibly updated corpus has been configured in Korp and is ready to use, it is worthwhile to write a short piece of news that is shown in Korp’s internal newsdesk, opened from the bell icon near the top right corner of the Korp page.
Korp news are currently stored in the independent branch news/master
of the Kielipankki-korp-frontend
GitHub repository. Since the branch is independent (has a completely separate set of files from the other branches), it is better to have a separate directory for its workspace, instead of switching back and forth between it and the usual Korp frontend code in the same workspace. To create a separate workspace, clone the korp-frontend
repository as follows:
git clone --branch news/master git@github.com:CSCfi/Kielipankki-korp-frontend.git Kielipankki-korp-frontend-news
If you are using Git version 1.7.10 or newer, you can add the option --single-branch
, so that the clone only contains the branch news/master
:
git clone --branch news/master --single-branch git@github.com:CSCfi/Kielipankki-korp-frontend.git Kielipankki-korp-frontend-news
Alternatively, if you are using Git version 2.5 or newer, you can create a separate worktree for the news; for more information, please see git help worktree
.
Pieces of news on the production Korp are in the subdirectory korp
and those on the Korp laboratory (Korp beta) in the directory korpbeta
. The news are in a single file by language: fi.txt
contains Finnish, en.txt
English and sv.txt
Swedish news. A piece of news in English is required for it to appear in any language, but in general, you should also write a Finnish version and at least for Swedish corpora preferably also in Swedish. The news are shown in the user interface language of Korp, or if a piece of news is not available in the language, in English by default.
Each piece of news begins with an HTML/XML comment that contains a heading for the piece of news, followed by the date of the news in the ISO format YYYY-MM-DD
:
<!-- A new corpus added 2017-12-01 -->
If there are more than one piece of news for a single day, you need to append a lowercase letter to the date, for example, 2017-12-01a
, 2017-12-01b
. You can also specify an expiration date after which the piece of news will not be shown any more: it is added after the date of the news in the same format. The end date is hardly necessary for news on corpora, but it is useful for news on service breaks, for example.
The actual text body of a piece of news comes after the heading and an empty line. The body text may use Markdown markup for formatting. A piece of news need not (and usually should not) be long, but it should have a link to an added corpus in Korp or its metadata URN. Please look at previous pieces of news for examples. One option is to write a piece of news also to Kielipankki’s news (in English), so that the body of the piece of news in Korp may essentially be a link to the piece of news in question. Such a link should preferably open to a new window (or tab), in which case the link must be written as HTML and not with Markdown markup:
<a href="https://www.kielipankki.fi/uutiset/…" target="_blank">Uutinen…</a>
You should add new news at the top of the file, so that the news are in the file in reverse temporal order.
When you have added a piece of news to the desired files, the files have to be compiled to JSON by running the following command in the top directory of the Git workspace for news:
./compile.bash
This command creates JSON files in the subdirectory json
.
If you wish to see how the piece of news looks like in Korp before committing it to Git, copy the file json/korpnews.json
(or for the beta version, json/korpbetanews.json
) to the subdirectory news/json
of the Korp frontend installation directory.
After this, you need to commit the changes (in particular, the changed JSON files) and push the commits to the master GitHub repository, so that they will visible when updating the Korp frontend installation (korp-install.sh
also installs the news):
git commit .
git push