[Importing corpus data to Korp: technical documentation]
The primary environment for processing corpora to be imported to Korp is CSC’s computing environment. You may also process corpora on your own Linux workstation, but that is recommended mainly for corpora with free licences. Please note that the previously used local Korp test server nyklait-09-01.hum.helsinki.fi is being phased out, so it is no longer a recommended corpus processing or Korp environment.
However, Korp itself cannot be run in the computing environment, so once you wish to test your corpus in practice, you will need to do it elsewhere, on the Korp server. The Korp test servers are currently not functional. On your own Linux server, you may set up the Korp frontend, but it is currently difficult to test new corpora locally, since the Korp frontend assumes the presence of all configured corpora on the Korp (backend) server used. (We have plans to fix this, though.)
If you wish to process corpora or install Korp locally, please note that you need a development version of Corpus Workbench (CWB), at least version 3.4.9, but preferably the latest one. (Korp does not work with the “stable” version 3.0.) is See the IMS Open Corpus Workbench (CWB) page for information on accessing the CWB Subversion repository. In addition to the cwb
section of the repository, you will also need cwb-perl
to be able to use the korp-make
script, and cwb-doc
contains documentation for both importing corpora and using the query language CQP.
[TODO: More information on how to install CWB and other tools locally.]
The directories related to Korp and corpora are under the CLARIN project directory /proj/clarin/
, most of them under /proj/clarin/korp/
. Note that you need to be in the user group clarin
to access the directories. The relevant directories are the following, relative to /proj/clarin
:
Directory | Description |
---|---|
Corpus data directories | |
korp/corpora/src/corpus/ |
Corpus data files for the corpus corpus in the source (non-VRT) format |
korp/corpora/data/corpus/ |
CWB data file for corpus |
korp/corpora/registry/ |
CWB registry files for corpora |
korp/corpora/pkgs/corpus/ |
Korp package files for corpus |
korp/corpora/log/ |
Log files, in particular for korp-make |
korp/corpora/vrt/corpus/ |
VRT and other generated files for corpus |
vrt-in/ |
VRT files to be parsed and NER-tagged |
vrt-out/ |
Parsed and NER-tagged VRT files produced from those in vrt-in |
Code and other directories | |
korp/cwb/bin/ |
Executables for the CWB |
korp/git-work/Kielipankki-konversio/ |
A working copy of the Kielipankki-konversio GitHub repository, kept up-to-date with the repository. (The older directory name korp/git-work/korp-corpimport/ may also be used.) |
korp/scripts/ |
A symbolic link to korp/git-work/Kielipankki-konversio/scripts/ containing many general-purpose corpus processing scripts |
To reduce the amount of typing when running corpus processing scripts, you should add to your path the following directories: /proj/clarin/korp/cwb/bin
, /proj/clarin/korp/scripts
and /proj/clarin/korp/git-work/Kielipankki-konversio/corp/corpus
if you use corpus-specific scripts for the corpus corpus:
PATH=$PATH:/proj/clarin/korp/cwb/bin:/proj/clarin/korp/scripts:/proj/clarin/korp/git-work/Kielipankki-konversio/corp/corpus
Alternatively, you may also use your own working copy of the Kielipankki-konversio GitHub repository, in particular if you use your own private branch for new conversion scripts before pushing them to the public repository.
You may also use your own work directory /wrk/username/
in the computing environment for processing corpora, or for smaller corpora, your home directory. It is easiest if you set up under /wrk/username/corpora/
a subdirectory structure similar to that under /proj/clarin/korp/corpora/
, as in the above table. In that case, you need to set the following environment variables to simplify corpus processing:
CORPUS_ROOT=/wrk/username/corpora
CORPUS_REGISTRY=$CORPUS_ROOT/registry
Please keep in mind that neither /proj/clarin
nor your personal work directory is backed up, so valuable data and scripts should be copied elsewhere. Also note that files that have not been used in 90 days are deleted automatically from the personal work directory.