[Importing corpus data to Korp: technical documentation]

The computing environment for corpus processing

The primary environment for processing corpora to be imported to Korp is CSC’s computing environment. You may also process corpora on your own Linux workstation, but that is recommended mainly for corpora with free licences. Please note that the previously used local Korp test server nyklait-09-01.hum.helsinki.fi is being phased out, so it is no longer a recommended corpus processing or Korp environment.

However, Korp itself cannot be run in the computing environment, so once you wish to test your corpus in practice, you will need to do it elsewhere, on the Korp server. The Korp test servers are currently not functional. On your own Linux server, you may set up the Korp frontend, but it is currently difficult to test new corpora locally, since the Korp frontend assumes the presence of all configured corpora on the Korp (backend) server used. (We have plans to fix this, though.)

If you wish to process corpora or install Korp locally, please note that you need a development version of Corpus Workbench (CWB), at least version 3.4.9, but preferably the latest one. (Korp does not work with the “stable” version 3.0.) is See the IMS Open Corpus Workbench (CWB) page for information on accessing the CWB Subversion repository. In addition to the cwb section of the repository, you will also need cwb-perl to be able to use the korp-make script, and cwb-doc contains documentation for both importing corpora and using the query language CQP.

[TODO: More information on how to install CWB and other tools locally.]

Processing corpora in the computing environment

Directory structure

The directories related to Korp and corpora are under the CLARIN project directory /proj/clarin/, most of them under /proj/clarin/korp/. Note that you need to be in the user group clarin to access the directories. The relevant directories are the following, relative to /proj/clarin:

Directory	Description
Corpus data directories
`korp/corpora/src/corpus/`	Corpus data files for the corpus corpus in the source (non-VRT) format
`korp/corpora/data/corpus/`	CWB data file for corpus
`korp/corpora/registry/`	CWB registry files for corpora
`korp/corpora/pkgs/corpus/`	Korp package files for corpus
`korp/corpora/log/`	Log files, in particular for `korp-make`
`korp/corpora/vrt/corpus/`	VRT and other generated files for corpus
`vrt-in/`	VRT files to be parsed and NER-tagged
`vrt-out/`	Parsed and NER-tagged VRT files produced from those in `vrt-in`
Code and other directories
`korp/cwb/bin/`	Executables for the CWB
`korp/git-work/Kielipankki-konversio/`	A working copy of the Kielipankki-konversio GitHub repository, kept up-to-date with the repository. (The older directory name `korp/git-work/korp-corpimport/` may also be used.)
`korp/scripts/`	A symbolic link to `korp/git-work/Kielipankki-konversio/scripts/` containing many general-purpose corpus processing scripts

Setting up your environment

To reduce the amount of typing when running corpus processing scripts, you should add to your path the following directories: /proj/clarin/korp/cwb/bin, /proj/clarin/korp/scripts and /proj/clarin/korp/git-work/Kielipankki-konversio/corp/corpus if you use corpus-specific scripts for the corpus corpus:


  PATH=$PATH:/proj/clarin/korp/cwb/bin:/proj/clarin/korp/scripts:/proj/clarin/korp/git-work/Kielipankki-konversio/corp/corpus

Alternatively, you may also use your own working copy of the Kielipankki-konversio GitHub repository, in particular if you use your own private branch for new conversion scripts before pushing them to the public repository.

You may also use your own work directory /wrk/username/ in the computing environment for processing corpora, or for smaller corpora, your home directory. It is easiest if you set up under /wrk/username/corpora/ a subdirectory structure similar to that under /proj/clarin/korp/corpora/, as in the above table. In that case, you need to set the following environment variables to simplify corpus processing:


  CORPUS_ROOT=/wrk/username/corpora
  CORPUS_REGISTRY=$CORPUS_ROOT/registry

Please keep in mind that neither /proj/clarin nor your personal work directory is backed up, so valuable data and scripts should be copied elsewhere. Also note that files that have not been used in 90 days are deleted automatically from the personal work directory.

Search the Language Bank Portal:

Researcher of the Month: Pekka Posio

Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information

The computing environment for corpus processing

Processing corpora in the computing environment

Directory structure

Setting up your environment

News

Contact