[Importing corpus data to Korp: technical documentation]
This page contains information on the input formats VRT and HRT used for Kielipankki’s Korp text corpus search service. The information is primarily aimed at Kielipankki’s staff importing corpora to Korp, but it may also be useful for corpus providers if they can affect the corpus format. For further information, please contact fin-clarin [at] helsinki.fi.
The section VRT (Kielipankki flavour) in brief below describes briefly the important characteristics of VRT for those who wish or need to produce corpus data in the VRT format.
The separate document VRT format in a nutshell describes the contents of a VRT formatted document in simple terms, e.g., for a user who downloads and needs to use resources in VRT format.
The original format of corpora to be imported to Korp may vary widely. The data may be plain text, HTML, in a tabular format such as CoNLL-X, in an XML format such as TEI. In any case, it is important that the data format is as consistent as possible, since it makes corpus processing faster and the result better.
The data in the original format is converted to VRT (VeRticalized Text), possibly via HRT (HoRizontal Text). VRT is the input format for the IMS Open Corpus Workbench (CWB) software underlying Korp, whereas HRT is an untokenized intermediate format.
VRT is a token-oriented columnar text format: each token (word) is on its own line together with its possible annotation attributes (positional attributes) separated by tabs. The structure of the text is represented with XML-style tags (structural attributes) on their own lines. Start tags may contain XML-style attributes (structural attribute annotations) for the structure.
Note that this document describes the VRT format as used in Kielipankki (the Language Bank of of Finland) for Korp and downloadable data, with some additional conventions and constraints compared with the more general input format recognized by CWB. Also note that even though not all the requirements and recommendations for the input format described in this document are mandated by CWB, violating them may make parts of the corpus data impossible or more difficult to search in Korp and also more difficult to process the VRT data with other tools.
This section describes briefly the important characteristics of the Kielipankki flavour of VRT for those who wish or need to produce corpus data in the VRT format. For more details and an example, please see the following sections.
word
) as the first attribute; the rest can vary_
for empty or missing attribute values<!-- #vrt positional-attributes: word lemma pos msd -->
text
, paragraph
(optional) and sentence
text
is a logical unit with common characteristics and metadatasentence
enclosed in a text
sentence
should be enclosed in a paragraph
chapter
between text
and paragraph
, or clause
or ne
(name expression) within sentence
; they need not cover all tokenstext
ssentence
and page
; however, avoid this if not neededtext
structures: datefrom
, dateto
, timefrom
, timeto
: The creation date and time interval of the original text: yyyymmdd for dates, hhmmss for times; all empty if the creation time is not known (see below for more information)text
, paragraph
and sentence
structures: id
: An identifier of the structure, unique within a single corpusa
…z
and 0
…9
, attributes names also underscores _
; cannot begin with a digit|
at the beginning and end and separating individual values; e.g. |Adj|Noun|Verb|
|
:
followed by a number (integer or float); e.g. |Adj:0.7|Noun:0.22|Verb:0.08|
<
, >
and &
as <
, >
and &
in positional and structural attribute values"
as "
in structural attribute values&#
nnnn;
, &#x
hhhh;
) and HTML character entity references, such as ä
<!-- … -->
are allowed and ignored (no leading or trailing spaces on the line)<!-- #vrt key: value -->
contain information used and generated by Kielipankki VRT Toolslink
structures with id
attributes
id
in different parts are linked with each otherlink
structure may cover one or more sentence
s or paragraph
s, for exampleThe following is an example of the VRT format as used by Korp. Tab characters are represented as → followed by spaces. Structural elements are text
, paragraph
and sentence
, and the positional attributes are word form (word
), the number of the token within the sentence (ref
), lemma (lemma
), lemma with compound boundaries marked (lemmacomp
), part of speech (pos
), morphological analysis (msd
), dependency head number (dephead
) and dependency relation (deprel
).
<!-- #vrt positional-attributes: word ref lemma lemmacomp pos msd dephead deprel -->
<text filename="EuroParl Corpus/fi-en/fi/ep-00-01-17.txt" title="" datefrom="20000117" dateto="20000117" timefrom="000000" timeto="235959">
<paragraph id="1">
<sentence id="1" line="2">
Istuntokauden→ 1→ istuntokausi→ istunto#kausi→ N→ N Gen Sg→ 2→ obj
uudelleenavaaminen→ 2→ uudelleenavaaminen→ uudelleen#avaaminen→ N→ N Nom Sg→ 0→ main
</sentence>
</paragraph>
<paragraph id="2">
<sentence id="2" line="4">
Julistan→ 1→ julistaa→ julistaa→ V→ V Prs Act Sg1→ 0→ main
perjantaina→ 2→ perjantai→ perjantai→ N→ N Ess Sg→ 1→ advl
joulukuun→ 3→ joulukuu→ joulu#kuu→ N→ N Gen Sg→ 5→ attr
17.→ 4→ 17.→ 17.→ Num→ Num Digit→ 5→ attr
päivänä→ 5→ päivä→ päivä→ N→ N Ess Sg→ 1→ advl
...
.→ 26→ .→ .→ Punct→ Punct→ -→ -
</sentence>
</paragraph>
...
</text>
Note that you need to use double quotation marks around the structural attribute annotation values. Even though the CWB encoder also understands single quotes, other tools processing VRT data assume double quotes.
The first line of the example is a positional attributes comment listing the names of the token (positional) attributes in the order they appear in the data. This is an extension of the Kielipankki VRT format.
Note that the initial input VRT often contains only the attribute word
(word form) and the rest are added in the annotation process in Kielipankki.
In contrast to XML, VRT does not require a single root element (structural attribute), so a VRT input may consist of a sequence of texts, for example. Another difference from XML is that VRT allows crossing structural attributes; for example:
<page id="p1">
...
<sentence id="s8">
...
</page>
<page id="p2">
...
</sentence>
...
</page>
However, using crossing structures makes it impossible to use XML tools for VRT data, so they should not be used whenever not necessary.
As the VRT format resembles XML, an XML format may be a good basis for corpus data to be imported to Korp. However, note that nesting the same structural attribute type in VRT is somewhat cumbersome, so it would be better to avoid having a clause inside a clause, for example.
Completely empty lines in the VRT input should be avoided. Even though leading and trailing spaces in attribute values are stripped in the encoding phase, they should preferably be stripped from the VRT input. The VRT input may contain XML-style comments <!-- ... -->
that are ignored, but each comment must be on its own line: multi-line comments are not recognized. An XML declaration at the beginning of a file is ignored. (All these require explicit options to the CWB encoding program, but it probably makes sense to use them.)
For more information on the VRT format in general, please refer to the CWB Corpus Encoding Tutorial in the CWB documentation. Some Korp-specific information is also found on Språkbanken’s Korp backend information page.
The input data for Korp must be UTF-8-encoded Unicode. If the original data is not in UTF-8, you need to convert it, using e.g. iconv
. In that case, you should know what is the original character encoding of the data. In a bad case, it may be a mix of two or more 8-bit encodings, such as ISO 8859-1 and Windows-1252, which complicates converting the encoding correctly; and in the worst case, the character encoding may be already incorrectly converted.
The characters &
and <
in the data need to be encoded as XML predefined entities &
and <
. Other XML predefined entities may also be used: "
for "
(straight ASCII double quotation mark), '
for '
(straight ASCII single quotation mark) and >
for >
, but they are mandatory only if the value of a structural attribute enclosed in quotes contains the same type of quote.
In contrast, do not use the numeric character references of XML &#
nnnn;
and &#x
hhhh;
, nor HTML character entity references, such as ä
, since the CWB encoder treats them literally. Instead, use the corresponding UTF-8-encoded Unicode characters directly.
The line endings in the VRT input may be either Unix- or Windows-style (bare LF or CR+LF), but Unix-style bare LF is preferred.
VRT data may not contain tabs or any line-separating characters anywhere else than as separators of positional attributes and lines, respectively. Moreover, the data should not contain any other control characters (characters in the ranges U+0000…U+001F and U+007F…U+009F), nor preferably the Unicode line and paragraph separators (U+2028, U+2029). They should be stripped from the data, or if their presence is essential, encoded in a corpus-specific way. In addition, soft hyphens (U+00AD) should also be removed.
The space (U+0020) and no-break space (NBSP, U+00A0) may be used in the values of positional and structural attributes between other characters. However, spaces and NBSPs at the beginning and end of an attribute value should be stripped, multiple consecutive spaces and NBSPs should be converted to a single one, and values consisting of only spaces and NBSPs should be emptied. In particular, tokens consisting of only spaces and NBSPs should be removed.
The Unicode characters FIGURE SPACE (U+2007) and NARROW NO-BREAK SPACE (U+202F) (and also THIN SPACE, U+2007, when used as a thousands separator) should be converted to NBSPs and treated as above. Other Unicode spaces should be treated as and converterd to plain spaces. (Information on Unicode spaces.)
The positional token attributes (columns) may be, for example, the following (in a dependency-parsed corpus with named entities marked):
word
)ref
)lemma
)lemmacomp
)pos
)msd
)dephead
)deprel
)nertag
)The names of positional attributes are specified at the beginning of VRT data (before the first token line) via a positional-attributes comment indicating the order of the attributes; for example:
<!-- #vrt positional-attributes: word ref lemma lemmacomp pos msd dephead deprel -->
With the exception of the word form that should be first, the attributes can be in some other order, as long as all the tokens in a corpus have the same attributes in the same order. If a corpus does not have an attribute, it is left out. If some tokens have an attribute and some others do not, a single underscore (_
) should preferably be used to denote the empty value, even though a completely empty value is also allowed. Even if the missing values were completely omitted, the attribute-separating tabs must be present, to keep the attribute alignment correct.
An attribute may have multiple values, in which case it is referred to as a feature-set attribute in CWB. Multi-valued attributes can be used to represent ambiguity or uncertainty, for example. A multi-valued attribute is represented by separating the values by vertical bars and adding vertical bars at the beginning and end of the whole value; for example, |Adj|Noun|Verb|
. An empty feature set attribute value (no values in the set) is denoted by a single vertical bar |
.
If a value of a feature set would itself contain the vertical bar, it should be replaced with another character, such as the Unicode broken bar U+00A6 ¦.
An extension of feature-set attributes in Korp are ranked attributes. Their values are feature-set values in which each individual value has a suffix consisting of a colon :
followed by a number (an integer or a float); e.g. |Adj:0.7|Noun:0.22|Verb:0.08|
. The numbers may denote probabilities or some other ranking of the values.
Korp recognizes and uses three levels of structures: text, paragraph and sentence. In the input VRT, they are represented with structural attributes of the same names (text
, paragraph
and sentence
). Paragraphs are optional, but texts and sentences are required.
Sentence is the (default) context of a match in the KWIC concordance in Korp, whereas the enclosing paragraph is shown in the context view. (If a corpus has no paragraphs, also the context view shows sentences.) A text is a logical unit with common characteristics and metadata, such as the same writer and timestamp. It may be, for example, a single message on a discussion forum, an article in a magazine or a complete novel. It is a matter of decision what constitutes a text; for example, in OCR’d newspapers and magazines without article boundaries marked, a text might be one issue or one page.
In addition to texts, paragraphs and sentences, the VRT input may also contain other structures, even though they are seen in Korp only via their possible attributes. The structures may be at any level, for example, chapters containing paragraphs, or clauses within sentences. If the original data has other structures, they should preferably be preserved in the VRT format. However, please note that currently some VRT processing tools do not handle correctly VRT data containing structures within sentences.
Note that all tokens should be within text
and sentence
structures; otherwise Korp will not show them and some processing tools, such as the parser will fail. Moreover, if the data has paragraphs, each sentence
should be inside a paragraph
. For example, if the text contains headings, they should be enclosed in sentence
structures (and also paragraph
if the data has paragraphs), with perhaps the attribute type
having value heading
, instead of inventing a distinct structure for headings.
The attributes of structural elements can mostly be free-form, but the same attribute names should be used for the same information across corpora; see below for commonly used attribute names. The metadata for a complete text should be represented as attributes of the text
start tag. The names of attributes should be in English.
If the creation date (and time) of the original text is known, it is represented (as local time) in the attributes datefrom
, dateto
, timefrom
and timeto
of the structure text
. The values of datefrom
and dateto
must be of the form yyyymmdd (or empty), and the values of timefrom
and timeto
of the form hhmmss (or empty). If the full creation date is known but not the time, the values of datefrom
and dateto
are the same, the value of timefrom
is 000000
, and that of timeto
235959
. If only the year is known, the value of datefrom
should be yyyy0101
and dateto
yyyy1231
. If the creation date is unknown, the attributes should be left empty. No ad-hoc values may be used, such as marking uncertainty with a question mark.
If you need separate, human-readable date and time attributes for the corpus, you can use date_orig
and time_orig
containing values extracted from the original corpus data. For uniformity across corpora, date_iso
and time_iso
can be used to represent the date and time in the long ISO formats yyyy-mm-dd and hh:mm:ss.
In particular dependency-parsed corpora require each sentence
structure to have the attribute id
, which is unique within the corpus.
All the structures of the same type in a corpus should have the same attributes. If the value of an attribute is empty for some structure, it should still be represented as attrname=""
. Athough the order of the attributes in the VRT does not matter, they should preferably be in the same order in all the structures of a corpus. An alphabetic (lexicographic) order is recommended.
The names of structures and their attributes in VRT may only contain the characters a
…z
(lowercase only), 0
…9
and _
(underscore). (CWB also allows a -
(hyphen), but not all VRT tools can handle it.) The names may not begin with a digit. Moreover, do not use the underscore in the names of structures; in their attributes it may be used without problems. You should also avoid using the following reserved words of the CQP query language as structure or (positional) attribute names:
asc ascending by cat cd collocate contains cut def define delete desc descending diff difference discard dump exclusive exit expand farthest foreach group host inclusive info inter intersect intersection join keyword left leftmost macro maximal match matchend matches meet MU nearest no not NULL off on randomize reduce RE reverse right rightmost save set show size sleep sort source subset TAB tabulate target target[0-9] to undump union unlock user where with within without yes
A structural attribute may also be a multi-valued (feature set) or ranked attribute, in which case its values are represented in the same way as the values of a positional feature-set or ranked attribute; for example, nertag="|LocGpl|LocPpl|"
or nertag="|LocGpl:0.8|LocPpl:0.2|"
.
The following structural attributes occur in several corpora and should be used for future corpora with similar information. The attributes whose names are in bold are required by Korp. The format of the value is shown for attributes with fixed-format values. The list is not exhaustive, so if you think your corpus would have attributes probably used before, you may ask for advice in naming the attributes.
Structure | Attribute | Description | Value |
---|---|---|---|
text |
datefrom |
The starting date of original creation or publication of the text | Format yyyymmdd; empty if undated; if only year is known, use yyyy0101 |
text |
dateto |
The ending date of original creation or publication of the text | Format yyyymmdd; empty if undated; if only year is known, use yyyy1231 |
text |
timefrom |
The starting time of day of original creation or publication of the text | Format hhmmss; empty if undated; if only the creation date is known, use 000000 |
text |
timeto |
The ending time of day of original creation or publication of the text | Format hhmmss; empty if undated; if only the creation date is known, use 235959 |
text |
date_orig |
Creation or publication date of the text | Free-form date in the original format present in the data; may contain ranges and indicators of uncertainty |
text |
time_orig |
Creation or publication time of the text | Free-form time in the original format present in the data; may contain time zone information |
text |
datetime_orig |
Combined creation or publication date and time of the text | Free-form date and time in the original format present in the data |
text |
date_iso |
Creation or publication date of the text | Long ISO format yyyy-mm-dd |
text |
time_iso |
Creation or publication time of the text | Long ISO format hh:mm:ss |
text |
title |
Title of the text | |
text |
author |
Author of the text | |
text |
translator |
Translator of the text | |
text |
year |
Writing or publication year of the text | |
text |
filename |
Name of the source file for the text | |
text |
url |
URL of a human-readable version of the text, possibly with context | |
text |
issue |
Magazine or newspaper issue | |
text |
lang |
Language of the text | Preferably a three-letter ISO 639-2 code |
text |
subject |
Subject of the text | |
text |
publisher |
Publisher of the text | |
text |
wordcount |
The number of words in the text, excluding punctuation marks | |
paragraph |
id |
An identifier for the paragraph | |
paragraph |
type |
The type of the paragraph | For example, paragraph , heading |
sentence |
id |
An identifier unique within a corpus; required for dependency-parsed corpora |
[TODO: Extend the list]
The Korp search interface requires information on the structures and attributes used in a corpus. Each corpus should preferably be accompanied with a list of annotation attributes of tokens and in particular structural attributes, together with their brief labels or descriptions in at least Finnish, preferably also in English and Swedish. (For Swedish corpora, the attribute labels may be only in Swedish and for corpora in other languages only in English.)
If the set of values of an attribute is fixed and relatively small, such as parts of speech, Korp’s extended search may have a selection list for it with human-readable names of the values (e.g., N
= noun). It would be good to have a list of such names for attribute values as well.
Each language (or otherwise parallel part) of parallel corpora is encoded separately. The alignment between different languages is marked by using the value for the id
attribute of the aligned structures. It is now preferable to use the separate alignment structure link
marking alignment units that may span several sentences, for example. It is possible to add the alignment id
attribute to paragraphs or sentences, as in a number of existing corpora, but using link
enables better compatibility with Språkbanken’s Korp.
CWB allows one-to-one, one-to-many, many-to-one, many-to-many and crossing alignments, but the alignment regions must be contiguous.
It is also possible to use alignments specified separately either as correspondences of aligned regions by token positions in the corpus or as alignment beads referring to the attributes of aligned structures. [TODO: Add examples]
[TODO: Add information about word alignment (optional).]
Note the following technical limitations of the CWB, which affect the VRT files (Appendix B of the CWB Corpus Encoding Tutorial):