The Korp corpus search interface of Kielipankki (The Language Bank of Finland) is accompanied with and based on a publicly accessible Web service. This service can be used to retrieve the data used for the Korp search interface by other programs and thus serves as an API for Korp. The Korp Web service also offers some features not currently available in the Korp search interface, such as pre-queries to limit the scope of the main query to structures (sentences, paragraphs or the like) also matching the pre-queries.
This document describes the commands available in the Korp API, their parameters and return data content. This document is largely based on the Språkbanken’s (The Swedish Language Bank at the University of Gothenburg) Korp Web service documentation.
The address for Kielipankki’s main Korp Web service is
https://www.kielipankki.fi/korp/cgi-bin/korp/korp.cgi
The Korp Web service is accessed using standard HTTP(S) GET requests, so the URLs for API calls are of the form
https://www.kielipankki.fi/korp/cgi-bin/korp/korp.cgi?command=…&corpus=…&…
Here, command
is the Web service command to execute and corpus
the list of corpora as the targets of the command. These parameters are required by most of the commands.
The Web service returns data in JSON (JavaScript Object Notation) format, which shows up as text in the Web browser.
In addition, the KWIC search results can be downloaded from Korp in various other formats using the download service at
https://www.kielipankki.fi/korp/cgi-bin/korp/korp_download.cgi
To get an idea as to how the results of a Korp search could be obtained via the API, you can open a network traffic monitor in your Web browser’s developer tools and see how the Korp search interface interacts with the API. Look for calls to korp.cgi
or korp_download.cgi
. You can copy their URLs to the browser address bar, modify them parameters and see the effect. (The parameter querydata
often appears in the URL but it is not needed for the Web service.)
Korp is based on the (IMS Open) Corpus Workbench (CWB) and uses its Corpus Query Protocol (CQP) as the query language. It may be helpful to know the basics of CQP for using the Korp Web service. Please note that the CQP queries in Korp are single query expressions: you cannot use the CQP commands for counting, sorting, naming results or setting options, for example.
Corpora in Korp (CWB) consist of tokens (words and punctuation marks). Each token has named positional attributes, at least a word form. In addition, sequences of tokens may be grouped by structural attributes that correspond roughly to XML elements and their attributes.
The positional and structural attributes vary from corpus to corpus but some attributes are common to many corpora. Even if the attribute names are the same, it does not guarantee that the attribute value sets are the same.
Common positional (token) attributes for corpora are:
word
: word formlemma
: base formlemmacomp
: base form with compound boundary markedpos
: part of speechmsd
: morpho-syntactic description (morphological analysis)ref
: the number of the token in the sentence (one-based)dephead
: the number of the dependency head of the tokendeprel
: dependency relation with dephead
lex
: a ”lemgram” for the word: lemma..pos.nThe use of lemgrams originates from the Swedish corpus markup, where it serves as a sense identifier. For Finnish corpora, lemgrams are constructed artificially. They use a different (fixed) set of part-of-speech tags and the sense number n is always 1.
Common structural attributes are text
, paragraph
and sentence
, corresponding to divisions of the text. The attribute values associated with these structures are represented by the structural attributes struct_attr; for example, text_title
for the title of a text and sentence_id
for a sentence identifier.
Almost all corpora have the structural attributes text_datefrom
, text_dateto
, text_timefrom
and text_timeto
, which correspond to the creation date and time of the text, represented in the format yyyymmdd and hhmmss, respectively. If the exact date is known, the values of text_datefrom
and text_dateto
are the same. If only the year is known, text_datefrom
is yyyy0101 and text_dateto
yyyy1231; if the year and month are known, text_datefrom
is the first day of the month and text_dateto
is the last one. If the time is known, text_datefrom
and text_dateto
are the same; otherwise text_datefrom
is 000000 and text_dateto
235959. An unknown date is represented by an empty string in all these attributes.
The parameters of Korp Web service commands are described below as
follows:
a
= …: a
is a required parameterb
= x
: the optional parameter b
takes value x
c
= …: the parameter c
takes multiple values separated with commasThe above list would be represented as URL parameters as follows:
?a=…&b=x&c=…,…
The properties of the JSON objects returned by the Korp Web service are described so that the following JSON:
{ "a": …, name: { "b": "y" "c": … } }
is described as follows:
a
: … (description)b
: y
c
: … (description)The [+] above for name indicates that the property may be repeated multiple times, obviously with different property names. If a value is an array of objects, it is mentioned explicitly.
All commands take the following optional parameters:
indent
= num: Format the resulting JSON with indentation step num. The default is to return the JSON in a compact form, with no indentation or line breaks.callback
= string: Enclose the resulting JSON in string( … } (for some AJAX calls, for example).cache
= true
: Use a cached result if available; if not, store the result in the Korp query cache for future queries.If a command causes an error, it returns JSON with property ERROR
:
ERROR
type
: The type of the errorvalue
: Error messageAll commands also return the real time (as opposed to CPU time) it used to take to execute the command:
time
: Run time in secondsRetrieve information about the available corpora and the CQP version used.
Parameters:
command
= info
Returns:
corpora
: Comma-separated list of corpora available on the Korp server. The corpora are shown as upper-case corpus ids.protected_corpora
: An array of the names (ids) of the corpora that are protectedcqp-version
: The CQP version used on the serverPlease note that trying to access protected corpora (corpora requiring authentication for access) via the Korp Web service results in an error.
Example
Retrieve information on one or more corpora and their attributes.
Parameters:
command
= info
corpus
= List of upper-case corpus idsReturns:
corpora
attrs
p
: Comma-separated list of positional (word) attributes in corpus corpusnames
: Comma-separated list of structural (text) attributes in corpus corpusname. Attributes with a simple name without underscores designate structures and they have no particular values: for example, sentence
for a sentence. Attributes with names of the form struct_attr containing underscores designate the attribute attr of the structure struct and they have values; for example, sentence_id
for the identifier of a sentence.a
: Comma-separated list of alignment attributes (for parallel corpora)info
Charset
: Character encoding of the corpusSize
: The number of tokens in the corpusSentences
: The number of sentences in the corpusUpdated
: The date of last update in ISO format yyyy–mm–ddtotal_size
: The total number of tokens in the above corporatotal_sentences
: The total number of sentences in the above corporaExamples:
Perform a KWIC concordance search for one or more corpora.
Parameters:
command
= query
corpus
= Corpus id in uppercasecqp
= CQP query expressiondefaultwithin
or within
) that contain a match for all the queries. See below for more information.start
= The number of the first hit to include in the concordance (starting from 0)end
= The number of the last hit to include in the concordancedefaultcontext
= n struct: The default context (n struct elements) to show around the match: typically 1 sentence
or 1 paragraph
to show only the containing sentence or paragraph. You can also use nwords
to show n words around the match, disregarding structure boundaries.context
= corpus:n struct: The context to show for corpus corpus instead of the defaultshow
= The positional attributes to show for tokens (from the list of attrs.p
returned by the info
for a corpus), and also the structures whose opening and closing is to be shown within tokens (from the list of attrs.s
returned by the info
for a corpus, typically structure names without an underscore)show_struct
= The structural attributes to show (from the list of attrs.s
returned by the info
for a corpus)cut
= The maximum number of hits to searchdefaultwithin
= struct: Limit search witihin the structural attribute structwithin
= corpus:struct: Limit search in corpus corpus within struct instead of the structure given with defaultwithin
sort
= Sort criterion for the search results within each corpus: one of keyword
(the searched word), left
(left context), right
(right context) or random
(random order)random_seed
= n: Use n as the seed for the random number generator, to get a reproducible random order with sort=random
incremental
= true
: Return results incrementally (as soon as the results for each corpus are ready) for a search from multiple corporaThe word form is always shown in the concordance, even if show=word
is not specified.
The additional CQP query parameters cqpn can be used to simulate order-independent conjunction of search criteria: it does not matter in which order the matches for the separate CQP queries appear in the text structure. This contrasts with a single CQP query, which always specifies the order in which the matching tokens must appear in the text. Note that the result will only indicate match positions for the largest-numbered query (the number of the unnumbered parameter cqp
is 0); the rest are considered pre-queries limiting the scope of the matches.
Returns:
hits
: The total number of hitscorpus_hits
kwic
: An array of KWIC rows with the following properties:
corpus
: Corpus name in uppercasematch
: Information on the match (of the main CQP query only):
start
: The start position (word) of the match on the KWIC rowend
: The end position (word) of the match on the KWIC rowposition
: Global corpus position (token number from the beginning of the corpus) for the matchtokens
: An array of tokens on the KWIC row. Each token is an object, whose properties are the positional attributes specified in the parameter show
(if they exist in the corpus in question). If structural attributes (structures) are specified in show
, their opening and closing are shown in the property structs
of the first and last token of the structure, respectively: the property structs.open
lists all the structures opening before the token and structs.close
the structures closing after the token.structs
:
show_struct
for the first token of the matching row.aligned
: For parallel corpora only
Examples:
Count the absolute and relative frequency of one or more attribute for
a CQP query.
Parameters:
command
= count
(or count_all
for counting statistics for all tokens)cqp
= CQP query expression (not applicable to count_all
)count_all
)groupby
= Positional and/or structural attributes according to which to group the resultscorpus
= Corpus names in uppercasedefaultwithin
= struct: Limit search witihin the structural attribute structwithin
= corpus:struct: Limit search in corpus corpus within struct instead of the structure given with defaultwithin
ignore_case
= Attributes for which case is ignoredstart
= The number of the first row (gropuby
attribute value) to returnend
= The number of the last row to returnincremental
= true
: Return results incrementally (as soon as the results for each corpus are ready) for a search from multiple corporaYou should use the command count_all
for counting statistics for all the tokens in one or more corpora: it is optimized and much faster in this task than count
. count_all
takes the same arguments as count
, except for cqp
(and cqpn).
Returns:
corpora
:
absolute
: Absolute frequencies
groupby
parameterrelative
: Relative frequencies
sums
: Sums of all the attribute values for the corpus
absolute
: Sum of absolute frequenciesrelative
: Sum of relative frequenciestotal
: The total frequencies for all corpora in the same format as above for individual corporacount
: The total number of different valuesExamples:
Get the frequencies of one or more expression over time.
Parameters:
command
= count_time
cqp
= CQP query expressioncorpus
= Corpus names in uppercasegranularity
= Temporal granularity of the result: y
(year; the default), m
(month) or d
(day)incremental
= true
: Return results incrementally (as soon as the results for each corpus are ready) for a search from multiple corporaIf one or more subcqpn is specified, return the frequency information also for these queries.
The result is returned both by corpus and total for all corpora.
Returns:
corpora
:
cqp
: The sub-CQP query in question (not returned for the main query)absolute
:
relative
:
sums
:
absolute
: Sum of absolute frequenciesrelative
: Sum of relative frequenciescombined
: The combined frequencies for all the corpora in corpora
in the above formatExamples:
Compare the search results of two sets of corpora using log-likelihood.
Parameters:
command
= loglike
set1_cqp
= CQP query expression for set 1set2_cqp
= CQP query expression for set 2groupby
= Positional and/or structural attributes according to whose values to group the resultsset1_corpus
= Corpus names in uppercase for set 1set2_corpus
= Corpus names in uppercase for set 2max
= The maximum number of resultsincremental
= true
: Return results incrementally (as soon as the results for each corpus are ready) for a search from multiple corporaThe command may be used to compare two different queries (or the same query) on two different sets of corpora (or the same set) as long as both sets of corpora have the attributes listed in the groupby
parameter.
Returns:
average
: average value for log-likelihoodloglike
groupby
attributes.set1
set2
If the parameter groupby
contains more than one attribute name, the values above have them separated by slashes (value1/value2/…).
Examples:
Retrieve the most frequent dependency relations in which a lemgram or word form occurs.
Parameters:
command
= relations
corpus
= Corpus name in uppercaseword
= The lemgram or word form to searchtype
= Search type: word
(word form; the default) or lemgram
min
= Minimum frequency to be shownmax
= The maximum number of results (0 = no limit)incremental
: Return information incrementally as the computing is ready for each individual corpusReturns:
relations
: An array of relations with the following properties:
source
: List of sources, which are strings of the form CORPUS:
id, where CORPUS is a corpus id and id an internal relation id; to be used as an input parameter to the command relations_sentences
dep
: Dependent lemgram (or word form)depextra
: Dependent prefix (not used in Finnish corpora)deppos
: Dependent part of speech (mapped to the SUC2 tagset)freq
: Number of occurrenceshead
: Head lemgram (or word form)headpos
: Head part of speech (mapped to the SUC2 tagset)mi
: Lexicographer’s mutual information valuerel
: Dependency relation (mapped to the Swedish treebank dependency labels)Retrieve the sentences in which a dependency relation occurs. The dependency relation is often from the word picture.
Parameters:
command
= relations_sentences
source
= Strings of the form CORPUS:
id, where CORPUS is a corpus id and id an internal relation id, as returned in the source
value by the command relations
head
= The lemgram of the head wordrel
= Dependency relationdep
= Dependent lemgramdepextra
= Dependent prefix (not used in Finnish corporastart
= The number of the first hit to include in the concordance (starting from 0)end
= The number of the last hit to include in the concordanceReturns:
The command returns a structure of the same type as the basic KWIC concordance returned by query
.
The KWIC concordance search results can be downloaded (exported) in various formats using the Web service at
https://www.kielipankki.fi/korp/cgi-bin/korp/korp_download.cgi
The main parameters the service takes are the following:
query_params
= The parameters to korp.cgi
command query
for generating a KWIC result; if specified, korp.cgi
is called to generate the result.query_result
= The Korp query result (in JSON) to format; overrides query_params
format
= The format to which to convert the result; default: json
(JSON). See below for further information.filename_format
= A format specification (template) for the (suggested) name of the file to generate; may contain the following format keys: {cqpwords}
(the words in the CQP query), {start}
(the number of the first hit), {end}
(the number of the last hit), {date}
(the date of the query as yyyymmdd), {time}
(the time of the query as hhmmss), {ext}
(file name extension based on the format); default: korp_kwic_{cqpwords}_{date}_{time}.{ext}
filename
: The (suggested) name of the file to generate; overrides filename_format
.The service can either take the query results returned by the main Korp Web service command query
in the parameter query_result
or take the parameters for the command query
, pass them to the main Web service and use the query results it returns. If neither query_params
nor query_result
is specified, the service assumes that the parameters contain parameters for the main Korp Web service command query
to perform a CQP query.
The result is formatted according to format
and possible additional parameters for the format.
The currently supported values for the format
parameter are
json
= JSON (default): The original JSON format returned by the main Korp Web servicenooj
= NooJ formatannot
(= tokens
) = Linguistic annotations in a tabular format: line per tokensentences
= Sentence per row, the word forms of the sentence in one column and each text attribute in its own column, and query information at the end of the file (but see below for a variant)ref
(= bibref
) = Bibliographical reference in a tabular format: the whole sentence on one line and metadata information on the following linesThe format sentences
has a variant which has the information on the whole result repeated for each row, instead of only once at the end of the file, and a field containing all the lemmas of a sentence. This variant is selected by giving the value lemmas-resultinfo
to the parameter subformat
. You can further customize the format with the parameters described further below.
The tabular formats annot
, sentences
and ref
are usually followed by a comma and a physical format specifying the physical representation format for the tabular data:
tsv
= TSV (tab-separated values) (default): fields (columns) separated by tabs, field values not quoted, Unix-style line endings (LF)csv
= CSV (comma-separated values): fields separated by commas, all field values in double quotes (also numeric values), literal doubles doubled, DOS/Windows-style line endings (CR+LF)xls
= Excel 97 XLS spreadsheetExamples:
ref
, lemma
, pos
, msd
, dephead
and deprel
, and the structural attributes , sentence_id
, text_date
, text_tid
and text_cid
:The physical format of TSV and CSV can be customized via the following parameters:
delimiter
= the string separating fields (columns)newline
= the end-of-line character(s) (literally; does not accept C-style escape sequences)quote
= the quote character enclosing field valuesreplace_quote
= the character(s) with which to replace quote characters in field valuesThe annot
, sentences
and ref
formats use the parameters structs
and attrs
to specify which structural and positional attributes should be shown in the result. They take a comma-separated list of the following values:
*
= Show all attributes listed in the Korp Web service parameter show_struct
for structs
and show
for attrs
.+
= As above, but only show only those that actually occur in the corpora from which the results come.*
or +
to omit an attribute.The tabular formats can be customized with a number of formatting parameters, including the following:
sentence_fields
= a comma-separated list of the names of the fields of sentences to display; available values include:
hit_num
= the number of the hit across all pages of hits for the query (zero-based)sentence_num
= the number of the sentence in this file (zero-based)corpus_name
= name of the corpustokens
= all tokens (typically, word forms) of the sentence, separated by a string that can be changed via the parameter token_sep
left_context
= the tokens of the sentence to the left of the match, formatted in the same way as tokens
match
= the tokens of the sentence that are a part of the match, formatted as tokens
right_context
= the tokens of the sentence to the right of the match, formatted as tokens
_
type = the positional attribute attr for each of the token of the sentence, formatted as tokens
, where attrs is a pluralized form of attribute name attr (note that attr needs to be listed in the parameter sentence_token_attrs
) and type is one of all
(all tokens of the sentence), left_context
, match
or right_context
; for example, lemmas_all
is a list of the lemma
attribute of all the tokens of a sentence and poses_match
is a list of pos
attribute of the tokens of the matchaligned
= for parallel corpora, the tokens of the sentence aligned with the match; use ?aligned
to include the field only if the the result contains aligned sentencesstructs
= all the structural attributes of the sentence (and containing elements), each formatted by default as name:
value and separated by semicolons*structs
= expanded to all the structural attributes of the sentence (and containing elements), each in its own fieldparams
= Korp query parameters, formatted as name=
value, separated by semicolonsdate
= the date and time of the query in the format YYYY-
MM-
DD
hh:
mm:
ssurn
= URN of the corpusmetadata_
linktype = link to metadata, where linktype is one of urn
(bare URN), url
(URL) or link
(the URN as a URL if URN is available, otherwise the URL)licence_
linktype = link to licence information for the corpus (linktype as above)licence_name
= the name of corpus licenceNote that values for the corpus information fields urn
, metadata_
linktype, licence_
linktype and licence_name
are currently not available in the API for all corpora. In sentence_fields
, the field names may be prefixed with a question mark ?
to include the field in the result only if any of the corpora in the result has a value for the field.
match_open
= the string to be added to (by default, before) the first token of the match in the sentence fields tokens
, match
, attrs_all
and attrs_match
(default: the empty string)match_close
= the string to be added to (by default, after) the last token of the match (in the sentence fields as for match_open
) (default: the empty string)For example, the default value for sentence_fields
in the format sentences
is corpus,?urn,?metadata_link,?licence_name,?licence_link,match_pos,left_context,match,right_context,?aligned,*structs
, whereas that for the variant lemmas-resultinfo
is hit_num,corpus,tokens,lemmas_all,?aligned,*structs,?urn,?metadata_link,?licence_name,date,hitcount,?korp_url,params
.