The Korp version of Wanca 2016 is a collection of web corpora in small Uralic languages. The collection is composed of 29 sentence corpora in different languages. The corpora have been collected from the Internet using the automated system developed in the Finno-Ugric Languages and the Internet project (SUKI) supported by the Kone foundation from their Language Programme 2012-2016.
The Korp version was published as BETA in the autumn and now the BETA-status has been removed after the following corrections were made to the Korp version:
Tokenization problems fixed for Nganasan, Nenets, Võro and Ludian.
Single tokens ”, ’ and ’ were added to the preceeding word if there
was no space between them in the original text.
E.g. ”нинту ”” -> ”нинту””, ”ott ’” -> ”ott’”, ”A ’” -> ”A'”.
Also, tokens ” and ’ at the end of a word were combined with the next word
if there was no space between them in the original text.
E.g. ”ӈизисибся” дя” -> ”ӈизисибся”дя”.
Tokenization problems fixed for Karelian, Livvi and Veps.
A single token ’ was combined to the preceeding and/or next word
if there was no space between them in the original text.
E.g. ”bol ’ ševikkivallankumouš” -> ”bol’ševikkivallankumouš”,
”Uz ’ Zavet” -> ”Uz’ Zavet”, ”rodn ’ astettih” -> ”rodn’astettih”.
These errors were introduced when the original sentence corpora was adapted to Korp. Now we have also published the original source version of the Wanca 2016 containing the sentence per line text files for each language as well as the VRT version exported from Korp. Both of these are published under CC-BY license and are available at the Language Bank Download service.
The creation process of the corpora is described in: Jauhiainen, Heidi, Tommi Jauhiainen, Krister Lindén
Wanca in Korp: Text corpora for underresourced Uralic languages. Proceedings of the Research data and humanities (RDHUM) 2019 conference : data, methods and tools. Jantunen, J. H., Brunni, S., Kunnas, N., Palviainen, S. & Västi, K. (eds.). Oulu: University of Oulu, p. 21-40 (Studia Humaniora Ouluensia; no. 17). 2019