The GitHub repository contains finite state source files for the Udmurt language, for building morphological analysers, proofing tools and dictionaries.
Udmurt belongs to the Permian branch of the Uralic language family. Its closest ties are with the Komi-Permyak and Komi-Zyrian languages, which, although viewed by some as mutually comprehensible, do not share this same intelligibility with Udmurt. Much like its Komi siblings, the Udmurt language shares many etymologically related words with Finnish and Hungarian, they are not obvious enough to recognize immediately, e.g., “ki” = “käsi”, “kéz” ‘hand; arm’; “vir” = “veri”, “vér” ‘blood’; “śin” = “silmä”, “szem” ‘eye’; “vu” = “vesi”, “víz” ‘water’.
Udmurt is the language used in regular issues of newspapers, journals, readers as well as radio and television broadcasts. Here, we have a complete translation of the New Testament from the end of the last century and the entire Bible from the second decade of this millennium. Unlike many Biblical text translations, the Udmurt New Testament and Bible represent the work of an individual native speaker and writer versed in linguistics, folklore and the Holy scriptures. The finite-state description of the language, however, was only begun in 2006 by Trond Trosterud and then continued in the next decade by Ryan Johnson and Jack Rueter with much collaboration from István Kozmács and numerous native speaker researchers as well as. Much needed help with the vocabulary in dictionary format has also come from Sirkka Saarinen and Sergei Maksimov.
The present state of the analyzer is relatively extensive. There are 180 lexicons and approximately 196,925 lemma and stem pairs in the finite-state analyzer. 94,663 of the total come from a shared proper-nouns file which has also been enhanced using proper nouns from Udmurt texts. The ratio of nominals to verbs shows over 42,000 nouns, 12,900 adjectives and 12,300 verbs, which is in great contrast to what is found in the Komi-Zyrian analyzer. So far, the finite-state analyzers have been used in the annotation of the Uspenskij parallel texts hosted on the Language Bank of Finland Korp server.
In anticipation of the forthcoming publication of Parallel Biblical Verses for Uralic Studies (PaBiVUS v2), the analyzers were evaluated against the words and word forms in the New Testament (NT 1997) and Bible (2013), such that some books were represented by more than one version. The total number of tokens, i.e., words and punctuation marks were approximately 786,201, with 48,696 unique word forms. 13,196 unique forms were not recognized, and a total of 6,047 unique forms occurred more than once.
New Testament 1997, Bible 2013
total tokens: 786,201
total words: 666,789
total characters (from words): 8,077,454
unique word forms: 48,696
Beginning: (2024-07-01)
unique misses 15902
number of lines before hapax: 7,535
Lacking unambiguous PoS: 80,637
Lacking unambiguous dependency: 202,272
Size of lexicon.lexc: 196,925
Number of LEXICONs: 180
The finite-state description of Udmurt requires enhancements in many dimensions ranging from lexica to morphosyntax. While evaluating the coverage, it was noted that the Biblical texts had a low coverage for associated proper nouns as well as pair-verb and collective-noun constructions, as was the case in Komi-Zyrian. Lexical additions might be made utilizing word form documentation in a Hunspell data set obtained from Enye Lav, and work by Timothy Arkhangelsk. Follow our progress on GiellaLT and in the UralicNLP python, java and .net libraries.