The GitHub repository contains finite-state source files for the Komi-Permyk language, for building morphological analyzers, proofing tools and dictionaries.
Komi-Permyak belongs to the Permian branch of the Uralic language family. Its closest ties are with the Komi-Zyrian and Udmurt languages, the former of which seems to be a northern cluster of a nearly comprehensible pluricentric Komi language. The two Komi language forms are close in many ways, sharing many of the same paradigmatic relations illustrating conjugation and declension alignment. A geopolitical divide, however, has long been established between the two language forms, and certain local particularities of the vernaculars have become well established – including the orthography.
The Komi-Permyak language has previously been well used in regular issues of newspapers, journals, readers, and this language awareness and usage is maintained to some extent on the Internet. In 2019, a complete translation of the New Testament was published –– a continuation of translation efforts from the 1990s. The language underwent several changes in its orthographies in the early nineteen-hundreds – from Cyrillic to Latin-based 1932–1937 and then to a new Cyrillic-based orthography after the purges of 1937–1938. At present, there are still a few letters that cannot be represented in UNICODE, namely, the Latin letters ‹d›, ‹l›, ‹s› and ‹t› with descender. Hence, descriptive work of the 1932–1937 period requires a workaround with approximate characters.
When Jack Rueter began work on the open-source, finite-state description of Komi-Permyak in 2016, he opted to utilize the striking similarities between the two Komi language forms. He found that by simply applying Permyak concatenative suffixes and suffixation to an otherwise Komi-Zyrian lexicon a lexicon size of 120,000 would provide a nearly 83% pseudo-coverage of for Komi-Permyak texts. ”Pseudo-” or false positive readings, however, are not helpful in the description of a language, and so extensive morphological input made by PhD. Larisa Ponomareva (2018–2021) and lexical work by Enye Lav (2020–>), both native speakers, have been of great importance to the present state of the analyzer.
The Komi-Permyak finite-state analyzer is supported by 579 lexicons and approximately 186,172 lemma-stem pairs. The actual size of the Komi lexicon can be arrived at by subtracting the mutually shared 94,663 proper nouns from the total, i.e., 91,509. Thus, the remaining Komi lexicon includes 20,501 verbs, 38,798 common nouns, 13,401 adjectives in the largest of its sublexica. The analyzer has been used in the development of the manually verified treebanks of the language found in the Universal Dependencies project, such as those found in the «UD v2.13» corpora hosted on the Language Bank of Finland Korp server.
In preparation for work with the Komi analyzer, a set of figures was compiled for evaluation purposes. It was noted that the New Testament and a previous version of the Gospel of Mark contained 174,642 tokens for a total of 15,642 unique tokens. When checking lemmatization, there were 818 unique forms missing from the analyzer, of 206 forms occurred more than once. Part-of-speech marking failed for 4,616 instances, which would indicate a need for further work in disambiguation. Finally, dependency tagging showed as many as 17,401 ambiguous tokens.
It was determined that a majority of the missing analyses was due to short comings in the lexicon and paradigms. This is rather straight forward, as the Biblical texts contain numerous proper nouns not present in other genres of Komi texts. Close scrutiny of the continuation lexica revealed that some of the morphology had been directly copied from the Komi-Zyrian analyzer without making corrections for distinguishing features, such as the voiced vs voiceless opposition in the morpheme-final ‹d› of Komi-Zyrian and the morpheme-final ‹t› of Komi-Permyak.
During the enhancement phase, additions were subsequently made to the lexical entries and the network of continuation lexica was extended. At the end of this phase, the number of continuation lexica had raised from 579 to 594, and the number of lemma-stem pairs from approximately 186,172 to 186,741. The number of unique word forms missing from the analyzer had dropped to 359, with 74 unique form occurring more than once. The number of ambiguous part-of-speech tags dropped to 4099, and the number of ambiguous dependencies dropped to 15,890.
Future work will take us beyond simple matters of vocabulary and basic paradigms. Specific work will be needed for dealing with orthographic variation found in the translation, which are not currently part of the spelling norm for the language. The 97% coverage for part-of-speech tagging must be improved, with further analysis for detecting false positives in dependency tagging as well where 91% of the tokens have a specified dependency. Extended work with parallel Permyak and Zyrian Komi texts might also help to align improvements in both sets of analyzers.
The Komi-Permyak analyzer is being enhanced for better coverage in the upcoming publication of the new version of Parallel Biblical Verses for Uralic Studies (PaBiVUS) through the Language Bank of Finland.