The GitHub repository contains finite state source files for the Western Mari language, for building morphological analysers, proofing tools and dictionaries.
Hill Mari aka Western Mari belongs to the Mari branch of the Uralic language family. Its closest ties are with the Meadow Mari aka Eastern Mari language. Earlier the Mari languages, which are geographically adjacent to the Mordvin languages, were classified as members of the Volga branch of the Uralic languages. Nowadays, it seems, this classification only finds motivation in proximity, without generic foundation. Hill Mari shares many of the same cognates that Mordvin languages do:
“kid” = Moksha “käd́” = Finnish “käsi” ‘hand; arm’; “vÿr” = Moksha “veŕ” = Finnish “veri” ‘blood’; “vÿc” = Moksha “vet́ə” = Finnish “viisi” ‘five’.
The Hill Mari language was reborn and actively invigorated with rise of the Internet. This involved a notable translator and writer who lived in Estonia and Finland from the late 1980s to the mid 2010s, Valeri Alikow, as well as Julia Kuprina and young people living in the Mari Republic. Among others this means a good presence in Wikipedia and translations of a few prominent Finnish and Estonian authors but also children’s books, more recently “The Little Prince”.
Work with the open-source, finite-state description of Hill Mari, however, was only begun in 2012 by Julia Kuprina and Jack Rueter in the auspices of the “Kone Foundation Language Programme”. Kuprina contributed to the extension of lexical work
by A. A. Savatkova and K.G. Yuadarov while actively working with Alikow. Rueter’s work was also highly influenced by Alikow. Only after two years of constructing the model was it compared to the separately developed Meadow Mari description. In the late 2010s, the Hill Mari analyzer was complemented with shared tools for both Mari literary languages implemented by Alexandra Simonenko, Anna Volkova, Jeremy Bradley, Jack Rueter and Trond Trosterud, and more vocabulary work for Meadow Mari to Hill Mari was contributed by Andrei Chemyshev.
The Hill Mari-language materials in the upcoming version of Parallel Biblical Verses for Uralic Studies (PaBiVUS v2) represent a lesser documented genre of of Hill Mari. The materials will include the New Testament published in 2014.
The Hill Mari analyzer provides coverage for an extensive morphology in both verbs and nominals. There are approximately 543 continuation lexica for 173,284 lemma-stem pairs. In addition to a shared set of proper nouns of over 94,000, the lexicon contains approximately 55,000 more proper noun forms, 11,975 common nouns, 3,500 verbs and 5,472 adjectives. Due to what appears to be an irregular vowel shift, developing shared vocabularies for the Western and Eastern Mari languages is still a challenge. The coverage afforded by the analyzer can be observed in Korp materials at the Language Bank of Finland, i.a. «Uspenskij 4 battles» as well as the forthcoming «Pavlik Morozov» and «PaBiVUS v2».
In preparation for the upcoming publication of Parallel Biblical Verses for Uralic Studies (PaBiVUS v2), the analyzers were evaluated against the words and word forms in books of the New Testament . All in all, there was a total of 63,022 word forms – 6.906 unique tokens of which there were 5,066 unique missing word forms. There were 2,247 missing unique forms that appeared more than once, 19,119 words were ambiguous for part-of-speech tagging, and 19,119 tokens had ambiguous dependency tagging.
New Testament 2014
total words: 63,022
total characters (from words): 329,606
unique words: 6,906
date of attestation: (2024-09-01)
unique misses: 5,066
number of lines before hapax: 2,247
Lacking unambiguous PoS: 19,119
Lacking unambiguous dependency: 36,098
Size of lexicon.lexc: 173,284
Number of LEXICONs: 543
The Hill Mari description is wanting in lexicon and perhaps extensions to inflectional paradigms. The Lexicon, as can be discerned above, is lacking in a majority of unique word forms used in the Biblical genre. This includes proper names, on the one hand, but also a high number of direct loanwords, on the other. This kind of words had not been considered earlier in the documentation of Hill Mari. The relatively low number of verbs in the description, less than four thousand, would intuitively point to the need for further work in verbs as well.
Follow our progress on GiellaLT and in the UralicNLP python, java and .net libraries.