The Eastern Mari (Meadow and Eastern Mari) morphology and tools

Name: CLARIN Annual Conference 2025
Start: 2025-09-30T00:00:00+03:00
End: 2025-10-02T23:59:59+03:00
Location: Vienna, Austria

The GitHub repository contains finite state source files for the Eastern Mari language, for building morphological analysers, proofing tools and dictionaries.

The Meadow Mari aka Eastern Mari language

The Mari literary languages are represented by a two-way split into the majority Eastern and Meadow Mari language, on the one hand, and the Hill Mari or Western Mari language, on the other. Thus, the term «Meadow Mari» is a translation of the endonym for the geographically central continuum of Mari adjacent to the «Hill Mari» in the west, and the term «Eastern Mari» is related to the geographically eastern continuum segment of majority Mari in the east. Hence, the terms «Meadow Mari» and «Eastern Mari», although indicating different parts of the Mari language continuum, are often used interchangeably, whereas the terms «Hill Mari» (endonym translation) and «Western Mari» are simply two names for the same language form. The split between Meadow & Eastern Mari versus Hill Mari is seen by some as a division in the continuum of a single language, despite the presence of lexical, vowel harmony and inflectional distinctions that might. Although the Meadow & Eastern Mari language shares many etymologically related words with Finnish and Hungarian, they are not obvious enough to recognize immediately, e.g., “kid” = “käsi”, “kéz” ‘hand; arm’; “vür” = “veri”, “vér” ‘blood’; “šinča” = “silmä”, “szem” ‘eye’.

Meadow Mari is used in regular issues of newspapers, journals, readers as well as radio and television programs. Mari underwent one great change in its orthography in the latter part of the 1930s. Until then, the principle had been one sound one letter. With the change, the Cyrillic spelling of Mari no longer gave non-speakers a clue as to the pronunciation of words, and the loanwords could readily be recognized as such. The stage was being set for further influx of internationalisms and Russian loans, but, even today, Mari words are actively used and created.

Only a few years after Kimmo Koskenniemi produced a finite-state description of the Finnish language in the early 1980s, Jorma Luutonen began work with a finite-state description of the Meadow Mari language. This work was recognized by Trond Trosterud, who initiated parallel work for Saami languages at Giellatekno in Tromsø, Norway. By the end of 1990s Trosterud was converting the original Latin-based code to UNICODE. In the beginning of the new Millennium, work with Meadow Mari was being conducted by yet a third programmer, Jeremy Bradley, in Vienna, Austria. In the second decade, Trosterud, Luutonen, Bradley joined together with Jack Rueter and prolific linguists Alexandra Simonenko and Anna Volkova to merge the finite-state descriptions developed in Turku, Tromsø and Vienna. Trosterud, Simonenko and Volkova worked specifically with the syntax and considered its portability to Hill Mari as well.

The finite-state analyzer

The present analyzer for Meadow Mari is extremely extensive. There are 438 lexicons and approximately 156,490 lemma and stem pairs in the finite-state analyzer. Over 94,600 of the total come from a shared proper nouns file and 4,343 come from a list of Meadow Mari proper nouns. Additionally, there are approximately 25,881 common nouns, 10,305 adjectives, and 7,643 verbs. In Yoshkar-Ola, Andrei Chemyshev went to great lengths to ensure both extensive vocabularies and Mari-language data, which was used in development.

Coverage for Meadow Mari in PaBiVUS

In anticipation of the forthcoming publication of Parallel Biblical Verses for Uralic Studies (PaBiVUS v2), the analyzers were evaluated against the words and word forms in the New Testament (2007). The total number of words was approximately 127,460, with 15,250 unique forms. 472 unique forms were not recognized, and a total of 129 unique forms occurred more than once.

Meadow Mari materials

New Testament 2007
total tokens: 150,503
total words: 127,640
total characters (from words): 127,640
unique words: 15,250
Beginning: (2024-08-20)
unique misses: 472
number of lines before hapax: 129
Lacking unambiguous PoS: 1023
Lacking unambiguous dependency: 28,558
Size of lexicon.lexc: 156,490
Number of LEXICONs: 438

While evaluating the coverage, it was noted that the Biblical texts had a low coverage for associated proper nouns, but that, actually, many of the languages written in Cyrillic letters shared the same names in present or historical translations of the Bible. In the long run, this part of the vocabulary might be shared, but it would definitely entail extra work to curtail an overflow of names inflecting according to foreign name templates.

Future work with the analyzers

Even though the analyzer recognized a high percentage of the text, there are still parts of the lexicon, inflection, naturally, syntax to deal with. We can look forward to yet a third version of PaBiVUS featuring the entire Bible in Meadow Mari, and this will mean the introduction of additional proper nouns, nominal inflection and some additional solutions to disambiguation and an enhanced understanding of variation in the ordering of the categories of number, case and possessive person – a feature previously addressed by Jorma Luutonen and still quite distinctive in the Mari languages. Follow our progress on GiellaLT and in the UralicNLP python, java and .net libraries.

Search the Language Bank Portal:

Researcher of the Month: Simo Määttä

Näytä kaikki tapahtumat

Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information