The Moksha morphology and tools

The GitHub repository contains finite state source files for the Moksha language, for building morphological analysers, proofing tools and dictionaries.

The Moksha language

Moksha belongs to the Mordvin branch of the Uralic language family. Its closest ties are with the Erzya language and especially the Shoksha language form, whose status as an individual language has been contested. Earlier the Mordvin languages, which are geographically adjacent to the Mari languages, were classified as members of the Volga branch of the Uralic languages. Nowadays, it seems, this classification only finds motivation in proximity, without generic foundation. Of the Mordvin languages, it is Moksha that shares Finnic *ä – distinguishing it from *e in words such as “käd́” = “käsi” ‘hand; arm’; “ked́” = “kesi” ‘skin’; “veŕ” = “veri” ‘blood’. Like Erzya, it also shares multiple cognates with Finnish and Saami “śeĺmə” = “silmä” = “čalbmi” ‘eye’; “kargə” = “kurki” = “guorgga” ‘crane’; “vet́ə” = “viisi” = “vihtta” ‘five’.

The Moksha language is actively used in many genres. In addition to regular issues of some newspapers, journals and readers it can also be found in radio and television news programs and on the Internet. Work with the open-source, finite-state description of Moksha, however, was only begun in 2012 by Merja Salo and Jack Rueter in the auspices of the “Kone Foundation Language Probramme”. While Salo contributed to the extension of lexical work by Aleksandr Feoktistov and Eeva Herrala, Osip Polâkov, Rueter strove to copy, where possible, the finite-state description already developed for Erzya. Copying the structure of an existing language model would make it easier for subsequent development in facilitation of the two languages – work with national corpora for the Mordvin languages, Universal Dependencies projects (see, for example, Uralic UD v2.13) and shallow-transfer rule-based machine translation in Apertium. In 2020–2022, Jorma Luutonen, Sirkka Saarinen and the helpful people at the University of Turku provided ample opportunity to work with both lemmatization in 2020 and hands-on work in Constraint Grammar dependencies 2021–2022 for both Moksha and Erzya.

The Moksha-language materials upcoming in the next version of Parallel Biblical Verses for Uralic Studies (PaBiVUS v2) represent a lesser documented genre of Moksha. The materials will include the New Testament published in 2016 and a test translation from 1995. The Moksha translator, Valentina Mishanina, is a seasoned writer, and this is quite apparent in the vivid and diverse text.

The finite-state analyzer

The Moksha analyzer provides coverage for an extensive morphology in both verbs and nominals. There are approximately 852 continuation lexica for 189,476 lemma-stem pairs. In addition to a shared set of proper nouns of over 94,000, the lexicon contains approximately 13,200 common nouns, 13,500 verbs and over 11,000 adjectives. As with any language, the noun, verb and adjective lists include loanwords of many varieties – some are written the same as they are in the majority language, while others have undergone varied degrees of integration into the Moksha language. As in Erzya, Zyrian Komi, Permyak Komi and Mansi, an effort is being made to document features of loanword integration into Moksha and at the same time indicate where native vocabulary already exists or is emerging. This documentation will be helpful in language identification and enhancement. The coverage of the analyzer can be observed in Korp materials at the Language Bank of Finland, i.a. «UD v2.13», «ERME v2», «Uspenskij 4 battles», «PaBiVUS».

Coverage for Moksha in PaBiVUS

In preparation for the upcoming publication of Parallel Biblical Verses for Uralic Studies (PaBiVUS v2), the analyzers were evaluated against the words and word forms in books of the New Testament (NT). All in all, there was a total of 166,243 word forms – 20,889 unique tokens of which there were 2,015 unique missing word forms. There were 366 missing unique forms that appeared more than once, 2,754 words were ambiguous for part-of-speech tagging, and 13,816 tokens had ambiguous dependency tagging.

Moksha materials

New Testament 2016; Gospel of John 1901; Gospel of Mark 1995.

total tokens: 201,579
total words: 158,742
total characters (from words): 2,012,818
unique tokens: 20,853
date of attestation 2024-07-24
unique misses = 2,032
number of lines before hapax: 366
Lacking unambiguous PoS: 2,754
Lacking unambiguous dependency: 13,816
Size of lexicon.lexc: 189,476
Number of LEXICONs: 852

Observations

The oldest version of the Gospel of John was written in a consistent orthography, which was readily normalized to the modern standard language. The description of the modern language, it seems, requires additional work in suffix and linking vowel variation. Linking vowel variation is present in words with specific consonant clusters followed by a word-final schwa /ə/ written as ‹а› or ‹е›. In nominals, the presence-absence alternation of a linking vowel before the plural marker or locative cases (Illative, Inessive or Elative) may require additional statistical work for predicting which is variant is more prominent.

New combinations have been discovered whose description may require renewed approaches to the concept of mood in verbs. Moksha and Erzya are described as having a conditional mood that may further combine with a conjunctional (aka subjunctional) mood. In the most recent translation of the New Testament in Moksha, it should be noted, an additional combining conditional + optative form has been found. This discovery may spark a renewed evaluation of the Mordvin Conditional, which, in fact, is a protasis marker with an approximate meaning ’if’ as discussed by Petar Kehayov. Should the protasis marker be dealt with as derivational morphology?

Future work with the analyzers

Although the Moksha analyzer now has better coverage than before, there is still work to be done in the declension of noun modifiers in instances of ellipsis. Unlike German, Russian and Finnish, Mokshen adjectives and numerals do not decline as noun modifiers. If the head noun is dropped, however, the right-most premodifier in a noun phrase becomes the locus of case, number, possessor and definiteness marking. This phenomenon also affects nouns modifying nouns if they are in the Genitive, Inessive, Elative and Abessive cases. This latter subgroup has often been dealt with separately under the terminology «secondary declension». As in the most recent fieldwork-driven article by Mariia Privizentseva, the description of noun modifier declension will treat adjectives, numerals and declined nouns as representative of one phenomenon.

Search the Language Bank Portal:

Researcher of the Month: Pekka Posio

Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information