<< Back to the group page

Current state of Uralic transducers used in the annotation of the PaBiVUS version 2 corpora

The evaluation of finite-state transducers requires test corpora representative of the individual languages. One variety of test corpora has been found in the well curated Bible translations for many of the smaller Uralic languages, thanks to the Institute for Bible Translation in Helsinki[1]. We are looking forward to continued success in their translation endeavors and collaboration with them in the future. Hopefully, extensions in this genre can also find collaboration in the work of the ‘Finnish Bible Society’[2]. Other curated genres will prove vital for coverage of and more extensive evaluation in the future. The better the analyzer, the better the tools for language invigoration.

As of December 10, 2024, improvements have been made to the individual analyzers, lexica and shared proper nouns files have become more extensive. The most pronounced upgrade of transducers can be attributed to the Veps-language analyzer. Whereas, previously, there were over 8,000 wordforms not recognized, this number has been reduced to approximately 2,500 through corrections to the paradigmatic descriptions and word lists. The shared proper nouns file directed at Uralic languages written in Cyrillic script has jumped in size from approximately 94,000 to over 145,000. This dramatic leap can be attributed to the addition of place names of the Russian Federation[3] and over 1,000 Pokémon names from Bulbapedia[4].

Some of the Uralic minority languages were overlooked in this evaluation. First, majority languages with their own state were not included in the evaluation. Second, only languages with a subcorpus to be published in Parallel Bible Corpora for Uralic Studies (PaBiVUS) version 2[5] were evaluated, and finally the languages with minimal appropriate descriptions, such as the pluricentric language Khanty were not discussed. Thus, there are enhancements to be made at many levels, and they will, in turn, be evaluated at a later point in time when Saamic, Samoyedic and additional Balto-Finnic languages are incorporated into the PaBiVUS collection.

During the evaluation period, it has become apparent that the dictionary component utilized in the lexica for the transducer at GiellaLT might also be made available for other open-source development. To this end, the “Verdd” dictionary editing platform[6] has been enhanced to allow not only for comma-separated-value (csv) and lexc (vital for analyzers) downloads, but to also provide bilingual downloads in various formats. These formats include XML download for the GiellaLT click-in-text dictionaries, Bidix download for Apertium[7], the open-source shallow-transfer machine translation platform, and pivot translation predictions for language pairs from RootRoo[8].

Follow our progress on GiellaLT[9], Verdd and in the UralicNLP python[10], java[11] and .net libraries.

 

[1] Institute for Bible Translation in Finland https://www.rki.fi/

[2] The [Suomen Pipliaseura] works with translation of the Bible, this includes work with the Saami and Karelian lanugages. https://www.piplia.fi/mita-teemme/muutosvoimana-oma-kieli/raamattu-vahemmistokielille/

[3] Lists of regions, cities, and towns in Russia: https://commons.wikimedia.org/wiki/Category:Cities_in_Russia

[4] Names of Pokémon creatures in Russian, most of which are transcriptions of the English version: https://bulbapedia.bulbagarden.net/wiki/Main_Page

[5] http://urn.fi/urn:nbn:fi:lb-2023030902

[6] A multilingual database with development and download possibilities for work with GiellaLT and Apertium shallow-transfer machine translation https://akusanat.com/verdd

[7] https://wiki.apertium.org/wiki/Main_Page

[8] https://rootroo.com/en/

[9] https://giellalt.github.io/

[10] https://github.com/mikahama/uralicNLP

[11] https://github.com/mikahama/uralicNLP-Java

Last modified on 2024-12-19

Search the Language Bank Portal:
Sofoklis Kakouros
Researcher of the Month: Sofoklis Kakouros

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information