Minority languages are often extremely low in utilizable refined resources. Every language has the potential, but not all languages have been studied well enough to render their resources predictable or comprehensible. The smaller the number of language users, the less likely it is that the raw language resources have been recognized and subsequently rendered useful in natural language processing (NLP).
When we think of all the neural networking and NLP in general done with majority languages of the world, we must see the continuum of refinement. English-language NLP, for instance, is spoon-fed by researchers around the world, and yet there are other languages, whose research and NLP is almost entirely neglected. One way of speeding the discovery of a language’s resources is through rule-based descriptions.
Rule-based discovery can be developed on the basis of observable variation in a given language. The scenario for rule-based descriptions is that they can be depicted as learning curves which might be compared to the learning curve of a young child. At the age of four, the child says: walks, walked; drinks, drinked. By seven, however, there is sophistication: the array «walks, walked» is retained, but «drinks, drank, drunk», «swim, swam, swum» emerge, even «brings, brang, brung». Later, there may be further refinement, which teaches the child «brings, brought, brought». In stage one, all verbs are regular. Stage two indicates the recognition of smaller groups of regularity (i : a : u before nasals), other small groups may also be recognized. Subsequently, there comes a time for refinement of exceptions to these two initial stages.
In languages, such as English, there are only a few word forms in regular inflection. For verbs, this might be four or five: walk, walks, walking, walked; swim, swims, swimming, swam, swum. These as well as nouns might be simply listed, i.e., the number of tokens will not rise any higher than five times the number of lemmas (walk, swim). In languages with more sophisticated morphology, the length of lists grows very quickly. Finnish, for example, easily generates ten times more verb forms than English by simply inflecting for six persons and the passive in various tenses moods and alternating between question and statement strategies. Here is where rule-based, finite-state descriptions come in.
The prospects of rule-based, finite-state description of regular inflection with regularly associated semantics, makes it possible, even desireable, to list 1000 lemmas with 100 shared forms, where the length of 100,000 forms might be crunched close to 1100 entries. Even an English-language form list could be crunched from approximately 4000 to nearly 1004, but, of course, neither of the languages lists separate word combinations. Hence, there is no equivalence between the Finnish 100,000 and the English 4000. These are just enumerations of possible morphological combiniations within each language.
Many of the languages related to Finnish, Karelian, Saami and Estonian have complex morphological systems. This complexity is readily described using finite-state morphology, where rule-based technologies help to further concentrate the expression of seemingly endless variation – over fifty forms per word. It, therefore, comes as no surprise that an algorithm to cope with this complexity was developed for Finnish by Kimmo Koskenniemi and subsequently by Helsinki Finite-State Technologies (HFST) and this algorithm has now been implemented for work with the Saami languages among others in Norway (GiellaLT < Giellatekno, Divvun).
The GiellaLT infrastructure, with its implementation of finite-state tools, allows people working with different languages to make use of technological solutions that, otherwise, might require several years of individual development. It is here that descriptions for many of the Uralic languages have been initialized and developed as both financed projects and the work of language technology enthusiasts.
The GiellaLT infrastructure makes it possible to reuse finite-state descriptions and even encourages it. Thus, contributing to the enhancement of the finite-state tools at GiellaLT, when extending the annotation of corpora on the Language Bank of Finland’s Korp server, is beneficial to the search engine users as well.
We will evaluate the state of development of analysers for individual languages in relation to text data being annotated for the Korp search engine. This evaluation will therefore be aligned with the annotation of upcoming corpora, such as a new extended version of PaBiVUS (Parallel Biblical Verses for Uralic Studies). The objective is to increase the lemmatization, morphological and syntactic annotation coverage not previously offered for non-majority languages in the parallel corpus. So, here we will provide an illustrative depiction of each individual finite-state description and what steps have been made for improvement. This might be seen as enhanced but not complete coverage of various genre as we go.
The evaluations will tend to illustrate the capacities of the analysers, which do have equivalent generators, but the possible overproductivity of these generators is presently not the focus of these evaluations. In time, attention will be also drawn towards the description of the disambiguation of morphological analyses, which is made possible in the open-source GiellaLT infrastructure. The enhanced descriptions, housed in GiellaLT, will serve as a contribution by the Language Bank of Finland in the shared responsibilities towards improved coverage of lesser described languages and NLP addressing them. Thus, the resulting analysers will available for building within the GiellaLT infrastructure or the UralicNLP python, java and .net libraries available through Github or the Language Bank of Finland.