Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Matias Tamminen, graduate student at the University of Helsinki tells us about his research on the Classics of English and American literature in Finnish and the Classics of Finnish Literature making use of the data analysis platform Mylly.
I am Matias Tamminen. This is my fifth year at the University of Helsinki. I am majoring in English Translation. My minors are Swedish Translation, Finnish, and Translation Studies. I am currently writing my Master’s thesis.
In my Master’s thesis, I compare the frequency differences of part-of-speech n-grams in native and translated Finnish prose literature using corpus data. The purpose of the thesis is to figure out whether there are syntactical features in translated Finnish prose literature that differ from native Finnish prose literature. I shall also delve into whether the possible differences are predictable in the light of various translation universals (i.e. features that are common to all translated language).
My thesis is based on the article Borin L., Prütz K. 2001. Through a glass darkly: Part of speech distribution in original and translated text. In: W. Daelemans, K. Sima’an, J. Veenstra, J. Zavrel (eds) Computational Linguistics in the Netherlands 2000; 2001:30–44.
Both the corpus of translated Finnish Classics of English and American literature in Finnish and the corpus of native Finnish Classics of Finnish Literature (subcorpora that contain native Finnish prose literature – the corpus contains other material as well) come from Kielipankki. In addition to these two corpora, I have two control corpora that I have compiled myself. The first one of these contains Finnish literature translated from other languages than English, and the second one contains the source texts of the books translated from English.
I process the data using Mylly, a software by Kielipankki. I have downloaded the corpora to my computer and imported them to Mylly. In Mylly, I have tagged the corpora with a parser, i.e. the properties of every word, such as the part of speech category, have been added to the corpora automatically. I have then calculated part-of-speech n-grams – the value of n ranging from one to five – from these annotated corpora. I have then normalized the frequencies of the n-grams in order to make them comparable over corpus boundaries. Lastly, I have calculated differences between these normalized frequencies using a couple of different methods. Mylly helps with the processing of the results, too. For example, I can pick a given n-gram and all its occurences automatically and sort and number the n-gram list according to any factor I want.
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive.