Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 358720
Start date: 2024-01-01
Duration: 24 months
Report author: Jussi Piitulainen (UHEL)
WP 1.1: Report on Named-Entity Annotation
Date of reporting: 2024-09-26
Contributors: Jussi Piitulainen, Jyrki Niemi (UHEL), Sam Hardwick (CSC)
Deliverable location:
Keywords for the deliverable page: named-entity; finnish-nertag; VRT; Suomi24
Name-like phrases are annotated in the Suomi24 2001–2020 VRT corpus in the Language Bank of Finland, using the computational resources of CSC. The new annotations are the three formats of the finnish-nertag 1.6 tool: maximally long identified names, names nested in those, and the BIO (begin, inside, outside) format for the maximal names.
All 20 years have already been processed with the tool. A small number of triply nested annotations required correction, for which a post-processing tool was written. All years are pending the addition of structural markup tags for each maximal name.
The final annotations are expected to be available in the Language Bank both through the Korp search engine and as a new downloadable version of the corpus in October 2024.
As an example of the tag format, below is a VRT fragment (found in year 2010 data) where ”Turun hallinto-oikeudelle” is recognized as a maximally long name with ”Turun” as a shorter name nested inside. There can be even a third nesting level. (The example is a projection to just the word and the new fields. Base forms and other morpho-syntactic annotations remain.)
word | nertag2 | nertags2/ | nerbio2 |
joka | _ | | | O |
jätetään | _ | | | O |
Turun | EnamexOrgCrp-B | |EnamexOrgCrp-B-0|EnamexLocPpl-F-1| | B-ORG |
hallinto-oikeudelle | EnamexOrgCrp-E | |EnamexOrgCrp-E-0| | I-ORG |
ensi | _ | | | O |
maanantaina | _ | | | O |
The number of maximally long names identified in the years 2001–2010 (roughly a half of the corpus) is as follows, by counting the BIO start tags (the B of BIO). The BIO tags classify the recognized names in six types, with a finer classification provided by the other formats.
Start tag (BIO) | frequency |
B-PER | 22 416 185 |
B-PRO | 17 347 958 |
B-LOC | 14 271 499 |
B-ORG | 9 088 301 |
B-MISC | 4 419 947 |
B-DATE | 2 590 846 |
The annotation work was facilitated by writing a new preprocessing tool that hides from the finnish-nertag tool such input sentences that might, empirically, induce extreme resource consumption (usually excessive time, sometimes excessive space, both leading to a crash). Some of these sentences originate in trollish behaviour in the discussion forum, some are otherwise not really ordinary sentences at all. Some may have been segmented in a less than helpful way, possibly due to missing punctuation marks or missing spaces.
In addition to the names, the corpus was also annotated with HeLI-OTS 2.0 language identification of each sentence and summaries in paragraph and text elements.
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Academy of Finland under grant number 358720.