Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 3.1: Report on Comprehensive data versioning
Date of reporting: 25-09-2024
Report authors: Martin Matthiesen (CSC)
Contributors: Erik Axelson, Eetu Mäkelä, Ville Vaara (UH), Sam Hardwick, Anni Järvenpää (CSC)
Deliverable location: https://github.com/CSCfi/kielipankki-nlf-harvester
Keywords for the deliverable page: versioning, updates, differences
The versioning mechanism has been tested with new data from the National Library. We discovered that we will likely need to make changes to the mechanism how data is packaged into zip files to avoid unnecessary growth of the versions stored in Allas.
Interviews with potential users of the data have been conducted: Erik Axelson and Ville Vaara (both UH). Both interviews are summarized below.
In 2024 FIN-CLARIN has published a new version of ”The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2 (1771-1874), VRT”[1], klk-fi-v2-1874-vrt, for short. This version was created using data directly obtained from the National Library, since our harvesting mechanism was not quite ready at the start of the project to create the new dataset. The NLF source data was extracted, tokenized and syntactically annotated and converted to the VRT format[3]. A list of included publications was compiled[4] and also End user notes, which document inconsistencies found after publication[5]. FIN-CLARIN has well established processes to obtain new copies from the National Library and these copies are in a different internal format than the data provided in this workpackage[2]. However, the differences are small and the data is well suited to be a basis for the next iteration. Since a new version of klk-fi-v2-1874-vrt is not planned during this project we will demonstrate the changes needed with a proof-of-concept.
Another use case for the data is the Elastic Search based tool developed in the previous FIN-CLARIAH development round in WP4.3[6]. In that use case the NLF data is converted to JSON suitable as input data for an Elastic Search Engine. When considering newer versions it became clear that an easy way of finding differences between the versions is a reasonable addition to the present implementation. The dataset is presently 10 TB in size and comparing two datasets of that size (the present version and an earlier version) to find out the differences is something that should be done once during the update and provided to the user as a service, enabling easier updates of indexes.
Moving forward we need to investigate the unnecessary growth of the versions and add functionality to make incremental updates of derived datasets (like in the Elastic Search case mentioned above) easier, by providing the differences between versions in a machine readable way. In deliverable 3.1.2 we will demonstrate the changes with working code.
[1] National Library of Finland. The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2 (1771-1874), VRT [data set]. Kielipankki. Retrieved from http://urn.fi/urn:nbn:fi:lb-2024060401
[2] See the Harvester documentation for details.
[3] Introduction to VRT: http://urn.fi/urn:nbn:fi:lb-2023020121
[4] List of publications: http://urn.fi/urn:nbn:fi:lb-2023092801
[5] End user notes: http://urn.fi/urn:nbn:fi:lb-2023101001
[6] See Deliverable 4.3.2 of FIN-CLARIAH 2022-2023. The current implementation can be found here: https://dariahfi-es.2.rahtiapp.fi (access available upon request)