Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months
WP 3.1: Report on Ingestion framework
Date of reporting: 2023-02
Report author: Johanna Lilja (National Library of Finland), Tuula Pääkkönen (National Library of Finland)
Contributors: Martin Matthiesen (CSC)
Deliverable location: https://github.com/CSCfi/kielipankki-nlf-harvester
Basic concept of how the data is downloaded exists. Technology defined (Apache airflow for workflow management) has been chosen. Script created for downloading METS XML, and then ALTO XML files via Airflow. CSC Project created with necessary quota. Download of dataset (METS, ALTO) started in January 2023. Areas of improvement identified: Download speed, METS filepaths need post processing. Next steps are agreed between NLF and CSC, we continue the fruitful collaboration. Airflow evaluated and found fit for purpose.
FIN-CLARIAH WP3.1 presentation from DARIAH-FI workshop on November 9th, 2022.