Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 5.1: Report on evidence-based infrastructure development
Date of reporting: 17-12-2024
Report authors: Anna Sendra Toset, Farid Alijani, Jaakko Peltonen, Sanna Kumpulainen (Tampere University)
Contributors: Elina Late (Tampere University)
Deliverable location: The recommendation system is available through a GitHub repository
The main objective of this deliverable is to develop FIN-CLARIAH from a bottom-up perspective by collecting information on how the users interact with the tools and materials available in the RI, both implicitly (interaction log analysis) and explicitly (interviews, workshops).
PART A: IMPLICIT USER MONITORING
As for the implicit user monitoring, the goal is to design and develop methods that enable analysis of log data from systems in the FIN-CLARIAH infrastructure and are usable for other compatible systems. The analysis of log data can serve purposes such as monitoring use of the systems and for recommendation of content to end-users.
Firstly, as part of an earlier deliverable, we conducted a comprehensive study on the utility of the log data to investigate the feasibility of developing both user-based and item-based recommender systems which could be potentially deployed for end-users in the future.
Secondly, as a proof of concept we have developed a collaborative recommender system to assist information retrieval in digital libraries, based on log data gathered from use of the libraries. In the recommender system, we are currently using the National Library of Finland (NLF) dataset, including metadata of the collection, description, preservation and accessibility of Finland’s printed national heritage as digitized materials. The proof of concept is easily extensible to comparable log files of other digital libraries, and similar approaches can be applied to other DARIAH-FI collections. We have an open access GitHub repository for the public use (see Deliverable location) which has been primarily tailored to the SLURM clusters, provided by CSC infrastructures for data storage and massive computational resources.
The developed recommender system combines collaborative and content-based recommendation. It has been initially developed with similarity search approaches and is extensible to various inference schemes including neural approaches in future work. During the current research period we have further improved the operation of the recommender system by analysis of its behavior and identification and resolving of data quality issues, in particular resolving OCR issues by incorporating language and spell checking for Finnish, Swedish and English with public software. We have made it possible to deploy and run the system as a standalone system running on a suitable (virtual) server. We have further tuned the responsibility of the system to make it as fast as possible even on a large corpus. Moreover, we have greatly improved the user interface of the system to allow seamless interplay between the recommendations and the baseline NLF search engine without requiring manual back and forth switching from the user; we have further added information of search result counts and (as an optional future feature) of their time distribution.
We have demonstrated our recommender system in two research events to promote it for researchers and audiences alike:
We have further developed a testing scheme for quantitative evaluation of the performance of the system and users’ satisfaction with it. The procedure is suitable to be run across a series of user experiments, for comparing the new system to a baseline system, which is here the original NLF search system. Recruiting for running such experiments is underway.
PART B: EXPLICIT USER MONITORING
As for the explicit user monitoring, during November 2024 we launched a round of interviews specifically targeting Kielipankki users with the objective of evaluating how this service is supporting the research tasks of social sciences and humanities scholars. Particularly, the interviews focus on assessing the experiences of these researchers with Kielipankki as well as on identifying their perceived needs and expectations regarding this service. The round of interviews is currently ongoing, and we expect to have the first results during Q1/2025.
Other work related to the explicit user monitoring consisted of exploring research data management practices among scholars with experience in digital humanities and computational social sciences research to find ways of better supporting these practices. This study was presented in Informaatiotutkimuksen päivät 2024 (Turku, 6-7 November) and was submitted for publication during Q4/2024.
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.