Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months
WP 1.3: Report on Models for retrieving QA pairs from the web
Date of reporting: 02-11-2023
Report author: Anni Eskelinen (UTU)
Contributors: Anni Eskelinen, Veronika Laippala, Amanda Myntti, Erik Henriksson, Sampo Pyysalo (UTU)
Deliverable location: https://github.com/TurkuNLP/register-qa | https://huggingface.co/TurkuNLP
Our pipeline to retrieve question-answer pairs from text corpora includes two transformer models: one for extracting documents with likely QA pairs from web-crawled corpora, and another one for extracting the actual QA pairs from the documents.
The model for QA document identification is a cross-lingual sequence classification model trained on register annotated data in English and Finnish as well as unpublished versions of Swedish and French which is specifically fine-tuned to predict whether a document (a piece of text) includes something related to questions and answers or not.
The model for QA pair extraction is a token classification model (for English and Finnish) which predicts whether a token in the text belongs to a question, answer or other and then splits the text into QA pairs based on those predictions and aggregation strategies. This model is used on the documents labelled as having something related to questions and answers.
The publication details will be updated later (work submitted for LREC-COLING 2024).