<< List of all deliverables
D1.3.1: Corpora of non-standard language
Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months
WP 1.3: Report on Corpora of non-standard language
Date of reporting: 2022-09
Report author: Veronika Laippala (UTU)
Contributors: Veronika Laippala, Filip Ginter, Sampo Pyysalo, Anni Eskelinen, Anna Salmela (UTU)
Deliverable location: turkunlp.org | github.com/TurkuNLP
Description
1) Text quality data
2) Register (genre) annotations for Oscar
3) Toxic language use for Finnish
- Toxic language can be defined as rude, disrespectful language, likely to make someone leave a discussion
- Toxic language data and models for Finnish to be published in early 2023 (submitting to Nodalida)
- Will be available at github.com/TurkuNLP and as a Huggingface dataset