The EU is in the process of creating an internal market for all types of data. The aim is to ensure that data can be shared from one stakeholder to another within the region, in accordance with the EU legislation. Data sharing requires interactive networks – data spaces – that can connect data providers and users, and offer a platform for them to communicate, make contracts and trade with each other.
All the upcoming European data spaces will be developed in line with the European Data Strategy. There are development plans for data spaces for approximately 15 different strategic fields. According to the vision, data spaces will allow for the commercialisation and more efficient re-use of data. This will benefit not only commercial stakeholders in the EU, but also EU citizens by providing them with better digital services, for example. In addition, researchers could gain access to new types of data and materials, which could boost basic research and increase opportunities for product development and innovation.
The European Language Data Space (shortened: LDS) is an ecosystem for the sharing and commercialisation of language data, such as text and speech data, and for the development of large language models and language-centric Artificial Intelligence. The Language Data Space is being developed and coordinated by the LDS Consortium, which was established in early 2023 with the support of the European Commission. The first phase of the LDS will last three years and during this period, the technical and legal framework for the operation of the common language data platform will be established in cooperation with the various stakeholders.
The work on the language data space will also be driven forward by ALT-EDIC, the language technology alliance of EU member states established in early 2024. In particular, ALT-EDIC aims to ensure the development of EU-based large language models.
The Language Data Space will be built partly on top of existing networks and language technology infrastructures. Sitra’s publication Snapshot of Finnish data spaces (2024) summarises well the current situation in Finland with regard to language technologies and the Language Data Space (in Finnish).
In spring 2024, the LDS Consortium launched a series of country-specific workshops to share information about the possibilities of the common Language Data Space, and to reach as many stakeholders in each member country as possible. The workshops are organised in collaboration with local institutions. In April 2024, Finland had the honour of being the first EU member state to host an LDS workshop. The event was organised locally by the University of Helsinki. More information on workshops in other EU countries and upcoming LDS events can be found on the Language Data Space website.
The Finnish LDS workshop provided an opportunity for organisations and companies in Finland to exchange ideas on the possibilities and challenges that a common platform and marketplace for language models and data could offer. As remote presenters, the workshop featured Philippe Gelin from the European Commission, and Georg Rehm from DFKI in Germany. In the panel discussions (see photos below), partners from the LAREINA project coordinated by the University of Helsinki shared their views on the importance of language data and on the challenges regarding the availability and technical quality of data or regarding copyright constraints. Without access to electronic data of sufficient quality and scope, it is difficult to develop language models for speakers of small and medium-sized languages.
After the LDS workshop, Finland initiated the membership process to join ALT-EDIC as an observer member. After summer 2024, the full membership of Finland in ALT-EDIC was confirmed for the next three years. The administrative representative of Finland in ALT-EDIC is the Ministry of Transport and Communications, with whom the University of Helsinki aims to maintain an active dialogue.
Language Data Space invites European stakeholders to join the LDS User Group. The group includes commercial stakeholders from different sectors as well as representatives from both public administrations and research. The news from the remote meeting of the LDS User Group in November 2024 can be found here. Joining the LDS User Group is done via a form that can be found on the LDS website. In particular, language data providers and utilisers, as well as language model developers, are warmly welcome to join the group.
At the end of 2024, the Language Data Space is entering the pilot phase, where the Language Bank of Finland is also actively involved. The aim is to test the pilot version of the LDS platform in Finland and to collect user feedback. The Language Bank of Finland is also planning to organise a workshop in spring 2025 on the Language Data Space, ALT-EDIC and copyright issues. We will inform about this upcoming event on our website and through the LAREINA project.
All photos in the article: Jyrki Niemi / University of Helsinki