Suomalaisen viittomakielen korpus

Saatavilla olevat versiot

Lyhenne	Nimi ja kuvailutiedot	Lisenssi	Sijainti	Viite	Aineistoryhmä ja ohje	Hae käyttöoikeutta	Julkaisuvuosi	Tukitaso
Lyhenne	Nimi ja kuvailutiedot	Lisenssi	Sijainti	Viite	Aineistoryhmä ja ohje	Hae käyttöoikeutta	Julkaisuvuosi	Tukitaso

Tietoa aineistosta

Kunkin aineistoversion tarkemmat tiedot päivitetään kuvailutietueeseen, joka löytyy pysyvällä tunnisteella (ks. linkki aineiston otsikon kohdalla).

Tarkemmat tämän korpuksen toisen osan videoiden kokoamisesta ja koosta löytyvät täältä.

Tärkeitä huomautuksia

Lisenssin muutos (6.12.2024): Syksyllä 2024 päivitetyn tallennussopimuksen mukaisesti tämän aineiston lisensseihin on lisätty aineistokohtaiset tietosuojaehdot.

Lisenssi ja pääsy aineistoon

Jotkin tämän aineiston versiot ovat saatavilla julkisesti (PUB), kun taas toisiin täytyy hakea erikseen henkilökohtaista käyttöoikeutta (RES).
Lisenssikuvaketta napauttamalla näet tarkan aineistokohtaisen lisenssin.
Tämän aineiston versioihin sisältyy henkilötietoja (lisenssissä on merkintä +PRIV). Henkilötietojen käsittelyssä on noudatettava aineistokohtaisia tietosuojaehtoja. Jos käsittelet henkilötietoja, ylläpidä projektiasi koskevaa julkista tietosuojailmoitusta ja toimita sen linkki Kielipankille, ks. ohjeet.

Tämän sivun pysyvä tunniste: http://urn.fi/urn:nbn:fi:lb-2021092401

Corpus of Finnish Sign Language

Suomeksi

Currently available versions of this resource

Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level
Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level

Further information

Further details of each version of the resource are maintained in the metadata record, findable via the persistent identifier (see the link at the resource title).

Details on the compilation of the videos and sizes of the second part of this corpus can be found here.

Important notes

License change (6.12.2024): According to the deposition agreement updated in autumn 2024, the resource-specific data protection terms and conditions were added to the licenses of the different versions of this resource.

License and access

Some versions of this resource are available publicly (PUB), whereas others require you to apply for individual access rights (RES).
Click on the license image to see the resource-specific license text.
All versions of this resource may contain personal data (license condition +PRIV). The license includes additional data protection terms and conditions that you must follow. If processing personal data, maintain a public Privacy Notice regarding your project and provide the link to the Language Bank of Finland, see instructions.

This page has a persistent identifier: http://urn.fi/urn:nbn:fi:lb-2024060525

Finland Swedish Online

Finland Swedish Online is a platform offering online courses for learners of Finland Swedish. The service is provided by the University of Helsinki. The service is based on Icelandic Online provided by the University of Iceland. The courses are offered at different levels. They are learner centered with interactive visual and listening exercises organized around themes relevant to life in Finland. The courses are supported by glossaries, grammars and dictionaries.

Access Finland Swedish Online

Try out the related service for Icelandic, Iclandic Online

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2024112801

STT:n uutisarkisto (1992-)

In English

Tärkeää: STT:n uutisarkiston kokotekstiaineistojen käyttöoikeus päättyy 21.2.2025

Saatavilla olevat versiot

Lyhenne	Nimi ja kuvailutiedot	Lisenssi	Sijainti	Viite	Aineistoryhmä ja ohje	Hae käyttöoikeutta	Julkaisuvuosi	Tukitaso
Lyhenne	Nimi ja kuvailutiedot	Lisenssi	Sijainti	Viite	Aineistoryhmä ja ohje	Hae käyttöoikeutta	Julkaisuvuosi	Tukitaso

Tulossa olevat versiot

Nämä aineistoversiot eivät vielä ole saatavilla Kielipankin kautta.

Lyhenne	Nimi ja kuvailutiedot	Lisenssi	Muoto	Tukitaso	Contact Person	Sijainti	Aineistoryhmä ja ohje	Muu tieto
Lyhenne	Nimi ja kuvailutiedot	Lisenssi	Muoto	Tukitaso	Contact Person	Sijainti	Aineistoryhmä ja ohje	Muu tieto

Poistetut versiot

Nämä aineistoversiot eivät enää ole saatavilla Kielipankin kautta.

Lyhenne	Nimi ja kuvailutiedot	Lisenssi	Viite	Aineistoryhmä ja ohje	Julkaisuvuosi
Lyhenne	Nimi ja kuvailutiedot	Lisenssi	Viite	Aineistoryhmä ja ohje	Julkaisuvuosi

Tietoa aineistosta

Suomen Tietotoimiston (STT) uutisarkisto sisältää uutisjakelun suomenkieliset artikkelit, jotka STT on lähettänyt media-asiakkaidensa käytettäväksi vuodesta 1992 lähtien. Valtaosa artikkeleista on uutisjuttuja, joiden pituus vaihtelee hyvin lyhyistä ”viivauutisista” uutissähkeisiin ja pidempiin uutisjuttuihin. Artikkelit on luokiteltu osastoittain (kotimaa, ulkomaat, talous, politiikka, kulttuuri, viihde ja urheilu) sekä sisältää metadataa (IPTC-asiasanat tai avainsanat sekä tietyiltä osin paikkaluokitukset). Arkisto sisältää myös muuta STT:n luomaa tai välittämää materiaalia kuten asiakkaille lähetettäviä uutislupauksia, urheilutuloksia, vieraskynäartikkeleita ja tiedotteita.

Tarkempaa tietoa eri aineistoversioiden sisällöstä löytyy niiden kuvailutiedoista. Kuvailutiedoista löytyvät myös tiedot aineiston käyttöoikeuksista ja lisensseistä.

Tärkeitä huomautuksia

Lisenssin muutos 2024-11-21: Oikeudenhaltija on ilmoittanut, että STT:n uutisarkiston kokotekstiaineistoja koskeva lisenssi päättyy 21.2.2025. Mikäli olet saanut Kielipankin kautta käyttöoikeuden STT:n uutisarkiston kokotekstiaineistoihin, sinun on lisenssiehtojen mukaisesti lopetettava kyseisten aineistojen käyttö ja poistettava ne laitteiltasi kolmen kuukauden siirtymäajan kuluessa eli 21.2.2025 mennessä (ks. lisenssin linkki edellä). Aiemmin luvan saaneille käyttäjille on ilmoitettu asiasta myös sähköpostitse.

Huomaathan, että käyttöoikeus päättyy vain STT:n uutisarkiston kokotekstiversioiden osalta! Niitä STT:n uutisarkiston versioita, joissa on saatavilla vain rajallisia konteksteja kerrallaan (esim. Kielipankissa olevat STT:n uutisarkiston Korp-versiot) tai joissa tekstisisällön virkejärjestys on sekoitettu, on edelleen sallittua käyttää. Kielipankki pyrkii lähitulevaisuudessa toimittamaan korvaavia aineistoversioita saataville latauspalvelun kautta.

Tämän sivun pysyvä tunniste: http://urn.fi/urn:nbn:fi:lb-2018121001

Finnish News Agency Archive (1992-)

Suomeksi

Important: The license of the full-text versions of the Finnish News Agency Archive will be terminated on 21.2.2025

Currently available versions of this resource

Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level
Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level

Upcoming versions of this resource

These resource versions are not yet available in the Language Bank of Finland.

Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information
Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information

Removed versions of this resource

These resource versions are no longer available in the Language Bank of Finland.

Shortname	Nimi ja kuvailutiedot	Lisenssi	Viite	Aineistoryhmä ja ohje	Publication year
Shortname	Nimi ja kuvailutiedot	Lisenssi	Viite	Aineistoryhmä ja ohje	Publication year

Further information

The Finnish News Agency Archive corpus comprises newswire articles in Finnish sent to media outlets by the Finnish News Agency (STT) since 1992.

Most of the material is news articles that vary from short “news flashes” to telegrams and longer articles. News articles are categorized by department (domestic, foreign, economy, politics, culture, entertainment and sports) as well as by metadata (IPTC subject categories or keywords and location data). The archive also includes other material STT has created or forwarded such as news planning lists, sports results, analysis articles and press releases.

Further details of each version of the resource are maintained in the metadata record, findable via the persistent identifier (see the link at the resource title).

Important notes

License change 2024-11-21: According to a notice from the rightholder, the end-user license of the full-text versions of the Finnish News Agency Archive will be terminated on 21st February 2025. In case you were granted the right to use the full text versions via the Language Bank of Finland, you must stop using the resources in question and you must remove them from your devices by the aforementioned deadline (see the license link above). The users who have access rights to the full-text versions have also been notified by email on 21st November 2024.

Please note that the termination of the license only affects the full-text versions of the resource! You may continue using those versions of the Finnish News Agency Archive that only show restricted contexts (e.g., the Korp versions of the archive in the Language Bank) or where the order of the sentences has been scrambled. The Language Bank is already working on new downloadable versions that can be made available under the public license.

License and access

Click on the license image at each resource to see the resource-specific license text.

Persistent identifier of this page: http://urn.fi/urn:nbn:fi:lb-2023072121

Suomenruotsalaisen viittomakielen korpus

Suomenruotsalaisen viittomakielen korpus

In English
På svenska

Suomenruotsalaisen viittomakielen korpus (CFSTS) on alun perin Suomen viittomakielten korpusprojektissa (CFINSL) systemaattisesti kerätty ja käsitelty aineistokokoelma. Korpus koostuu videotiedostoista, videoita koskevista annotaatioista ELAN-ohjelman tiedostoformaatissa sekä viittojia koskevista metatiedoista. Aineisto on jaettu kahteen osakorpukseen, joista yhdessä on kerronta-aineistoa (cfsts-elicit) ja toisessa viitottua keskustelua (cfsts-conv) kahdeltatoista viittojalta. Kerronta-aineisto on julkisesti saatavilla, kun taas keskustelut ovat luvanvaraisesti saatavilla rajoitetulla lisenssillä. Tarkempia tietoja löytyy kummankin aineiston kuvailutietueesta, ks. alla olevat linkit.

Vinkki: Katso myös Signbank: suomenruotsalainen viittomakieli.

Saatavilla olevat versiot

Lyhenne	Nimi ja kuvailutiedot	Lisenssi	Sijainti	Viite	Aineistoryhmä ja ohje	Hae käyttöoikeutta	Julkaisuvuosi	Tukitaso
Lyhenne	Nimi ja kuvailutiedot	Lisenssi	Sijainti	Viite	Aineistoryhmä ja ohje	Hae käyttöoikeutta	Julkaisuvuosi	Tukitaso

Tämän sivun pysyvä tunniste: http://urn.fi/urn:nbn:fi:lb-2024090328

Finlandssvensk teckenspråkskorpus

Suomeksi
In English

Den finlandssvenska teckenspråkskorpusen (CFSTS) är en systematisk samling av material på finlandssvenskt teckenspråk som ursprungligen samlades in i korpusprojektet för Finlands teckenspråk (CFINSL). Resursen innehåller videofiler, inspelade från upp till sex olika kameravinklar, annoteringar av videorna i ELAN-format och metadata om 12 teckenspråksanvändare. Korpusen är uppdelad i två delkorpusar: den ena innehåller eliciterade berättelser (cfsts-elicit) och den andra innehåller diskussioner (cfsts-conv) från teckenspråksanvändarna. Berättelserna är allmänt tillgängliga, medan diskussionerna är tillgängliga under en begränsad licens.

Mer information finns i metadataposten för varje delkorpus, se nedan.

Tips: Se även Signbank: finlandssvenskt teckenspråk.

Tillgängliga versioner av denna resurs

Förkortning	Namn och metadata	Lisens	Tillgång	Citera	Resursgrupp och hjälp	Ansök	Utgivningsår	Servicenivå
Förkortning	Namn och metadata	Lisens	Tillgång	Citera	Resursgrupp och hjälp	Ansök	Utgivningsår	Servicenivå

Den här sidan har en beständig identifierare: http://urn.fi/urn:nbn:fi:lb-2024090329

Corpus of Finland-Swedish Sign Language

Corpus of Finland-Swedish Sign Language

Suomeksi
På svenska

The Corpus of Finland-Swedish Sign Language (CFSTS) is a systematic collection of materials in Finland-Swedish Sign Language, originally collected in the Corpus project of Finland’s Sign Languages (CFINSL). The resource contains video files, recorded from up to six different camera angles, annotations of the videos in ELAN format, and metadata about the signers. The corpus is divided into two subcorpora, one including elicited narratives (cfsts-elicit) and the other including signed conversations (cfsts-conv) from 12 signers. The elicited narratives are publicly available, whereas the conversations are available under a restricted license.

Further details can be found in the metadata record of each subcorpus, see below.

Tip: See also Signbank: Finland-Swedish Sign Language.

Currently available versions of this resource

Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level
Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level

This page has a persistent identifier: http://urn.fi/urn:nbn:fi:lb-2024090327

Amerikansuomalaisten siirtolaisten puhuttu suomen ja englannin kieli

In English

Tämän aineiston tulossa olevat versiot (ei vielä saatavilla)

Lyhenne	Nimi ja kuvailutiedot	Lisenssi	Muoto	Tukitaso	Contact Person	Sijainti	Aineistoryhmä ja ohje	Muu tieto
Lyhenne	Nimi ja kuvailutiedot	Lisenssi	Muoto	Tukitaso	Contact Person	Sijainti	Aineistoryhmä ja ohje	Muu tieto

Viimeksi päivitetty: 5.9.2024

Tämän sivun pysyvä tunniste: http://urn.fi/urn:nbn:fi:lb-2024082922

Spoken Finnish and English of Finnish-American Immigrants

Suomeksi

Upcoming versions of this resource (not yet available)

Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information
Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information

Last updated: 5.9.2024

This page has a persistent identifier: http://urn.fi/urn:nbn:fi:lb-2024082921

INCEpTION

INCEpTION is a certified open-source web annotation service that has been developed by the Faculty of Computer Science of Technische Universität Darmstadt and is available to all registered users of the CLARIN:EL Research Infrastructure.

INCEpTION offers a generic multi-user annotation environment aiming

to cover three essential aspects of text annotation in a single tool: corpus building, knowledge modelling and annotation and
to combine them with machine-learning-based assistive mechanisms (so-called recommenders) to improve the annotation efficiency and quality.

INCEpTION service is hosted at Kielipankki’s CLARIN partners at CLARIN:EL in Greece. (Click here to view their Privacy Policy.)

To start using the INCEpTION service Click ”Use Service” > ”Log in to access” > ”CLARIN Service Provider Federation login” and select your home organization.

For more information see the INCEpTION User Documentation.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2024081601

The Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version

Suomeksi

This corpus contains newspapers and magazines from Finland starting from 1770, compiled by the National Library of Finland.

NB: The Finnish acronym for the corpora The Newspaper and Periodical OCR Corpus of the National Library of Finland used to be ”Digilib”. Currently, however, the acronym ”klk” and the short names klk-fi-1874-dl and klk-fi-1920-dl are recommended instead.

Latest versions/subcorpora:
The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2 (1771-1874), VRT Metadata and license Attribution instructions	Download the resource
The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2, VRT Metadata and license Attribution instructions	Download the resource
The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2, Korp Metadata and license Attribution instructions Example queries in Korp	Select the corpus in Korp
The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version Metadata and license Attribution instructions	Select the corpus in Korp
The Swedish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version Metadata and license Attribution instructions	Select the corpus in Korp
The Newspaper and Periodical OCR Corpus of the National Library of Finland (1771-1874) Metadata and license Attribution instructions	Download the resource
The Newspaper and Periodical OCR Corpus of the National Library of Finland (1875-1920) Metadata and license Attribution instructions	Download the resource
The Newspaper and Periodical Corpus of the National Library of Finland, Swedish sub-corpus, 1771–1879, VRT Metadata and license Attribution instructions	Download the resource
The Newspaper and Periodical Corpus of the National Library of Finland, Swedish sub-corpus, 1880–1948, scrambled, VRT Metadata and license Attribution instructions	Download the resource
Search for these versions in META-SHARE

Of this language corpus different versions/subcorpora are published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

N-grams

Based on the KLK data, word-level collections of uni-, bi- and trigrams have been created and are available for download. These are their own data sets:

The N-grams of the Newspaper and Periodical Corpus of the National Library of Finland

Example queries from Korp

Concordance view of any form of the word 'sosialismi' in the Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2, Korp — Concordance view of any form of the word ’sosialismi’ in the Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2, Korp

Word picture of the word 'sosialismi' in klk-fi-v2-korp — Word picture of the word ’sosialismi’ in the Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2, Korp

Trend diagram of all forms of the word 'sosialismi' occurring in klk-fi-v2-korp — Trend diagram of all forms of the word ’sosialismi’ occurring in the Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2, Korp

OCR quality

The corpora consist mainly of digitized versions of texts originally printed on paper. These physical papers have been scanned, and optical character recognition (OCR) was performed on the resulting images. The digitized material spans a long period and contains different kinds of texts, writing styles and fonts. Scanning some parts of the material is more complex than scanning other parts, and the physical condition of the original texts also varies. The OCR techniques used have also varied, and there is the possibility that some of the texts have gone through manual post-correction. This results in some parts of the corpora being of terrible quality while others are of good quality. We have collected a list of publications related to OCR quality and collection processing:

Kettunen, K., & Pääkkönen, T. (2016, May). Measuring lexical quality of a historical finnish newspaper collection―Analysis of Garbled OCR data with basic language technology tools and means. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16) (pp. 956-961).

Koistinen, M., Kettunen, K., & Kervinen, J. (2017). How to improve optical character recognition of historical Finnish newspapers using open source Tesseract OCR engine. Proc. of LTC, 279-283.

Kettunen, K., Kervinen, J., & Koistinen, M. (2018, April). Creating and Using Ground Truth OCR Sample Data for Finnish Historical Newspapers and Journals. In DHN (pp. 162-169).

Kettunen, K., Koistinen, M., & Ruokolainen, T. (2018, April). Research and Development Efforts on the Digitized Historical Newspaper and Journal Collection of The National Library of Finland. In DHN (pp. 474-479).

Kettunen, K., & Koistinen, M. (2019, March). Open Source Tesseract in Re-OCR of Finnish Fraktur from 19th and Early 20th Century Newspapers and Journals-Collected Notes on Quality Improvement. In DHN (pp. 270-282).

Drobac, S. (2020). OCR and post-correction of historical newspapers and journals. University of Helsinki.

Hamdi, A., Jean-Caurant, A., Sidère, N., Coustaty, M., & Doucet, A. (2020). Assessing and minimizing the impact of OCR quality on named entity recognition. In Digital Libraries for Open Knowledge: 24th International Conference on Theory and Practice of Digital Libraries, TPDL 2020, Lyon, France, August 25–27, 2020, Proceedings 24 (pp. 87-101). Springer International Publishing.

Kettunen, K., Koistinen, M., & Kervinen, J. (2020). Ground truth OCR sample data of Finnish historical newspapers and journals in data improvement validation of a re-OCRing process. LIBER Quarterly: The Journal of the Association of European Research Libraries, 30(1), 1-20.

Jauhiainen, T., Piitulainen, J., Axelson, E., & Lindén, K. (2022, October). Language identification as part of the text corpus creation pipeline at the language bank of finland. In Digital Humanities in the Nordic and Baltic Countries Conference (pp. 251-259). CEUR.

Kettunen, K., Keskustalo, H., Kumpulainen, S., Pääkkönen, T., & Rautiainen, J. (2022). OCR quality affects perceived usefulness of historical newspaper clippings–a user study. arXiv preprint arXiv:2203.03557.

Rautiainen, J. (2022). Digitized Finnish Newspapers in Digital Humanities Research Projects: Challenges and Solutions from the Library Perspective.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021092404

Last updated: 19.6.2024

Tweet #lb_textreuse-sv

Tekstin uudelleenkäyttöklusterit ruotsinkielisessä lehdistössä 1645-1918

In English

Saatavilla olevat versiot

Lyhenne	Nimi ja kuvailutiedot	Lisenssi	Sijainti	Viite	Aineistoryhmä ja ohje	Hae käyttöoikeutta	Julkaisuvuosi	Tukitaso
Lyhenne	Nimi ja kuvailutiedot	Lisenssi	Sijainti	Viite	Aineistoryhmä ja ohje	Hae käyttöoikeutta	Julkaisuvuosi	Tukitaso

Aineiston sisältö

Resurssi perustuu Suomen ja Ruotsin kansalliskirjastojen digitoiman ruotsinkielisen sanoma- ja aikakauslehtiaineiston päällekkäisyyksien ja toistojen tutkimukseen. Tarkoituksena oli löytää kaikki yli 300 merkkiä pitkät tekstit tai tekstinpätkät, jotka olivat toistuneet tai kopioitu vähintään kerran. Näitä samankaltaisuuksia tai päällekkäisyyksiä löytyi yli 101 miljoonaa. Kun samoja tekstejä klusteroitiin, klustereita löytyi lähes 22 miljoonaa. Tutkimus kattoi vuodet 1645-1918 alkaen ensimmäisestä Ruotsissa painetusta sanomalehdestä. Tutkimuksessa oli mukana yhteensä 7,5 miljoonaa sivua digitoitua sanomalehtiaineistoa. Edellä mainittujen Suomessa ja Ruotsissa painettujen sanomalehtien lisäksi tietokanta sisältää Pohjois-Amerikassa julkaistuja ruotsinkielisiä maahanmuuttajien sanomalehtiä.

Materiaali on tuotettu hankkeessa ”Informationsflöden över Östersjön: Svenskspråkig press som kulturförmedlare”, jota rahoittaa Suomen ruotsalaisen kirjallisuuden seura (Svenska Litteratursällskapet i Finland). Digitoitu aineisto koottiin marraskuussa 2022.

Kokeile hakukonetta, joka on suunniteltu näiden tekstikokonaisuuksien etsimiseen ja analysointiin.

Lisätietoja sisällöstä ja eri korpusversioita koskevista ehdoista ja edellytyksistä on saatavilla vastaavissa metatietueissa.

Viimeksi päivitetty: 28.05.2024

Tämän sivun pysyvä tunniste: http://urn.fi/urn:nbn:fi:lb-2023092726

Tweet #lb_textreuse-sv

Text reuse clusters in the Swedish-language press 1645-1918

Suomeksi

Currently available versions of this resource

Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level
Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level

Corpus contents

The resource is based on a study of overlaps and repetitions of texts in the Swedish-language newspaper and magazine material that has been digitised by the national libraries of Finland and Sweden. The idea was to locate all texts or text fragments longer than 300 characters that had been repeated or copied at least once. More than 101 million of these similarities or overlaps were found. When the same texts were clustered together, there were almost 22 million clusters. The study covered the years 1645-1918, starting with the first newspaper printed in Sweden. In total, 7.5 million pages of digitised newspaper material were included in the study. In addition to the aforementioned newspapers printed in Finland and Sweden, the database includes Swedish-language immigrant newspapers published in North America.

The resource was produced by the project ”Informationsflöden över Östersjön: Svenskspråkig press som kulturförmedlare”, funded by Society of Swedish Literature in Finland (Svenska Litteratursällskapet i Finland). The digitised material was compiled in November 2022.

Try out the Search engine designed for searching and analysing these clusters of text reuse.

Further details about the content and the terms and conditions regarding the different corpus versions are available in the corresponding metadata records.

Last updated: 28.05.2024

This page has a persistent identifier: http://urn.fi/urn:nbn:fi:lb-2023092725

GiellaLT

GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology.

The GiellaLT website contains the technical documentation of the GiellaLT infrastructure, developed and used by Divvun and Giellatekno.

It is an open source website providing analysers and tools for a wide range of languages, as well as a ready-made setup for adding more languages.

Testing and enhancement of language models (transducers) from GiellaLT

The Language Bank of Finland is currently in the process of evaluating the state of development of GiellaLT’s analysers for individual languages in relation to text data being annotated for the Korp search engine.

Read more about the details and findings of the evaluation performed by Jack Rueter.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2024050301

Testing and enhancement of language models (transducers) from GiellaLT

GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. The web site of GiellaLT offers language models (transducers) for a wide range of languages. Writing documentation for each language repository is an ongoing effort, and part of the development process.

Analyser enhancement

The GiellaLT infrastructure, with its implementation of finite-state tools, allows people working with different languages to make use of technological solutions that, otherwise, might require several years of individual development. It is here that descriptions for many of the Uralic languages have been initialized and developed as both financed projects and the work of language technology enthusiasts.
The GiellaLT infrastructure makes it possible to reuse finite-state descriptions and even encourages it. Thus, contributing to the enhancement of the finite-state tools at GiellaLT, when extending the annotation of corpora on the Language Bank of Finland’s Korp server, is beneficial to the search engine users as well.

On this page, we will evaluate the state of development of analysers for individual languages in relation to text data being annotated for the Korp search engine. This evaluation will therefore be aligned with the annotation of upcoming corpora, such as a new extended version of PaBiVUS (Parallel Biblical Verses for Uralic Studies). The objective is to increase the lemmatization, morphological and syntactic annotation coverage not previously offered for non-majority languages in the parallel corpus. So, here we will provide an illustrative depiction of each individual finite-state description and what steps have been made for improvement. This might be seen as enhanced but not complete coverage of various genre as we go.

The evaluations will tend to illustrate the capacities of the analysers, which do have equivalent generators, but the possible overproductivity of these generators is presently not the focus of these evaluations. In time, attention will be also drawn towards the description of the disambiguation of morphological analyses, which is made possible in the open-source GiellaLT infrastructure. The enhanced descriptions, housed in GiellaLT, will serve as a contribution by the Language Bank of Finland in the shared responsibilities towards improved coverage of lesser described languages and NLP addressing them. Thus, the resulting analysers will available for building within the GiellaLT infrastructure or the UralicNLP python, java and .net libraries available through Github or the Language Bank of Finland.

For more details see the complete description on the analyser enhancement by Jack Rueter.

Evaluations of analysers for individual languages:

Please follow this link for a Follow-up on the analyser enhancement by Jack Rueter.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2024050302

Nordic Tweet Stream (NTS) haku- ja visualisointikäyttöliittymä

In English

NTS on monikielinen monitorikorpus, joka sisältää maantieteellisesti paikannettuja twiittejä ja niihin liittyviä metatietoja Pohjoismaista. Kaikkiaan se sisältää lähes 74 miljoonaa viestiä sadoilta tuhansilta käyttäjätileiltä Tanskasta, Suomesta, Islannista, Norjasta ja Ruotsista. NTS-tiedot kattavat ajanjakson tammikuun 2013 ja toukokuun 2023 välillä, ja ne kerättiin Twitter Academic API:n avulla, joka on nyt suljettu.

NTS:n tarkoituksena on helpottaa SSH:n perustutkimusta. NTS:ssä on helppokäyttöinen graafinen käyttöliittymä, joka tukee nopeaa tiedonsaantia, jotta tutkijat voivat keskittyä tietojen analysointiin. Tietoaineisto mahdollistaa erityyppiset tutkimukset. Esimerkiksi on mahdollista tutkia julkista keskustelua ja tunteita lähihistorian tapahtumista (esim. COVID-19-pandemia, Nato-jäsenyysprosessi jne.). Tietokokonaisuus on myös resurssi sosiolingvistiselle tutkimukselle ja monikielisyyden tutkijoille.

Tutustu verkkosivustoon.

Lisää tietoa NTS:stä

Jos käytät NTS-käyttöliittymää ja hyödynnät tuloksia julkaisuissasi, mainitse hiljattain julkaistu artikkeli, joka on saatavilla verkossa:
[1] Laitinen, Mikko, Jonas Lundberg, Magnus Levin & Rafael Martins. 2018. The Nordic Tweet Stream: A Dynamic Real-Time Monitor Corpus of Big and Rich Language Data, Proc. of Digital Humanities in the Nordic Countries 3rd Conference, Helsinki, Finland, March 7-9, 2018, CEUR-WS.org, online CEUR-WS.org/Vol-2084/short10.pdf.

Tämän sivun pysyvä tunniste: http://urn.fi/urn:nbn:fi:lb-2024041502

Nordic Tweet Stream (NTS) search & visualization interface

Suomeksi

The NTS is a multilingual monitor corpus of geolocated tweets and associated metadata from the Nordic region. Altogether, it contains nearly 74 million messages from hundreds of thousands of user accounts from Denmark, Finland, Iceland, Norway, and Sweden. The NTS data cover the period between January 2013 and May 2023 and were collected using the Twitter Academic API, which is now closed.

The purpose of the NTS is to facilitate fundamental research in SSH. The NTS comes with an easy-to-use graphic interface that supports quick data access so that researchers can focus on data analysis. The dataset enables various types of research. For instance, it is possible to study public discourses and sentiment concerning events in recent history (e.g., the COVID-19 pandemic, the NATO membership process, etc.). The dataset is also a resource for sociolinguistic research and for scholars of multilingualism.

Please visit the website.

About NTS

If you use the NTS interface and use the findings in your publications, please cite the recent paper, which is available online:
[1] Laitinen, Mikko, Jonas Lundberg, Magnus Levin & Rafael Martins. 2018. The Nordic Tweet Stream: A Dynamic Real-Time Monitor Corpus of Big and Rich Language Data, Proc. of Digital Humanities in the Nordic Countries 3rd Conference, Helsinki, Finland, March 7-9, 2018, CEUR-WS.org, online CEUR-WS.org/Vol-2084/short10.pdf.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2024041501

Tweet #uralic-ud

Uralic UD

In English

Aineiston viimeisimmät versiot:
Uralic UD v2.13, Kielipankin Korp-versio (beta) Kuvailutiedot ja lisenssi Tämän version viittausohje	Avaa aineisto Korpissa
Etsi muut saatavilla olevat versiot

Tämän korpuksen uusin versio on annotoitu Universal Dependencies -järjestelmän version 2.13 mukaisesti seuraavien uralilaisten kielten osalta: Erzya, Estonian, Finnish, Hungarian, Karelian, Komi-Permyak, Komi-Zyrian, Livvi, Moksha, North Sami, Skolt Sami, Veps.

Puupankit ja niiden lisenssit:

Erzya (JR); CC BY-SA 4.0
Estonian (EDT, EWT); CC BY-NC-SA 4.0
Finnish (FTB, OOD, PUD, TDT); FTB: CC BY 4.0, other: CC BY-SA 4.0
Hungarian (Szeged); CC BY-NC-SA 3.0
Karelian (KKPP); CC BY-SA 4.0
Komi-Permyak (UH); CC BY-SA 4.0
Komi-Zyrian (IKDP, Lattice); CC BY-SA 4.0
Livvi (KKPP); CC BY-SA 4.0
Moksha (JR); CC BY-SA 4.0
North Sami (Giella); CC BY-SA 4.0
Skolt Sami (Giellagas); CC BY-SA 4.0
Veps (VWT); CC BY-SA 4.0

Universal Dependencies v2.13 License Agreement

Viitetiedot

Uralic UD-hankkeet aakkosjärjestyksessä kielen ja osahankkeen mukaan jaoteltuina:

UD_Erzya-JR
Osallistujat: Rueter, Jack; Tyers, Francis; Klementieva, Elena; Erina, Olga; Riabov, Ivan
https://github.com/UniversalDependencies/UD_Erzya-JR/blob/master/README.md

UD_Estonian-EDT
Osallistujat: Muischnek, Kadri; Müürisep, Kaili; Puolakainen, Tiina; Rääbis, Andriela; Torga, Liisi
https://github.com/UniversalDependencies/UD_Estonian-EDT/blob/master/README.md

UD_Estonian-EWT
Osallistujat: Muischnek, Kadri; Müürisep, Kaili; Puolakainen, Tiina; Särg, Dage; Eiche, Sandra; Rääbis, Andriela
https://github.com/UniversalDependencies/UD_Estonian-EWT/blob/master/README.md

UD_Finnish-FTB
Osallistujat: Piitulainen, Jussi; Nurmi, Hanna
https://github.com/UniversalDependencies/UD_Finnish-FTB/blob/master/README.md

UD_Finnish-OOD
Osallistujat: Kanerva, Jenna
https://github.com/UniversalDependencies/UD_Finnish-OOD/blob/master/README.md

UD_Finnish-PUD
Osallistujat: Kanerva, Jenna; Ginter, Filip; Ojala, Stina; Missilä, Anna
https://github.com/UniversalDependencies/UD_Finnish-PUD/blob/master/README.txt

UD_Finnish-TDT
Osallistujat: Ginter, Filip; Kanerva, Jenna; Laippala, Veronika; Miekka, Niko; Missilä, Anna; Ojala, Stina; Pyysalo, Sampo
https://github.com/UniversalDependencies/UD_Finnish-TDT/blob/master/README.txt

UD_Hungarian-Szeged
Osallistujat: Farkas, Richárd; Simkó, Katalin; Szántó, Zsolt; Varga, Viktor; Vincze, Veronika
https://github.com/UniversalDependencies/UD_Hungarian-Szeged/blob/master/README.md

UD_Karelian-KKPP
Osallistujat: Pirinen, Flammie
https://github.com/UniversalDependencies/UD_Karelian-KKPP/blob/master/README.md

UD_Komi_Permyak-UH
Osallistujat: Ponomareva, Larisa; Partanen, Niko; Rueter, Jack; Tyers, Francis
https://github.com/UniversalDependencies/UD_Komi_Permyak-UH/blob/master/README.md

UD_Komi_Zyrian-IKDP
Osallistujat: Partanen, Niko; Blokland, Rogier; Rießler, Michael; Rueter, Jack
https://github.com/UniversalDependencies/UD_Komi_Zyrian-IKDP/blob/master/README.md

UD_Komi_Zyrian-Lattice
Osallistujat: Partanen, Niko; Lim, KyungTae; Poibeau, Thierry; Rueter, Jack
https://github.com/UniversalDependencies/UD_Komi_Zyrian-Lattice/blob/master/README.md

UD_Livvi-KKPP
Osallistujat: Pirinen, Flammie
https://github.com/UniversalDependencies/UD_Livvi-KKPP/blob/master/README.md

UD_Moksha-JR
Osallistujat: Rueter, Jack; Levina, Maria; Kabaeva, Nadezhda; Molnár, Judit; Alnajjar, Khalid
https://github.com/UniversalDependencies/UD_Moksha-JR/blob/master/README.md

UD_North_Sami-Giella
Osallistujat: Trosterud, Trond; Antonsen, Lene; Tyers, Francis
https://github.com/UniversalDependencies/UD_North_Sami-Giella/blob/master/README.md

UD_Skolt_Sami-Giellagas
Osallistujat: Rueter, Jack; Juutinen, Markus; Tyers, Francis; Pirinen, Tommi A; Hämäläinen, Mika
https://github.com/UniversalDependencies/UD_Skolt_Sami-Giellagas/blob/master/README.md

UD_Veps-VWT
Osallistujat: Laan, Käbi
https://github.com/UniversalDependencies/UD_Veps-VWT/blob/master/README.md

Viimeksi päivitetty: 09.04.2024

Tämän sivun pysyvä tunniste: http://urn.fi/urn:nbn:fi:lb-2024040901

Tweet #lb_digitala

DigiTala (2019–2023)

In English

Aineiston viimeisimmät versiot:
DigiTala: lukioissa ja yliopistossa kerätty S2-aineisto, syksy 2021 Kuvailutiedot ja lisenssi Tämän version viittausohje	Lataa aineisto
DigiTala: lukioissa kerätty S2-aineisto, kevät 2021 Kuvailutiedot ja lisenssi Tämän version viittausohje	Lataa aineisto
DigiTala: aikuisoppijoilta kerätty ruotsi toisena kielenä -aineisto, kevät 2023 Kuvailutiedot ja lisenssi Tämän version viittausohje	Lataa aineisto
DigiTalan YKI-aineisto Kuvailutiedot ja lisenssi Tämän version viittausohje	Lataa aineisto
Etsi muut saatavilla olevat versiot

Aineiston sisältö

Tämä resurssi sisältää näytteitä L2-suomea ja L2-ruotsia puhuvilta henkilöiltä, transkriptioita, ihmisten antamia arvioita, oppijoiden vastauksia testin jälkeisiin kyselyihin ja arvioijien vastauksia arvioinnin jälkeisiin kyselyihin. Aineisto on kerätty DigiTala-tutkimushankkeessa (2019-2023) suomea tai ruotsia toisena kielenä oppivilta aikuisopiskelijoilta.

DigiTala-tutkimushankkeen (2019-2023) päätavoitteena on kehittää digitaalinen työkalu, joka käyttää automaattista puheentunnistusta ja automaattista pisteytystä suomen- ja ruotsinkielisten oppijoiden suullisen kielitaidon arviointiin. Työkalu antaa myös automaattista palautetta oppijoiden puhesuorituksista. Hankkeessa kehitetyn digitaalisen työkalun tarkoituksena on mahdollistaa suullisen kielitaidon arviointi korkean tason kielikokeissa. Lisäksi oppilaat voivat harjoitella ääntämistä ja puheen tuottamista vierailla kielillä itsenäisesti koulun ulkopuolella tai ilman opettajan ohjausta kielitunneilla.

Hankkeen aikana kerättiin aineistoa suomea tai ruotsia toisena kielenä opiskelevilta lukiolaisilta ja yliopisto-opiskelijoilta. Lisäksi hankkeessa hyödynnettiin suomen ja ruotsin yleisten kielitutkintojen (Yleiset kielitutkinnot, YKI) puheaineistoa.

Hanke on Suomen Akatemian rahoittama 2019-2023, ja siinä yhdistyvät Helsingin yliopiston (apurahanumero 322619), Aalto-yliopiston (apurahanumero 322625) ja Jyväskylän yliopiston (apurahanumero 322965) asiantuntemus puheen ja kielen prosessoinnissa, kielikasvatuksessa ja fonetiikassa. Nykyinen hanke perustuu pilottihankkeen aikana saatuihin kokemuksiin, ks. DigiTala (2015-2017).

Lisätietoja sisällöstä ja eri korpusversioita koskevista ehdoista ja edellytyksistä löytyy kunkin aineistoversion kuvailutiedoista.

Lisätiedot

DigiTala-hankkeen (2019-2023) verkkosivusto

DigiTala-hankkeen materiaaleja: Tehtävät, kyselylomakkeet ja arviointikriteerit

Viimeksi päivitetty: 07.03.2024

Tämän sivun pysyvä tunniste: http://urn.fi/urn:nbn:fi:lb-2024013002

Viimeksi muokattu 2024-03-07

Hae Kielipankki-portaalista:

Kuukauden tutkija: Sofoklis Kakouros

Yhteystiedot

Kielipankin tekninen ylläpito:
kielipankki (ät) csc.fi
p. 09 4572001

Aineistoihin ja muuhun sisältöön liittyvät asiat:
fin-clarin (ät) helsinki.fi
p. 029 4129317

Tarkemmat yhteystiedot

Suomalaisen viittomakielen korpus

Saatavilla olevat versiot

Tietoa aineistosta

Tärkeitä huomautuksia

Lisenssi ja pääsy aineistoon

Corpus of Finnish Sign Language

Currently available versions of this resource

Further information

Important notes

License and access

Finland Swedish Online

STT:n uutisarkisto (1992-)

Saatavilla olevat versiot

Tulossa olevat versiot

Poistetut versiot

Tietoa aineistosta

Tärkeitä huomautuksia

Finnish News Agency Archive (1992-)

Currently available versions of this resource

Upcoming versions of this resource

Removed versions of this resource

Further information

Important notes

License and access

Suomenruotsalaisen viittomakielen korpus

Saatavilla olevat versiot

Finlandssvensk teckenspråkskorpus

Tillgängliga versioner av denna resurs

Corpus of Finland-Swedish Sign Language

Currently available versions of this resource

Amerikansuomalaisten siirtolaisten puhuttu suomen ja englannin kieli

Tämän aineiston tulossa olevat versiot (ei vielä saatavilla)

Spoken Finnish and English of Finnish-American Immigrants

Upcoming versions of this resource (not yet available)

INCEpTION

The Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version

N-grams

Example queries from Korp

OCR quality

Tekstin uudelleenkäyttöklusterit ruotsinkielisessä lehdistössä 1645-1918

Saatavilla olevat versiot

Aineiston sisältö

Text reuse clusters in the Swedish-language press 1645-1918

Currently available versions of this resource

Corpus contents

GiellaLT

Testing and enhancement of language models (transducers) from GiellaLT

Testing and enhancement of language models (transducers) from GiellaLT

Analyser enhancement

Evaluations of analysers for individual languages:

Nordic Tweet Stream (NTS) haku- ja visualisointikäyttöliittymä

Nordic Tweet Stream (NTS) search & visualization interface

Uralic UD

Viitetiedot

DigiTala (2019–2023)

Aineiston sisältö

Lisätiedot

Uutisia

Yhteystiedot