Text reuse clusters in the Swedish-language press 1645-1918

Tekstin uudelleenkäyttöklusterit ruotsinkielisessä lehdistössä 1645-1918

Shortname: textreuse-sv-src

Metadata: http://urn.fi/urn:nbn:fi:lb-2023092721

Rightholder: University of Turku

License: CC BY
The complete license is available at http://urn.fi/urn:nbn:fi:lb-2023092723

A copy of the license is included in LICENSE.txt. The license details
may be subject to change, so before downloading the resource, please
refer to the latest version of the license at the above link.


ABOUT THE CORPUS:
=================
The resource is based on a study of overlaps and repetitions of texts 
in the Swedish-language newspaper and magazine material that has been 
digitised by the national libraries of Finland and Sweden. The idea was 
to locate all texts or text fragments longer than 300 characters that 
had been repeated or copied at least once. More than 101 million of 
these similarities or overlaps were found. When the same texts were 
clustered together, there were almost 22 million clusters. The study 
covered the years 1645-1918, starting with the first newspaper printed 
in Sweden. In total, 7.5 million pages of digitised newspaper material 
were included in the study. In addition to the aforementioned newspapers 
printed in Finland and Sweden, the database includes Swedish-language 
immigrant newspapers published in North America.

The resource was produced by the project "Informationsflöden över Östersjön: 
Svenskspråkig press som kulturförmedlare", funded by Society of Swedish Literature 
in Finland (Svenska Litteratursällskapet i Finland). The digitised material was 
compiled in November 2022.


DATA FORMAT:
============
The data files are known to be readable into python with json.load(gzip.open('D_DD.gz')) 
and sufficient memory (more than 8G for some files, but less than 16G).

The total compressed size of the corpus is 61G, split across archives of 12G or less.

For further information, please contact fin-clarin@helsinki.fi .