Text reuse clusters in the Swedish-language press 1645-1918 Tekstin uudelleenkäyttöklusterit ruotsinkielisessä lehdistössä 1645-1918 Shortname: textreuse-sv-src Metadata: http://urn.fi/urn:nbn:fi:lb-2023092721 Rightholder: University of Turku License: CC BY The complete license is available at http://urn.fi/urn:nbn:fi:lb-2023092723 A copy of the license is included in LICENSE.txt. The license details may be subject to change, so before downloading the resource, please refer to the latest version of the license at the above link. ABOUT THE CORPUS: ================= The resource is based on a study of overlaps and repetitions of texts in the Swedish-language newspaper and magazine material that has been digitised by the national libraries of Finland and Sweden. The idea was to locate all texts or text fragments longer than 300 characters that had been repeated or copied at least once. More than 101 million of these similarities or overlaps were found. When the same texts were clustered together, there were almost 22 million clusters. The study covered the years 1645-1918, starting with the first newspaper printed in Sweden. In total, 7.5 million pages of digitised newspaper material were included in the study. In addition to the aforementioned newspapers printed in Finland and Sweden, the database includes Swedish-language immigrant newspapers published in North America. The resource was produced by the project "Informationsflöden över Östersjön: Svenskspråkig press som kulturförmedlare", funded by Society of Swedish Literature in Finland (Svenska Litteratursällskapet i Finland). The digitised material was compiled in November 2022. DATA FORMAT: ============ The data files are known to be readable into python with json.load(gzip.open('D_DD.gz')) and sufficient memory (more than 8G for some files, but less than 16G). The total compressed size of the corpus is 61G, split across archives of 12G or less. For further information, please contact fin-clarin@helsinki.fi .