The Newspaper and Periodical Corpus of the National Library of Finland, Swedish sub-corpus, 1880–1948, scrambled, VRT Persistent identifier: Licence: CC BY 4.0, IPR holder: The National Library of Finland Short name: klk-sv-1880-1948-s-vrt Description The corpus contains the years 1880–1948 of the Swedish sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland in the VRT (VeRticalized Text) format. The data has been digitized by the National Library of Finland and converted to the VRT format and annotated by FIN-CLARIN. The sentences within each page have been scrambled to a random order for copyright reasons. For some more information, please see the corpus metadata record at The data has been annotated with an old version of Språkbanken’s Korp corpus pipeline, with text-level metadata from the original data. Please note that the text data has been programmatically recognized from page images (OCR’d) and annotated without any manual correction, so its quality varies significantly. The data for each year is in a single file, named klk-sv-YYYY-s.vrt. The data is encoded in UTF-8, with Unix-style line endings (LF). The literal characters &, < and > have been encoded as the XML predefined entities &, < and >, and in structural attribute annotations also " as ". Each token is on a line of its own, with the token and its annotation attributes (positional attributes) separated by tabs. The attributes are the following (in this order, also listed in the “#vrt positional-attributes” comment at the beginning of the file): word: word form pos: part-of-speech tag msd: morpho-syntactic description lemma: base form(s) lex: lemgram(s) (lemma + part-of-speech code) saldo: lemma(s) with sense information prefix: prefix lemgram(s) suffix: suffix lemgram(s) ref: the number of the token in the sentence dephead: the number of the dependency head of the token deprel: dependency relation ocr: OCR confidence for the token (0.01…1.00) style: “_” (normal text), “subscript” or “superscript” The attributes lemma, lex, saldo, prefix and suffix are feature-set (multi-valued) attributes, in which the different values are separated by vertical bars (|), with a leading and trailing vertical bar. A lone vertical bar denotes the empty set (no value). Structural divisions are marked with XML-style tags, with annotations associated with each structure as attributes in the start tag. The order of the annotation attributes may vary. The structures and their annotation attributes are: text: A single page of a newspaper or magazine binding_id: issue identifier used for linking to page images at the National Library of Finland datefrom: the first date of the date range covering the issue date (yyyymmdd): if issue date is a year, “yyyy0101”, if a month, “yyyymm01” dateto: the last date of the date range covering the issue date (yyyymmdd): e.g., if issue data is a year, “yyyy1231” elec_date: digitization date (yyyy-mm-dd) file: original single-page VRT file name img_url: template for page image file name issue_date: date of the issue in the format [[dd.]mm.]yyyy issue_no: number of the issue issue_title: title of the issue label: name of the publication, issue number and date language: two-letter ISO 639-1 language code page_id: page identifier page_no: page number part_name: name of the part of publication (seldom used) publ_id: publication identifier: either ISSN or “fk” + number for publications without an ISSN publ_part: part of publication (number) (seldom used) publ_title: name of the publication publ_type: type of publication: “sanomalehti” for a newspaper, “aikakausi” for a periodical sentcount: number of sentences on the page timefrom: always “000000” (time information at day granularity) timeto: always “235959” tokencount: number of tokens on the page sentence: A sentence id: unique identifier of the sentence Note that sentences broken by page breaks have not been concatenated. The data also contains some singe-line XML-style comments at the beginning and end of each file.