Resource title (English): The Suomi24 Corpus 2018-2020, VRT version 1.1 Resource title (Finnish): Suomi24-korpus 2018-2020, VRT-versio 1.1 Shortname: suomi24-2018-2020-vrt-v1-1 Metadata: http://urn.fi/urn:nbn:fi:lb-2021101523 Rightholder: City Digital Group License: ACA-NC The complete license is available at http://urn.fi/urn:nbn:fi:lb-2022020821 A copy of the license is included in LICENSE.txt. The license details may be subject to change, so before downloading the resource, please refer to the latest version of the license at the above link. Resource group page: http://urn.fi/urn:nbn:fi:lb-2022011221 Short description The corpus contains all the texts available in the discussion forums of the Suomi24 online social networking website from 1 January 2018 to 31 December 2020. The data was tokenized, converted to VRT format and annotated at the Language Bank of Finland. VRT version 1.1 includes annotations for names and identified languages as well as minor changes in some text attribute values. The entire corpus in the VRT format is downloadable for academic research purposes. Detailed description This data set is an annotated VRT version of a full database dump of the content of the Suomi24 discussion forums (https://keskustelu.suomi24.fi) from 1 January 2018 to 31 December 2020 from City Digital Group, received in May 2021. The data set excludes data from closed or hidden discussion topics. The data was tokenized, transformed to VRT format and morpho-syntatically annotated for FIN-CLARIN in the CSC Puhti environment with ad-hoc FIN-CLARIN VRT Tools scripts running e.g. the UDPipe tokenizer (finnish-tdt model, with post-processors) and the already old dependency analysis tools and models (TDPP) from Turku NLP group (TDPP script adapted for VRT in the language bank, models used as they were). The messages were then reordered and augmented with derived attributes. Later, sentence sentiment polarity was annotated by a sentiment classifier trained on the FinnSentiment corpus (see https://arxiv.org/pdf/2012.02613.pdf), names were recognized by the FiNER tagger, a part of Finnish Tagtools 1.6 (http://urn.fi/urn:nbn:fi:lb-2024021401), and sentence languages were identified with HeLI-OTS 2.0 (https://urn.fi/urn:nbn:fi:lb-2024040301). In VRT version 1.1, name attributes have been added, the language identified for each sentence has been updated to that produced by HeLI-OTS 2.0, and aggregate identified-language attributes have been added to each paragraph and text. In addition, spurious spaces have been removed from some text attributes, and some attributes have been renamed. Please see the end of this file for more details on the changes. The data has been divided into files by the year, corresponding to the subcorpora in Korp. The messages within each year are sorted by thread, and threads are sorted by the timestamp of the first message of the thread. Messages within a thread are sorted in thread order: each message is followed by the direct comments to it (recursively), sorted by their timestamp. Threads that span over several years have been split by the year. Messages appear as text elements that contain paragraph elements that contain sentence elements that contain a sequence of annotated tokens. Thread titles appear both as an attribute in each message and as a paragraph in the first message of the thread. The text elements contain the following essential attributes: - msg_type: "thread_start" or "comment" - thread_id: thread identifier (number) - comment_id: comment identifier (number; 0 if thread start message) - msg_id: constructed message id (thread_id:comment_id) - parent_comment_id: parent-comment identifier (0 if thread start message or if parent is the thread start message) - quoted_comment_id: quoted-comment identifier (0 if no quotation) - date: creation date (2019-01-11) - time: creation time (16:55:26) - datetime: combined creation date and time (2019-01-11 16:55:26) - thread_start_datetime: creation date and time of the thread start message (2018-01-01 01:30:00) - parent_datetime: creation date and time of the parent comment (2018-01-01 01:30:00, empty for thread start messages) - author: user nickname - author_logged_in: whether author was logged in (y, n) - title: thread title from starting message - topic_names: hierarchical topic (discussion area) name, top level first, levels separated by " > " ("Ajoneuvot ja liikenne > Autot > Automerkit > Honda") - topic_names_set: topic level names as a set ("|Ajoneuvot ja liikenne|Automerkit|Autot|Honda|") - topic_name_top: top-level topic name ("Ajoneuvot ja liikenne") - topic_name_leaf: bottom-level topic name ("Honda") - topic_adultonly: whether the topic is for adults only (y, n) - thread_closed: whether the thread is closed (new comments cannot be written) (y, n) - empty: whether the original message was completely empty (y, n) - sum_lang: the ISO 639-3 codes of languages identified in the sentences of the text and the number of sentences in each language (see the sentence attribute lang below for some more information) ("|fin:37|izh:1|und:1|", ordered by number of occurrences, tied codes in alphabetic order) The following text element attributes can be derived from other attributes, are included mostly for backward-compatibility or are otherwise less essential: - title_orig: original title with possible leading, trailing and multiple consecutive spaces preserved - author_orig: original author with possible leading, trailing and multiple consecutive spaces preserved - topic_names_orig: original hierarchical topic (discussion area) name, with double spaces preserved - datetime_approximated: whether the date and time were approximated based on the surrounding messages (always "n" (no) in this data) - author_nick_registered: whether nickname was registered (always "?" in this data, as the information was not available) - user_id: a user identifier of 32 hexadecimal digits for logged-in users, corresponding to the nickname in attribute author; 0 for others - hierarchy_id: an id (number) in the comment messages of the original data whose purpose is unknown to us but kept just in case; empty for thread start messages - datefrom, dateto: creation date (20190111) - timefrom, timeto: creation time (165526) - author_name_type: always "user_nickname" - filename_vrt: the name of the VRT file containing the message during processing - filename_orig: the name of the VRT file containing the message in the VRT version 1.0 of the corpus - origfile_textnum: the number of the corresponding text element in the VRT file in the VRT version 1.0 (1-based) - id: same as msg_id (the previous name of msg_id) - _sort_key: the key according to which the messages were sorted (byte-wise) within each thread Paragraph attributes: - type: "title" or "body" - sum_lang: the ISO 639-3 codes of languages identified in the sentences of the paragraph and the number of sentences in each language (see the sentence attribute lang below for some more information) ("|fin:2|und:1|") - id: running number of the paragraph within the subcorpus Sentence attributes: - lang: ISO 639-3 code of the language of the sentence as identified by HeLI-OTS 2.0; "und" for non-language data - lang_conf: a confidence value of the language identification provided by HeLI-OTS - lang_v1: ISO 639-3 code of the language of the sentence as identified by HeLI-OTS 1.1 (http://urn.fi/urn:nbn:fi:lb-2021062801); "xxx" for non-language data ("lang" in version 1.0) - sentiment_polarity: sentiment polarity of the sentence: "pos", "neut" or "neg" - polarity: an alias of (and the older name for) sentiment_polarity - id: running number of the sentence within the subcorpus - _skip: "|finnish-nertag|" if the sentence was not annotated with names; completely missing otherwise ("|" in Korp) In addition to these elements to which all tokens belong, name (and time and number) expressions recognized by FiNER 1.6 are enclosed in "ne" elements with the following attributes: - name: the name enclosed by the element, possibly multi-word; for name expressions, the last word is the base form of the last token, whereas the preceding ones are word forms - fulltype: the complete type of the name as recognized by FiNER ("EnamexOrgCrp") - ex: the main category of the expression: "ENAMEX" (name), "TIMEX" (time expression) or "NUMEX" (numerical expression) - type: the broad type of the expression ("ORG") - subtype: the finer type of the expression ("CRP") - placename: same as the value for "name" if the name is recognized as a place name, empty otherwise - placename_source: "ner" if the name is recognized as a place name, empty otherwise Nested name expressions are enclosed in "ne1" and "ne2" elements with the same attributes as "ne". "ne1" elements occur only within "ne" and "ne2" only within "ne1". The order of the attributes in the element start tags is arbitrary but fixed. The original data contained one completely empty message. To preserve its information in the VRT data, a lone underscore was added as its content, with the appropriate annotations. The attribute "empty" of this text has the value "y". The first line of each VRT file is a special comment that names the positional attributes (tab-separated fields) in order: <!-- #vrt positional-attributes: word ref lemma lemmacomp pos msd dephead deprel spaces initid lex/ nertag2 nertags2/ nerbio2 --> - word: surface form of the token - lemma: base form - lemmacomp: base form with compound-boundary markers (vertical bars) separating compound parts - pos: part of speech - msd: morpho-syntactic description - ref: the number of the token in the sentence - dephead: dependency head number (0 if no head) - deprel: dependency relation - spaces: spaces around (or within) the token in the original data (from tokenizer) - initid: running number (from tokenizer; largely redundant with ref) - lex/: lemgram, a combination of base form and a part-of-speech tag, surrounded by vertical bars - nertag2: maximal name information produced by FiNER, of the form CategoryTypSbt-X, where CategoryTypSbt is the full type of the name (see above) and X is one of "B" (the first word of a multi-word name), "E" (the last word of a multi-word name) or "F" (a full single-word name) - nertags2/: name information produced by FiNER, including possible nested names: values CategoryTypSbt-X-N separated by vertical bars, where CategoryTyp and X are as in "nertag2" and N is the nesting level (0, 1 or 2), with 0 being the outermost (maximal) name - nerbio2: a different kind of name information produced by FiNER for maximal names: B-TYP (the first word of a name of with broad type TYP), I-TYP (a subsequent word of a name with type TYP) or O (outside a name) Since the parser produced some multi-rooted analyses anyway, the long sentences that were parsed in shorter shreds were left multi-rooted when the shreds were put back together. The three characters < > & appear as < > & everywhere (because in bare form they are used for the markup), and the double quotation mark " appears as " in text attribute values. Attribute values are always enclosed in double quotation marks. Otherwise all content is encoded as UTF-8. Spurious control characters were interpreted or removed, space characters were normalized and non-characters were removed. However, Unicode normalization was not done, nor ligatures considered; unassigned code points and private-use characters may have been addressed somehow. No attempt was made to normalize the various characters used or abused for quotation marks, apostrophes, or dashes. The values of the text attributes "title", "topic_names", "topic_name_leaf" and "topic_names_set" were cleaned up so that they do not contain leading, trailing or multiple consecutive spaces. The original values are preserved in attributes "title_orig" and "topic_names_orig", respectively. The original values of "topic_name_leaf" and "topic_names_set" can be derived from "topic_names_orig". Over-long "words" were shortened and marked with "REDACTED" in the data, partly for processing reasons. The tokenizer has not split tokens at punctuation marks lacking the space that normatively follows them. If the punctuation mark would have indicated a sentence break, the break has been omitted. In addition, if the word following the punctuation mark contains a non-ASCII letter, such as ä or ö, the word has incorrectly been split before the non-ASCII letter. This may be corrected in a future version of the data. (Tokenization issues are hard to fix at scale when further annotations depend on the old tokenization.) Each VRT file contains a couple of informational XML-style comment lines ("<!-- ... -->") at the beginning and end of the file. Differences from VRT version 1.0 - Names have been annotated with the positional attributes "nertag2", "nertag2" and "nerbio2" and structures "ne", "ne1" and "ne2" and their attributes. - Sentence languages have been identified with HeLI-OTS 2.0 and annotated with the attributes "lang" and "lang_conf" (and "_skip"), with aggregate values in paragraph and text attributes "sum_lang". In version 1.0, the "lang" attribute in sentences was the language identified with HeLI-OTS 1.1; its value is preserved in attribute "lang_v1". - The sentence attribute "polarity" has been renamed to the more appropriate "sentiment_polarity", but "polarity" is kept as an alias for backward-compatibility. - The values of text attributes "title" has been cleaned up by removing leading, trailing and multiple consecutive spaces. The original values are preserved in attribute "title_orig". The attribute "author_orig" has been added for compatibility with Suomi24 2001–2017, VRT version 1.3, but its value is always the same as that of "author". - The values of text attributes "topic_names", "topic_names_set" and "topic_name_leaf" have been cleaned up by removing the spurious space in " Työpaikkailmoitukset" and "Ravinto ja ruokavaliot". The original value of "topic_names" is preserved in attribute "topic_names_orig"; the other two attributes have no corresponding original-value attribute but their values can be inferred from "topic_names". - The text attribute "id" has been aliased to "msg_id" for forward-compatibility. Differences from Suomi24 2001–2017, VRT version 1.3 The format of the data of Suomi24 2018–2020, VRT version is mostly the same as that of Suomi24 2001–2017, VRT version 1.3 (http://urn.fi/urn:nbn:fi:lb-2020021801). The few differences are due to differences in the original source data or in processing the data. Additional text attributes (see above for their meaning): - hierarchy_id - thread_closed - user_id Additional sentence attribute: - lang_v1 Some attributes are missing as either the information they contained was not available in the source data or they were not applicable to this data: - author_v1: not applicable - author_nick_type: whether nickname was registered (information not in source data) - author_signed_status: whether nickname was registered and the author logged in (nickname registration status not in source data) - topic_nums: comma-separated topic numbers (not in source data) - topic_nums_set: topic numbers as a set (not in source data) Values for the positional attribute "lemma" for compound words may differ from those in Suomi24 2001–2017, as they are intended to be more natural, without lemmatizing all compound parts of the word. Some Unicode characters may have been treated differently. For further information, please contact fin-clarin@helsinki.fi .