The Suomi24 Corpus 2018–2020, VRT version (beta)

PLEASE NOTE that this data is a beta test version and may change
without notice.

Persistent identifier: http://urn.fi/urn:nbn:fi:lb-2021101523
Licence: CLARIN ACA +NC 1.0: http://urn.fi/urn:nbn:fi:lb-20150304151


Short description

The corpus contains all the texts available in the discussion forums
of the Suomi24 online social networking website from 1 January 2018 to
31 December 2020. The data was tokenized, converted to VRT format and
annotated at the Language Bank of Finland.

The entire corpus in the VRT format is downloadable for academic
research purposes.


Detailed description

This data set is an annotated VRT version of a full database dump of
the content of the Suomi24 discussion forums
(https://keskustelu.suomi24.fi) from 1 January 2018 to 31 December
2020 from City Digital Group, received in May 2021. The data set
excludes data from closed or hidden discussion topics.

The data was cleaned up, tokenized, transformed to VRT format and
morpho-syntatically annotated for FIN-CLARIN in the CSC Puhti
environment by Jussi Piitulainen using various ad-hoc and FIN-CLARIN
VRT Tools scripts running e.g. the UDPipe tokenizer and the old
dependency analysis tools and models (TDPP) from Turku NLP. The
messages were then reordered and augmented with derived attributes by
Jyrki Niemi.

The data has been divided into files by the year, corresponding to the
subcorpora in Korp. The messages within each year are sorted by thread
and threads by the timestamp of the first message of the thread.
Messages within a thread are sorted in thread order: each message is
followed by the direct comments to it (recursively), sorted by their
timestamp. Threads that span over several years have been split by the
year.

Messages appear as text elements that contain paragraph elements that
contain sentence elements that contain a sequence of annotated tokens.
Thread titles appear both as an attribute in each message and as a
paragraph in the first message of the thread.

The text elements contain the following essential attributes:
- msg_type: "thread_start" or "comment"
- thread_id: thread identifier (number)
- comment_id: comment identifier (number; 0 if thread start message)
- id: constructed message id (thread_id:comment_id)
- parent_comment_id: parent-comment identifier (0 if thread start
  message or if parent is the thread start message)
- quoted_comment_id: quoted-comment identifier (0 if no quotation)
- date: creation date (2019-01-11)
- time: creation time (16:55:26)
- datetime: combined creation date and time (2019-01-11 16:55:26)
- thread_start_datetime: creation date and time of the thread start
  message (2018-01-01 01:30:00)
- parent_datetime: creation date and time of the parent comment
  (2018-01-01 01:30:00, empty for thread start messages)
- author: user nickname
- author_logged_in: whether author was logged in (y, n)
- title: thread title from starting message
- topic_names: hierarchical topic (discussion area) name, top level
  first, levels separated by " > " ("Ajoneuvot ja liikenne >
  Autot > Automerkit > Honda")
- topic_names_set: topic level names as a set ("|Ajoneuvot ja
  liikenne|Automerkit|Autot|Honda|")
- topic_name_top: top-level topic name ("Ajoneuvot ja liikenne")
- topic_name_leaf: bottom-level topic name ("Honda")
- topic_adultonly: whether the topic is for adults only (y, n)
- thread_closed: whether the thread is closed (new comments cannot be
  written) (y, n)
- empty: whether the original message was completely empty (y, n)

The following text element attributes can be derived from other
attributes, are included mostly for backward-compatibility or are
otherwise less essential:
- datetime_approximated: whether the date and time were approximated
  based on the surrounding messages (always "n" (no) in this data)
- author_nick_registered: whether nickname was registered (always "?"
  in this data, as the information was not available)
- user_id: a user identifier of 32 hexadecimal digits for logged-in
  users, corresponding to the nickname in attribute author; 0 for
  others
- hierarchy_id: an id (number) in the comment messages of the original
  data whose purpose is unknown to us but kept just in case; empty for
  thread start messages
- datefrom, dateto: creation date (20190111)
- timefrom, timeto: creation time (165526)
- author_name_type: always "user_nickname"
- filename_vrt: the name of the VRT file containing the message during
  processing
- filename_orig: the name of the VRT file containing the message in
  the VRT version 1.0 of the corpus
- origfile_textnum: the number of the corresponding text element in
  the VRT file in the VRT version 1.0 (1-based)
- _sort_key: the key according to which the messages were sorted
  (byte-wise) within each thread

Paragraph attributes:
- id: running number of the paragraph within the subcorpus
- type: "title" or "body"

Sentence attributes:
- id: running number of the sentence within the subcorpus
- lang: ISO 639-3 code of the language of the sentence as identified
  by HeLI-OTS 1.1 (http://urn.fi/urn:nbn:fi:lb-2021062801); "xxx" for
  non-language data
- polarity: sentiment polarity of the sentence: "pos", "neut" or
  "neg"; added by a sentiment classifier trained on the FinnSentiment
  corpus (see https://arxiv.org/pdf/2012.02613.pdf)

The order of the attributes in the element start tags is arbitrary but
fixed.

The original data contained one completely empty message. To preserve
its information in the VRT data, a lone underscore was added as its
content, with the appropriate annotations. The attribute "empty" of
this text has the value "y".

The first line of each VRT file is a special comment that names the
positional attributes (tab-separated fields) in order:

<!-- #vrt positional-attributes: word ref lemma lemmacomp pos msd dephead deprel spaces initid lex/ -->

- word: surface form of the token
- lemma: base form
- lemmacomp: base form with compound-boundary markers (vertical bars)
  separating compound parts
- pos: part of speech
- msd: morpho-syntactic description
- ref: the number of the token in the sentence
- dephead: dependency head number (0 if no head)
- deprel: dependency relation
- spaces: spaces around (or within) the token in the original data
  (from tokenizer)
- initid: running number (from tokenizer; largely redundant with ref)
- lex/: lemgram, a combination of base form and a part-of-speech tag,
  surrounded by vertical bars

Since the parser produced some multi-rooted analyses anyway, the long
sentences that were parsed in shorter shreds were left multi-rooted
when the shreds were put back together.

The three characters < > & appear as &lt; &gt; &amp; everywhere
(because in bare form they are used for the markup), and the double
quotation mark " appears as &quot; in text attribute values. Attribute
values are always enclosed in double quotation marks.

Otherwise all content is encoded as UTF-8. Spurious control characters
were interpreted or removed, space characters were normalized and
non-characters were removed. However, Unicode normalization was not
done, nor ligatures considered, nor unassigned code points, and
private-use characters were preserved as such; we may learn to do
better.

No attempt was made to normalize the various characters used or abused
for quotation marks, apostrophes, or dashes.

Note that some values of the text attributes "author" and "title"
contain two (or more) consecutive spaces. Moreover, one value of the
attribute "topic_name_leaf" contains a leading space and one value a
double space between words: " Työpaikkailmoitukset" and "Ravinto  ja
ruokavaliot", respectively. This is also reflected in the attributes
"topic_names" and "topic_names_set". These may be normalized in a
future version of the data.

Over-long "words" were shortened and marked with "REDACTED" in the
data, partly for processing reasons.

The tokenizer has not split tokens at punctuation marks lacking the
space that normatively follows them. If the punctuation mark would
have indicated a sentence break, the break has been omitted. In
addition, if the word following the punctuation mark contains a
non-ASCII letter, such as ä or ö, the word has incorrectly been split
before the non-ASCII letter. This may be corrected in a future version
of the data.

Each VRT file contains a couple of informational XML-style comment
lines ("<!-- ... -->") at the beginning and end of the file.


Differences from Suomi24 2001–2017, VRT version 1.2

The format of the data of Suomi24 2018–2020, VRT version is mostly the
same as that of Suomi24 2001–2017, VRT version 1.2
(http://urn.fi/urn:nbn:fi:lb-2020021801). The few differences are due
to differences in the original source data or in processing the data.

Additional text attributes (see above for their meaning):
- hierarchy_id
- thread_closed
- user_id

Additional sentence attribute:
- lang

Some attributes are missing as either the information they contained
was not available in the source data or they were not applicable to
this data:
- author_v1: not applicable
- author_nick_type: whether nickname was registered (information not
  in source data)
- author_signed_status: whether nickname was registered and the author
  logged in (nickname registration status not in source data)
- topic_nums: comma-separated topic numbers (not in source data)
- topic_nums_set: topic numbers as a set (not in source data)

Values for the positional attribute "lemma" for compound words may
differ from those in Suomi24 2001–2017, as they are intended to be
more natural, without lemmatizing all compound parts of the word.

Some Unicode characters may have been treated differently.