Tool category in Mylly: Relation algebra | fi
Relation algebra
This is a technical document about an implementation of relation algebra in Mylly. For practical examples, see (pages to be written).
This is work in progress. Added first examples 2017-10-07.
Introduction
A relation is a set of records that assign definite values to a set of attributes. The attributes are names that constitute the type of the relation.
In a fuller theory, the attributes also have their own types, but in Mylly, there is only an intended type for each attribute; formally all values are character strings. This limitation facilitates the representation of a relation as a simple text file in TSV format, with a headline of names and the corresponding values of each record on lines of their own, adjacent fields separated with tabs. (Link to the IANA RFC here.)
Relations can contain an annotation structure attached to tokenized text (with unique identifiers within the relation), or metadata attached to sentence identifiers, or data frames that can be the input or output of a statistical analysis.
sen | fi | en | mood |
---|---|---|---|
3 | On se niin. | It is so. | calm |
4 | Ei ole. | Is not. | calm |
5 | On! | Is! | angry |
word | tok | sen | lemma |
---|---|---|---|
on | 1 | 3 | olla |
se | 2 | 3 | se |
niin | 3 | 3 | niin |
on | 1 | 5 | olla |
ei | 1 | 4 | ei |
ole | 2 | 4 | olla |
A relation algebra consists of operations that make relations out of relations. There are also actions that produce relations but take another kind of input on the side, and tools that take relations but produce something else.
Such tools can take apart and put together related relations in useful ways, opening up ways for further analysis. Some tools are generally applicable. In addition, Mylly provides more special tools to facilitate the manipulation and investigation of language data.
Unique and unordered attributes and records
Duplicate records are not allowed in a relation. The order of the lines in the TSV file that represents a relation does not matter: the record lines can be reordered arbitrarily or at will, and the file will still represent the same relation. A relation is literally a set of records.
When multiplicity or order are important, the relation can contain attributes that make these explicit.
The attributes are similarly distinct from each other and their order is the file that represents a relation is similarly accidental. The attributes, too, are literally a set.
pos | lemma |
---|---|
V | olla |
V | ei |
V | voida |
lemma | pos |
---|---|
ei | V |
olla | V |
voida | V |
The join
Two relations share the attributes that have the same name (and type, in fuller theory).
The join of two relations consists of the records that have all the attributes that either relation has, the shared attributes have the same value in both relations, and the other values are taken from the record that has the non-shared attribute.
word | pos | tok | sen | lemma |
---|---|---|---|---|
ei | V | 1 | 4 | ei |
on | V | 1 | 3 | olla |
on | V | 1 | 5 | olla |
ole | V | 2 | 4 | olla |
(When there are no shared attributes, the join is the product of the relations. When all attributes are shared, the join is the intersection of the relations. It makes no sense to join relations where the formally shared attributes have different intended types.)
(Example is to join data and meta that share a sentence identifier.)
Renaming of attributes
One or more attributes of a relation can be renamed for any reason, for no reason, and particularly to control their participation in joins as shared attributes or not as shared attributes, whichever is appropriate.
The projections
(Project to or out of selected attributes. Mylly provides versions that count the selected combinations.)
lemma |
---|
ei |
niin |
olla |
se |
lemma | count |
---|---|
olla | 3 |
ei | 1 |
niin | 1 |
se | 1 |
The classification of sentences into moods without the actual sentences can be obtained either by keeping the mood and the sentence number, or by dropping the Finnish and English sentences.
mood | sen |
---|---|
calm | 4 |
calm | 3 |
angry | 5 |
Variations on the join
(Matching records, and those that do not match. Compose and image, probably, because it should be easier to provide them than explain that an immediate use case was not at hand. (The latter two are not implemented at the time of this writing, which is 2017-10-06.)
The composition of the annotated sentences with the mood classification is like the corresponding join except for the omission of the shared attribute.
tok | mood | word | lemma |
---|---|---|---|
1 | calm | on | olla |
2 | calm | se | se |
3 | calm | niin | niin |
1 | angry | on | olla |
1 | calm | ei | ei |
2 | calm | ole | olla |
Parts and partitions
(Sum, selection of a part by a combination of attribute values, partition on an attribute either to a specific combination and complement or all parts – within reason, because who would want to deal with more files than they can reasonably deal with, in a graphical user interface)
tok | mood | word | lemma |
---|---|---|---|
1 | calm | on | olla |
1 | angry | on | olla |
2 | calm | ole | olla |
tok | mood | word | lemma |
---|---|---|---|
2 | calm | se | se |
3 | calm | niin | niin |
1 | calm | ei | ei |
(Random part. Not to be misunderestimated. (Also a tool to sample observations, more difficult to classify but also nice to have.))
(Union, intersection and difference. No complement.)
Functional extension
A number of tools extend each record with a value that can be computed based on the values in that record. Such operations may involve another source of information, which can be another relation but can also be anything whatsoever.
(Examples are the key-value expander and the new, as of 2017-10-06, frequency counter.)
For example, the frequencies of the lemmas in the vocabulary relation as counted in the corpus relation.
lemma | pos | freq |
---|---|---|
ei | V | 1 |
olla | V | 3 |
voida | V | 0 |