The Language Bank offers a wide variety of text and speech corpora. Many of them are available for download in our download service in their original source format and/or in VRT format (the text file format extracted from Korp). In addition, some of the downloadable resources are directly available in an uncompressed form in CSC’s computing environment (see instructions for locating resources in the Language Bank).
On this page, you can find examples of the folder structures of downloadable resources. They will hopefully give you an idea about what to expect after downloading a resource or when accessing datasets on the computing environment. The examples may also help you in designing the structure of your own datasets if you wish to make them available via the Language Bank of Finland.
A simple structure with only one data file and a README.txt:
A complex structure with files in various formats, ordered by date:
Two variants of the same resource: the original, unannotated text documents, and the annotated version in VRT format, where the individual text documents from four categories are included and described in a smaller number of files:
Last updated: 2023-05-19