Allas is a new data storage service that uses the Object Storage architecture. Data is stored as objects in so-called ”buckets”. While buckets are a bit similar to directories, they have to be unique accross the system. In other words, there can only be one bucket named ”test” in Allas.
In Allas, there are only buckets and objects. Directories are simulated by Swift tools, but paths are really part of the object. Thus, ”suomi-24/src/comment-1.txt” and ”suomi-24/src/comment-2.txt” are two objects named in a way that Swift and other tools can group together, but there is no such thing as a ”suomi-24″ directory.
Allas also does not, by default, register who uploads the data, and it does not support fine-grained access control. All buckets associated with a project (in our case ,”clarin”) can be deleted by anyone with write access. Buckets are, by default, private but can be made publicly available by any group member.
The removal of a permanent work storage on Puhti forces us to rethink data management. We will need to better separate valuable primary data, tools and generated data.
To get started, type in CSC’s computing environment:
module load allas
allas-conf clarin
a-list
Remember the following conventions:
The Swift tools are intended for more advanced use of Allas. They allow you to upload and download directories and individual files.
CSC has created a set of tools to make Allas access a bit easier: a-tools. Tools such as ”a-put” and ”a-get” can convert entire directories into compressed tar archives and put/get them to/from Allas. The tools are basically wrappers around Swift and other tools such as Rclone.
When should you use Allas, when IDA and when Puhti’s file system, like /scratch/clarin
or /projappl/clarin
?
Puhti’s file system has no permanent large data storage. The only permanent directory apart from your home directory is /projappl/clarin
that has only a limited quota and is intended to be used for shared software. /scratch/clarin/
is intended for data to work on. Old data is removed regularly.
IDA contains data that cannot be easily recreated or reacquired from other sources, such as raw language data from depositors.
The use cases for Allas are developing. Objects in Allas can be provided in a massive parallel manner making massive parallel processing of data easier in the future. The most likely use cases:
The work directory. Data is removed regularly (see CSC’s documentation)
Space for tools that we want to share across the project but not (yet) provide to all users. The tools should be version-controlled.
Personal files and tools.
Especially when you use Swift, you want to consider how to organize your buckets. Swift has easy commands for uploading and downloading entire buckets, so it depends on your use case which structure should be preferred.
Say you have two corpora, named ”Suomi24” and ”Ylilauta”. In both cases, there are the subdirectories ”src” and ”vrt”. You can now create one bucket named ”clarin-corpora” with objects named ”Suomi24/src/…” (and obviously ”Suomi24/vrt/…”) and ”Ylilauta/src/…” and so on, or two buckets ”clarin-corpora-Suomi24” and ”clarin-corpora-Ylilauta” with objects in them named ”src/…” and ”vrt/…”. (The ”corpora” part would be a convention to give a hint about the content.)
One consideration could be size: Suomi24-sized corpora should likely be in a separate bucket, but various small corpora can maybe be grouped together in one bucket.