The general language identifier HeLI-OTS 1.0 is an automatic tool that is capable of identifying the language of each line of text in the input file. HeLI-OTS 1.0 selects the best match among 200 languages.
The publication of HeLI-OTS 1.0 is one of the results of the co-operation project Language Identification of Speech and Text by the University of Helsinki and Lingsoft Oy, supported by “Tandem Industry Academia 2020” funding from Finnish Research Impact Foundation. The tool is based on the HeLI method, developed by Tommi Jauhiainen and Heidi Jauhiainen in continuation to Tommi’s research for his Master and PhD projects at the Department of Digital Humanities at the University of Helsinki.
The language identifier is available under Apache 2 and CC-BY licenses. The tool is simple to use: it reads the text file specified as a parameter, identifies the language of each line in the input file, and writes the ISO 639-3 language codes in the corresponding lines of the output file. The source code for the entire language identifier can be downloaded from Zenodo, but if you just want to use the language identifier, you will only need the file HeLI.jar (42 MB). When in operation, the language identifier uses about 3 gigabytes of memory and a single computing core. It is capable of identifying the language of about 3,000 sentences per second on a modern laptop.
You can run the tool with the following type of command:
java -jar HeLI.jar <infile> <outfile>