Corpus of Contemporary American English

The Corpus of Contemporary American English (COCA) is the only large and ”representative” corpus of American English. COCA is probably the most widely-used corpus of English. It is related to many other corpora of English that were formerly known as the ”BYU Corpora”, and they offer unparalleled insight into variation in English.

For general terms and conditions for this and other corpora from BYU please see https://www.corpusdata.org/restrictions.asp

For more information about access rights see Mark Davies’ downloadable corpora at Kielipankki

Latest versions/subcorpora:
Corpus of Contemporary American English – Kielipankki VRT version 2020 Metadata and license Attribution instructions	Download the resource
Corpus of Contemporary American English – Kielipankki Korp version 2020 Metadata and license Attribution instructions	Select the corpus in Korp
Corpus of Contemporary American English – Kielipankki download version 2020 Metadata and license Attribution instructions	Download the resource
Corpus of Contemporary American English – Kielipankki Korp version 2017H1 Metadata and license Attribution instructions	Select the corpus in Korp
Corpus of Contemporary American English – Kielipankki download version 2017H1 Metadata and license Attribution instructions	Download the resource

Of this language corpus different versions/subcorpora are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2017061921

The Helsinki Korp Version of the Parole Corpus

This electronic language resource was compiled out of several languages spoken in Europe during the international project Le Parole.

Latest versions/subcorpora:
The Helsinki Korp Version of the Swedish Parole Corpus Metadata and license Attribution instructions	Select the corpus in Korp
The Finnish Parole Corpus Metadata and license Attribution instructions	available upon request via our IDA-service
Search for all versions in META-SHARE

Of this language corpus different versions/subcorpora are published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021042601

Tweet #hcs

Helsinki Corpus of Swahili 2.0

Helsinki Corpus of Swahili 2.0 is available for research purposes in Kielipankki – the Language Bank of Finland. The corpus contains about 25 million words of written text, and it is available in two formats. The annotated version contains morphological and syntactic annotation as well as glosses in English. The not annotated version contains plain text. The corpus text was randomly shuffled document-internally. The sentence order is the same in both corpus versions.

More information on the corpus: https://www.kielipankki.fi/corpora/hcs2/

Latest versions/subcorpora:
Helsinki Corpus of Swahili 2.0 (HCS 2.0) Annotated Version Metadata and license Attribution instructions	Select the corpus in Korp
Helsinki Corpus of Swahili 2.0 (HCS 2.0) Downloadable Annotated Version Metadata and license Attribution instructions	Download the resource A copy of this version is available in the computing environment.
Helsinki Corpus of Swahili 2.0 (HCS 2.0) Not Annotated Version Metadata and license Attribution instructions	Download the resource A copy of this version is available in the computing environment.
Search for these versions in META-SHARE

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2014032624

Wanca 2016

Wanca 2016 is a collection of web corpora in small Uralic languages. The collection is composed of 29 sentence corpora in different languages. The corpora have been collected from the Internet using the automated system developed in the Finno-Ugric Languages and the Internet project (SUKI) supported by the Kone foundation from their Language Programme 2012-2016. The sentences have been extracted from the pages found while harvesting with Heritrix and the language of each sentence has been identified with MultiLi using HeLI as the identification method. Each sentence has a link to the original page it was found in, but it is possible that some of the links stop working. In that case we recommend searching for the page in the Internet Archive Wayback machine https://archive.org/web/.

More information on Wanca: http://www.suki.ling.helsinki.fi/wanca

Latest versions/subcorpora:
Wanca 2016, Korp Version Metadata and license Attribution instructions	Select the corpus in Korp
Wanca 2016, source Metadata and license Attribution instructions	Download the resource
Wanca 2016, VRT Metadata and license Attribution instructions	Download the resource
Search for these versions in META-SHARE

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

The languages in Wanca 2016 are:

ISO 639-3	Name of language
fit	Tornedalen Finnish (meänkieli)
fkv	Kven (kvääni)
izh	Ingrian (ižoran keel)
kca	Khanty (ханты ясанг)
koi	Komi-Permyak (перем коми кыв)
kpv	Komi-Zyrian (Коми кыв)
krl	Karelian (karjal)
liv	Liv (līvõ kēļ)
lud	Ludian (lüüdin kiel’)
mdf	Moksha (мокшень)
mhr	Eastern and Meadow Mari (марий йылме)
mns	Mansi (мāньси лāтыӈ)
mrj	Western or Hill Mari (Кырык мары)
myv	Erzya (эрзянь)
nio	Nganasan (ня”)
olo	Livvi (Olonets / livvin karjal)
sjd	Kildin Sami (Кӣллт са̄мь кӣлл)
sjk	Kemi Sami (samääškiela)
sju	Ume Sami (uumajanlappi)
sma	Southern Sami (åarjel-saemien)
sme	Northern Sami (davvisámi, davvisámegiella)
smj	Lule Sami (julevsábme)
smn	Inari Sami (anarâškielâ)
sms	Skolt Sami (sää´mǩiõll)
udm	Udmurt (удмурт кыл)
vep	Veps (vepsän kel’)
vot	Votic (vad̕d̕a ceeli)
vro	Võro (võro kiil)
yrk	Nenets (ненэцяʼ вада)

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-202104141

Search the Language Bank Portal:

Researcher of the Month: Sofoklis Kakouros

Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information

Corpus of Contemporary American English

The Helsinki Korp Version of the Parole Corpus

Helsinki Corpus of Swahili 2.0

Wanca 2016

News

Contact