The Indo-European Cognate Relationships dataset is now published

Mika Norling

Data, Publications, Research

08/09/2025

Latest news

The Tie That Binds Us?25/06/2026
A new grant to develop cross-cutting research projects18/06/2026
UU researcher profile: Harald Hammarström documents languages on the verge of extinction12/05/2026

Newsletter – past issues

Coming events

^{Harald Hammarström is one of the co-authors.}

Abstract

The Indo-European Cognate Relationships (IE-CoR) dataset is an open-access relational dataset showing how related, inherited words (‘cognates’) pattern across 160 languages of the Indo-European family. IE-CoR is intended as a benchmark dataset for computational research into the evolution of the Indo-European languages. It is structured around 170 reference meanings in core lexicon, and contains 25731 lexeme entries, analysed into 4981 cognate sets. Novel, dedicated structures are used to code all known cases of horizontal transfer. All 13 main documented clades of Indo-European, and their main subclades, are well represented. Time calibration data for each language are also included, as are relevant geographical and social metadata. Data collection was performed by an expert consortium of 89 linguists drawing on 355 cited sources. The dataset is extendable to further languages and meanings and follows the Cross-Linguistic Data Format (CLDF) protocols for linguistic data. It is designed to be interoperable with other cross-linguistic datasets and catalogues, and provides a reference framework for similar initiatives for other language families.

^{Fig. 1 Language sample in IE-CoR 1.2. Colours represent main clades.}

Background: the Indo-European languages and phylogenetic research

Almost half of the world’s population speaks a language of the Indo-European lineage. This huge family of over 400 languages has a long research tradition stretching back well over two hundred years, but much remains to be understood about its origins, dispersal, and internal structure. In particular, major phylogenetic analyses in recent years, as surveyed in, have supported conflicting hypotheses for the time depth and geographical origin of Indo-European. Recent analyses have mostly used state-of-the-art Bayesian phylogenetic analysis tools, applied to datasets of cognates (related words) across the Indo-European languages, i.e. forerunners of the new IE-CoR dataset presented here. Those past datasets have been criticised, however, for their limited and uneven coverage of the Indo-European family through time and space, and across its internal diversity, as well as for poor data coding — data problems directly implicated in the inconsistent phylogenetic results obtained.

The new Indo-European Cognate Relationships (IE-CoR) dataset is designed to overcome the limitations of past datasets. It encodes cognate relationships in 170 meanings of core vocabulary (i.e. basic terms like hand, drink, black, three) across 160 Indo-European languages. (For explanations of linguistic terminology used in this text, such as ‘cognate’, see the Definitions box.) IE-CoR aims to provide a benchmark dataset for quantitative and phylogenetic research on the Indo-European (IE) language family.

Anderson, C., Scarborough, M., Jocz, L. et al. The Indo-European Cognate Relationships dataset. Sci Data 12, 1541 (2025). https://doi.org/10.1038/s41597-025-05445-3

In:

Data, Publications, Research

Cognate, dataset, human past, Indo-European languages, interdisciplinary research, linguistics, Nature, phylogenetic

Other news

The Tie That Binds Us?

25/06/2026

A new article discussing ancient DNA, kinship studies and human connection across time, co-authored by one of our former Human Past SCAS Fellows, Mehmet Somel, has just been published in the Cambridge Archaeological Journal. Moots, H. M., Tsosie, K. S., & Somel, M. (2026). The Tie That Binds Us? Challenging the Primacy of DNA in Kinship Studies…
A new grant to develop cross-cutting research projects

18/06/2026

Uppsala University Future Institutes (UUniFI), CIRCUS (Centre for Integrated Research on Culture and Society) has decided to support our work on the development of an interdisciplinary research project titled “Mechanisms of Human Migration: Causes, Processes & Consequences“. The UUniFI Circus will provide both financial and administrative support and host a seminar series in which project…
UU researcher profile: Harald Hammarström documents languages on the verge of extinction

12/05/2026

A professor of linguistics with a Master’s in computer science and a PhD in computational linguistics, Harald goes an extra mile to document the languages that head towards extinction. “Throughout time, smaller languages have always been swallowed up by larger ones. But now, with globalisation, this is happening at an incredibly accelerated pace.” Language has…

Center for the Human Past

The Indo-European Cognate Relationships dataset is now published

Latest news

Coming events

Talks of the Past Open Seminar with Gwenna Breton

Talks of the Past Open Seminar: Cultural Creolization in Ostrobothnia, Finland: An Interdisciplinary View on Burials, Exchange, and Social Identity, speaker Anna Wessman

Other news

The Tie That Binds Us?

A new grant to develop cross-cutting research projects

UU researcher profile: Harald Hammarström documents languages on the verge of extinction