Concept: Language family
Zipf’s law on word frequency and Heaps' law on the growth of distinct words are observed in Indo-European language family, but it does not hold for languages like Chinese, Japanese and Korean. These languages consist of characters, and are of very limited dictionary sizes. Extensive experiments show that: (i) The character frequency distribution follows a power law with exponent close to one, at which the corresponding Zipf’s exponent diverges. Indeed, the character frequency decays exponentially in the Zipf’s plot. (ii) The number of distinct characters grows with the text length in three stages: It grows linearly in the beginning, then turns to a logarithmical form, and eventually saturates. A theoretical model for writing process is proposed, which embodies the rich-get-richer mechanism and the effects of limited dictionary size. Experiments, simulations and analytical solutions agree well with each other. This work refines the understanding about Zipf’s and Heaps' laws in human language systems.
India is a patchwork of tribal and non-tribal populations that speak many different languages from various language families. Indo-European, spoken across northern and central India, and also in Pakistan and Bangladesh, has been frequently connected to the so-called “Indo-Aryan invasions” from Central Asia ~3.5 ka and the establishment of the caste system, but the extent of immigration at this time remains extremely controversial. South India, on the other hand, is dominated by Dravidian languages. India displays a high level of endogamy due to its strict social boundaries, and high genetic drift as a result of long-term isolation which, together with a very complex history, makes the genetic study of Indian populations challenging.
There would be little adaptive value in a complex communication system like human language if there were no ways to detect and correct problems. A systematic comparison of conversation in a broad sample of the world’s languages reveals a universal system for the real-time resolution of frequent breakdowns in communication. In a sample of 12 languages of 8 language families of varied typological profiles we find a system of ‘other-initiated repair’, where the recipient of an unclear message can signal trouble and the sender can repair the original message. We find that this system is frequently used (on average about once per 1.4 minutes in any language), and that it has detailed common properties, contrary to assumptions of radical cultural variation. Unrelated languages share the same three functionally distinct types of repair initiator for signalling problems and use them in the same kinds of contexts. People prefer to choose the type that is the most specific possible, a principle that minimizes cost both for the sender being asked to fix the problem and for the dyad as a social unit. Disruption to the conversation is kept to a minimum, with the two-utterance repair sequence being on average no longer that the single utterance which is being fixed. The findings, controlled for historical relationships, situation types and other dependencies, reveal the fundamentally cooperative nature of human communication and offer support for the pragmatic universals hypothesis: while languages may vary in the organization of grammar and meaning, key systems of language use may be largely similar across cultural groups. They also provide a fresh perspective on controversies about the core properties of language, by revealing a common infrastructure for social interaction which may be the universal bedrock upon which linguistic diversity rests.
- Proceedings of the National Academy of Sciences of the United States of America
- Published over 2 years ago
Understanding how and why language subsystems differ in their evolutionary dynamics is a fundamental question for historical and comparative linguistics. One key dynamic is the rate of language change. While it is commonly thought that the rapid rate of change hampers the reconstruction of deep language relationships beyond 6,000-10,000 y, there are suggestions that grammatical structures might retain more signal over time than other subsystems, such as basic vocabulary. In this study, we use a Dirichlet process mixture model to infer the rates of change in lexical and grammatical data from 81 Austronesian languages. We show that, on average, most grammatical features actually change faster than items of basic vocabulary. The grammatical data show less schismogenesis, higher rates of homoplasy, and more bursts of contact-induced change than the basic vocabulary data. However, there is a core of grammatical and lexical features that are highly stable. These findings suggest that different subsystems of language have differing dynamics and that careful, nuanced models of language change will be needed to extract deeper signal from the noise of parallel evolution, areal readaptation, and contact.
Genetic and archaeological studies have established a sub-Saharan African origin for anatomically modern humans with subsequent migrations out of Africa. Using the largest multi-locus data set known to date, we investigated genetic differentiation of early modern humans, human admixture and migration events, and relationships among ancestries and language groups. We compiled publicly available genome-wide genotype data on 5,966 individuals from 282 global samples, representing 30 primary language families. The best evidence supports 21 ancestries that delineate genetic structure of present-day human populations. Independent of self-identified ethno-linguistic labels, the vast majority (97.3%) of individuals have mixed ancestry, with evidence of multiple ancestries in 96.8% of samples and on all continents. The data indicate that continents, ethno-linguistic groups, races, ethnicities, and individuals all show substantial ancestral heterogeneity. We estimated correlation coefficients ranging from 0.522 to 0.962 between ancestries and language families or branches. Ancestry data support the grouping of Kwadi-Khoe, Kx'a, and Tuu languages, support the exclusion of Omotic languages from the Afroasiatic language family, and do not support the proposed Dené-Yeniseian language family as a genetically valid grouping. Ancestry data yield insight into a deeper past than linguistic data can, while linguistic data provide clarity to ancestry data.
There are two competing hypotheses for the origin of the Indo-European language family. The conventional view places the homeland in the Pontic steppes about 6000 years ago. An alternative hypothesis claims that the languages spread from Anatolia with the expansion of farming 8000 to 9500 years ago. We used Bayesian phylogeographic approaches, together with basic vocabulary data from 103 ancient and contemporary Indo-European languages, to explicitly model the expansion of the family and test these hypotheses. We found decisive support for an Anatolian origin over a steppe origin. Both the inferred timing and root location of the Indo-European language trees fit with an agricultural expansion from Anatolia beginning 8000 to 9500 years ago. These results highlight the critical role that phylogeographic inference can play in resolving debates about human prehistory.
The claim that Eskimo languages have words for different types of snow is well-known among the public, but has been greatly exaggerated through popularization and is therefore viewed with skepticism by many scholars of language. Despite the prominence of this claim, to our knowledge the line of reasoning behind it has not been tested broadly across languages. Here, we note that this reasoning is a special case of the more general view that language is shaped by the need for efficient communication, and we empirically test a variant of it against multiple sources of data, including library reference works, Twitter, and large digital collections of linguistic and meteorological data. Consistent with the hypothesis of efficient communication, we find that languages that use the same linguistic form for snow and ice tend to be spoken in warmer climates, and that this association appears to be mediated by lower communicative need to talk about snow and ice. Our results confirm that variation in semantic categories across languages may be traceable in part to local communicative needs. They suggest moreover that despite its awkward history, the topic of “words for snow” may play a useful role as an accessible instance of the principle that language supports efficient communication.
The negative bias accompanying the terms left and left-handers has long interested researchers. This paper examines a large number of languages of Indo-European and non-Indo-European origin for such biasing. One surprising outcome is that, within the Indo-European language family, the terms for right and left do not go back to one set of antonyms but have their etymological roots in a number of different core semantic concepts. As in the non-Indo-European languages, right is almost always thought of positively, whereas left is negatively connotated. This is interpreted as the outcome of a universal human evaluation process, partly based on the principle of embodiment. The terms for right never have, in any of the examined languages, a negative bias; the words for left, usually never positively biased, were turned into euphemisms in three language groups (Scandinavian, Greek, and Avestan). On one interpretation, this seems to be an act of historical political correctness, corroborating the negative attitude cultures have for left-handers, very likely an outcome of discrimination of minorities.
- Proceedings of the National Academy of Sciences of the United States of America
- Published over 5 years ago
We summarize a number of findings in laryngology demonstrating that perturbations of phonation, including increased jitter and shimmer, are associated with desiccated ambient air. We predict that, given the relative imprecision of vocal fold vibration in desiccated versus humid contexts, arid and cold ecologies should be less amenable, when contrasted to warm and humid ecologies, to the development of languages with phonemic tone, especially complex tone. This prediction is supported by data from two large independently coded databases representing 3,700+ languages. Languages with complex tonality have generally not developed in very cold or otherwise desiccated climates, in accordance with the physiologically based predictions. The predicted global geographic-linguistic association is shown to operate within continents, within major language families, and across language isolates. Our results offer evidence that human sound systems are influenced by environmental factors.
Language is the best example of a cultural evolutionary system, able to retain a phylogenetic signal over many thousands of years. The temporal stability (conservatism) of basic vocabulary is relatively well understood, but the stability of the structural properties of language (phonology, morphology, syntax) is still unclear. Here we report an extensive Bayesian phylogenetic investigation of the structural stability of numerous features across many language families and we introduce a novel method for analyzing the relationships between the “stability profiles” of language families. We found that there is a strong universal component across language families, suggesting the existence of universal linguistic, cognitive and genetic constraints. Against this background, however, each language family has a distinct stability profile, and these profiles cluster by geographic area and likely deep genealogical relationships. These stability profiles seem to show, for example, the ancient historical relationships between the Siberian and American language families, presumed to be separated by at least 12,000 years, and possible connections between the Eurasian families. We also found preliminary support for the punctuated evolution of structural features of language across families, types of features and geographic areas. Thus, such higher-level properties of language seen as an evolutionary system might allow the investigation of ancient connections between languages and shed light on the peopling of the world.