Concept: August Schleicher
We analyze the occurrence frequencies of over 15 million words recorded in millions of books published during the past two centuries in seven different languages. For all languages and chronological subsets of the data we confirm that two scaling regimes characterize the word frequency distributions, with only the more common words obeying the classic Zipf law. Using corpora of unprecedented size, we test the allometric scaling relation between the corpus size and the vocabulary size of growing languages to demonstrate a decreasing marginal need for new words, a feature that is likely related to the underlying correlations between words. We calculate the annual growth fluctuations of word use which has a decreasing trend as the corpus size increases, indicating a slowdown in linguistic evolution following language expansion. This “cooling pattern” forms the basis of a third statistical regularity, which unlike the Zipf and the Heaps law, is dynamical in nature.
- Proceedings. Biological sciences / The Royal Society
- Published about 6 years ago
Phonology and syntax represent two layers of sound combination central to language’s expressive power. Comparative animal studies represent one approach to understand the origins of these combinatorial layers. Traditionally, phonology, where meaningless sounds form words, has been considered a simpler combination than syntax, and thus should be more common in animals. A linguistically informed review of animal call sequences demonstrates that phonology in animal vocal systems is rare, whereas syntax is more widespread. In the light of this and the absence of phonology in some languages, we hypothesize that syntax, present in all languages, evolved before phonology.
It is tempting to treat frequency trends from the Google Books data sets as indicators of the “true” popularity of various words and phrases. Doing so allows us to draw quantitatively strong conclusions about the evolution of cultural perception of a given topic, such as time or gender. However, the Google Books corpus suffers from a number of limitations which make it an obscure mask of cultural popularity. A primary issue is that the corpus is in effect a library, containing one of each book. A single, prolific author is thereby able to noticeably insert new phrases into the Google Books lexicon, whether the author is widely read or not. With this understood, the Google Books corpus remains an important data set to be considered more lexicon-like than text-like. Here, we show that a distinct problematic feature arises from the inclusion of scientific texts, which have become an increasingly substantive portion of the corpus throughout the 1900s. The result is a surge of phrases typical to academic articles but less common in general, such as references to time in the form of citations. We use information theoretic methods to highlight these dynamics by examining and comparing major contributions via a divergence measure of English data sets between decades in the period 1800-2000. We find that only the English Fiction data set from the second version of the corpus is not heavily affected by professional texts. Overall, our findings call into question the vast majority of existing claims drawn from the Google Books corpus, and point to the need to fully characterize the dynamics of the corpus before using these data sets to draw broad conclusions about cultural and linguistic evolution.
- Philosophical transactions of the Royal Society of London. Series B, Biological sciences
- Published almost 6 years ago
Iconicity, a resemblance between properties of linguistic form (both in spoken and signed languages) and meaning, has traditionally been considered to be a marginal, irrelevant phenomenon for our understanding of language processing, development and evolution. Rather, the arbitrary and symbolic nature of language has long been taken as a design feature of the human linguistic system. In this paper, we propose an alternative framework in which iconicity in face-to-face communication (spoken and signed) is a powerful vehicle for bridging between language and human sensori-motor experience, and, as such, iconicity provides a key to understanding language evolution, development and processing. In language evolution, iconicity might have played a key role in establishing displacement (the ability of language to refer beyond what is immediately present), which is core to what language does; in ontogenesis, iconicity might play a critical role in supporting referentiality (learning to map linguistic labels to objects, events, etc., in the world), which is core to vocabulary development. Finally, in language processing, iconicity could provide a mechanism to account for how language comes to be embodied (grounded in our sensory and motor systems), which is core to meaningful communication.
Human language is composed of sequences of reusable elements. The origins of the sequential structure of language is a hotly debated topic in evolutionary linguistics. In this paper, we show that sets of sequences with language-like statistical properties can emerge from a process of cultural evolution under pressure from chunk-based memory constraints. We employ a novel experimental task that is non-linguistic and non-communicative in nature, in which participants are trained on and later asked to recall a set of sequences one-by-one. Recalled sequences from one participant become training data for the next participant. In this way, we simulate cultural evolution in the laboratory. Our results show a cumulative increase in structure, and by comparing this structure to data from existing linguistic corpora, we demonstrate a close parallel between the sets of sequences that emerge in our experiment and those seen in natural language.
The study of language evolution, and human cognitive evolution more generally, has often been ridiculed as unscientific, but in fact it differs little from many other disciplines that investigate past events, such as geology or cosmology. Well-crafted models of language evolution make numerous testable hypotheses, and if the principles of strong inference (simultaneous testing of multiple plausible hypotheses) are adopted, there is an increasing amount of relevant data allowing empirical evaluation of such models. The articles in this special issue provide a concise overview of current models of language evolution, emphasizing the testable predictions that they make, along with overviews of the many sources of data available to test them (emphasizing comparative, neural, and genetic data). The key challenge facing the study of language evolution is not a lack of data, but rather a weak commitment to hypothesis-testing approaches and strong inference, exacerbated by the broad and highly interdisciplinary nature of the relevant data. This introduction offers an overview of the field, and a summary of what needed to evolve to provide our species with language-ready brains. It then briefly discusses different contemporary models of language evolution, followed by an overview of different sources of data to test these models. I conclude with my own multistage model of how different components of language could have evolved.
We present a new open source software tool called BEASTling, designed to simplify the preparation of Bayesian phylogenetic analyses of linguistic data using the BEAST 2 platform. BEASTling transforms comparatively short and human-readable configuration files into the XML files used by BEAST to specify analyses. By taking advantage of Creative Commons-licensed data from the Glottolog language catalog, BEASTling allows the user to conveniently filter datasets using names for recognised language families, to impose monophyly constraints so that inferred language trees are backward compatible with Glottolog classifications, or to assign geographic location data to languages for phylogeographic analyses. Support for the emerging cross-linguistic linked data format (CLDF) permits easy incorporation of data published in cross-linguistic linked databases into analyses. BEASTling is intended to make the power of Bayesian analysis more accessible to historical linguists without strong programming backgrounds, in the hopes of encouraging communication and collaboration between those developing computational models of language evolution (who are typically not linguists) and relevant domain experts.
Among 7100 languages spoken on Earth, the Koreanic language is the 13th largest, with about 77 million speakers in and around the Korean Peninsula. In comparison to other languages of similar size, however, surprisingly little is known about the evolution of the Koreanic language. This is mainly due to two reasons. The first reason is that the genealogical relationship of the Koreanic to other neighboring languages remains uncertain, and thus inference from the linguistic comparative method provides only provisional evidence. The second reason is that, as the ancestral Koreanic speakers lacked their own writing system until around 500 years ago, there are scant historical materials to peer into the past, except for those preserved in Sinitic characters that we have no straightforward way of interpreting. Here I attempt to overcome these disadvantages and shed some light on the linguistic history of the Korean Peninsula, by analyzing the internal variation of the Koreanic language with methods adopted from evolutionary biology. The preliminary results presented here suggest that the evolutionary history of the Koreanic language is characterized by a weak hierarchical structure, and intensive gene/culture flows within the Korean Peninsula seem to have promoted linguistic homogeneity among the Koreanic variants. Despite the gene/culture flows, however, there are still three detectable linguistic barriers in the Korean Peninsula that appear to have been shaped by geographical features such as mountains, elevated areas, and ocean. I discuss these findings in an inclusive manner to lay the groundwork for future studies.
Explaining the diversity of languages across the world is one of the central aims of typological, historical, and evolutionary linguistics. We consider the effect of language contact-the number of non-native speakers a language has-on the way languages change and evolve. By analysing hundreds of languages within and across language families, regions, and text types, we show that languages with greater levels of contact typically employ fewer word forms to encode the same information content (a property we refer to as lexical diversity). Based on three types of statistical analyses, we demonstrate that this variance can in part be explained by the impact of non-native speakers on information encoding strategies. Finally, we argue that languages are information encoding systems shaped by the varying needs of their speakers. Language evolution and change should be modeled as the co-evolution of multiple intertwined adaptive systems: On one hand, the structure of human societies and human learning capabilities, and on the other, the structure of language.
- Journal of speech, language, and hearing research : JSLHR
- Published over 2 years ago
We aimed to study narrative skills in Mandarin-speaking children with language impairment (LI) to compare with children with LI speaking Indo-European languages.