Crowdsourcing linguistic phenomena with smartphone applications is relatively new. In linguistics, apps have predominantly been developed to create pronunciation dictionaries, to train acoustic models, and to archive endangered languages. This paper presents the first account of how apps can be used to collect data suitable for documenting language change: we created an app, Dialäkt Äpp (DÄ), which predicts users' dialects. For 16 linguistic variables, users select a dialectal variant from a drop-down menu. DÄ then geographically locates the user’s dialect by suggesting a list of communes where dialect variants most similar to their choices are used. Underlying this prediction are 16 maps from the historical Linguistic Atlas of German-speaking Switzerland, which documents the linguistic situation around 1950. Where users disagree with the prediction, they can indicate what they consider to be their dialect’s location. With this information, the 16 variables can be assessed for language change. Thanks to the playfulness of its functionality, DÄ has reached many users; our linguistic analyses are based on data from nearly 60,000 speakers. Results reveal a relative stability for phonetic variables, while lexical and morphological variables seem more prone to change. Crowdsourcing large amounts of dialect data with smartphone apps has the potential to complement existing data collection techniques and to provide evidence that traditional methods cannot, with normal resources, hope to gather. Nonetheless, it is important to emphasize a range of methodological caveats, including sparse knowledge of users' linguistic backgrounds (users only indicate age, sex) and users' self-declaration of their dialect. These are discussed and evaluated in detail here. Findings remain intriguing nevertheless: as a means of quality control, we report that traditional dialectological methods have revealed trends similar to those found by the app. This underlines the validity of the crowdsourcing method. We are presently extending DÄ architecture to other languages.
In contrast with animal communication systems, diversity is characteristic of almost every aspect of human language. Languages variously employ tones, clicks, or manual signs to signal differences in meaning; some languages lack the noun-verb distinction (e.g., Straits Salish), whereas others have a proliferation of fine-grained syntactic categories (e.g., Tzeltal); and some languages do without morphology (e.g., Mandarin), while others pack a whole sentence into a single word (e.g., Cayuga). A challenge for evolutionary biology is to reconcile the diversity of languages with the high degree of biological uniformity of their speakers. Here, we model processes of language change and geographical dispersion and find a consistent pressure for flexible learning, irrespective of the language being spoken. This pressure arises because flexible learners can best cope with the observed high rates of linguistic change associated with divergent cultural evolution following human migration. Thus, rather than genetic adaptations for specific aspects of language, such as recursion, the coevolution of genes and fast-changing linguistic structure provides the biological basis for linguistic diversity. Only biological adaptations for flexible learning combined with cultural evolution can explain how each child has the potential to learn any human language.
We contrasted the predictive power of three measures of semantic richness-number of features (NFs), contextual dispersion (CD), and a novel measure of number of semantic neighbors (NSN)-for a large set of concrete and abstract concepts on lexical decision and naming tasks. NSN (but not NF) facilitated processing for abstract concepts, while NF (but not NSN) facilitated processing for the most concrete concepts, consistent with claims that linguistic information is more relevant for abstract concepts in early processing. Additionally, converging evidence from two datasets suggests that when NSN and CD are controlled for, the features that most facilitate processing are those associated with a concept’s physical characteristics and real-world contexts. These results suggest that rich linguistic contexts (many semantic neighbors) facilitate early activation of abstract concepts, whereas concrete concepts benefit more from rich physical contexts (many associated objects and locations).
A word like Huh?-used as a repair initiator when, for example, one has not clearly heard what someone just said- is found in roughly the same form and function in spoken languages across the globe. We investigate it in naturally occurring conversations in ten languages and present evidence and arguments for two distinct claims: that Huh? is universal, and that it is a word. In support of the first, we show that the similarities in form and function of this interjection across languages are much greater than expected by chance. In support of the second claim we show that it is a lexical, conventionalised form that has to be learnt, unlike grunts or emotional cries. We discuss possible reasons for the cross-linguistic similarity and propose an account in terms of convergent evolution. Huh? is a universal word not because it is innate but because it is shaped by selective pressures in an interactional environment that all languages share: that of other-initiated repair. Our proposal enhances evolutionary models of language change by suggesting that conversational infrastructure can drive the convergent cultural evolution of linguistic items.
- Proceedings of the National Academy of Sciences of the United States of America
- Published over 2 years ago
How universal is human conceptual structure? The way concepts are organized in the human brain may reflect distinct features of cultural, historical, and environmental background in addition to properties universal to human cognition. Semantics, or meaning expressed through language, provides indirect access to the underlying conceptual structure, but meaning is notoriously difficult to measure, let alone parameterize. Here, we provide an empirical measure of semantic proximity between concepts using cross-linguistic dictionaries to translate words to and from languages carefully selected to be representative of worldwide diversity. These translations reveal cases where a particular language uses a single “polysemous” word to express multiple concepts that another language represents using distinct words. We use the frequency of such polysemies linking two concepts as a measure of their semantic proximity and represent the pattern of these linkages by a weighted network. This network is highly structured: Certain concepts are far more prone to polysemy than others, and naturally interpretable clusters of closely related concepts emerge. Statistical analysis of the polysemies observed in a subset of the basic vocabulary shows that these structural properties are consistent across different language groups, and largely independent of geography, environment, and the presence or absence of a literary tradition. The methods developed here can be applied to any semantic domain to reveal the extent to which its conceptual structure is, similarly, a universal attribute of human cognition and language use.
- Proceedings of the National Academy of Sciences of the United States of America
- Published over 1 year ago
It is widely assumed that one of the fundamental properties of spoken language is the arbitrary relation between sound and meaning. Some exceptions in the form of nonarbitrary associations have been documented in linguistics, cognitive science, and anthropology, but these studies only involved small subsets of the 6,000+ languages spoken in the world today. By analyzing word lists covering nearly two-thirds of the world’s languages, we demonstrate that a considerable proportion of 100 basic vocabulary items carry strong associations with specific kinds of human speech sounds, occurring persistently across continents and linguistic lineages (linguistic families or isolates). Prominently among these relations, we find property words (“small” and i, “full” and p or b) and body part terms (“tongue” and l, “nose” and n). The areal and historical distribution of these associations suggests that they often emerge independently rather than being inherited or borrowed. Our results therefore have important implications for the language sciences, given that nonarbitrary associations have been proposed to play a critical role in the emergence of cross-modal mappings, the acquisition of language, and the evolution of our species' unique communication system.
There would be little adaptive value in a complex communication system like human language if there were no ways to detect and correct problems. A systematic comparison of conversation in a broad sample of the world’s languages reveals a universal system for the real-time resolution of frequent breakdowns in communication. In a sample of 12 languages of 8 language families of varied typological profiles we find a system of ‘other-initiated repair’, where the recipient of an unclear message can signal trouble and the sender can repair the original message. We find that this system is frequently used (on average about once per 1.4 minutes in any language), and that it has detailed common properties, contrary to assumptions of radical cultural variation. Unrelated languages share the same three functionally distinct types of repair initiator for signalling problems and use them in the same kinds of contexts. People prefer to choose the type that is the most specific possible, a principle that minimizes cost both for the sender being asked to fix the problem and for the dyad as a social unit. Disruption to the conversation is kept to a minimum, with the two-utterance repair sequence being on average no longer that the single utterance which is being fixed. The findings, controlled for historical relationships, situation types and other dependencies, reveal the fundamentally cooperative nature of human communication and offer support for the pragmatic universals hypothesis: while languages may vary in the organization of grammar and meaning, key systems of language use may be largely similar across cultural groups. They also provide a fresh perspective on controversies about the core properties of language, by revealing a common infrastructure for social interaction which may be the universal bedrock upon which linguistic diversity rests.
Prior work decoding linguistic meaning from imaging data has been largely limited to concrete nouns, using similar stimuli for training and testing, from a relatively small number of semantic categories. Here we present a new approach for building a brain decoding system in which words and sentences are represented as vectors in a semantic space constructed from massive text corpora. By efficiently sampling this space to select training stimuli shown to subjects, we maximize the ability to generalize to new meanings from limited imaging data. To validate this approach, we train the system on imaging data of individual concepts, and show it can decode semantic vector representations from imaging data of sentences about a wide variety of both concrete and abstract topics from two separate datasets. These decoded representations are sufficiently detailed to distinguish even semantically similar sentences, and to capture the similarity structure of meaning relationships between sentences.
- World psychiatry : official journal of the World Psychiatric Association (WPA)
- Published 4 months ago
Language and speech are the primary source of data for psychiatrists to diagnose and treat mental disorders. In psychosis, the very structure of language can be disturbed, including semantic coherence (e.g., derailment and tangentiality) and syntactic complexity (e.g., concreteness). Subtle disturbances in language are evident in schizophrenia even prior to first psychosis onset, during prodromal stages. Using computer-based natural language processing analyses, we previously showed that, among English-speaking clinical (e.g., ultra) high-risk youths, baseline reduction in semantic coherence (the flow of meaning in speech) and in syntactic complexity could predict subsequent psychosis onset with high accuracy. Herein, we aimed to cross-validate these automated linguistic analytic methods in a second larger risk cohort, also English-speaking, and to discriminate speech in psychosis from normal speech. We identified an automated machine-learning speech classifier - comprising decreased semantic coherence, greater variance in that coherence, and reduced usage of possessive pronouns - that had an 83% accuracy in predicting psychosis onset (intra-protocol), a cross-validated accuracy of 79% of psychosis onset prediction in the original risk cohort (cross-protocol), and a 72% accuracy in discriminating the speech of recent-onset psychosis patients from that of healthy individuals. The classifier was highly correlated with previously identified manual linguistic predictors. Our findings support the utility and validity of automated natural language processing methods to characterize disturbances in semantics and syntax across stages of psychotic disorder. The next steps will be to apply these methods in larger risk cohorts to further test reproducibility, also in languages other than English, and identify sources of variability. This technology has the potential to improve prediction of psychosis outcome among at-risk youths and identify linguistic targets for remediation and preventive intervention. More broadly, automated linguistic analysis can be a powerful tool for diagnosis and treatment across neuropsychiatry.
The claim that Eskimo languages have words for different types of snow is well-known among the public, but has been greatly exaggerated through popularization and is therefore viewed with skepticism by many scholars of language. Despite the prominence of this claim, to our knowledge the line of reasoning behind it has not been tested broadly across languages. Here, we note that this reasoning is a special case of the more general view that language is shaped by the need for efficient communication, and we empirically test a variant of it against multiple sources of data, including library reference works, Twitter, and large digital collections of linguistic and meteorological data. Consistent with the hypothesis of efficient communication, we find that languages that use the same linguistic form for snow and ice tend to be spoken in warmer climates, and that this association appears to be mediated by lower communicative need to talk about snow and ice. Our results confirm that variation in semantic categories across languages may be traceable in part to local communicative needs. They suggest moreover that despite its awkward history, the topic of “words for snow” may play a useful role as an accessible instance of the principle that language supports efficient communication.