We analyzed 700 million words, phrases, and topic instances collected from the Facebook messages of 75,000 volunteers, who also took standard personality tests, and found striking variations in language with personality, gender, and age. In our open-vocabulary technique, the data itself drives a comprehensive exploration of language that distinguishes people, finding connections that are not captured with traditional closed-vocabulary word-category analyses. Our analyses shed new light on psychosocial processes yielding results that are face valid (e.g., subjects living in high elevations talk about the mountains), tie in with other research (e.g., neurotic people disproportionately use the phrase ‘sick of’ and the word ‘depressed’), suggest new hypotheses (e.g., an active life implies emotional stability), and give detailed insights (males use the possessive ‘my’ when mentioning their ‘wife’ or ‘girlfriend’ more often than females use ‘my’ with ‘husband’ or ‘boyfriend’). To date, this represents the largest study, by an order of magnitude, of language and personality.
The Pirahã language has been at the center of recent debates in linguistics, in large part because it is claimed not to exhibit recursion, a purported universal of human language. Here, we present an analysis of a novel corpus of natural Pirahã speech that was originally collected by Dan Everett and Steve Sheldon. We make the corpus freely available for further research. In the corpus, Pirahã sentences have been shallowly parsed and given morpheme-aligned English translations. We use the corpus to investigate the formal complexity of Pirahã syntax by searching for evidence of syntactic embedding. In particular, we search for sentences which could be analyzed as containing center-embedding, sentential complements, adverbials, complementizers, embedded possessors, conjunction or disjunction. We do not find unambiguous evidence for recursive embedding of sentences or noun phrases in the corpus. We find that the corpus is plausibly consistent with an analysis of Pirahã as a regular language, although this is not the only plausible analysis.
Do principles of language processing in the brain affect the way grammar evolves over time or is language change just a matter of socio-historical contingency? While the balance of evidence has been ambiguous and controversial, we identify here a neurophysiological constraint on the processing of language that has a systematic effect on the evolution of how noun phrases are marked by case (i.e. by such contrasts as between the English base form she and the object form her). In neurophysiological experiments across diverse languages we found that during processing, participants initially interpret the first base-form noun phrase they hear (e.g. she…) as an agent (which would fit a continuation like … greeted him), even when the sentence later requires the interpretation of a patient role (as in … was greeted). We show that this processing principle is also operative in Hindi, a language where initial base-form noun phrases most commonly denote patients because many agents receive a special case marker (“ergative”) and are often left out in discourse. This finding suggests that the principle is species-wide and independent of the structural affordances of specific languages. As such, the principle favors the development and maintenance of case-marking systems that equate base-form cases with agents rather than with patients. We confirm this evolutionary bias by statistical analyses of phylogenetic signals in over 600 languages worldwide, controlling for confounding effects from language contact. Our findings suggest that at least one core property of grammar systematically adapts in its evolution to the neurophysiological conditions of the brain, independently of socio-historical factors. This opens up new avenues for understanding how specific properties of grammar have developed in tight interaction with the biological evolution of our species.
We introduce a Maximum Entropy model able to capture the statistics of melodies in music. The model can be used to generate new melodies that emulate the style of a given musical corpus. Instead of using the n-body interactions of (n-1)-order Markov models, traditionally used in automatic music generation, we use a k-nearest neighbour model with pairwise interactions only. In that way, we keep the number of parameters low and avoid over-fitting problems typical of Markov models. We show that long-range musical phrases don’t need to be explicitly enforced using high-order Markov interactions, but can instead emerge from multiple, competing, pairwise interactions. We validate our Maximum Entropy model by contrasting how much the generated sequences capture the style of the original corpus without plagiarizing it. To this end we use a data-compression approach to discriminate the levels of borrowing and innovation featured by the artificial sequences. Our modelling scheme outperforms both fixed-order and variable-order Markov models. This shows that, despite being based only on pairwise interactions, our scheme opens the possibility to generate musically sensible alterations of the original phrases, providing a way to generate innovation.
One universal feature of human languages is the division between grammatical functors and content words. From a learnability point of view, functors might provide entry points or anchors into the syntactic structure of utterances due to their high frequency. Despite its potentially universal scope, this hypothesis has not yet been tested on typologically different languages and on populations of different ages. Here we report a corpus study and an artificial grammar learning experiment testing the anchoring hypothesis in Basque, Japanese, French, and Italian adults. We show that adults are sensitive to the distribution of functors in their native language and use them when learning new linguistic material. However, compared to infants' performance on a similar task, adults exhibit a slightly different behavior, matching the frequency distributions of their native language more closely than infants do. This finding bears on the issue of the continuity of language learning mechanisms.
- Proceedings of the National Academy of Sciences of the United States of America
- Published about 1 year ago
Although sentences unfold sequentially, one word at a time, most linguistic theories propose that their underlying syntactic structure involves a tree of nested phrases rather than a linear sequence of words. Whether and how the brain builds such structures, however, remains largely unknown. Here, we used human intracranial recordings and visual word-by-word presentation of sentences and word lists to investigate how left-hemispheric brain activity varies during the formation of phrase structures. In a broad set of language-related areas, comprising multiple superior temporal and inferior frontal sites, high-gamma power increased with each successive word in a sentence but decreased suddenly whenever words could be merged into a phrase. Regression analyses showed that each additional word or multiword phrase contributed a similar amount of additional brain activity, providing evidence for a merge operation that applies equally to linguistic objects of arbitrary complexity. More superficial models of language, based solely on sequential transition probability over lexical and syntactic categories, only captured activity in the posterior middle temporal gyrus. Formal model comparison indicated that the model of multiword phrase construction provided a better fit than probability-based models at most sites in superior temporal and inferior frontal cortices. Activity in those regions was consistent with a neural implementation of a bottom-up or left-corner parser of the incoming language stream. Our results provide initial intracranial evidence for the neurophysiological reality of the merge operation postulated by linguists and suggest that the brain compresses syntactically well-formed sequences of words into a hierarchy of nested phrases.
Event-related brain potentials (ERPs) have been instrumental for discerning the relationship between children’s aerobic fitness and aspects of cognition, yet language processing remains unexplored. ERPs linked to the processing of semantic information (the N400) and the analysis of language structure (the P600) were recorded from higher and lower aerobically fit children as they read normal sentences and those containing semantic or syntactic violations. Results revealed that higher fit children exhibited greater N400 amplitude and shorter latency across all sentence types, and a larger P600 effect for syntactic violations. Such findings suggest that higher fitness may be associated with a richer network of words and their meanings, and a greater ability to detect and/or repair syntactic errors. The current findings extend previous ERP research explicating the cognitive benefits associated with greater aerobic fitness in children and may have important implications for learning and academic performance.
We used eye-tracking to investigate if and when children show an incremental bias to assume that the first noun phrase in a sentence is the agent (first-NP-as-agent bias) while processing the meaning of English active and passive transitive sentences. We also investigated whether children can override this bias to successfully distinguish active from passive sentences, after processing the remainder of the sentence frame. For this second question we used eye-tracking (Study 1) and forced-choice pointing (Study 2). For both studies, we used a paradigm in which participants simultaneously saw two novel actions with reversed agent-patient relations while listening to active and passive sentences. We compared English-speaking 25-month-olds and 41-month-olds in between-subjects sentence structure conditions (Active Transitive Condition vs. Passive Condition). A permutation analysis found that both age groups showed a bias to incrementally map the first noun in a sentence onto an agent role. Regarding the second question, 25-month-olds showed some evidence of distinguishing the two structures in the eye-tracking study. However, the 25-month-olds did not distinguish active from passive sentences in the forced choice pointing task. In contrast, the 41-month-old children did reanalyse their initial first-NP-as-agent bias to the extent that they clearly distinguished between active and passive sentences both in the eye-tracking data and in the pointing task. The results are discussed in relation to the development of syntactic (re)parsing.
Autism spectrum disorder (ASD) is frequently associated with communicative impairment, regardless of intelligence level or mental age. Impairment of prosodic processing in particular is a common feature of ASD. Despite extensive overlap in neural resources involved in prosody and music processing, music perception seems to be spared in this population. The present study is the first to investigate prosodic phrasing in ASD in both language and music, combining event-related brain potential (ERP) and behavioral methods. We tested phrase boundary processing in language and music in neuro-typical adults and high-functioning individuals with ASD. We targeted an ERP response associated with phrase boundary processing in both language and music - i.e., the Closure Positive Shift (CPS). While a language-CPS was observed in the neuro-typical group, for ASD participants a smaller response failed to reach statistical significance. In music, we found a boundary-onset music-CPS for both groups during pauses between musical phrases. Our results support the view of preserved processing of musical cues in ASD individuals, with a corresponding prosodic impairment. This suggests that, despite the existence of a domain-general processing mechanism (the CPS), key differences in the integration of features of language and music may lead to the prosodic impairment in ASD.
Many species of animals deliver vocalizations in sequences presumed to be governed by internal rules, though the nature and complexity of these syntactical rules have been investigated in relatively few species. Here I present an investigation into the song syntax of fourteen male Cassin’s Vireos (Vireo cassinii), a species whose song sequences are highly temporally structured. I compare their song sequences to three candidate models of varying levels of complexity-zero-order, first-order and second-order Markov models-and employ novel methods to interpolate between these three models. A variety of analyses, including sequence simulations, Fisher’s exact tests, and model likelihood analyses, showed that the songs of this species are too complex to be described by a zero-order or first-order Markov model. The model that best fit the data was intermediate in complexity between a first- and second-order model, though I also present evidence that some transition probabilities are conditioned on up to three preceding phrases. In addition, sequences were shown to be predictable with more than 54% accuracy overall, and predictability was positively correlated with the rate of song delivery. An assessment of the time homogeneity of syntax showed that transition probabilities between phrase types are largely stable over time, but that there was some evidence for modest changes in syntax within and between breeding seasons, a finding that I interpret to represent changes in breeding stage and social context rather than irreversible, secular shifts in syntax over time. These findings constitute a valuable addition to our understanding of bird song syntax in free-living birds, and will contribute to future attempts to understand the evolutionary importance of bird song syntax in avian communication.