Concept: Speech processing
The technology for evaluating patient-provider interactions in psychotherapy-observational coding-has not changed in 70 years. It is labor-intensive, error prone, and expensive, limiting its use in evaluating psychotherapy in the real world. Engineering solutions from speech and language processing provide new methods for the automatic evaluation of provider ratings from session recordings. The primary data are 200 Motivational Interviewing (MI) sessions from a study on MI training methods with observer ratings of counselor empathy. Automatic Speech Recognition (ASR) was used to transcribe sessions, and the resulting words were used in a text-based predictive model of empathy. Two supporting datasets trained the speech processing tasks including ASR (1200 transcripts from heterogeneous psychotherapy sessions and 153 transcripts and session recordings from 5 MI clinical trials). The accuracy of computationally-derived empathy ratings were evaluated against human ratings for each provider. Computationally-derived empathy scores and classifications (high vs. low) were highly accurate against human-based codes and classifications, with a correlation of 0.65 and F-score (a weighted average of sensitivity and specificity) of 0.86, respectively. Empathy prediction using human transcription as input (as opposed to ASR) resulted in a slight increase in prediction accuracies, suggesting that the fully automatic system with ASR is relatively robust. Using speech and language processing methods, it is possible to generate accurate predictions of provider performance in psychotherapy from audio recordings alone. This technology can support large-scale evaluation of psychotherapy for dissemination and process studies.
Extensive research shows that inter-talker variability (i.e., changing the talker) affects recognition memory for speech signals. However, relatively little is known about the consequences of intra-talker variability (i.e. changes in speaking style within a talker) on the encoding of speech signals in memory. It is well established that speakers can modulate the characteristics of their own speech and produce a listener-oriented, intelligibility-enhancing speaking style in response to communication demands (e.g., when speaking to listeners with hearing impairment or non-native speakers of the language). Here we conducted two experiments to examine the role of speaking style variation in spoken language processing. First, we examined the extent to which clear speech provided benefits in challenging listening environments (i.e. speech-in-noise). Second, we compared recognition memory for sentences produced in conversational and clear speaking styles. In both experiments, semantically normal and anomalous sentences were included to investigate the role of higher-level linguistic information in the processing of speaking style variability. The results show that acoustic-phonetic modifications implemented in listener-oriented speech lead to improved speech recognition in challenging listening conditions and, crucially, to a substantial enhancement in recognition memory for sentences.
Dyslexia is associated with numerous deficits to speech processing. Accordingly, a large literature asserts that dyslexics manifest a phonological deficit. Few studies, however, have assessed the phonological grammar of dyslexics, and none has distinguished a phonological deficit from a phonetic impairment. Here, we show that these two sources can be dissociated. Three experiments demonstrate that a group of adult dyslexics studied here is impaired in phonetic discrimination (e.g., ba vs. pa), and their deficit compromises even the basic ability to identify acoustic stimuli as human speech. Remarkably, the ability of these individuals to generalize grammatical phonological rules is intact. Like typical readers, these Hebrew-speaking dyslexics identified ill-formed AAB stems (e.g., titug) as less wordlike than well-formed ABB controls (e.g., gitut), and both groups automatically extended this rule to nonspeech stimuli, irrespective of reading ability. The contrast between the phonetic and phonological capacities of these individuals demonstrates that the algebraic engine that generates phonological patterns is distinct from the phonetic interface that implements them. While dyslexia compromises the phonetic system, certain core aspects of the phonological grammar can be spared.
Natural language processing employs computational techniques for the purpose of learning, understanding, and producing human language content. Early computational approaches to language research focused on automating the analysis of the linguistic structure of language and developing basic technologies such as machine translation, speech recognition, and speech synthesis. Today’s researchers refine and make use of such tools in real-world applications, creating spoken dialogue systems and speech-to-speech translation engines, mining social media for information about health or finance, and identifying sentiment and emotion toward products and services. We describe successes and challenges in this rapidly advancing area.
- Proceedings of the National Academy of Sciences of the United States of America
- Published about 3 years ago
Temporal cues are important for discerning word boundaries and syllable segments in speech; their perception facilitates language acquisition and development. Beat synchronization and neural encoding of speech reflect precision in processing temporal cues and have been linked to reading skills. In poor readers, diminished neural precision may contribute to rhythmic and phonological deficits. Here we establish links between beat synchronization and speech processing in children who have not yet begun to read: preschoolers who can entrain to an external beat have more faithful neural encoding of temporal modulations in speech and score higher on tests of early language skills. In summary, we propose precise neural encoding of temporal modulations as a key mechanism underlying reading acquisition. Because beat synchronization abilities emerge at an early age, these findings may inform strategies for early detection of and intervention for language-based learning disabilities.
Because linguistic communication is inherently noisy and uncertain, adult language comprehenders integrate bottom-up cues from speech perception with top-down expectations about what speakers are likely to say. Further, in line with the predictions of ideal-observer models, past results have shown that adult comprehenders flexibly adapt how much they rely on these two kinds of cues in proportion to their changing reliability. Do children also show evidence of flexible, expectation-based language comprehension? We presented preschoolers with ambiguous utterances that could be interpreted in two different ways, depending on whether the children privileged perceptual input or top-down expectations. Across three experiments, we manipulated the reliability of both their perceptual input and their expectations about the speaker’s intended meaning. As predicted by noisy-channel models of speech processing, results showed that 4- and 5-year-old-but perhaps not younger-children flexibly adjusted their interpretations as cues changed in reliability.
The ability to recognize speech acts (verbal actions) in conversation is critical for everyday interaction. However, utterances are often underspecified for the speech act they perform, requiring listeners to rely on the context to recognize the action. The goal of this study was to investigate the time-course of auditory speech act recognition in action-underspecified utterances and explore how sequential context (the prior action) impacts this process. We hypothesized that speech acts are recognized early in the utterance to allow for quick transitions between turns in conversation. Event-related potentials (ERPs) were recorded while participants listened to spoken dialogues and performed an action categorization task. The dialogues contained target utterances that each of which could deliver three distinct speech acts depending on the prior turn. The targets were identical across conditions, but differed in the type of speech act performed and how it fit into the larger action sequence. The ERP results show an early effect of action type, reflected by frontal positivities as early as 200 ms after target utterance onset. This indicates that speech act recognition begins early in the turn when the utterance has only been partially processed. Providing further support for early speech act recognition, actions in highly constraining contexts did not elicit an ERP effect to the utterance-final word. We take this to show that listeners can recognize the action before the final word through predictions at the speech act level. However, additional processing based on the complete utterance is required in more complex actions, as reflected by a posterior negativity at the final word when the speech act is in a less constraining context and a new action sequence is initiated. These findings demonstrate that sentence comprehension in conversational contexts crucially involves recognition of verbal action which begins as soon as it can.
There is widespread interest in the relationship between the neurobiological systems supporting human cognition and emerging computational systems capable of emulating these capacities. Human speech comprehension, poorly understood as a neurobiological process, is an important case in point. Automatic Speech Recognition (ASR) systems with near-human levels of performance are now available, which provide a computationally explicit solution for the recognition of words in continuous speech. This research aims to bridge the gap between speech recognition processes in humans and machines, using novel multivariate techniques to compare incremental ‘machine states’, generated as the ASR analysis progresses over time, to the incremental ‘brain states’, measured using combined electro- and magneto-encephalography (EMEG), generated as the same inputs are heard by human listeners. This direct comparison of dynamic human and machine internal states, as they respond to the same incrementally delivered sensory input, revealed a significant correspondence between neural response patterns in human superior temporal cortex and the structural properties of ASR-derived phonetic models. Spatially coherent patches in human temporal cortex responded selectively to individual phonetic features defined on the basis of machine-extracted regularities in the speech to lexicon mapping process. These results demonstrate the feasibility of relating human and ASR solutions to the problem of speech recognition, and suggest the potential for further studies relating complex neural computations in human speech comprehension to the rapidly evolving ASR systems that address the same problem domain.
Infants preferentially discriminate between speech tokens that cross native category boundaries prior to acquiring a large receptive vocabulary, implying a major role for unsupervised distributional learning strategies in phoneme acquisition in the first year of life. Multiple sources of between-speaker variability contribute to children’s language input and thus complicate the problem of distributional learning. Adults resolve this type of indexical variability by adjusting their speech processing for individual speakers. For infants to handle indexical variation in the same way, they must be sensitive to both linguistic and indexical cues. To assess infants' sensitivity to and relative weighting of indexical and linguistic cues, we familiarized 12-month-old infants to tokens of a vowel produced by one speaker, and tested their listening preference to trials containing a vowel category change produced by the same speaker (linguistic information), and the same vowel category produced by another speaker of the same or a different accent (indexical information). Infants noticed linguistic and indexical differences, suggesting that both are salient in infant speech processing. Future research should explore how infants weight these cues in a distributional learning context that contains both phonetic and indexical variation.
We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard deviation: 2.66 years, gender balanced, German speakers), six types of food (Apple, Nectarine, Banana, Haribo Smurfs, Biscuit, and Crisps), and read as well as spontaneous speech, which is made publicly available for research purposes. We start with demonstrating that for automatic speech recognition (ASR), it pays off to know whether speakers are eating or not. We also propose automatic classification both by brute-forcing of low-level acoustic features as well as higher-level features related to intelligibility, obtained from an Automatic Speech Recogniser. Prediction of the eating condition was performed with a Support Vector Machine (SVM) classifier employed in a leave-one-speaker-out evaluation framework. Results show that the binary prediction of eating condition (i. e., eating or not eating) can be easily solved independently of the speaking condition; the obtained average recalls are all above 90%. Low-level acoustic features provide the best performance on spontaneous speech, which reaches up to 62.3% average recall for multi-way classification of the eating condition, i. e., discriminating the six types of food, as well as not eating. The early fusion of features related to intelligibility with the brute-forced acoustic feature set improves the performance on read speech, reaching a 66.4% average recall for the multi-way classification task. Analysing features and classifier errors leads to a suitable ordinal scale for eating conditions, on which automatic regression can be performed with up to 56.2% determination coefficient.