Concept: Speech recognition
The technology for evaluating patient-provider interactions in psychotherapy-observational coding-has not changed in 70 years. It is labor-intensive, error prone, and expensive, limiting its use in evaluating psychotherapy in the real world. Engineering solutions from speech and language processing provide new methods for the automatic evaluation of provider ratings from session recordings. The primary data are 200 Motivational Interviewing (MI) sessions from a study on MI training methods with observer ratings of counselor empathy. Automatic Speech Recognition (ASR) was used to transcribe sessions, and the resulting words were used in a text-based predictive model of empathy. Two supporting datasets trained the speech processing tasks including ASR (1200 transcripts from heterogeneous psychotherapy sessions and 153 transcripts and session recordings from 5 MI clinical trials). The accuracy of computationally-derived empathy ratings were evaluated against human ratings for each provider. Computationally-derived empathy scores and classifications (high vs. low) were highly accurate against human-based codes and classifications, with a correlation of 0.65 and F-score (a weighted average of sensitivity and specificity) of 0.86, respectively. Empathy prediction using human transcription as input (as opposed to ASR) resulted in a slight increase in prediction accuracies, suggesting that the fully automatic system with ASR is relatively robust. Using speech and language processing methods, it is possible to generate accurate predictions of provider performance in psychotherapy from audio recordings alone. This technology can support large-scale evaluation of psychotherapy for dissemination and process studies.
Extensive research shows that inter-talker variability (i.e., changing the talker) affects recognition memory for speech signals. However, relatively little is known about the consequences of intra-talker variability (i.e. changes in speaking style within a talker) on the encoding of speech signals in memory. It is well established that speakers can modulate the characteristics of their own speech and produce a listener-oriented, intelligibility-enhancing speaking style in response to communication demands (e.g., when speaking to listeners with hearing impairment or non-native speakers of the language). Here we conducted two experiments to examine the role of speaking style variation in spoken language processing. First, we examined the extent to which clear speech provided benefits in challenging listening environments (i.e. speech-in-noise). Second, we compared recognition memory for sentences produced in conversational and clear speaking styles. In both experiments, semantically normal and anomalous sentences were included to investigate the role of higher-level linguistic information in the processing of speaking style variability. The results show that acoustic-phonetic modifications implemented in listener-oriented speech lead to improved speech recognition in challenging listening conditions and, crucially, to a substantial enhancement in recognition memory for sentences.
Clinical documentation has undergone a change due to the usage of electronic health records. The core element is to capture clinical findings and document therapy electronically. Health care personnel spend a significant portion of their time on the computer. Alternatives to self-typing, such as speech recognition, are currently believed to increase documentation efficiency and quality, as well as satisfaction of health professionals while accomplishing clinical documentation, but few studies in this area have been published to date.
The HCAHPS Survey obtains hospital patients' experiences using four modes: Mail Only, Phone Only, Mixed (mail/phone follow-up), and Touch-Tone (push-button) Interactive Voice Response with option to transfer to live interviewer (TT-IVR/Phone). A new randomized experiment examines two less expensive modes: Web/Mail (mail invitation to participate by Web or request a mail survey) and Speech-Enabled IVR (SE-IVR/Phone; speaking to a voice recognition system; optional transfer to an interviewer). Web/Mail had a 12% response rate (vs. 32% for Mail Only and 33% for SE-IVR/Phone); Web/Mail respondents were more educated and less often Black than Mail Only respondents. SE-IVR/Phone respondents (who usually switched to an interviewer) were less often older than 75 years, more often English-preferring, and reported better care than Mail Only respondents. Concerns regarding inconsistencies across implementations, low adherence to primary modes, or low response rate may limit the applicability of the SE-IVR/Phone and Web/Mail modes in HCAHPS and similar standardized environments.
Feasibility of automated speech sample collection with stuttering children using interactive voice response (IVR) technology
- International journal of speech-language pathology
- Published almost 4 years ago
Purpose: To investigate the feasibility of adopting automated interactive voice response (IVR) technology for remotely capturing standardized speech samples from stuttering children. Method: Participants were 10 6-year-old stuttering children. Their parents called a toll-free number from their homes and were prompted to elicit speech from their children using a standard protocol involving conversation, picture description and games. The automated IVR system was implemented using an off-the-shelf telephony software program and delivered by a standard desktop computer. The software infrastructure utilizes voice over internet protocol. Speech samples were automatically recorded during the calls. Video recordings were simultaneously acquired in the home at the time of the call to evaluate the fidelity of the telephone collected samples. Key outcome measures included syllables spoken, percentage of syllables stuttered and an overall rating of stuttering severity using a 10-point scale. Results: Data revealed a high level of relative reliability in terms of intra-class correlation between the video and telephone acquired samples on all outcome measures during the conversation task. Findings were less consistent for speech samples during picture description and games. Conclusions: Results suggest that IVR technology can be used successfully to automate remote capture of child speech samples.
Speech perception may be underpinned by a hierarchical cortical system, which attempts to match “external” incoming sensory inputs with “internal” top-down predictions. Prior knowledge modulates internal predictions of an upcoming stimulus and exerts its effects in temporal and inferior frontal cortex. Here, we used source-space magnetoencephalography (MEG) to study the spatiotemporal dynamics underpinning the integration of prior knowledge in the speech processing network. Prior knowledge was manipulated to i) increase the perceived intelligibility of speech sentences, and ii) dissociate the perceptual effects of changes in speech intelligibility from acoustical differences in speech stimuli. Cortical entrainment to the speech temporal envelope, which accounts for neural activity specifically related to sensory information, was affected by prior knowledge: This effect emerged early (∼50 ms) in left inferior frontal gyrus (IFG) and then (∼100 ms) in Heschl’s gyrus (HG), and was sustained until latencies of ∼250 ms. Directed transfer function (DTF) measures were used for estimating direct Granger causal relations between locations of interest. In line with the cortical entrainment result, this analysis indicated that prior knowledge enhanced top-down connections from left IFG to all the left temporal areas of interest - namely HG, superior temporal sulcus (STS), and middle temporal gyrus (MTG). In addition, intelligible speech increased top-down information flow between left STS and left HG, and increases bottom-up flow in higher-order temporal cortex, specifically between STS and MTG. Together these results provide a detailed view of how, where and when prior knowledge influences continuous speech perception and they are compatible with theories that explain this mechanism as a result of both ascending and descending cortical interactions, such as predictive coding.
The voice is the most direct link we have to others' minds, allowing us to communicate using a rich variety of speech cues [1, 2]. This link is particularly critical early in life as parents draw infants into the structure of their environment using infant-directed speech (IDS), a communicative code with unique pitch and rhythmic characteristics relative to adult-directed speech (ADS) [3, 4]. To begin breaking into language, infants must discern subtle statistical differences about people and voices in order to direct their attention toward the most relevant signals. Here, we uncover a new defining feature of IDS: mothers significantly alter statistical properties of vocal timbre when speaking to their infants. Timbre, the tone color or unique quality of a sound, is a spectral fingerprint that helps us instantly identify and classify sound sources, such as individual people and musical instruments [5-7]. We recorded 24 mothers' naturalistic speech while they interacted with their infants and with adult experimenters in their native language. Half of the participants were English speakers, and half were not. Using a support vector machine classifier, we found that mothers consistently shifted their timbre between ADS and IDS. Importantly, this shift was similar across languages, suggesting that such alterations of timbre may be universal. These findings have theoretical implications for understanding how infants tune in to their local communicative environments. Moreover, our classification algorithm for identifying infant-directed timbre has direct translational implications for speech recognition technology.
Several behavioural studies have shown that the interplay between voice and face information in audiovisual speech perception is not universal. Native English speakers (ESs) are influenced by visual mouth movement to a greater degree than native Japanese speakers (JSs) when listening to speech. However, the biological basis of these group differences is unknown. Here, we demonstrate the time-varying processes of group differences in terms of event-related brain potentials (ERP) and eye gaze for audiovisual and audio-only speech perception. On a behavioural level, while congruent mouth movement shortened the ESs' response time for speech perception, the opposite effect was observed in JSs. Eye-tracking data revealed a gaze bias to the mouth for the ESs but not the JSs, especially before the audio onset. Additionally, the ERP P2 amplitude indicated that ESs processed multisensory speech more efficiently than auditory-only speech; however, the JSs exhibited the opposite pattern. Taken together, the ESs' early visual attention to the mouth was likely to promote phonetic anticipation, which was not the case for the JSs. These results clearly indicate the impact of language and/or culture on multisensory speech processing, suggesting that linguistic/cultural experiences lead to the development of unique neural systems for audiovisual speech perception.
ABSTRACT Research with adults has shown that spoken language processing is improved when listeners are familiar with talkers' voices, known as the familiar talker advantage. The current study explored whether this ability extends to school-age children, who are still acquiring language. Children were familiarized with the voices of three German-English bilingual talkers and were tested on the speech of six bilinguals, three of whom were familiar. Results revealed that children do show improved spoken language processing when they are familiar with the talkers, but this improvement was limited to highly familiar lexical items. This restriction of the familiar talker advantage is attributed to differences in the representation of highly familiar and less familiar lexical items. In addition, children did not exhibit accent-general learning; despite having been exposed to German-accented talkers during training, there was no improvement for novel German-accented talkers.
Speech carries accent information relevant to determining the speaker’s linguistic and social background. A series of web-based experiments demonstrate that accent cues can modulate access to word meaning. In Experiments 1-3, British participants were more likely to retrieve the American dominant meaning (e.g., hat meaning of “bonnet”) in a word association task if they heard the words in an American than a British accent. In addition, results from a speeded semantic decision task (Experiment 4) and sentence comprehension task (Experiment 5) confirm that accent modulates on-line meaning retrieval such that comprehension of ambiguous words is easier when the relevant word meaning is dominant in the speaker’s dialect. Critically, neutral-accent speech items, created by morphing British- and American-accented recordings, were interpreted in a similar way to accented words when embedded in a context of accented words (Experiment 2). This finding indicates that listeners do not use accent to guide meaning retrieval on a word-by-word basis; instead they use accent information to determine the dialectic identity of a speaker and then use their experience of that dialect to guide meaning access for all words spoken by that person. These results motivate a speaker-model account of spoken word recognition in which comprehenders determine key characteristics of their interlocutor and use this knowledge to guide word meaning access.