Concept: Speech synthesis
The technology for evaluating patient-provider interactions in psychotherapy-observational coding-has not changed in 70 years. It is labor-intensive, error prone, and expensive, limiting its use in evaluating psychotherapy in the real world. Engineering solutions from speech and language processing provide new methods for the automatic evaluation of provider ratings from session recordings. The primary data are 200 Motivational Interviewing (MI) sessions from a study on MI training methods with observer ratings of counselor empathy. Automatic Speech Recognition (ASR) was used to transcribe sessions, and the resulting words were used in a text-based predictive model of empathy. Two supporting datasets trained the speech processing tasks including ASR (1200 transcripts from heterogeneous psychotherapy sessions and 153 transcripts and session recordings from 5 MI clinical trials). The accuracy of computationally-derived empathy ratings were evaluated against human ratings for each provider. Computationally-derived empathy scores and classifications (high vs. low) were highly accurate against human-based codes and classifications, with a correlation of 0.65 and F-score (a weighted average of sensitivity and specificity) of 0.86, respectively. Empathy prediction using human transcription as input (as opposed to ASR) resulted in a slight increase in prediction accuracies, suggesting that the fully automatic system with ASR is relatively robust. Using speech and language processing methods, it is possible to generate accurate predictions of provider performance in psychotherapy from audio recordings alone. This technology can support large-scale evaluation of psychotherapy for dissemination and process studies.
Restoring natural speech in paralyzed and aphasic people could be achieved using a Brain-Computer Interface (BCI) controlling a speech synthesizer in real-time. To reach this goal, a prerequisite is to develop a speech synthesizer producing intelligible speech in real-time with a reasonable number of control parameters. We present here an articulatory-based speech synthesizer that can be controlled in real-time for future BCI applications. This synthesizer converts movements of the main speech articulators (tongue, jaw, velum, and lips) into intelligible speech. The articulatory-to-acoustic mapping is performed using a deep neural network (DNN) trained on electromagnetic articulography (EMA) data recorded on a reference speaker synchronously with the produced speech signal. This DNN is then used in both offline and online modes to map the position of sensors glued on different speech articulators into acoustic parameters that are further converted into an audio signal using a vocoder. In offline mode, highly intelligible speech could be obtained as assessed by perceptual evaluation performed by 12 listeners. Then, to anticipate future BCI applications, we further assessed the real-time control of the synthesizer by both the reference speaker and new speakers, in a closed-loop paradigm using EMA data recorded in real time. A short calibration period was used to compensate for differences in sensor positions and articulatory differences between new speakers and the reference speaker. We found that real-time synthesis of vowels and consonants was possible with good intelligibility. In conclusion, these results open to future speech BCI applications using such articulatory-based speech synthesizer.
Due to their periodic nature, neural oscillations might represent an optimal “tool” for the processing of rhythmic stimulus input [1-3]. Indeed, the alignment of neural oscillations to a rhythmic stimulus, often termed phase entrainment, has been repeatedly demonstrated [4-7]. Phase entrainment is central to current theories of speech processing [8-10] and has been associated with successful speech comprehension [11-17]. However, typical manipulations that reduce speech intelligibility (e.g., addition of noise and time reversal [11, 12, 14, 16, 17]) could destroy critical acoustic cues for entrainment (such as “acoustic edges” ). Hence, the association between phase entrainment and speech intelligibility might only be “epiphenomenal”; i.e., both decline due to the same manipulation, without any causal link between the two . Here, we use transcranial alternating current stimulation (tACS ) to manipulate the phase lag between neural oscillations and speech rhythm while measuring neural responses to intelligible and unintelligible vocoded stimuli with sparse fMRI. We found that this manipulation significantly modulates the BOLD response to intelligible speech in the superior temporal gyrus, and the strength of BOLD modulation is correlated with a phasic modulation of performance in a behavioral task. Importantly, these findings are absent for unintelligible speech and during sham stimulation; we thus demonstrate that phase entrainment has a specific, causal influence on neural responses to intelligible speech. Our results not only provide an important step toward understanding the neural foundation of human abilities at speech comprehension but also suggest new methods for enhancing speech perception that can be explored in the future.
We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard deviation: 2.66 years, gender balanced, German speakers), six types of food (Apple, Nectarine, Banana, Haribo Smurfs, Biscuit, and Crisps), and read as well as spontaneous speech, which is made publicly available for research purposes. We start with demonstrating that for automatic speech recognition (ASR), it pays off to know whether speakers are eating or not. We also propose automatic classification both by brute-forcing of low-level acoustic features as well as higher-level features related to intelligibility, obtained from an Automatic Speech Recogniser. Prediction of the eating condition was performed with a Support Vector Machine (SVM) classifier employed in a leave-one-speaker-out evaluation framework. Results show that the binary prediction of eating condition (i. e., eating or not eating) can be easily solved independently of the speaking condition; the obtained average recalls are all above 90%. Low-level acoustic features provide the best performance on spontaneous speech, which reaches up to 62.3% average recall for multi-way classification of the eating condition, i. e., discriminating the six types of food, as well as not eating. The early fusion of features related to intelligibility with the brute-forced acoustic feature set improves the performance on read speech, reaching a 66.4% average recall for the multi-way classification task. Analysing features and classifier errors leads to a suitable ordinal scale for eating conditions, on which automatic regression can be performed with up to 56.2% determination coefficient.
We conducted a study of a motor imagery brain-computer interface (BCI) using electroencephalography to continuously control a formant frequency speech synthesizer with instantaneous auditory and visual feedback. Over a three-session training period, sixteen participants learned to control the BCI for production of three vowel sounds (/ textipa i/ [heed], / textipa A/ [hot], and / textipa u/ [who’d]) and were split into three groups: those receiving unimodal auditory feedback of synthesized speech, those receiving unimodal visual feedback of formant frequencies, and those receiving multimodal, audio-visual (AV) feedback. Audio feedback was provided by a formant frequency artificial speech synthesizer, and visual feedback was given as a 2-D cursor on a graphical representation of the plane defined by the first two formant frequencies. We found that combined AV feedback led to the greatest performance in terms of percent accuracy, distance to target, and movement time to target compared with either unimodal feedback of auditory or visual information. These results indicate that performance is enhanced when multimodal feedback is meaningful for the BCI task goals, rather than as a generic biofeedback signal of BCI progress.
It is a matter of debate, whether and how improved auditory discrimination abilities enable speeded speech comprehension in congenitally blind adults. Previous research concentrated on semantic and syntactic aspects of processing. Here we investigated phonologically mediated spoken word access processes by means of word onset priming. Blind adults and age- and gender-matched sighted adults listened to spoken word onsets (primes) followed by complete words (targets). Phonological overlap between primes and targets varied. Blind participants made faster lexical decision responses than sighted participants, yet their speeded responses were not restricted to phonologically overlapping trials. Furthermore, timing of Event Related Potential (ERP) results did not differ for blind and sighted participants. Together those results suggest that blind and sighted listeners are equally fast in implicit phonological encoding and lexical matching mechanisms. It appears that blind adults' speeded speech processing emerges when phonological analysis makes promising word candidates available for further processing. As one possible interpretation, we speculate that lexical selection processes in blind adults do not need to wait for information from the visual domain, while auditory-visual integration mechanisms are mandatorily implemented in speech recognition routines of sighted adults.
In language production research, the latency with which speakers produce a spoken response to a stimulus and the onset and offset times of words in longer utterances are key dependent variables. Measuring these variables automatically often yields partially incorrect results. However, exact measurements through the visual inspection of the recordings are extremely time-consuming. We present AlignTool, an open-source alignment tool that establishes preliminarily the onset and offset times of words and phonemes in spoken utterances using Praat, and subsequently performs a forced alignment of the spoken utterances and their orthographic transcriptions in the automatic speech recognition system MAUS. AlignTool creates a Praat TextGrid file for inspection and manual correction by the user, if necessary. We evaluated AlignTool’s performance with recordings of single-word and four-word utterances as well as semi-spontaneous speech. AlignTool performs well with audio signals with an excellent signal-to-noise ratio, requiring virtually no corrections. For audio signals of lesser quality, AlignTool still is highly functional but its results may require more frequent manual corrections. We also found that audio recordings including long silent intervals tended to pose greater difficulties for AlignTool than recordings filled with speech, which AlignTool analyzed well overall. We expect that by semi-automatizing the temporal analysis of complex utterances, AlignTool will open new avenues in language production research.
Voice Discrimination by Adults with Cochlear Implants: the Benefits of Early Implantation for Vocal-Tract Length Perception
- Journal of the Association for Research in Otolaryngology : JARO
- Published 5 months ago
Cochlear implant (CI) users find it extremely difficult to discriminate between talkers, which may partially explain why they struggle to understand speech in a multi-talker environment. Recent studies, based on findings with postlingually deafened CI users, suggest that these difficulties may stem from their limited use of vocal-tract length (VTL) cues due to the degraded spectral resolution transmitted by the CI device. The aim of the present study was to assess the ability of adult CI users who had no prior acoustic experience, i.e., prelingually deafened adults, to discriminate between resynthesized “talkers” based on either fundamental frequency (F0) cues, VTL cues, or both. Performance was compared to individuals with normal hearing (NH), listening either to degraded stimuli, using a noise-excited channel vocoder, or non-degraded stimuli. Results show that (a) age of implantation was associated with VTL but not F0 cues in discriminating between talkers, with improved discrimination for those subjects who were implanted at earlier age; (b) there was a positive relationship for the CI users between VTL discrimination and speech recognition score in quiet and in noise, but not with frequency discrimination or cognitive abilities; © early-implanted CI users showed similar voice discrimination ability as the NH adults who listened to vocoded stimuli. These data support the notion that voice discrimination is limited by the speech processing of the CI device. However, they also suggest that early implantation may facilitate sensory-driven tonotopicity and/or improve higher-order auditory functions, enabling better perception of VTL spectral cues for voice discrimination.
- Neural networks : the official journal of the International Neural Network Society
- Published about 1 year ago
Speech Emotion Recognition (SER) can be regarded as a static or dynamic classification problem, which makes SER an excellent test bed for investigating and comparing various deep learning architectures. We describe a frame-based formulation to SER that relies on minimal speech processing and end-to-end deep learning to model intra-utterance dynamics. We use the proposed SER system to empirically explore feed-forward and recurrent neural network architectures and their variants. Experiments conducted illuminate the advantages and limitations of these architectures in paralinguistic speech recognition and emotion recognition in particular. As a result of our exploration, we report state-of-the-art results on the IEMOCAP database for speaker-independent SER and present quantitative and qualitative assessments of the models' performances.
Automatic Speech Recognition Predicts Speech Intelligibility and Comprehension for Listeners With Simulated Age-Related Hearing Loss
- Journal of speech, language, and hearing research : JSLHR
- Published 10 months ago
The purpose of this article is to assess speech processing for listeners with simulated age-related hearing loss (ARHL) and to investigate whether the observed performance can be replicated using an automatic speech recognition (ASR) system. The long-term goal of this research is to develop a system that will assist audiologists/hearing-aid dispensers in the fine-tuning of hearing aids.