Speech Interaction Technology
The performance of speech technology has improved rapidly in the recent decade and made possible many useful applications such as personal voice assistants. Technology-led advancement, however, also creates new needs and risks; The improved capabilities of speech services increasingly expose users to threats to their privacy. Black-box technologies can make unethical choices without the users’ or service providers’ awareness. How can users ever trust such technologies?
We research and aim to develop speech technologies that are privacy-preserving, secure, explainable, inclusive as well as trustworthy, and evoke the corresponding level of trust in users. We work in particular on:
- Disentangelement of information categories from speech signals such as phonetic information, style, affect, and the speaker's physical identity. The desired level of privacy can then be enforced individually for each category of information.
- Frameworks for explainability and theoretical proofs of privacy.
- Experience/perception of privacy and user-interface design; We need to understand how people experience privacy to design systems that respect and support the users’ boundaries, needs and preferences.
- Inclusivity and equality in speech technology; the performance of speech technology should not depend on the users’ background, such as their affiliation to groups of gender, religion, education, ethnicity, politics, age, physiological properties, etc. We need methodologies for the development of speech technologies that ensure performance for all populations. A central challenge here is that we cannot always know in advance which population categories exist and, often it is unethical to label speakers into groups, such as those based on their gender identity. Methodologies thus must be blind to the labels of minority groups but simultaneously equitable for all.
A part of this research is funded by a research grant from the Strategic Research Council of the Academy of Finland, project “Trust-M: Designing Inclusive & Trustworthy Digital Public Services for Migrants in Finland”.
Parts of this work are associated with the ISCA Special Interest Group “Security and Privacy in Speech Communication” (ISCA SIG SPSC).
Contacts: Silas Rech and Tom Bäckström
Use of speech services in noisy, reverberant conditions remains a challenge especially when far from the microphone and when using low-cost microphones. While state-of-the-art speech enhancement methods can recover speech signals with high accuracy, they come at the cost of high-complexity and/or high delay (offline operation). Real-world applications however require speech enhancement which operates in real-time and with affordable hardware.
We study speech enhancement challenges which occur in real-life scenarios, such as
- Blind bandwidth extension to compensate for the low sampling rate of common microphones.
- Competing speaker-cancellation in multi-user and multi-device scenarios, such as open-plan and home offices. Observe that this addresses privacy concerns which are prevalent in such shared-office scenarios.
Contacts: Silas Rech, Esteban Gomez, and Tom Bäckström
Machine learning methodology has advanced rapidly and provides generic tools for a wide variety of use cases. While this approach has proved very successful, it does not take advantage of a long history of domain knowledge in speech processing. It is thus reasonable to expect that a combination of machine learning with speech signal processing will give further significant improvements in performance.
Current projects include
- Speech feature modeling by vector quantization; by reparametrizing quantization, it can be made differentiable thus enabling backpropagation.
- Dynamic neural networks, where hyper-parameters are trained online to gain an optimal trade-off between complexity and performance.
- Entropy coding with neural networks.
Contacts: Mohammad Vali and Tom Bäckström
Interactions with digital voice assistants are largely based on a question-and-answer paradigm. Human-to-human dialogue is, however, much more than that. We speak in incomplete sentences, sometimes overlapping with each other and we use backchannel expressions like ‘uh-oh’, ‘hmm’ as affirmations and to express disagreement also while the other is speaking. Clearly, dialogues are thus continuous, dynamic, fully bidirectional interactions. Such dynamic interaction enables speakers to continuously adapt their communication. For example, when a listener tacitly communicates excitement, it supports and encourages the speaker to continue.
We want to model the dynamic interaction of human-to-human communication, to better understand human interaction and to develop voice assistants which have more intuitive interaction with humans. Our hypothesis is that dynamic interaction features are communicated through the backchannel and speech style. We are thus studying both models of dynamic interaction and analyzing communication conveyed in the style of speech.
Contacts: Mariem Bouafif, Alexandra Craciun and Tom Bäckström
Most devices that feature speech processing capabilities are small, affordable, and low-power embedded devices like headsets, mobile microphones, and mobile phones. Many state-of-the-art deep learning methods, however, have billions of parameters and require serious computational power to run online, and are thus not feasible in embedded environments. To enable speech processing in embedded devices we, therefore, need the ability to scale and optimize resource requirements to match available hardware and other systems constraints.
We currently work on:
- Low-complexity speech processing.
- Optimization and dynamic allocation of resources in neural networks.
- Modeling of computational requirements such that the complexity of embedded implementations of models developed in high-abstraction languages (e.g., python, PyTorch, TensorFlow), can be reliably estimated in the high-abstraction environment, before porting it to the embedded environment. That is, the objective is that when developing a model in a high-abstraction language, we can restrict complexity during development, such that we can be confident that its embedded implementation will run in the target environment, even before porting the code.
Contacts: Esteban Gomez, Alexandra Craciun and Tom Bäckström
Speech enhancement with speech source models using post-filtering
In speech coding, we have learned that source models are especially important in improving efficiency. It is therefore reasonable to assume that such models would be efficient also in speech enhancement. We have applied speech source models for the speech enhancement task to improve speech quality in noisy scenarios with low-complexity and low-delay methods. In particular, we have focused on speech coding scenarios, where speech is corrupted by quantization noise. A particular characteristic of the combination of speech coding and enhancement is that in speech coding it is not possible to use inter-frame information (i.e., information over the time axis, across processing windows), as any information shared across windows would either increase delay or jeopardize reconstruction in case of packet loss. Therefore, our application in the speech coding scenario is implemented entirely on the decoder-side. In other words, we treat quantization noise at the decoder as noise and use speech enhancement to improve quality.
Our method is based on predicting the distribution of the current frame from the recent past. This gives us more accurate statistical priors for the reconstruction / enhancement task, compared to reconstruction without past information, and consequently we gain a better quality at the output. We have explored such prediction with both Gaussian, GMM and neural network models, and concluded that a simple Gaussian is a reasonable approximation in the low-complexity approach. Improved neural network models remain as future work.
As source models, we have studied:
- Multi-channel estimation https://arxiv.org/pdf/2011.03810
- Spectral log-power envelope models https://research.aalto.fi/files/27812283/ELEC_das_et_al_Postfiltering_Using_Interspeech.pdf
- Fundamental frequency models in the log-power domain, optimized separately for each frequency https://isca-speech.org/archive/Interspeech_2020/pdfs/1067.pdf
- Phase-models in MDCT and STFT-domains, https://www.researchgate.net/profile/Tom_Baeckstroem/publication/327389332_Postfiltering_with_Complex_Spectral_Correlations_for_Speech_and_Audio_Coding/links/5be4116f299bf1124fc34e68/Postfiltering-with-Complex-Spectral-Correlations-for-Speech-and-Audio-Coding.pdf and http://www.essv.de/pdf/2020_109_116.pdf
- Prediction using GMM models (though this paper is not about enhancement) https://ieeexplore.ieee.org/abstract/document/8461527/
By Sneha Das and Tom Bäckström
Authentication of devices in the same room using acoustic fingerprints
We have plenty of devices, yet we typically use only one at a time. If devices would be seamlessly inter-operable, such that, for example, we could in a teleconference use all nearby microphones, then the interaction quality could be improved; The audio quality could be improved by better sampling of the acoustic space and service quality would be improved since we would not need to think about which devices can handle which kinds of services.
A first step in this direction is to identify which devices can be allowed to interact. Typically, devices which are near each other can interact. However, devices can be near each other though they are in different rooms. Therefore, a better criterion for authentication is that devices in the same room can interact. Better yet, devices can interact, when the acoustic distance is short enough.
For this purpose, we have designed an acoustic fingerprint, which quantifies characteristics of the acoustic environment, such that we can determine if two devices are in the same room by comparing fingerprints. The fingerprint typically should include both temporal and stationary information, that is, temporal events (transients) are very often specific to a particular environment (no two rooms have the same activity and thus the same noises), but stationary information such as room reverberation characteristics are also needed to discriminate between broadcasted sounds such as TV sounds. Our experiments show that in typical low-noise scenarios with microphones at a distance of less than 5m, fingerprints typically have 85% identical bits.
The final step is to compare fingerprints between devices. Observe that we cannot just transmit the fingerprints, since that would leak information that is potentially private. Instead, we must use cryptographic primitives which allow the comparison of private data, without revealing that data but only the result of the comparison.
- Fingerprints reviewed https://aaltodoc.aalto.fi/handle/123456789/46766
- Fingerprints (first paper) https://aaltodoc.aalto.fi/handle/123456789/40456
- Provable consent (check afterwards that consent was given) https://research.aalto.fi/files/51759985/Sigg_ProvableConsent.pdf
By Pablo Pérez Zarazaga, Tom Bäckström and Stephan Sigg
Speech and audio coding for ad-hoc sensor networks
We have plenty of devices, yet we typically use only one at a time. If devices would be seamlessly inter-operable, such that, for example, we could in a teleconference use all nearby microphones, then the interaction quality could be improved; The audio quality could be improved by better sampling of the acoustic space and service quality would be improved since we would not need to think about which devices can handle which kinds of services.
A necessary component of such interaction between devices are methods for efficient transmission of information between devices. In essence, we need to use speech and audio coding methods to compress information for efficient transmission. Clearly, conventional coding methods are here useful, but we need to extend them such that we can take advantage of multiple devices. In particular, observe that it would not be useful to send the the same data from multiple devices. For example, if two devices send the same fundamental frequency, then we could omit the second one to save bandwidth without loss of quality. However, choosing which information to send would require interaction between an unknown number of devices, which is potentially complicated. Instead, we have opted to base our design on independent devices, which do not share information but only transmit their data such that we use randomization to make sources as independent as possible.
- Dithered coding for ad-hoc sensor networks https://research.aalto.fi/files/27811883/dithering2.pdf
- Dithering methods https://ieeexplore.ieee.org/abstract/document/8052578/
- End-to-end optimization of source models for coding https://research.aalto.fi/files/37082504/ELEC_Backstrom_End_to_end_Interspeech.pdf
- Coding based on GMMs https://ieeexplore.ieee.org/abstract/document/8461527/
- Position paper https://arxiv.org/pdf/1811.05720
- Envelope modelling for super-wide band https://pdfs.semanticscholar.org/1f66/d8c5cc623d7a2e43a260672270d03274579e.pdf
- Optimal overlap-add windowing for coding https://arxiv.org/pdf/1902.01053
- See also above methods for decoder-side postfiltering
By Tom Bäckström, Johannes Fischer (International Audio Laboratories Erlangen), Sneha Das, Srikanth Korse (International Audio Laboratories Erlangen)
Experience of privacy
When you tell a secret to a friend, you whisper. People are naturally attuned to the level of privacy, they intuitively know how private the scenario is, and subconsciously modify their behavior in accordance to the perceived privacy of the surrounding. We cannot tell secrets in a public space - it's obvious to us. We call this effect our experience of privacy. Observe that such experience is correlated with the actual level of privacy, but not strictly bound to it. There could be a secret eavesdropper nearby without our knowledge, such that we unknowingly reveal secrets, or we could be overly paranoid and refuse to tell secrets even in the absence of eavesdroppers. Therefore, both the subjective and objective sense of privacy has to be taken into account.
To design speech interfaces which respect our privacy, they have to understand both such subjective and objective privacy. They have to understand which kinds of environments feel private to us and they have to try to identify real threats to privacy. Moreover, they have to be able to act according to those levels of privacy.
- Database for quantifying experience of privacy among users https://research.aalto.fi/files/34916518/ELEC_Zarazaga_Sound_Privacy_Interspeech.pdf and http://www.interspeech2020.org/uploadfile/pdf/Thu-3-3-2.pdf
- Popular science style review paper https://www.vde.com/resource/blob/1991012/07662bec66907573ab254c3d99394ec7/itg-news-juli-oktober-2020-data.pdf
- User-interface study of privacy in speech interfaces https://fruct.org/publications/acm27/files/Yea.pdf and https://trepo.tuni.fi/bitstream/handle/10024/120072/YeasminFarida.pdf?sequence=2
- Privacy in teleconferencing https://arxiv.org/pdf/2010.09488
- See also acoustic fingerprint papers above
By Sneha Das, Pablo Pérez Zarazaga, Anna Leschanowsky, Farida Yeasmin, Tom Bäckström and others
Speaker identification, verification and spoofing
Voice-quality Features for Deep Neural Network Based Speaker Verification Systems.
Jitter and shimmer are voice-quality features which have been successfully used to detect voice pathologies and classify different speaking styles. Thus, we investigate the usefulness of such voice-quality features in neural-network based speaker verification systems. To combine these two sets of features, the cosine distance scores estimated from the two sets are linearly weighted to obtain a single, fused score. The fused score is used to accept/reject a given speaker. The experimental results carried out on Voxceleb-1 dataset demonstrate that the fusion of the cosine distance scores extracted from the mel-spectrogram and voice quality features provide a 11% relative improvement in Equal Error Rate (EER) compared to the baseline system which is based only on mel-spectrogram features.
The main contribution of this work is that we propose the use of voice-quality features for deep learning based speaker verification systems. We are interested in voice-quality features since jitter and shimmer measurements show significant differences between different speaking styles. Since these features have shown potential for characterizing pathological voices and linguistic abnormalities, they can be also employed to characterize a particular speaker. The voice-quality features are used together with the short-term mel-spectrogram features. The fusion of the voice-quality features with the mel-spectrogram is carried out at the score likelihood level, i.e., the cosine distance scores extracted using the mel-spectrogram and voice-quality models are linearly weighted.
- Voice-quality papers for
- Speaker diarization: http://www.odyssey2016.com/papers/pdfs_stamped/18.pdf,
- Speaker clustering: https://www.isca-speech.org/archive/Interspeech_2016/pdfs/0339.PDF
- Long-term feautures for diarization
- https://link.springer.com/article/10.1186/s13636-018-0140-x
Speaker recognition and speaker diarizartion are very much interrelated.
By Abraham Woubie Zewoudie, Lauri Koivisto and Tom Bäckström
To What Extent do Voice-quality Features Enhance the Performance of Anti-spoofing Systems?
Automatic speaker verification technology is currently widely used in different range of applications which require not only robustness to changes in the acoustic environment, but also resilience to intentional circumvention, known as spoofing. Replay attacks are a key concern among other possible attack vectors; they can be performed with ease and the threat they pose to automatic speaker verification (ASV) reliability has been confirmed in different studies. Replay attacks are mounted using recordings of a target speaker’s voice which are replayed to an ASV system in the place of genuine speech. A prime example could be the use of a smart-device to replay a recording of a target speaker’s voice to unlock a smartphone and use ASV access control. Thus, in this work, we will explore to what extent voice-quality features help ASV systems to combat replay attacks. The impact of voice-quality is analyzed by using them by fusing them with the state-of-of-the-art anti-spoofing features such as Constant Q Cepstral Coefficients (CQCCs) .
By Abraham Woubie Zewoudie Woubie and Tom Bäckström
Group Members
Teaching
Our department provides the following courses in speech and language technology:
- ELEC-E5500 Speech Processing
- ELEC-E5510 Speech Recognition
- ELEC-E5521 Speech and Language Processing Methods
- ELEC-E5550 Statistical Natural Language Processing
- ELEC-E5541 Special Assignment in Speech and Language Processing
- ELEC-C5341 Äänen- ja puheenkäsittely
Project topics for Bachelor theses, Master’s theses, and special assignments
We are always open to suggestions of topics for projects, especially when they are related to our current research described above. To aid in finding exciting topics, we maintain a list of suggested project topics at the Special Assignment –page. Note that even if that page is about special assignment projects, most topics can be scaled also to bachelor and master’s theses.
Resources
-
YouTube page
-
GitHub page
-
Introduction to speech processing
Latest publications
Privacy in Speech Technology
Privacy preservation in audio and video
Real-Time Joint Noise Suppression and Bandwidth Extension of Noisy Reverberant Wideband Speech
Evaluating privacy, security, and trust perceptions in conversational AI: A systematic review
User Perspective on Anonymity in Voice Assistants – A comparison between Germany and Finland
Privacy PORCUPINE: Anonymization of Speaker Attributes Using Occurrence Normalization for Space-Filling Vector Quantization
Optimizing the Performance of Text Classification Models by Improving the Isotropy of the Embeddings using a Joint Loss Function
Low-complexity Real-time Neural Network for Blind Bandwidth Extension of Wideband Speech
Privacy and Quality Improvements in Open Offices Using Multi-Device Speech Enhancement
The Internet of Sounds: Convergent Trends, Insights and Future Directions
- Published:
- Updated: