Department of Information and Communications Engineering

Speech Interaction Technology

Our goal is to improve spoken interaction, in applications such as telecommunication and speech interfaces. We develop methods which are efficient and sustainable with respect to resources and provide high sound quality and intuitive interaction while simultaneously retaining the privacy and trust of users. A particular area of interest is environments where multiple people interact with multiple devices, which requires advanced methods for communication, authentication, and processing.

The performance of speech technology has improved rapidly in the recent decade and made possible many useful applications such as personal voice assistants. Technology-led advancement, however, also creates new needs and risks; The improved capabilities of speech services increasingly expose users to threats to their privacy. Black-box technologies can make unethical choices without the users’ or service providers’ awareness. How can users ever trust such technologies?

We research and aim to develop speech technologies that are privacy-preserving, secure, explainable, inclusive as well as trustworthy, and evoke the corresponding level of trust in users. We work in particular on:

Disentangelement of information categories from speech signals such as phonetic information, style, affect, and the speaker's physical identity. The desired level of privacy can then be enforced individually for each category of information.
Frameworks for explainability and theoretical proofs of privacy.
Experience/perception of privacy and user-interface design; We need to understand how people experience privacy to design systems that respect and support the users’ boundaries, needs and preferences.
Inclusivity and equality in speech technology; the performance of speech technology should not depend on the users’ background, such as their affiliation to groups of gender, religion, education, ethnicity, politics, age, physiological properties, etc. We need methodologies for the development of speech technologies that ensure performance for all populations. A central challenge here is that we cannot always know in advance which population categories exist and, often it is unethical to label speakers into groups, such as those based on their gender identity. Methodologies thus must be blind to the labels of minority groups but simultaneously equitable for all.

A part of this research is funded by a research grant from the Strategic Research Council of the Academy of Finland, project “Trust-M: Designing Inclusive & Trustworthy Digital Public Services for Migrants in Finland”.

Parts of this work are associated with the ISCA Special Interest Group “Security and Privacy in Speech Communication” (ISCA SIG SPSC).

Contacts: Silas Rech and Tom Bäckström

Use of speech services in noisy, reverberant conditions remains a challenge especially when far from the microphone and when using low-cost microphones. While state-of-the-art speech enhancement methods can recover speech signals with high accuracy, they come at the cost of high-complexity and/or high delay (offline operation). Real-world applications however require speech enhancement which operates in real-time and with affordable hardware.

We study speech enhancement challenges which occur in real-life scenarios, such as

Blind bandwidth extension to compensate for the low sampling rate of common microphones.
Competing speaker-cancellation in multi-user and multi-device scenarios, such as open-plan and home offices. Observe that this addresses privacy concerns which are prevalent in such shared-office scenarios.

Contacts: Silas Rech, Esteban Gomez, and Tom Bäckström

Machine learning methodology has advanced rapidly and provides generic tools for a wide variety of use cases. While this approach has proved very successful, it does not take advantage of a long history of domain knowledge in speech processing. It is thus reasonable to expect that a combination of machine learning with speech signal processing will give further significant improvements in performance.

Current projects include

Speech feature modeling by vector quantization; by reparametrizing quantization, it can be made differentiable thus enabling backpropagation.
Dynamic neural networks, where hyper-parameters are trained online to gain an optimal trade-off between complexity and performance.
Entropy coding with neural networks.

Contacts: Mohammad Vali and Tom Bäckström

Interactions with digital voice assistants are largely based on a question-and-answer paradigm. Human-to-human dialogue is, however, much more than that. We speak in incomplete sentences, sometimes overlapping with each other and we use backchannel expressions like ‘uh-oh’, ‘hmm’ as affirmations and to express disagreement also while the other is speaking. Clearly, dialogues are thus continuous, dynamic, fully bidirectional interactions. Such dynamic interaction enables speakers to continuously adapt their communication. For example, when a listener tacitly communicates excitement, it supports and encourages the speaker to continue.

We want to model the dynamic interaction of human-to-human communication, to better understand human interaction and to develop voice assistants which have more intuitive interaction with humans. Our hypothesis is that dynamic interaction features are communicated through the backchannel and speech style. We are thus studying both models of dynamic interaction and analyzing communication conveyed in the style of speech.

Contacts: Mariem Bouafif, Alexandra Craciun and Tom Bäckström

Most devices that feature speech processing capabilities are small, affordable, and low-power embedded devices like headsets, mobile microphones, and mobile phones. Many state-of-the-art deep learning methods, however, have billions of parameters and require serious computational power to run online, and are thus not feasible in embedded environments. To enable speech processing in embedded devices we, therefore, need the ability to scale and optimize resource requirements to match available hardware and other systems constraints.

We currently work on:

Low-complexity speech processing.
Optimization and dynamic allocation of resources in neural networks.
Modeling of computational requirements such that the complexity of embedded implementations of models developed in high-abstraction languages (e.g., python, PyTorch, TensorFlow), can be reliably estimated in the high-abstraction environment, before porting it to the embedded environment. That is, the objective is that when developing a model in a high-abstraction language, we can restrict complexity during development, such that we can be confident that its embedded implementation will run in the target environment, even before porting the code.

Contacts: Esteban Gomez, Alexandra Craciun and Tom Bäckström

Speech enhancement with speech source models using post-filtering

In speech coding, we have learned that source models are especially important in improving efficiency. It is therefore reasonable to assume that such models would be efficient also in speech enhancement. We have applied speech source models for the speech enhancement task to improve speech quality in noisy scenarios with low-complexity and low-delay methods. In particular, we have focused on speech coding scenarios, where speech is corrupted by quantization noise. A particular characteristic of the combination of speech coding and enhancement is that in speech coding it is not possible to use inter-frame information (i.e., information over the time axis, across processing windows), as any information shared across windows would either increase delay or jeopardize reconstruction in case of packet loss. Therefore, our application in the speech coding scenario is implemented entirely on the decoder-side. In other words, we treat quantization noise at the decoder as noise and use speech enhancement to improve quality.

Our method is based on predicting the distribution of the current frame from the recent past. This gives us more accurate statistical priors for the reconstruction / enhancement task, compared to reconstruction without past information, and consequently we gain a better quality at the output. We have explored such prediction with both Gaussian, GMM and neural network models, and concluded that a simple Gaussian is a reasonable approximation in the low-complexity approach. Improved neural network models remain as future work.

As source models, we have studied:

Multi-channel estimation https://arxiv.org/pdf/2011.03810
Spectral log-power envelope models https://research.aalto.fi/files/27812283/ELEC_das_et_al_Postfiltering_Using_Interspeech.pdf
Fundamental frequency models in the log-power domain, optimized separately for each frequency https://isca-speech.org/archive/Interspeech_2020/pdfs/1067.pdf
Phase-models in MDCT and STFT-domains, https://www.researchgate.net/profile/Tom_Baeckstroem/publication/327389332_Postfiltering_with_Complex_Spectral_Correlations_for_Speech_and_Audio_Coding/links/5be4116f299bf1124fc34e68/Postfiltering-with-Complex-Spectral-Correlations-for-Speech-and-Audio-Coding.pdf and http://www.essv.de/pdf/2020_109_116.pdf
Prediction using GMM models (though this paper is not about enhancement) https://ieeexplore.ieee.org/abstract/document/8461527/

By Sneha Das and Tom Bäckström

Authentication of devices in the same room using acoustic fingerprints

We have plenty of devices, yet we typically use only one at a time. If devices would be seamlessly inter-operable, such that, for example, we could in a teleconference use all nearby microphones, then the interaction quality could be improved; The audio quality could be improved by better sampling of the acoustic space and service quality would be improved since we would not need to think about which devices can handle which kinds of services.

A first step in this direction is to identify which devices can be allowed to interact. Typically, devices which are near each other can interact. However, devices can be near each other though they are in different rooms. Therefore, a better criterion for authentication is that devices in the same room can interact. Better yet, devices can interact, when the acoustic distance is short enough.

For this purpose, we have designed an acoustic fingerprint, which quantifies characteristics of the acoustic environment, such that we can determine if two devices are in the same room by comparing fingerprints. The fingerprint typically should include both temporal and stationary information, that is, temporal events (transients) are very often specific to a particular environment (no two rooms have the same activity and thus the same noises), but stationary information such as room reverberation characteristics are also needed to discriminate between broadcasted sounds such as TV sounds. Our experiments show that in typical low-noise scenarios with microphones at a distance of less than 5m, fingerprints typically have 85% identical bits.

The final step is to compare fingerprints between devices. Observe that we cannot just transmit the fingerprints, since that would leak information that is potentially private. Instead, we must use cryptographic primitives which allow the comparison of private data, without revealing that data but only the result of the comparison.

Fingerprints reviewed https://aaltodoc.aalto.fi/handle/123456789/46766
Fingerprints (first paper) https://aaltodoc.aalto.fi/handle/123456789/40456
Provable consent (check afterwards that consent was given) https://research.aalto.fi/files/51759985/Sigg_ProvableConsent.pdf

By Pablo Pérez Zarazaga, Tom Bäckström and Stephan Sigg

Speech and audio coding for ad-hoc sensor networks

A necessary component of such interaction between devices are methods for efficient transmission of information between devices. In essence, we need to use speech and audio coding methods to compress information for efficient transmission. Clearly, conventional coding methods are here useful, but we need to extend them such that we can take advantage of multiple devices. In particular, observe that it would not be useful to send the the same data from multiple devices. For example, if two devices send the same fundamental frequency, then we could omit the second one to save bandwidth without loss of quality. However, choosing which information to send would require interaction between an unknown number of devices, which is potentially complicated. Instead, we have opted to base our design on independent devices, which do not share information but only transmit their data such that we use randomization to make sources as independent as possible.

Dithered coding for ad-hoc sensor networks https://research.aalto.fi/files/27811883/dithering2.pdf
Dithering methods https://ieeexplore.ieee.org/abstract/document/8052578/
End-to-end optimization of source models for coding https://research.aalto.fi/files/37082504/ELEC_Backstrom_End_to_end_Interspeech.pdf
Coding based on GMMs https://ieeexplore.ieee.org/abstract/document/8461527/
Position paper https://arxiv.org/pdf/1811.05720
Envelope modelling for super-wide band https://pdfs.semanticscholar.org/1f66/d8c5cc623d7a2e43a260672270d03274579e.pdf
Optimal overlap-add windowing for coding https://arxiv.org/pdf/1902.01053
See also above methods for decoder-side postfiltering

By Tom Bäckström, Johannes Fischer (International Audio Laboratories Erlangen), Sneha Das, Srikanth Korse (International Audio Laboratories Erlangen)

Experience of privacy

When you tell a secret to a friend, you whisper. People are naturally attuned to the level of privacy, they intuitively know how private the scenario is, and subconsciously modify their behavior in accordance to the perceived privacy of the surrounding. We cannot tell secrets in a public space - it's obvious to us. We call this effect our experience of privacy. Observe that such experience is correlated with the actual level of privacy, but not strictly bound to it. There could be a secret eavesdropper nearby without our knowledge, such that we unknowingly reveal secrets, or we could be overly paranoid and refuse to tell secrets even in the absence of eavesdroppers. Therefore, both the subjective and objective sense of privacy has to be taken into account.

To design speech interfaces which respect our privacy, they have to understand both such subjective and objective privacy. They have to understand which kinds of environments feel private to us and they have to try to identify real threats to privacy. Moreover, they have to be able to act according to those levels of privacy.

Database for quantifying experience of privacy among users https://research.aalto.fi/files/34916518/ELEC_Zarazaga_Sound_Privacy_Interspeech.pdf and http://www.interspeech2020.org/uploadfile/pdf/Thu-3-3-2.pdf
Popular science style review paper https://www.vde.com/resource/blob/1991012/07662bec66907573ab254c3d99394ec7/itg-news-juli-oktober-2020-data.pdf
User-interface study of privacy in speech interfaces https://fruct.org/publications/acm27/files/Yea.pdf and https://trepo.tuni.fi/bitstream/handle/10024/120072/YeasminFarida.pdf?sequence=2
Privacy in teleconferencing https://arxiv.org/pdf/2010.09488
See also acoustic fingerprint papers above

By Sneha Das, Pablo Pérez Zarazaga, Anna Leschanowsky, Farida Yeasmin, Tom Bäckström and others

Speaker identification, verification and spoofing

Voice-quality Features for Deep Neural Network Based Speaker Verification Systems.

Jitter and shimmer are voice-quality features which have been successfully used to detect voice pathologies and classify different speaking styles. Thus, we investigate the usefulness of such voice-quality features in neural-network based speaker verification systems. To combine these two sets of features, the cosine distance scores estimated from the two sets are linearly weighted to obtain a single, fused score. The fused score is used to accept/reject a given speaker. The experimental results carried out on Voxceleb-1 dataset demonstrate that the fusion of the cosine distance scores extracted from the mel-spectrogram and voice quality features provide a 11% relative improvement in Equal Error Rate (EER) compared to the baseline system which is based only on mel-spectrogram features.

The main contribution of this work is that we propose the use of voice-quality features for deep learning based speaker verification systems. We are interested in voice-quality features since jitter and shimmer measurements show significant differences between different speaking styles. Since these features have shown potential for characterizing pathological voices and linguistic abnormalities, they can be also employed to characterize a particular speaker. The voice-quality features are used together with the short-term mel-spectrogram features. The fusion of the voice-quality features with the mel-spectrogram is carried out at the score likelihood level, i.e., the cosine distance scores extracted using the mel-spectrogram and voice-quality models are linearly weighted.

Voice-quality papers for
Speaker diarization: http://www.odyssey2016.com/papers/pdfs_stamped/18.pdf,
Speaker clustering: https://www.isca-speech.org/archive/Interspeech_2016/pdfs/0339.PDF
Long-term feautures for diarization
https://link.springer.com/article/10.1186/s13636-018-0140-x

Speaker recognition and speaker diarizartion are very much interrelated.

By Abraham Woubie Zewoudie, Lauri Koivisto and Tom Bäckström

To What Extent do Voice-quality Features Enhance the Performance of Anti-spoofing Systems?

Automatic speaker verification technology is currently widely used in different range of applications which require not only robustness to changes in the acoustic environment, but also resilience to intentional circumvention, known as spoofing. Replay attacks are a key concern among other possible attack vectors; they can be performed with ease and the threat they pose to automatic speaker verification (ASV) reliability has been confirmed in different studies. Replay attacks are mounted using recordings of a target speaker’s voice which are replayed to an ASV system in the place of genuine speech. A prime example could be the use of a smart-device to replay a recording of a target speaker’s voice to unlock a smartphone and use ASV access control. Thus, in this work, we will explore to what extent voice-quality features help ASV systems to combat replay attacks. The impact of voice-quality is analyzed by using them by fusing them with the state-of-of-the-art anti-spoofing features such as Constant Q Cepstral Coefficients (CQCCs) .

By Abraham Woubie Zewoudie Woubie and Tom Bäckström

Group Members

Teaching

Our department provides the following courses in speech and language technology:

Project topics for Bachelor theses, Master’s theses, and special assignments

We are always open to suggestions of topics for projects, especially when they are related to our current research described above. To aid in finding exciting topics, we maintain a list of suggested project topics at the Special Assignment –page. Note that even if that page is about special assignment projects, most topics can be scaled also to bachelor and master’s theses.

Resources

YouTube page

https://www.youtube.com/@speechinteractiontechnolog667
GitHub page

https://github.com/Speech-Interaction-Technology-Aalto-U
Introduction to speech processing

https://speechprocessingbook.aalto.fi/

Privacy in Speech Technology

Tom Bäckström 2024

Evaluating privacy, security, and trust perceptions in conversational AI: A systematic review

Anna Leschanowsky, Silas Rech, Birgit Popp, Tom Bäckström 2024

User Perspective on Anonymity in Voice Assistants – A comparison between Germany and Finland

Ingo Siegert, Silas Rech, Tom Bäckström, Matthias Haase 2024 Legal and Ethical Issues in Human Language Technologies 2024, LEGAL 2024 at LREC-COLING 2024 - Workshop Proceedings

Optimizing the Performance of Text Classification Models by Improving the Isotropy of the Embeddings using a Joint Loss Function

Joseph Attieh, Abraham Zewoudie, Vladimir Vlassov, Adrian Flanagan, Tom Bäckström 2023 Document Analysis and Recognition – ICDAR 2023 - 17th International Conference, Proceedings

Research portal

Published: 2.4.2020
Updated: 18.4.2024

Speech Interaction Technology

Speech enhancement with speech source models using post-filtering

Authentication of devices in the same room using acoustic fingerprints

Speech and audio coding for ad-hoc sensor networks

Experience of privacy

Speaker identification, verification and spoofing

Group Members

Tom Bäckström

Mohammad Vali

Silas Rech

Teaching

Resources

Latest publications

Privacy in Speech Technology

Evaluating privacy, security, and trust perceptions in conversational AI: A systematic review

User Perspective on Anonymity in Voice Assistants – A comparison between Germany and Finland

Optimizing the Performance of Text Classification Models by Improving the Isotropy of the Embeddings using a Joint Loss Function

Low-complexity Real-time Neural Network for Blind Bandwidth Extension of Wideband Speech

Privacy and Quality Improvements in Open Offices Using Multi-Device Speech Enhancement

The Internet of Sounds: Convergent Trends, Insights and Future Directions

Interpretable Latent Space Using Space-Filling Curves for Phonetic Analysis in Voice Conversion

Stochastic Optimization of Vector Quantization Methods in Application to Speech and Image Processing

Speech Localization at Low Bitrates in Wireless Acoustics Sensor Networks

Speech Interaction Technology

Trustworthy, ethical, explainable, inclusive, and privacy-preserving speech technology

Resource-efficient and real-time neural speech enhancement

Modeling speech signals with machine learning and signal processing methods

Dynamic interaction modelling for spoken dialogue

Embedded and low-complexity speech processing

Past topics

Speech enhancement with speech source models using post-filtering

Authentication of devices in the same room using acoustic fingerprints

Speech and audio coding for ad-hoc sensor networks

Experience of privacy

Speaker identification, verification and spoofing

Group Members

Teaching

Resources

Latest publications