Department of Signal Processing and Acoustics

Speech Interaction Technology

Our objective is to improve spoken interaction, in applications such as telecommunication as well as when using speech interfaces. We develop methods which are efficient and sustainable with respect to resources, provide high sound quality and intuitive interaction while simultaneously retaining the privacy of users. A particular area of interest are environments where multiple people interact with multiple devices, which requires advanced methods for communication, authentication and processing.
Speech Interaction Technology

Research Overview

We currently specialize in the following research areas:

  • Speech and audio coding, distributed coding and acoustic sensor networks
  • Security and privacy in speech communication
  • Acoustic authentication methods
  • Speaker recognition

Research topic descriptions

Speech enhancement with speech source models using post-filtering

In speech coding, we have learned that source models are very important in improving efficiency. It is therefore reasonable to assume that such models would be efficient also in speech enhancement. We have applied  speech source models for the speech enhancement task to improve speech quality in noisy scenarios with low-complexity and low-delay methods. In particular, we have focused on speech coding scenarios, where speech is corrupted by quantization noise. A particular characteristic of the combination of speech coding and enhancement is that in speech coding it is not possible to use inter-frame information (i.e. information over the time axis, across processing windows), as any information shared across windows would either increase delay or jeopardize reconstruction in case of packet loss. Therefore our application in the speech coding scenario is implemented entirely on the decoder-side. In other words, we treat quantization noise at the decoder as noise and use speech enhancement to improve quality.

Our method is based on predicting the distribution of the current frame from the recent past. This gives us more accurate statistical priors for the reconstruction / enhancement task, compared to reconstruction without past information, and consequently we gain a better quality at the output. We have explored such prediction with both Gaussian, GMM and neural network models, and concluded that a simple Gaussian is a reasonable approximation in the low-complexity approach. Improved neural network models remain as future work.

As source models, we have studied:

By Sneha Das and Tom Bäckström

Authentication of devices in the same room using acoustic fingerprints

We have plenty of devices, yet we typically use only one at a time. If devices would be seamlessly inter-operable, such that, for example, we could in a teleconference use all nearby microphones, then the interaction quality could be improved; The audio quality could be improved by better sampling of the acoustic space and service quality would be improved since we would not need to think about which devices can handle which kinds of services.

A first step in this direction is to identify which devices can be allowed to interact. Typically, devices which are near each other can interact. However, devices can be near each other though they are in different rooms. Therefore, a better criteria for authentication is that devices in the same room can interact. Better yet, devices where the acoustic distance is short enough can interact.

For this purpose, we have designed an acoustic fingerprint, which quantifies characteristics of the acoustic environment, such that we can determine if two devices are in the same room by comparing fingerprints. The fingerprint typically should include both temporal and stationary information, that is, temporal events (transients) are very often specific to a particular environment (no two rooms have the same activity and thus the same noises), but stationary information such as room reverberation characteristics are also needed to discriminate between broadcasted sounds such as TV sounds. Our experiments show that in typical low-noise scenarios with microphones at a distance of less than 5m, fingerprints typically have 85% identical bits.

The final step is to compare fingerprints between devices. Observe that we cannot just transmit the fingerprints, since that would leak information which is potentially private. Instead, we have to use cryptographic primitives which allow comparison of private data, without revealing that data but only the result of the comparison.

By Pablo Pérez Zarazaga, Tom Bäckström and Stephan Sigg

Speech and audio coding for ad-hoc sensor networks

We have plenty of devices, yet we typically use only one at a time. If devices would be seamlessly inter-operable, such that, for example, we could in a teleconference use all nearby microphones, then the interaction quality could be improved; The audio quality could be improved by better sampling of the acoustic space and service quality would be improved since we would not need to think about which devices can handle which kinds of services.

A necessary component of such interaction between devices are methods for efficient transmission of information between devices. In essence, we need to use speech and audio coding methods to compress information for efficient transmission. Clearly, conventional coding methods are here useful, but we need to extend them such that we can take advantage of multiple devices. In particular, observe that it would not be useful to send the the same data from multiple devices. For example, if two devices send the same fundamental frequency, then we could omit the second one to save bandwidth without loss of quality. However, choosing which information to send would require interaction between an unknown number of devices, which is potentially complicated. Instead, we have opted to base our design on independent devices, which do not share information but only transmit their data such that we use randomization to make sources as independent as possible.

By Tom Bäckström, Johannes Fischer (International Audio Laboratories Erlangen), Sneha Das, Srikanth Korse (International Audio Laboratories Erlangen)

Experience of privacy

When you tell a secret to a friend, you whisper. People are naturally attuned to the level of privacy, they intuitively know how private the scenario is, and subconsciously modify their behavior in accordance to the perceived privacy of the surrounding. We cannot tell secrets in a public space - it's obvious to us. We call this effect our experience of privacy. Observe that such experience is correlated with the actual level of privacy, but not strictly bound to it. There could be a secret eavesdropper nearby without our knowledge, such that we unknowingly reveal secrets, or we could be overly paranoid and refuse to tell secrets even in the absence of eavesdroppers. Therefore, both the subjective and objective sense of privacy has to be taken into account.

To design speech interfaces which respect our privacy, they have to understand both such subjective and objective privacy. They have to understand which kinds of environments feel private to us and they have to try to identify real threats to privacy. Moreover, they have to be able to act according to those levels of privacy.

By Sneha Das, Pablo Pérez Zarazaga, Anna Leschanowsky, Farida Yeasmin, Tom Bäckström and others

Speaker identification, verification and spoofing

Voice-quality Features for Deep Neural Network Based Speaker Verification Systems.

Jitter and shimmer are voice-quality features which have been successfully used to detect voice pathologies and classify different speaking styles. Thus, we investigate the usefulness of such voice-quality features in neural-network based speaker verification systems. To combine these two sets of features, the cosine distance scores estimated from the two sets are linearly weighted to obtain a single, fused score. The fused score is used to accept/reject a given speaker. The experimental results carried out on Voxceleb-1 dataset demonstrate that the fusion of the cosine distance scores extracted from the mel-spectrogram and voice quality features provide a 11% relative improvement in Equal Error Rate (EER) compared to the baseline system which is based only on mel-spectrogram features.

The main contribution of this work is that we propose the use of voice-quality features for deep learning based speaker verification systems. We are interested in voice-quality features since jitter and shimmer measurements show significant differences between different speaking styles. Since these features have shown potential for characterizing pathological voices and linguistic abnormalities, they can be also employed to characterize a particular speaker. The voice-quality features are used together with the short-term mel-spectrogram features. The fusion of the voice-quality features with the mel-spectrogram is carried out at the score likelihood level, i.e., the cosine distance scores extracted using the mel-spectrogram and voice-quality models are linearly weighted.

Speaker recognition and speaker diarizartion are very much interrelated.

By Abraham Woubie Zewoudie, Lauri Koivisto and Tom Bäckström

To What Extent do Voice-quality Features Enhance the Performance of Anti-spoofing Systems?

Automatic speaker verification technology is currently widely used in different range of applications which require not only robustness to changes in the acoustic environment, but also resilience to intentional circumvention,  known as spoofing. Replay attacks are a key concern among other possible attack vectors; they can be performed with ease and the threat they pose to automatic speaker verification (ASV) reliability has been confirmed in different studies. Replay attacks are mounted using recordings of a target speaker’s voice which are replayed to an ASV system in the place of genuine speech. A prime example could be the use of a smart-device to replay a recording of a target speaker’s voice to unlock a smartphone and use ASV access control. Thus, in this work, we will explore to what extent voice-quality features help ASV systems to combat replay attacks. The impact of voice-quality is analyzed by using them by fusing them with the state-of-of-the-art anti-spoofing features such as Constant Q Cepstral Coefficients (CQCCs) .

By Abraham Woubie Zewoudie Woubie and Tom Bäckström

Teaching

Our department provides the following courses on speech and language technology: 

We are part of the major in Machine Learning, Data Science and Artificial Intelligence (Macadamia) in the Master's Programme in Computer, Communication and Information Sciences.

Research group's youtube page

http://www.youtube.com/channel/UC5eUH2UGJ7UjqJ_MsSOLIQQ

Team Photos

Group members

Latest publications

Evaluation of Zero Frequency Filtering based Method for Multi-pitch Streaming of Concurrent Speech Signals

Mariem Bouafif Mansali, Tom Bäckström, Zied Lachiri 2021 28th European Signal Processing Conference, EUSIPCO 2020 - Proceedings

Privacy in Speech Communication Technology

Tom Bäckström, Pablo Perez Zarazaga, Sneha Das, Stephan Sigg 2021

PyAWNeS-Codec: Speech and audio codec for ad-hoc acoustic wireless sensor networks

Tom Bäckström, Mariem Bouafif, Pablo Perez Zarazaga, Meghna Ranjit, Sneha Das, Zied Lachiri 2021 Proceedings of the European Signal Processing Conference 2021 (EUSIPCO)

Enhancement by postfiltering for speech and audio coding in ad-hoc sensor networks

Sneha Das, Tom Bäckström 2021 JASA Express Letters

Cancellation of Local Competing Speaker with Near-field Localization for Distributed Ad-Hoc Sensor Network

Pablo Perez Zarazaga, Mariem Bouafif, Tom Bäckström, Zied Lachiri 2021 Interspeech

Federated Learning for Privacy Preserving On-Device Speaker Recognition

Abraham Zewoudie, Tom Bäckström 2021 1st ISCA Symposium on Security and Privacy in Speech Communication

The Use of Audio Fingerprints for Authentication of Speakers on Speech Operated Interfaces

Abraham Zewoudie, Tom Bäckström, Pablo Perez Zarazaga 2021 1st ISCA Symposium on Security and Privacy in Speech Communication

Voice-quality Features for Deep Neural Network Based Speaker Verification Systems

Abraham Zewoudie, Lauri Koivisto, Tom Bäckström 2021 EUSIPCO 2021

Intuitive Privacy from Acoustic Reach: A Case for Networked Voice User-Interfaces

Tom Bäckström, Sneha Das, Pablo Perez Zarazaga, Johannes Fischer, Rainhard Findling, Stephan Sigg, Le Nguyen 2021 Proceedings of the 1st ISCA Symposium on Security and Privacy in Speech Communication
More information on our research in the Research database.
  • Published:
  • Updated:
Share
URL copied!