Speech Interaction Technology

Research Overview
We currently specialize in the following research areas:
- Speech and audio coding, distributed coding and acoustic sensor networks
- Security and privacy in speech communication
- Acoustic authentication methods
- Speaker recognition
Research topic descriptions
Speech enhancement with speech source models using post-filtering
In speech coding, we have learned that source models are very important in improving efficiency. It is therefore reasonable to assume that such models would be efficient also in speech enhancement. We have applied speech source models for the speech enhancement task to improve speech quality in noisy scenarios with low-complexity and low-delay methods. In particular, we have focused on speech coding scenarios, where speech is corrupted by quantization noise. A particular characteristic of the combination of speech coding and enhancement is that in speech coding it is not possible to use inter-frame information (i.e. information over the time axis, across processing windows), as any information shared across windows would either increase delay or jeopardize reconstruction in case of packet loss. Therefore our application in the speech coding scenario is implemented entirely on the decoder-side. In other words, we treat quantization noise at the decoder as noise and use speech enhancement to improve quality.
Our method is based on predicting the distribution of the current frame from the recent past. This gives us more accurate statistical priors for the reconstruction / enhancement task, compared to reconstruction without past information, and consequently we gain a better quality at the output. We have explored such prediction with both Gaussian, GMM and neural network models, and concluded that a simple Gaussian is a reasonable approximation in the low-complexity approach. Improved neural network models remain as future work.
As source models, we have studied:
- Multi-channel estimation https://arxiv.org/pdf/2011.03810
- Spectral log-power envelope models https://research.aalto.fi/files/27812283/ELEC_das_et_al_Postfiltering_Using_Interspeech.pdf
- Fundamental frequency models in the log-power domain, optimized separately for each frequency https://isca-speech.org/archive/Interspeech_2020/pdfs/1067.pdf
- Phase-models in MDCT and STFT-domains, https://www.researchgate.net/profile/Tom_Baeckstroem/publication/327389332_Postfiltering_with_Complex_Spectral_Correlations_for_Speech_and_Audio_Coding/links/5be4116f299bf1124fc34e68/Postfiltering-with-Complex-Spectral-Correlations-for-Speech-and-Audio-Coding.pdf and http://www.essv.de/pdf/2020_109_116.pdf
- Prediction using GMM models (though this paper is not about enhancement) https://ieeexplore.ieee.org/abstract/document/8461527/
By Sneha Das and Tom Bäckström
Authentication of devices in the same room using acoustic fingerprints
We have plenty of devices, yet we typically use only one at a time. If devices would be seamlessly inter-operable, such that, for example, we could in a teleconference use all nearby microphones, then the interaction quality could be improved; The audio quality could be improved by better sampling of the acoustic space and service quality would be improved since we would not need to think about which devices can handle which kinds of services.
A first step in this direction is to identify which devices can be allowed to interact. Typically, devices which are near each other can interact. However, devices can be near each other though they are in different rooms. Therefore, a better criteria for authentication is that devices in the same room can interact. Better yet, devices where the acoustic distance is short enough can interact.
For this purpose, we have designed an acoustic fingerprint, which quantifies characteristics of the acoustic environment, such that we can determine if two devices are in the same room by comparing fingerprints. The fingerprint typically should include both temporal and stationary information, that is, temporal events (transients) are very often specific to a particular environment (no two rooms have the same activity and thus the same noises), but stationary information such as room reverberation characteristics are also needed to discriminate between broadcasted sounds such as TV sounds. Our experiments show that in typical low-noise scenarios with microphones at a distance of less than 5m, fingerprints typically have 85% identical bits.
The final step is to compare fingerprints between devices. Observe that we cannot just transmit the fingerprints, since that would leak information which is potentially private. Instead, we have to use cryptographic primitives which allow comparison of private data, without revealing that data but only the result of the comparison.
- Fingerprints reviewed https://aaltodoc.aalto.fi/handle/123456789/46766
- Fingerprints (first paper) https://aaltodoc.aalto.fi/handle/123456789/40456
- Provable consent (check afterwards that consent was given) https://research.aalto.fi/files/51759985/Sigg_ProvableConsent.pdf
By Pablo Pérez Zarazaga, Tom Bäckström and Stephan Sigg
Speech and audio coding for ad-hoc sensor networks
We have plenty of devices, yet we typically use only one at a time. If devices would be seamlessly inter-operable, such that, for example, we could in a teleconference use all nearby microphones, then the interaction quality could be improved; The audio quality could be improved by better sampling of the acoustic space and service quality would be improved since we would not need to think about which devices can handle which kinds of services.
A necessary component of such interaction between devices are methods for efficient transmission of information between devices. In essence, we need to use speech and audio coding methods to compress information for efficient transmission. Clearly, conventional coding methods are here useful, but we need to extend them such that we can take advantage of multiple devices. In particular, observe that it would not be useful to send the the same data from multiple devices. For example, if two devices send the same fundamental frequency, then we could omit the second one to save bandwidth without loss of quality. However, choosing which information to send would require interaction between an unknown number of devices, which is potentially complicated. Instead, we have opted to base our design on independent devices, which do not share information but only transmit their data such that we use randomization to make sources as independent as possible.
- Dithered coding for ad-hoc sensor networks https://research.aalto.fi/files/27811883/dithering2.pdf
- Dithering methods https://ieeexplore.ieee.org/abstract/document/8052578/
- End-to-end optimization of source models for coding https://research.aalto.fi/files/37082504/ELEC_Backstrom_End_to_end_Interspeech.pdf
- Coding based on GMMs https://ieeexplore.ieee.org/abstract/document/8461527/
- Position paper https://arxiv.org/pdf/1811.05720
- Envelope modelling for super-wide band https://pdfs.semanticscholar.org/1f66/d8c5cc623d7a2e43a260672270d03274579e.pdf
- Optimal overlap-add windowing for coding https://arxiv.org/pdf/1902.01053
- See also above methods for decoder-side postfiltering
By Tom Bäckström, Johannes Fischer (International Audio Laboratories Erlangen), Sneha Das, Srikanth Korse (International Audio Laboratories Erlangen)
Experience of privacy
When you tell a secret to a friend, you whisper. People are naturally attuned to the level of privacy, they intuitively know how private the scenario is, and subconsciously modify their behavior in accordance to the perceived privacy of the surrounding. We cannot tell secrets in a public space - it's obvious to us. We call this effect our experience of privacy. Observe that such experience is correlated with the actual level of privacy, but not strictly bound to it. There could be a secret eavesdropper nearby without our knowledge, such that we unknowingly reveal secrets, or we could be overly paranoid and refuse to tell secrets even in the absence of eavesdroppers. Therefore, both the subjective and objective sense of privacy has to be taken into account.
To design speech interfaces which respect our privacy, they have to understand both such subjective and objective privacy. They have to understand which kinds of environments feel private to us and they have to try to identify real threats to privacy. Moreover, they have to be able to act according to those levels of privacy.
- Database for quantifying experience of privacy among users https://research.aalto.fi/files/34916518/ELEC_Zarazaga_Sound_Privacy_Interspeech.pdf and http://www.interspeech2020.org/uploadfile/pdf/Thu-3-3-2.pdf
- Popular science style review paper https://www.vde.com/resource/blob/1991012/07662bec66907573ab254c3d99394ec7/itg-news-juli-oktober-2020-data.pdf
- User-interface study of privacy in speech interfaces https://fruct.org/publications/acm27/files/Yea.pdf and https://trepo.tuni.fi/bitstream/handle/10024/120072/YeasminFarida.pdf?sequence=2
- Privacy in teleconferencing https://arxiv.org/pdf/2010.09488
- See also acoustic fingerprint papers above
By Sneha Das, Pablo Pérez Zarazaga, Anna Leschanowsky, Farida Yeasmin, Tom Bäckström and others
Speaker identification, verification and spoofing
Voice-quality Features for Deep Neural Network Based Speaker Verification Systems.
Jitter and shimmer are voice-quality features which have been successfully used to detect voice pathologies and classify different speaking styles. Thus, we investigate the usefulness of such voice-quality features in neural-network based speaker verification systems. To combine these two sets of features, the cosine distance scores estimated from the two sets are linearly weighted to obtain a single, fused score. The fused score is used to accept/reject a given speaker. The experimental results carried out on Voxceleb-1 dataset demonstrate that the fusion of the cosine distance scores extracted from the mel-spectrogram and voice quality features provide a 11% relative improvement in Equal Error Rate (EER) compared to the baseline system which is based only on mel-spectrogram features.
The main contribution of this work is that we propose the use of voice-quality features for deep learning based speaker verification systems. We are interested in voice-quality features since jitter and shimmer measurements show significant differences between different speaking styles. Since these features have shown potential for characterizing pathological voices and linguistic abnormalities, they can be also employed to characterize a particular speaker. The voice-quality features are used together with the short-term mel-spectrogram features. The fusion of the voice-quality features with the mel-spectrogram is carried out at the score likelihood level, i.e., the cosine distance scores extracted using the mel-spectrogram and voice-quality models are linearly weighted.
- Voice-quality papers for
- Speaker diarization: http://www.odyssey2016.com/papers/pdfs_stamped/18.pdf,
- Speaker clustering: https://www.isca-speech.org/archive/Interspeech_2016/pdfs/0339.PDF
- Long-term feautures for diarization
Speaker recognition and speaker diarizartion are very much interrelated.
By Abraham Woubie Zewoudie, Lauri Koivisto and Tom Bäckström
To What Extent do Voice-quality Features Enhance the Performance of Anti-spoofing Systems?
Automatic speaker verification technology is currently widely used in different range of applications which require not only robustness to changes in the acoustic environment, but also resilience to intentional circumvention, known as spoofing. Replay attacks are a key concern among other possible attack vectors; they can be performed with ease and the threat they pose to automatic speaker verification (ASV) reliability has been confirmed in different studies. Replay attacks are mounted using recordings of a target speaker’s voice which are replayed to an ASV system in the place of genuine speech. A prime example could be the use of a smart-device to replay a recording of a target speaker’s voice to unlock a smartphone and use ASV access control. Thus, in this work, we will explore to what extent voice-quality features help ASV systems to combat replay attacks. The impact of voice-quality is analyzed by using them by fusing them with the state-of-of-the-art anti-spoofing features such as Constant Q Cepstral Coefficients (CQCCs) .
By Abraham Woubie Zewoudie Woubie and Tom Bäckström
Teaching
Our department provides the following courses on speech and language technology:
- ELEC-E5500 Speech Processing
- ELEC-E5510 Speech Recognition
- ELEC-E5550 Statistical Natural Language Processing
- ELEC-E5521 Speech and Language Processing Methods
Research group's youtube page
