Defence of dissertation in the field of Speech and Language Technology, Lauri Juvela, MSc (Tech.)

Title of the thesis is "Neural waveform generation for source-filter vocoding in speech synthesis"

The public defense will be organized via remote technology. Link:

Title of the thesis is "Neural waveform generation for source-filter vocoding in speech synthesis".

Speech synthesis, or artificial generation of speech from any given text, has been one of the fundamental problems in speech communication technology. While early research on synthesis was driven by curiosity about the human voice production, modern speech synthesis has found many applications in screen readers, assistive devices and human-computer speech interfaces, to name a few. With the recent advances in statistical model based synthesis using neural networks, speech synthesis has reached an unprecedented level of naturalness and flexibility that will make possible many exciting future applications. A major contributor to the recent improvements has been the introduction of neural network waveform synthesis models, which take the role of a vocoder in a traditional speech synthesis system.

However, a gap remains both in understanding and computational efficiency of the algorithms between the recent raw waveform neural vocoders and the classical model-based signal processing vocoders. A central motivation of the present dissertation has been to combine the emerging generative neural network models with classical speech signal processing concepts for efficient, high-quality synthesis that retains a degree of interpretability.

Specifically, this dissertation focuses on neural network modeling of the excitation signal related to the source-filter model of human voice production. Since the present signal processing techniques for modeling the spectral envelope of the vocal tract are highly developed, the spectral envelope can be parameterized and used directly as a part of neural vocoding schemes. The remaining task is then to develop neural network models for the residual excitation signal.

This dissertation presents an improved framework for representing residual excitation waveform in a pitch synchronous format, and applies generative adversarial networks for synthesizing these waveforms without a parametric aperiodicity model. Furthermore, it proposes an autoregressive WaveNet based excitation model, which only explicitly uses a spectral envelope model during synthesis. Finally, the two approaches are combined into a parallel-inference-capable source-filter synthesizer, which is trainable in an end-to-end fashion.

Opponent: Assistant Professor Gustav Henter, KTH Royal Institute of Technology, Sweden.

Custos: Professor Paavo Alku, Aalto University School of Electrical Engineering, Department of Signal Processing and Acoustics.

Contact information: Lauri Juvela, Aalto University School of Electrical Engineering, p. 0503790120, [email protected]

Electronic dissertationThe dissertation is publicly displayed 10 days before the defence:

  • Published:
  • Updated: