Public defence in Speech and Language Technology, M.Sc.(Tech.) Aku Rouhe

End-to-End language technology relies on compute and data, but structure and engineering are still relevant.
- Public defence from the Aalto University School of Electrical Engineering, Department of Information and Communications Engineering
An artificial intelligence's artistic rendition of an attention-based end-to-end model: an ear points a beam at speech bubbles.
The attention-mechanism is a technique that neural networks use to focus on the relevant parts of their inputs.

The title of the thesis: Attention-based End-to-End Models in Language Technology

Doctoral student: Aku Rouhe
Opponents: Prof. Ralf Schlüter, RWTH Aachen University, Germany
Custos: Prof. Mikko Kurimo, Aalto University School of Electrical Engineering, Department of Information and Communications Engineering 

Speech recognition and the wider language technology research field has recently focused on end-to-end models. These end-to-end models discard the structure built in to our artificial intelligence system through human ingenuity and understanding, and instead rely on data and compute. This study asks whether the migration to end-to-end models is truly justified in light of the systems' performance. As speech and language technologies become a part of everyday life, it is vital to understand the choices that are taken in building those technologies. End-to-end models work especially well in an abundance of data and compute, but those are not available for example for many small languages. 

The study focuses on speech recognition, where the central contribution is to create new matched comparisons between end-to-end models and their alternative, decomposed solutions. These comparisons equalise the data and certain key computational aspects. The main finding is that decomposed approaches remain competitive with end-to-end models in the classic performance metrics used in speech recognition. The study also shows how end-to-end models can be improved in ways that leverage the subtask data. All-in-all the study emphasises that it is not the migration to end-to-end models that has brought the largest improvements in recent years, but the new neural network architectures, and increases in data and compute. 

Additionally the study touches on canonical morpheme segmentation and speech translation. Although the study touches on those subjects more briefly, it appears that the same questions posed for speech recognition are also relevant in the wider language technology field. 

End-to-end models have clear usecases in for instance mobile phones, but the study shows that decomposed solutions are still a viable approach, and that the performance improvements brought about by increased data and compute resources can still be augmented by human understanding in speech and language technology solutions.

Keywords: speech recognition, language technology, end-to-end models

Thesis available for public display 10 days prior to the defence at:

Contact information:

Email [email protected]
Phone 0408133607

Doctoral theses in the School of Electrical Engineering:

  • Published:
  • Updated: