Public defence in Speech and Language Technology, M.Sc.(Tech.) Aku Rouhe
- Public defence from the Aalto University School of Electrical Engineering, Department of Information and Communications Engineering
The title of the thesis: Attention-based End-to-End Models in Language Technology
Doctoral student: Aku Rouhe
Opponents: Prof. Ralf Schlüter, RWTH Aachen University, Germany
Custos: Prof. Mikko Kurimo, Aalto University School of Electrical Engineering, Department of Information and Communications Engineering
Speech recognition and the wider language technology research field has recently focused on end-to-end models. These end-to-end models discard the structure built in to our artificial intelligence system through human ingenuity and understanding, and instead rely on data and compute. This study asks whether the migration to end-to-end models is truly justified in light of the systems' performance. As speech and language technologies become a part of everyday life, it is vital to understand the choices that are taken in building those technologies. End-to-end models work especially well in an abundance of data and compute, but those are not available for example for many small languages.
The study focuses on speech recognition, where the central contribution is to create new matched comparisons between end-to-end models and their alternative, decomposed solutions. These comparisons equalise the data and certain key computational aspects. The main finding is that decomposed approaches remain competitive with end-to-end models in the classic performance metrics used in speech recognition. The study also shows how end-to-end models can be improved in ways that leverage the subtask data. All-in-all the study emphasises that it is not the migration to end-to-end models that has brought the largest improvements in recent years, but the new neural network architectures, and increases in data and compute.
Additionally the study touches on canonical morpheme segmentation and speech translation. Although the study touches on those subjects more briefly, it appears that the same questions posed for speech recognition are also relevant in the wider language technology field.
End-to-end models have clear usecases in for instance mobile phones, but the study shows that decomposed solutions are still a viable approach, and that the performance improvements brought about by increased data and compute resources can still be augmented by human understanding in speech and language technology solutions.
Keywords: speech recognition, language technology, end-to-end models
Thesis available for public display 10 days prior to the defence at: https://aaltodoc.aalto.fi/doc_public/eonly/riiputus/
Doctoral theses in the School of Electrical Engineering: https://aaltodoc.aalto.fi/handle/123456789/53