Public defence in Computer Science, M.Sc. Joonas Jälkö
Opponent: Professor Emiliano De Cristofaro, University College London,England
Custos: Professor Samuel Kaski, Aalto University School of Science, Department of Computer Science
The thesis is publicly displayed 10 days before the defence in the publication archive Aaltodoc of Aalto University.
Public defence announcement:
In the past decade machine learning methods have become more and more crucial part of our lives. These methods are used for example to give users personalized recommendations in retail, but also in fields involving crucial decision making such as medicine and health care. Training data is a crucial component of making these models to perform well in such demanding tasks. However, when this data contains sensitive information, there is a concern whether the models can unintentionally leak it. Another concern is how much we can rely on the learned models. When machine learning is used for crucial decision making, it is instrumental to quantify the uncertainty the models has on its predictions or analyses. This dissertation studies the intersection of differential privacy and approximate Bayesian inference, which, respectively, aim to address the privacy concerns raised by using sensitive data for training the models and the uncertainty quantification of the learned results.
The thesis proposes several novel techniques for privacy-preserving Bayesian inference. Moreover, the thesis demonstrates how to generate privacy-preserving synthetic data sets from probabilistic models learned using the proposed inference techniques. We test the proposed inference methods with several real-world examples. Our results show that it is possible to learn probabilistic models under the strict privacy constraint of differential privacy while retaining the usefulness and capturing the uncertainty of the learned models. In the synthetic data application, we demonstrate that it is possible to share privacy-preserving synthetic data that retains the main statistical properties of the sensitive data set, and furthermore that the statistical properties are better retained when the data generation process is modeled accurately.