Department of Computer Science: MSc Thesis Presentations

"Applying natural language embeddings in phishing email classification tasks", author: Essi Tallgren.

Applying natural language embeddings in phishing email classification tasks

Author: Essi Tallgren
Supervisor: Juho Rousu
Time: Tuesday 13 August at 14:00-14:30
Place: online (zoom)

Abstract: In this thesis natural language embeddings were applied for phishing email classification tasks. Different classification methods, including SVM, Random Forest Classifier, Logistic Regression, and LightGBM, were investigated with the objective of gaining reliable phishing email classification results. The motivation behind this research project was the need to adapt to evolving trends in the cybersecurity threat landscape and being able to quickly react to high-risk phishing attacks. As aggressors utilize emerging technologies to create new and innovative ways to compromise their victim’s computer systems, there is an increasing need for developing effective ways to counter them.

The data used in this research was collected from real phishing emails reported by Hoxhunt’s users. The emails have been labelled by the company’s threat analysts and using the already labelled data, natural language embeddings were created with the BERT model and classification research was performed. The dataset in use was highly imbalanced, and the applicability of the oversampling method SMOTE was examined during the classification process.

The research in this thesis showed, that choosing suitable labels for language based classification is of utmost importance. Among the initial labels, there were labels that represented techniques used in phishing and topics that phishing emails often have. Of these two, the former was shown to give inferior results to the latter. This was due to techniques not being as closely tied to the language as the topics were. It was also shown that LightGBM was the most well suited
classification method for the purpose of this thesis, as it had a high performance and qualities most suited for this study.

With the conclusions made in this research, it could be deduced that the application of natural language embeddings in phishing email classification has potential. The model had an accuracy of almost 80 %, even though the used language model had not been fine-tuned and no comparison between language models was made.

Department of Computer Science

Read more
Mahine Learning researchers working at Department of Computer Science in Aalto University
  • Published:
  • Updated: