Events

Public defence in the field of Mathematics and Statistics, M.Sc.(Tech.) Aleksi Avela

On imbalanced data and text classification

Public defence from the Aalto University School of Science, Department of Mathematics and Statistics.
Doctoral hat floating above a speaker's podium with a microphone.

Title of the thesis: On imbalanced data and text classification

Thesis defender: Aleksi Avela
Opponent: Professor Thomas Verdebout, Universite Libre de Bruxelles, Belgium
Custos: Professor Pauliina Ilmonen, Aalto University School of Science

Classification is a branch of statistics in which methods for predicting the classes of observations are developed and studied. A classification task could, for example, deal with labeling images of dogs from images of cats. Machine learning provides some of the most prominent approaches to classification. The idea of machine learning is to feed a set of pre-labeled examples to an algorithm, which uses these examples to learn (i.e., to optimize) a classification rule. Yet, classifying the training data accurately is not the final stop, but the aim is that the resulting classifier generalizes outside the training set such that it can also be used accurately for future observations.

In practice, it is often the case that the majority of data belong to some common class(es), and the class of interesting and important observations is relatively rare. This can happen, for instance, in medical testing, where the patients with a risk are rare compared to the healthy population, but the cost of misclassifying an at-risk patient is much greater than misclassifying a patient with no risk. This phenomenon is referred to as the problem of imbalanced data. Machine learning algorithms often struggle with imbalanced data and typically show great love to the majority class while neglecting the rare, so-called minority class.

The problems of imbalanced data can also be amplified by other challenges related to the type of considered data. Text classification is an example of such task. In text classification, the observations are natural language documents. However, the distinctive property of text data is that it is not inherently comprised of measurements that could be used in classification. The first and not trivial step of text classification is to transform natural language into a form that is digestible for a machine learning algorithm, thus adding another layer of complexity to (imbalanced) text classification.

This thesis considers the challenges of imbalanced data both in general and in the context of text classification. The topics studied in the thesis include a practical application of text classification, a new approach tackling the problems of imbalanced text classification, and a theoretical study of issues related to the evaluation of classifiers when dealing with imbalanced data in general.

Keywords: classification, imbalanced data, text classification

Thesis available for public display 7 days prior to the defence at Aaltodoc

Doctoral theses of the School of Science

A large white 'A!' sculpture on the rooftop of the Undergraduate centre. A large tree and other buildings in the background.

Doctoral theses of the School of Science at Aaltodoc (external link)

Doctoral theses of the School of Science are available in the open access repository maintained by Aalto, Aaltodoc.

Zoom Quick Guide
  • Updated:
  • Published:
Share
URL copied!