Doctoral theses of the School of Science at Aaltodoc (external link)
Doctoral theses of the School of Science are available in the open access repository maintained by Aalto, Aaltodoc.
Title of the thesis: Addressing statistical and computational challenges in extreme multilabel classification with unbiased estimators, macro-averaged metrics, and hardware-aware implementations
Thesis defender: Erik Schultheis
Opponent: Staff Research Scientist Aditya Menon, USA
Custos: Professor Pekka Marttinen, Aalto University School of Science
Recommendations, such as those encountered in web shops or search engines, need to select a small set of items for a given query. This can be seen as a multilabel classification problem with a very large set of possible labels. This has three important consequences: First, getting human experts to exhaustively label even a small set of examples that could be used for training an automatic system is very expensive, as a single example can have over a million candidate labels. Thus, training data will be incompletely labeled. Second, a few very popular items (e.g., blockbuster movies) would appear many times as a label, but most only have a few relevant instances. If we evaluated the success of the system simply by averaging all predictions, it only matters to be accurate on popular items; a more sophisticated scoring is needed. Finally, the enormous label space causes the training of such classifiers to be a computationally expensive process.
The thesis addresses these problems as follows: Under some conditions, we show how the effect of missing labels can be corrected, as long as an estimate of how many labels are missing is available. To give more weight to underrepresented labels, we can calculate a score (such as the fraction of times it was correctly identified) for each label individually and then averages these scores, giving all labels the same weight. We prove that knowing the probability for each label (which is the information produced by traditional classifiers) is sufficient for optimizing this objective. Interestingly, these need not be the labels with the highest probabilities; sometimes, it might be better to recommend
the niche film over the blockbuster, even if the latter is more likely.
We alleviate the computational cost by exploiting sparsity in several ways: First, for each instance, most labels will be trivially irrelevant. If those can be identified quickly, then the computation can focus on the small set of labels for which the decision is more difficult. Second, we show that it is possible to use only a small subset of model parameters and still achieve accuracy comparable to the full model. Doing so makes it possible to run these learning algorithms on cheap gaming GPUs, instead of requiring expensive datacenter hardware. With the growing prevalence of large language models and their extensive token vocabularies, similar techniques will be increasingly valuable
for enabling efficient training and inference on consumer hardware.
Thesis available for public display 7 days prior to the defence at Aaltodoc.
Doctoral theses of the School of Science are available in the open access repository maintained by Aalto, Aaltodoc.