Department of Computer Science: MSc Thesis Presentations
When
Where
Event language(s)
Identifying mislabelled data in extreme multi-label text classification – applying confident learning to a medical coding dataset
Author: Akseli Anttonen
Supervisor: Juho Rousu
Abstract: Data annotations in datasets used for machine learning are often produced by human annotation or other noisy processes. Systematic label errors may be introduced to datasets due to biases in the data generation or processing. This means that the given labels in most datasets contain label errors. Mislabels can reduce predictive performance and undermine machine learning models' generalization ability. This thesis investigates mislabel detection in the context of an extreme multi-label text classification task. This is a setting where each text document is annotated with several labels chosen from a set of thousands of options. Experiments are carried out to test one mislabel detection method on a dataset used for automatic medical coding. Automatic medical coding is the task of predicting medical diagnosis or procedure codes based on medical records. The employed method, confident learning, uses the predicted probabilities of a trained model. Cases where the model confidently disagrees with a given label are detected as potential label errors. The mislabel detection is evaluated against a keyword-search-based ground truth on a subset of labels. Furthermore, the effect of cleaning the training set is investigated by re-training the model after correcting label errors. The results suggest that confident learning can spot cases where an erroneous extra label is present with high precision. However, the method is too unreliable to fully automatically clean the dataset. The re-training results show that a model trained on cleaned data is more conservative, having a lower false positive rate, but performs worse overall.
Department of Computer Science
We are an internationally-oriented community and home to world-class research in modern computer science.
- Published:
- Updated: