Events

Machine Learning Coffee seminar: Jaakko Peltonen, University of Tampere "Exploring Large And Hierarchical Online Discussion Venues With Probabilistic Models"

Helsinki region machine learning researchers will start our week by an exciting machine learning talk. Porridge and coffee is served at 9:00 and the talk will begin at 9:15.
Machine Learning Coffee Seminar, image: Matti Ahlgren

Exploring Large And Hierarchical Online Discussion Venues With Probabilistic Models

Jaakko Peltonen
University of Tampere

Abstract:

In many domains, document sets are hierarchically organized such as message forums having multiple levels of sections. Analysis of latent topics within such content is crucial for tasks like trend and user interest analysis. Nonparametric topic models are a powerful approach, but traditional Hierarchical Dirichlet Processes (HDPs) are unable to fully take into account topic sharing across deep hierarchical structure. Moreover, in addition to the underlying trends of content and the structure of the venue, another key aspect of online discussion venues is the multitude of participants; authors may participate differently at multiple levels of sections, with different interests and contributions across the hierarchy. We introduce the Tree-structured Hierarchical Dirichlet Process (THDP), allowing Dirichlet process based topic modeling over a given tree structure of arbitrary size and height, where documents can arise at all tree nodes. We further introduce the Author 

Tree-structured Hierarchical Dirichlet Process (ATHDP), allowing Dirichlet process based topic modeling of both text content and authors over a given tree structure of arbitrary size and height. Experiments on a hierarchical forums demonstrate better generalization performance of THDP than traditional HDPs in terms of ability to model new data and classify documents to sections, and better performance of ATHDP compared to traditional HDP based alternatives in terms of perplexity and authorship attribution accuracy. Lastly, we introduce a novel interactive system for visualizing and exploring a large hierarchical text corpus of online forum postings, based on large-scale scatter plots created by flexible nonlinear dimensionality reduction of posting contents, coupled with a coloring optimized to represent the forum hierarchy. We exploit the hierarchy to provide data-driven summaries of plot areas at multiple levels of detail, allowing the user to quickly compare both the content-based similarity of groups of posts and how near they arise in the forum hierarchy.

See the next talks at the seminar webpage.

Please spread the news and join us for our weekly habit of beginning the week by an interesting machine learning talk!

Welcome!

  • Published:
  • Updated: