Machine Learning Coffee seminar: Jaakko Peltonen, University of Tampere "Exploring Large And Hierarchical Online Discussion Venues With Probabilistic Models"
Exploring Large And Hierarchical Online Discussion Venues With Probabilistic Models
University of Tampere
In many domains, document sets are hierarchically organized such as message forums having multiple levels of sections. Analysis of latent topics within such content is crucial for tasks like trend and user interest analysis. Nonparametric topic models are a powerful approach, but traditional Hierarchical Dirichlet Processes (HDPs) are unable to fully take into account topic sharing across deep hierarchical structure. Moreover, in addition to the underlying trends of content and the structure of the venue, another key aspect of online discussion venues is the multitude of participants; authors may participate differently at multiple levels of sections, with different interests and contributions across the hierarchy. We introduce the Tree-structured Hierarchical Dirichlet Process (THDP), allowing Dirichlet process based topic modeling over a given tree structure of arbitrary size and height, where documents can arise at all tree nodes. We further introduce the Author
Tree-structured Hierarchical Dirichlet Process (ATHDP), allowing Dirichlet process based topic modeling of both text content and authors over a given tree structure of arbitrary size and height. Experiments on a hierarchical forums demonstrate better generalization performance of THDP than traditional HDPs in terms of ability to model new data and classify documents to sections, and better performance of ATHDP compared to traditional HDP based alternatives in terms of perplexity and authorship attribution accuracy. Lastly, we introduce a novel interactive system for visualizing and exploring a large hierarchical text corpus of online forum postings, based on large-scale scatter plots created by flexible nonlinear dimensionality reduction of posting contents, coupled with a coloring optimized to represent the forum hierarchy. We exploit the hierarchy to provide data-driven summaries of plot areas at multiple levels of detail, allowing the user to quickly compare both the content-based similarity of groups of posts and how near they arise in the forum hierarchy.