Dirichlet-Hawkes Processes with Applications to Clustering Continuous-Time Document Streams
Abstract
Clusters in document streams, such as online news articles, can be induced by their
textual contents, as well as by the temporal dynamics of their arriving patterns.
Can we leverage both sources of information to obtain a better clustering of the
documents, and distill information that is not possible to extract using contents
only? In this paper, we propose a novel random process, referred to as the
Dirichlet-Hawkes process, to take into account both information in a unified
framework. A distinctive feature of the proposed model is that the preferential
attachment of items to clusters according to cluster sizes, present in Dirichlet
processes, is now driven according to the intensities of cluster-wise self-exciting
temporal point processes, the Hawkes processes. This new model establishes a
previously unexplored connection between Bayesian Nonparametrics and temporal Point
Processes, which makes the number of clusters grow to accommodate the increasing
complexity of online streaming contents, while at the same time adapts to the ever
changing dynamics of the respective continuous arrival time. We conducted
large-scale experiments on both synthetic and real world news articles, and show
that Dirichlet-Hawkes processes can recover both meaningful topics and temporal
dynamics, which leads to better predictive performance in terms of content
perplexity and arrival time of future documents.
