Hierarchical Label Propagation and Discovery for Machine Generated Email
Venue
Proceedings of the International Conference on Web Search and Data Mining (WSDM), ACM (2016), pp. 317-326
Publication Year
2016
Authors
James B. Wendt, Michael Bendersky, Lluis Garcia-Pueyo, Vanja Josifovski, Balint Miklos, Ivo Krka, Amitabh Saikia, Jie Yang, Marc-Allen Cartright, Sujith Ravi
BibTeX
Abstract
Machine-generated documents such as email or dynamic web pages are single
instantiations of a pre-defined structural template. As such, they can be viewed as
a hierarchy of template and document specific content. This hierarchical template
representation has several important advantages for document clustering and
classification. First, templates capture common topics among the documents, while
filtering out the potentially noisy variabilities such as personal information.
Second, template representations scale far better than document representations
since a single template captures numerous documents. Finally, since templates group
together structurally similar documents, they can propagate properties between all
the documents that match the template. In this paper, we use these advantages for
document classification by formulating an efficient and effective hierarchical
label propagation and discovery algorithm. The labels are propagated first over a
template graph (constructed based on either term-based or topic-based
similarities), and then to the matching documents. We evaluate the performance of
the proposed algorithm using a large donated email corpus and show that the
resulting template graph is significantly more compact than the corresponding
document graph and the hierarchical label propagation is both efficient and
effective in increasing the coverage of the baseline document classification
algorithm. We demonstrate that the template label propagation achieves more than
91% precision and 93% recall, while increasing the label coverage by more than 11%.
