We study the problem of topic modeling in corpora whose documents are organized in a multi-level hierarchy. We explore a parametric approach to this problem, assuming that the number of topics is known or can be estimated by cross-validation. The models we consider can be viewed as special (finite-dimensional) instances of hierarchical Dirichlet processes (HDPs). For details, please read our arXiv report.


Do-kyum Kim, Geoffrey M. Voelker and Lawrence K. Saul.
Topic Modeling of Hierarchical Corpora.
Submitted for publication.

Do-kyum Kim, Geoffrey M. Voelker and Lawrence K. Saul.
A Variational Approximation for Topic Modeling of Hierarchical Corpora.
In Proceedings of the 30th International Conference on Machine Learning (ICML 2013). Atlanta, GA.
[paper] [supplement] [bib]


Source code

You can find our implementation at GitHub .

Data sets

These are the data sets we used in the paper.
Flat corpora: [KOS] [Enron] [SubsetOfNYTimes]
Hierarchical corpora: [NIPS] [Freelancer] [BlackHatWorld]

Format of the data sets

The data sets above are represented in the following format.

First, the hierarchies in corpora are described in 'hier_tree_structures.txt'. The first line of the file has the number of categories including the root node. Then, each line describes individual category. These lines are composed as:
[category_id]\t[category_id_of_first_child] [category_id_of_second_child] ... -1\n
and the category id of 0 is reserved for the root node.

Second, for each category, there are attached documents, and this mapping is given in 'category_to_docids.txt'. Each line is composed as:
[document_id1] [document_id2] ... -1\n
where the document ids are 0-based. The first line corresponds to the root node (i.e. the category id of 0) and the second line corresponds to the category id of 1 and so on.

Lastly, the document-term matrixes are given in 'document_term_matrix.txt'. In some data sets, there is only one 'document_term_matrix.txt' that are shared over different folds and splits; in the others, each fold and split has separate 'document_term_matrix.txt'. Each of 'document_term_matrix.txt' is composed as:
[# documents]
[# dictinct terms]
[# dictinct terms in the first document]
[term_id1]:[# occurrences]
[term_id2]:[# occurrences]
[# dictinct terms in the second document]
[term_id1]:[# occurrences]
Note that the term ids are also 0-based.