# Introduction

We study the problem of topic modeling in corpora whose documents are organized in a multi-level hierarchy.
We explore a parametric approach to this problem, assuming that the number of topics is known or can be estimated by cross-validation.
The models we consider can be viewed as special (finite-dimensional) instances of hierarchical Dirichlet processes (HDPs).
For details, please read our arXiv report.

# Publication

Do-kyum Kim, Geoffrey M. Voelker and Lawrence K. Saul.

Topic Modeling of Hierarchical Corpora.

Submitted for publication.

[arXiv]

Do-kyum Kim, Geoffrey M. Voelker and Lawrence K. Saul.

A Variational Approximation for Topic Modeling of Hierarchical Corpora.

In Proceedings of the 30th International Conference on Machine Learning (ICML 2013). Atlanta, GA.

[paper]
[supplement]
[bib]

# People

# Source code

You can find our implementation at GitHub .

# Data sets

These are the data sets we used in the paper.

Flat corpora:
[KOS]
[Enron]
[SubsetOfNYTimes]

Hierarchical corpora:
[NIPS]
[Freelancer]
[BlackHatWorld]

# Format of the data sets

The data sets above are represented in the following format.

First, the hierarchies in corpora are described in 'hier_tree_structures.txt'.
The first line of the file has the number of categories including the root node.
Then, each line describes individual category.
These lines are composed as:

[category_id]\t[category_id_of_first_child] [category_id_of_second_child] ... -1\n

and the category id of 0 is reserved for the root node.

Second, for each category, there are attached documents, and this mapping is given in 'category_to_docids.txt'.
Each line is composed as:

[document_id1] [document_id2] ... -1\n

where the document ids are 0-based.
The first line corresponds to the root node (i.e. the category id of 0) and the second line corresponds to the category id of 1 and so on.

Lastly, the document-term matrixes are given in 'document_term_matrix.txt'.
In some data sets, there is only one 'document_term_matrix.txt' that are shared over different folds and splits;
in the others, each fold and split has separate 'document_term_matrix.txt'.
Each of 'document_term_matrix.txt' is composed as:

[# documents]

[# dictinct terms]

[# dictinct terms in the first document]

[term_id1]:[# occurrences]

[term_id2]:[# occurrences]

...

[# dictinct terms in the second document]

[term_id1]:[# occurrences]

...

Note that the term ids are also 0-based.