Michael Paul - Multi-Facted Topic Modeling Package

Multi-Faceted Topic Modeling Package

This software includes implementations of cross-collection latent Dirichlet allocation (ccLDA) and the topic-aspect model (TAM), introduced and described in the papers below, as well as an LDA implementation. See the included README for open licensing information as well as usage instructions and input/output formatting guidelines.

TAM and ccLDA fall in the class of multi-faceted topic models which learn topical variation across some other variable such as a document's collection, the author's perspective, a time span, and other possibilities. For example, topics in scientific literature might appear across multiple disciplines, but in different ways in each field, so the topical words in a document would also depend on paper's primary discipline. Topics found in reviews and editorials might be expressed in different ways depending on the author's perspective or sentiment. ccLDA captures these properties with explicit document labels, while TAM tries to learn this other latent dimension automatically. In more recent work we found TAM to be useful for unsupervised viewpoint clustering.

This implementation includes somewhat minimal functionality. In particular, it does not provide a method for running inference on new documents, and it does not allow asymmetric topic priors. I may or may not add these things in a future release. (Note: this implementation of ccLDA does not learn an asymmetric alpha matrix as in the original paper. We found that it mostly learned sparse priors and this was not too important. If you desire this functionality, it can be found in this older implementation of ccLDA.)

Please contact me if you find any bugs/errors. It may be a good idea to check back every once in a while in case there are future updates, especially in case bugs are discovered.

Revision History

5/07/2011 - v0.16 - Fixed a bug related to the previous bug correction. TAM was not behaving correctly under the previous version. If you are using v0.15, please update to this fixed version.
4/14/2011 - v0.15 - Two bug corrections, both for the TAM model. First, an array for aspect counts was not allocated to the correct size, which would have caused the problem to crash under certain parameter settings (if Z > Y). Second, two of the counts used in computing sampling probabilities for the x/l variables were switched. Many thanks to Shima Gerani for pointing these out.
11/11/2010 - v0.1 - First release.

References

Michael Paul and Roxana Girju. A Two-Dimensional Topic-Aspect Model for Discovering Multi-Faceted Topics. In the proceedings of the 24th AAAI Conference on Artificial Intelligence (AAAI-10), pages 545-550, Atlanta, Georgia. July 2010.
Michael Paul and Roxana Girju. Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models. In the proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP 2009), pages 1408-1417, Singapore. August 2009.