Michael J. Paul - Homepage

Here is an archive of my academic work (see also Google Scholar):

Code

SPRITE
Factorial LDA
Topic modeling for document sequences

Java implementations of the Block HMM and Mixed Membership Markov Model (M4)

Multi-Faceted Topic Modeling Package

Java implementations of Cross-Collection LDA and the Topic-Aspect Model

Carmen: Geolocation for Twitter

Data

Zika Tweets, 2015-2016

Data used in Pruss et al., PLOS ONE 2019.

Tweets for Survey Prediction

Data used in Benton et al., AAAI 2016.

Weibo Air Pollution Dataset

Data used in Wang et al., JMIR 2015.

Health Tweets

Data used in Paul and Dredze, PLOS 2014. Includes health-related tweet IDs and ATAM output.

Doctor Review Dataset

Data used in Wallace et al., JAMIA 2014.

Influenza Twitter Annotations

Data used in Lamb et al., NAACL 2013. Tweet IDs annotated with flu relevance.

Health Twitter Annotations

Data used in Paul and Dredze, ICWSM 2011. Tweet IDs annotated with health relevance.

Cross-Cultural Blog and Forum Dataset

Data used in Paul and Girju, EMNLP 2009.

Other Resources

The code from my group's recent projects can be found on the GitHub pages for Xiaolei Huang and Yoshinari Fujinuma.
Some annotated tweets are available from the SMM4H shared tasks.
See Mark Dredze's website for other code/data that I have worked with.