Topic Modeling: New Software and a Wrap-up of our NEH-Sponsored Workshop

Topic modeling in (and on) the humanities has been the subject of a number of blog posts and online conversations over the past few weeks, including this article by Andrew Goldstone and Ted Underwood, which provides a very clear introduction to the method and outlines a set of experiments on PMLA, and a series of posts by Jon Goodwin that walk through experiments on texts from JSTOR's Data for Research in useful detail. The comments on these blog posts are well worth reading, along with the parallel discussions on Twitter, such as these responses to a recent question by Matt Burton about appropriate techniques for measuring the similarity of documents or topics.

Both Jon and Matt were participants at MITH's NEH-funded topic modeling workshop last month, and Jennifer Guiliano and I would like to thank all of the speakers and attendees once more, and to point again to some of the many follow-up blog posts and other documents that came out of the workshop (please comment or contact me if I've missed something you've written that you'd like to see listed here):

Some reflections by Thomas Padilla
A wrap-up by Sarita Alami
Some questions by Trevor Owens (who was not at the workshop, although many of the commenters on this post were)
A collection of Tweets by Scott Kleinman
And another set by Natalia Cecire, with a focus on issues of labor
Detailed notes by Brian Croxall
A Digital Dialogue at MITH by Lisa Rhody on the use of topic modeling in her dissertation project

We'd also like to announce a small topic modeling library and toolkit that MITH is releasing and will continue to develop over the next few months. This library is written in the Scala programming language and currently serves primarily as a lightweight wrapper for MALLET. It pulls together bits of functionality and code that we at MITH found ourselves developing for various projects with a topic modeling component, including a graduate course project on the Gothic novel and science fiction, Lisa Rhody's work on ekphrastic poetry, Amanda Visconti's work on visualizing Digital Humanities Quarterly, and the Foreign Literatures in America project.

One simple piece of functionality that we've found widely useful is a command-line tool that exports data from a model file generated by MALLET to a spreadsheet that can be opened in Excel or LibreOffice. While Excel is in many ways a less sophisticated data analysis platform than tools like R, it is widely used and has a relatively shallow learning curve. For example, this tool has allowed us to train a topic model and hand the results as a spreadsheet to a group of undergraduate students, who can then easily identify the documents in their corpus most strongly associated with a particular topic, or find the documents that are most similar (according to the topic model) to a particular text they are reading.

The project currently doesn't go out of its way to insulate users from the command line, but it is designed to be easy to install and use. It relies on the Maven build tool to manage dependencies, so you can run MALLET's topic modeling engine (with reasonable defaults) without manually installing MALLET on your machine, for example. If you're on a Mac with OS X, you already have Maven installed, and if you run Windows or Linux the installation process of installing Maven is fairly painless and straightforward.

If you're curious about topic modeling and are willing to roll up your sleeves and open a terminal, we'd encourage you to click the link above and try out this software. This is very much a work in progress, so let us know about features you'd like to see—either in a comment here or by creating a new issue in the GitHub repository—and we'll do our best to get them implemented. And watch this space—in the spring we'll be launching a sandbox environment that will allow users to run MALLET and other topic modeling tools without installing any software on their local machine.