In a recent post to this blog, Sayan Bhattacharyya described his contributions to the Woodchipper project in the context of a broader discussion about corpus-based approaches to humanities research. Topic modeling, the statistical technology undergirding Woodchipper, has garnered increasing attention as a tool of hermeneutic empowerment, a method for drawing structure out of a corpus on the basis of minimal critical presuppositions. In this post I map out a basic genealogy of topic modeling in the humanities, from the highly cited paper that first articulated Latent Dirichlet Allocation (LDA) to recent work at MITH.
The Story of Topic Modeling
The original LDA topic modeling paper, the one that defined the field, was published by Blei, Ng, and Jordan in 2003. The basic story is one of assumptions, and it goes like this: First, assume that each document is made up of a random mixture of categories, or topics. Now, suppose each category is defined by its preference for some words over others. Finally, let’s pretend we’re going to generate each word in each document from scratch. Over and over again, we randomly choose a category, then we randomly choose a word based on the preferences of that category.
Obviously the corpus wasn’t actually generated this way. Barring cyborg intervention, it was probably written down by a person or group of people. However, topic modeling calls on us to suspend our disbelief. Let’s just suppose the corpus was generated entirely through this process. Then, given that the corpus is what it is, what are the most likely underlying affinities between words and between categories? Topic modeling infers a plausible answer under the assumption that the “generative story” I told a paragraph ago is true.
Skepticism is warranted, but the proof of the pudding is that it can be quite nice. As an aside to his work modeling Martha Ballard’s diary, Cameron Blevins (Ph.D. candidate in American History, Stanford University) offers frank acknowledgment that a newcomer can benefit from topic modeling by using a standard toolkit like MALLET (developed by University of Massachusetts-Amherst): “I don’t pretend to have a firm grasp on the inner statistical/computational plumbing of how MALLET produces these topics, but in the case of Martha Ballard’s diary, it worked. Beautifully.”
A tool like MALLET gives a sense of the relative importance of topics in the composition of each document, as well as a list of the most prominent words in each topic. The word lists define the topics – it’s up to practitioners to discern meaning in the topics and to give them names. For example, Blevins identifies the theme of “housework” in two topics, and then shows that the prevalence of these topics in the corpus increases over the life-span of Martha Ballard’s diary. Although a correlation between housework and age might seem counterintuitive, it turns out to corroborate the definitive critical commentary on the diary.
In this or similar fashion, the “topic proportions” assigned to each document are often used, in conjunction with topic word lists, to draw comparative insights about a corpus. Boundary lines are drawn around document subsets and topic proportions are aggregated within each piece of territory. In chronological applications like Blevins’ study of the diary, it is often advantageous to draw boundary lines in time. Griffiths and Steyvers (2004) illustrate how to register temporal changes in topic composition using the output from basic LDA. Newman and Block (2006)‘s work with the 18th century Pennsylvania Gazette corpus was perhaps the first diachronic application of topic modeling in the humanities.
Into a DH Frame: Graphs, Maps, and Trees
Applications of topic modeling in the digital humanities are sometimes framed within a “distant reading” paradigm, for which Franco Moretti’s Graphs, Maps, Trees (2005) is the key text. Robert K. Nelson, director of the Digital Scholarship Lab and author of the Mining the Dispatch project, explains that “the real potential of topic modeling . . . isn’t at the level of the individual document. Topic modeling, instead, allows us to step back from individual documents and look at larger patterns among all the documents, to practice not close but distant reading, to borrow