Digging into Data with Topic Models

I am a graduate student from the University of Michigan interning this summer at MITH, working on the topic modeling project that is underway here. In this post, I will describe the "what" and "why" of what I have been doing, and I will try to put it in the wider context of corpus-based approaches.

R&D software developer Travis Brown and others at MITH have developed Woodchipper, a visualization tool, which runs the Mallet package developed at the University of Massachusetts at Amherst to perform topic modeling on a selected corpus, and then displays the results of a principal-component analysis. An attractive feature of Woodchipper is that it is oriented towards "drilling down" -- a concept that is particularly relevant to the digital humanities. Those of us who "do" humanities pride ourselves on being close readers of texts. To be appealing to humanists, topic modeling, insofar as we can think of it as a method of "distant reading," will need to be combined with close reading. Woodchipper allows the humanist scholar to view individual texts by displaying each page of a text as a clickable data point on a two-dimensional graph; the spatial layout of the graph is shaped by the results of the principal component analysis.

Visualization between the "distant" aspect of the text's high-level attributes -- its "topics" -- and the "close" aspects the text -- its individual words -- are crucial. The challenge is the following: why should the researcher trust the high-level attributes the model says that the text supposedly has? Only if the visualization bridges the gap between the high-level and text-level attributes of the text by clearly displaying their relationship, will the user be likely to trust the high-level properties discovered by the topic model.

Thus we decided to make the visualization more expressive and richer. Earlier, Woodchipper displayed only a specified number of topics adjudged by the algorithm to be the best topics for the page. The visualization represented each topic as a list of the first few words that were the most representative of that topic. However, to simply represent a topic as a few selected words is misleading, because, even if those selected words represent the highest-probability words in that topic, the actual probability mass represented by each word in that topic may be quite different. It would be more logical and more expressive, therefore, to represent a topic by those words which, together, add up to a certain specified fraction of the total probability mass. Doing so necessitates changing the Scala code on the server side, which furnishes these words, before the Woodchipper client accesses the topic (and, hence, the words).

We also realized that a further change needed to be made. Each page was too small in size, so that, very often, no word in the page actually matched the topics for that page. We realized that we probably needed to break up the documents into larger sized units, in order to show to the user a more trustworthy picture of how the top-level ("topics") connects with the bottom-level ("words on a specific page") when we metaphorically "drill down" from top to bottom.

Stay tuned to the MITH blog for further posts over the course of the summer from myself and fellow graduate intern, Clay Templeton.