Reflections on Scale and Topic Modeling

I recently came across a 1991 interview of the literary critic Harold Bloom (Sterling Professor of Humanities at Yale University) in The Paris Review, in the course of which Bloom remarks:

"As far as I’m concerned, computers have as much to do with literature as space travel, perhaps much less."

Since coming here (to MITH) as an intern this summer, I have learned about several projects that just might make Bloom change his mind. In the last few weeks, I have been working on one such project, along with Travis Brown and Clay Templeton here at MITH, that utilizes cutting-edge work on topic modeling currently being done in University of Maryland’s Computer Science department. Clay has already written about the project in his previous blog post, and so I will simply use this opportunity to express some reflection of my own.

The question of “scale” has been on my mind over the past couple of weeks. We are processing vast amounts of text data — topic modeling is the kind of approach whose power of discovery is predicated on the assumption that vast amounts of textual data will be available for it to run on. It makes me pause and reflect that the assumption that these approaches will “scale up” quantitatively to continue to become more prominent and visible in the coming years, rests on some deeper technological and social assumptions. That is, increased success for these approaches is going to depend on Moore’s Law continuing to hold (i.e., more and more processing power being available more and more cheaply), and also on the willingness (and legal feasibility) of those libraries and institutions that own such vast repositories of texts to make them available in computer-readable formats.

Earlier, our group here at MITH was working with the “unsupervised” topic modeling approach, in which no knowledge of the content of the text is really needed — the algorithm simply cranks away at whatever text corpus it is working on, and discovers topics from it. For the last week or so, though, we have focused on the brand-new and cutting-edge “supervised” topic modeling approach that is being developed by a research group in the Computer Science department here at Maryland. The idea in this kind of “supervised” topic modeling is to “train” the algorithm by making use of domain knowledge. For example, in conjunction with the Civil War era newspaper archive with which we are working, we are making use of such related pieces of knowledge (coming from contemporaneous sources external to our corpus) as the casualty rate for each week, and the Consumer Price Index for each month. The idea behind this approach is that the algorithm will discover more “meaningful” topics if it has a way to make use of feedback regarding how well the topics discovered by it are associated with a parameter of interest. Thus, if we are trying to bias the algorithm into discovering topics that pertain more directly to the Civil War and its effects, then it will make sense to align the aforementioned “other kinds of data” such as — in our case, casualty figures and economic figures for the era — which have a provenance outside the text corpus. This is where the “qualitative” scale becomes important.

The more intelligently we try to leverage these approaches’ power, the sheer number of areas with which the successful practitioner of this kind of topic modeling approach will, therefore, have to have at least a passing acquaintance, will “scale” up. This made me think about how people trained in information science — which is a truly interdisciplinary field — are really well-positioned to do this. Over the last week, for example, I read several papers on the economic history of the Civil War (which we were pointed to by Robert K. Nelson, a historian at the University of Richmond who has worked on topic modeling and history). Who would have guessed that one would have to read Civil War papers in the course of a summer internship in Information Science? I aligned the economic data with the text corpus, and based on what the data seemed to be telling us, I came up with a design for some experiments to test out some hypotheses, which we will proceed to carry out over the next few days.

Also, in a piece of exciting news, the paper proposal that we (Travis, Clay, and I) submitted to the “Making Meaning” conference for graduate students, organized by the Program in Rhetoric at the English Department of the University of Michigan, has been accepted. This presentation will reflect on how one might situate approaches like topic modeling in the context of literary theory and philosophy. This, too, is an example of how as “information scientists” we must see, and think, in terms of the “big picture” — that is, scale up to the big picture.

P.S. Now that this post turned out to be a reflection on the question of scale, it just occurred to me that it is also appropriate that the programming language I learned during the earlier part of the internship was — Scala!