Topic Modeling

Agenda

9:00-9:30 am Registration and Refreshments

9:30 am-9:45 am Welcome and Setting of Workshop Goals: Jennifer Guiliano, Maryland Institute for Technology in the Humanities

Humanities Applications

9:45-10:15 am Presentation: Matthew Jockers, Department of English and Center for Digital Research in the Humanities, University of Nebraska

Title: Thematic Change and Authorial Innovation in the 19th Century Novel

Abstract: This presentation explores my use of latent Dirichlet allocation (topic modeling) to extract 500 themes from a corpus of 3,500 19th-century novels. By linking the derivative topical data to bibliographic metadata about the texts (their publication dates but also information about the gender and nationality of the authors) macroscale trends across time, nationality, and gender can be observed and employed to contextualize anecdotal readings of the individual texts in the corpus. The approach allows us to see and measure thematic innovation across the century and also according to author nationality and author gender. In the end, this larger context of thematic change allows us to better understand and assess the contributions of specific writers such as Jane Austen, Hermann Melville, and Maria Edgeworth.

10:15-10:45 am Presentation: Robert Nelson, Department of American Studies and Digital Scholarship Lab, University of Richmond

Title: Analyzing Nationalism and Other Slippery “Isms”

Abstract: Franco Moretti’s concept of “distant reading” has been all but ubiquitous in the rationales historians, literary critics, and other digital humanists have provided for different text-mining techniques. Practitioners of text mining have seized upon both Moretti’s methodological challenge and his memorable phrase, claiming that text mining offers the tantalizing prospect of addressing hugely ambitious interpretive questions by using computation to analyze massive amounts of digitized text. This presentation will take a somewhat different tack. Using topic modeling to chart the relative prominence of nationalistic rhetoric in two newspapers during the course of the American Civil War, this presentation will use those charts to infer the particular instrumental political purposes of that rhetoric for different individuals and groups at different moments in time. In doing so, it will suggest that topic modeling enables us to develop subtler—one might even say closer—readings of something as abstract and amorphous as an ideological discourse like nationalism.

10:45-11:30 am Discussion (all): Facilitator, Katrina Fenlon, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign

11:30-11:45 am Break

Models and Extensions

11:45-12:15 pm  Presentation: Jordan Boyd-Graber, School of Information Studies and Institute for Advanced Computer Studies, University of Maryland

Title: Incorporating Human Knowledge and Insights into Probabilistic Models of Text

Abstract: Topic models aren’t perfect; errors hamper adoption, degrade performance in downstream computational tasks, and prevent users from making sense of large datasets.  I begin by first explaining that the answer is not simply to build fancier statistical models, as the usefulness of topic models does not always correlate with likelihood, the traditional objective function of statistical models.  Given this understanding, what is possible to improve topic models?

As an attempt to answer this question, I will present models incorporate human knowledge into topic models via ontologies, through direct interaction with users, or by mimicking social processes.  After describing the statistical formalisms that allow these knowledge repositories to be seamlessly integrated and the associated computational challenges, I demonstrate how these models can contribute to real-word natural language processing tasks such as classifying documents, predicting sentiment, topic segmentation, and detecting who is influential in a conversation.

12:15- 12:45 pm Discussion (all): Facilitator, Scott Weingart, School of Library and Information Science, Indiana University

12:45-2:00 pm Catered Lunch: 6137 McKeldin Library

Implementations and Tools

2:00- 2:30 pm Presentation: Jo Guldi, Department of History, Brown University and Christopher Johnson-Roberson, Ethnomusicology, Brown University
Title: Paper Machines: A Tool for Analyzing Large-Scale Digital Corpora
Abstract: Paper Machines,developed in the summer of 2012 through a Harvard grant and a grant from Google Summer of code, is designed to help scholars parse through large sets of information, capitalizing on current work in computer science, drawing upon current work in topic modeling and visualization to generate iterative, time-dependent visualizations of how what a hand-curated body of texts talks about and how it changes over time. At a conceptual level, we believe that this tool will be a powerful resource in examining large quantities of information and allow knowledge seekers to consider a broader, richer, often ignored corpus of text. In doing so, we hope to enlist the power of digital humanities to tame the pile of paper, and redistribute the power that “official” paper took away.

2:30- 3:00 pm Presentation: David Mimno, Department of Computer Science, Princeton University

Title: The details: how we train big topic models on lots of text
Abstract: It’s possible to treat a topic model as a black box: text goes in, topics come out. But to really understand what a model is telling you about your corpus, you need to understand what the algorithm is actually doing. In this talk I’ll cover some of the different choices we can make in training models and what their implications are for efficiency, scalability, and topic quality. As a running example, I’ll use the Mallet topic modeling package. Issues covered will include Gibbs sampling, variational inference, hyperparameters, and automated quality diagnostics. No particular mathematical background is required.

3:00- 3:30 pm Discussion (all): Facilitator, Travis Brown, Maryland Institute for Technology in the Humanities

3:30-3:45 pm Break

3:45-5 pm Topic Modeling in the Humanities Roundtable Discussion (all): Facilitator, David Blei, Department of Computer Science, Princeton University

 

*Subject to change