Asking Questions of Lots of Text with Weka

Adrian Hamins-Puertolas and Adam Elrafei are students in Team POLITIC, an undergraduate research team in the University of Maryland’s GEMSTONE honors research-focused honors college, mentored by MITH Faculty Fellow Peter Mallios.

Our undergraduate research team uses newly developed technology to understand and quantify how American audiences received Russian authors in the early 1920s. One of the tools we’re exploring is Weka, a collection of machine-learning algorithms that can be used to mine data-sets. MITH has helped us design and construct our database, which contains thousands of articles about Russian authors featured in American literary magazines written during the 1920s). Each article in the database is associated with values indicating the frequency of words in the text, so we can trace how often a single word (unigram) like “revolution” appears throughout our articles, or how often two words appear next to each other (a bigram), such as “Russian revolution”. Both these features offer us paths to think about our dataset in terms of describing and quantifying word proximity.

MITH's Travis Brown demonstrated how we could use Weka to train a machine learning classifier that could assign labels to articles in the dataset we had not ever read. To test this, we created a smaller training dataset of just 150 articles, a number small enough that we could actually read through the entire texts and manually describe by answering questions ranging from “Is a given literary author a subject of debate in this article?” to “Is radical politics an issue in the article?”. Given these measures, Weka has the ability to classify every other article in our dataset with some degree of accuracy.

Weka provided us with a decision tree that classifies answers to the question “Is literary style and artistry an issue in this article” appropriately for approximately 67% of our training set. This success rate can be improved as we add new measures for classifying and quantifying the text. One direction we can go is to attempt to use MALLET—“an integrated collection of Java code useful for statistical natural language processing document classification”—in order to create topics—groups of words that MALLET finds to be significantly thematically related. Topic modeling is fascinating because a preliminary examination of generated topics has already provided us with a variety of distinct themes and vocabulary appearing in our dataset, ranging from religion to specific Russian authors. We’re in the process of running Weka’s classification system on those generated topics that include religious language in order to answer another of our questions—”Is religion an issue in the article?”.

Our current Weka experiments, using a smaller training set of 46 articles, have already acquired promising results. For example, when using the J48 decision tree algorithm on our textual data filtered into unigrams, Weka correctly classifies 76% of our documents when answering the “Is politics an issue?” question. If we filter our data into both unigrams and bigrams, the correct classification rate decreases to 67%. However, if we filter our data into unigrams and apply a stemmer (which breaks down words into their root forms and ignores prefixes and suffixes), our correct classification rate increases to 77%.

We are looking forward to expanding our experiments to apply to an even larger subset of our data, as we continue to learn more about natural language processing tools in the coming weeks.