Chasing the Great Data Whale

The first thing you hear, or at least that you should hear, when you present an idea for a digital humanities project to someone already familiar with the field is this: "That's great! What does your data set look like?" Actually, that's the reaction you'll get if whoever you're talking to is taking you seriously, so the reaction is a mixed blessing. On the one hand, you have their attention. On the other, they know enough to point out that projects without data go nowhere, and good data (not necessarily synonomous with big data) is hard to find. Data is truly the Moby Dick of the digital humanities. None is ever big enough, clean enough, or well-structured enough to achieve precisely what it is that researchers would like to achieve. Just when you (and more likely your team) feel confident that the data set is "harpooned" (structured, refined, aptly-tagged, and curated), the whale takes off again with a new standard: interoperabilty. It doesn't play well with other data, and the chase begins again. When your data set works for you, that has some value, but when your data set works with others, well, that means a wider audience and a broader impact for your work.

Most projects have lifespans determined by fellowships or grants or sabbaticals, and unlike Captain Ahab, we can't afford to have our hard work dragged out into the abyss and lost. The DH mantra may well be "project or perish." Hard decisions about data are often determined by two factors: intellectual value and time. First, your data needs to be thoughtfully selected, described (tagged with metadata) and clean enough in order to work and to reasonably make the argument that what you've done with the data can be trusted. However, you need to know when it's time to cut the rope and release what might be done, a choice between good-enough and great. When just a few more hours of tagging, a few more weeks of correcting OCR errors seem just within your grasp, the choice feels mercinary. Deciding to stop improving your data, though, is like the difference between Faulkner and Melville: "you must kill all your darlings." Data sets, really, are Modernist objects: they are "what will suffice."

This is the lesson I learned during my first full month at MITH. As a Winnemore Dissertation Fellow, I have approximately four months to capitalize on MITH's resources and to produce a project that has a strong (but not perfect) data set, answers relevant humanities questions, and possesses enough sustainability, public outreach, and external support to become a viable, fundable project in the future. In these first five weeks, I have benefitted from the wealth of experience and depth of knowledge that the team assigned to my project possess. Jennifer Guiliano knows how to pull projects together, shepherding them through the steps of data management, publicity, and sustainability. Taking to heart her sage wisdom about managing successful DH projects, I feel that I have a much stronger grasp on what steps must be taken now and what can and must happen later, professional development knowledge that more graduate students should have when they venture into the alt-ac or tenure track market. Trevor Muñoz's expertise with data sustainability prompted questions that have helped shape a future to my project even at it's earliest outset—few graduate students have the time or the resources to think about how adequate curation in the short term could mean greater benefit in the long term. Amanda Visconti and Kristin Keister have been helping me to shape a public portal for my work, and I know that the virtual presence they help create will lend to the future success of the project, as well as its intellectual value.

The most salient lessons I've learned about the lure of the great data whale, however, have been from Travis Brown. I arrived in January with a large, but unrefined dataset of approximately 4,500 poems in a less-than-uniform collection of spreadsheets. I had a basic understanding of what I wanted the data to look like, but in consultation, Travis helped me to determine what would work better and might last longer. Travis created a data repository on MITH's space with a version control system called "git." (Julie Meloni's often-cited and useful "Gentle Introduction to Version Control" on ProfHacker provides a useful explanation of what git is, why it's valuable, and where to find resources if you'd like to try it.) Once I installed git on my machine, replicated the repository, and "pushed" (basically moved the data from my computer into the repository Travis created) the data to it, Travis could take a look. We agreed to separate the text of the poems and their titles from the descriptive information about it (author, date, publication, etc.) and to use a uniform identification number to name the file for each poem, and to track its descriptive data in a separate spreadsheet. We realized at that point that there were some duplicates, and in conversation agreed that we would keep the duplicates (in case there was a reason for them, such as minor textual differences) and tag them, so that later we could come back to examine them, but in the meantime not include them in the tests I'd like to run.

Then came the "darling killing." Metadata, the information that describes the poetic texts that make up my data set, is necessary for the tests I'd like to run—those that classify and sort texts based on the likelihood that words co-appear in particular poems. The amount of metadata that I include to describe the poems will determine the kinds of tests and the reliability of the tests I hope to run. However, tagging 4,500 poems, even with simple identifiers, could take the whole four months of my fellowship if I let it. The hard choice was this: I would tag only the gender of the poet associated with each poem and whether or not the poem is ekphrastic (that is to say written about, to, or for work of visual art) or not or unknown. Some poems were easily tagged as ekphrastic, because I had sought them out specifically for that purpose; however, more often than not, I would need to make poem by poem determinations about the likely poetic subject. This takes time, and because of the tests I need to run to ask the questions I want to ask (eg. Can Mallet distinguish between descriptions of art and descriptions of nature?), I will need to let go of other useful, informative, helpful tags I would like to have, like the date each poem was published, the artwork to which it refers, and so on.

I am keeping record of all these things. The decision _not _to be perfect is the right choice, but it isn't an easy one. I feel sometimes as though I have watched my whale slip from my grasp and sink back into the sea. My crew is safe. My project will continue on schedule, but not without the longing, nagging feeling that I need to return to this place again. Perfect data sets are a myth, one that often forms the barrier to new scholars who wish to begin DH projects. Rather than struggling for the perfect data set, I want to suggest that we place a much stronger emphasis on the more intellectual and more necessary component of data curation and data versions. I would argue that we judge projects not by the "completeness" or "perfection" of the data, but how well its versioning has been documented, how thoroughly curatorial decisions such as what to tag, when, and why have been publicized and described, and how much the evolution of the data contributes to the development of other projects within the community. Much the same way that we know more about the value of an article by how often it has been cited, we should value a digital humanities project by how much can be learned by the projects that follow it. In this sense, we should all be called Ishmael.

Lisa Rhody is a Ph.D. candidate in English at the University of Maryland, a Spring 2012 MITH Winnemore Dissertation Fellow, and a lecturer on the arts for the Virginia Museum of Fine Arts. This post first appeared on Lisa's personal blog on March 1, 2012.