A digital humanities center is nothing if not a site of constant motion: staff, directors, fellows, projects, partners, tools, technologies, resources, and (innumerable) best practices all change over time, sometimes in quite unpredictable ways. As small, partly or wholly soft-funded units whose missions involve research, or teaching, or anchoring a local interest community, digital humanities centers face fundamental challenges involving the long-term digital stewardship of the work they help to produce.
The importance of stewarding digital scholarship will only grow and the work will need to be shared by the entire digital humanities community. Founded sixteen years ago in 1999, MITH is proud of the way it has faced and continues to face these challenges. We would like to take this opportunity to document our practices in a series of blog posts, beginning with this one, in the hope of providing a clear and potentially useful record of our principles for digital stewardship, the issues we’ve faced, and our practices for dealing with them.
In this initial post, we’ll provide an overview of the actions MITH has taken to steward the variety of digital humanities work created here. In doing so, we’ll articulate the underlying principles that have guided our decisions and present some key lessons we’ve learned. Finally, we’ll point out some areas where further work is needed by stakeholders across the wider digital humanities community.
What are we stewarding?
To document digital stewardship practices effectively, we need to be as clear as possible about what we are stewarding. In MITH’s case, we are concerned with actions taken to care for and manage over time the digital humanities work publicly shared by staff, students, faculty, librarians and others who have been affiliated with MITH. Given the era of MITH’s founding, the World Wide Web has been the chief means for making such work public. So as to have a compact way of referring to “digital humanities work made publicly available via the technologies of the web”, we will use the term “website” throughout this initial discussion of MITH’s stewardship activities. However, we purposefully understand the referents of “website” to encompass things ranging from collections of a few hyperlinked documents to complex applications and virtual worlds. In other contexts, these things might also be discussed as “databases”, “digital archives”, “tools”, or “projects”. (In other words, the use of the term “website” is a convenience based on the fact that most of the digital work we’re interested in was delivered using the web.) We acknowledge, furthermore, that the “websites” we are stewarding may represent or document the bulk of the intellectual, emotional, and technical labor of a person or a project, or they may represent very little of that labor—like the varying portions of icebergs buoyant enough to rise above the water line.
At one level, then, the digital humanities work we’re discussing comprises aggregations of computer files: documents and data in various formats, layers of interlocking software, and so on. At another level, this work comprises tacit knowledge of its creators—including knowledge located in the interrelationships between people. This work also comprises elements of how it has been received since it was made public—its position in networks of (hyper-) links and citations, for example. In discussing stewardship we mean the process of assessing, accounting for, and making decisions about how to represent all of these elements of the digital work people affiliated with MITH have created, within the constraints of available resources and from whatever situated vantage point each of us inevitably occupies.
Principle: active records and archival records
One important principle for MITH’s stewardship activities is acknowledging that we manage digital materials of two different major types. To borrow from the parlance of archives and records management, we care for digital humanities websites as both “active records” and “archival records”. By active records, we mean websites that are still in regular use by creators or their successors to do digital humanities work. Such use could involve adding additional data, sharing with students in an ongoing course, or incorporating a site into a new project or publication. By archival records, we generally mean websites that are no longer the active project of the scholars who created them—for instance, if they are no longer being updated with new data—but which have ongoing value and interest from elements of the community beyond the original scholar or team. The actual age of a particular resource is not significant in this distinction: older work is not automatically “archival,” just as recent work can be active briefly and become “archival” in a short amount of time (work related to events like conferences is a good example).
One important implication that follows from our decision to think of MITH’s digital materials as records with active and archival modes is that this principle makes explicit the balance that a digital humanities center must strike. Websites have a lifecycle. Different choices and actions are appropriate at different stages of that lifecycle. On the one hand, for websites as active records, we want to maintain working systems in place to enable digital humanities work by specific members of our community with whom we have ongoing collaborative relationships. For websites as archival records, on the other hand, our aim may be less to manage active systems as it is to see that representations of earlier work are preserved in some form as evidence of people’s scholarship and of MITH’s own history.
MITH’s stewardship in practice
The rubric of stewardship that MITH has outlined for itself encompasses both active management and digital preservation. The locus of this discussion of specific stewardship actions will be the MITH servers—as institutional spaces into which, over time, digital materials were transferred so as to be made public as part of digital humanities work.
Systems Administration and "Backups"
MITH has, for many years, as part of our active management, “backed up” the contents of our running server machines to guard against sudden data loss. At present, MITH pays for a service hosted by the University of Maryland Division of Information Technology for this purpose. These backups guard against data loss from a short-term failure of the servers—so that the latest copy of the full system can be restored to a recent running state. This backup service also ensures that copies of the data on MITH’s machines are duplicated at an additional location on the University of Maryland campus and to magnetic tape for storage at a geographically remote location from the campus (as current best practice suggests).
Active management to support digital materials necessarily involves change—for one thing, physical machines wear out and become obsolete. The longer a website’s life, the more certain it is that those digital materials will need to be migrated from one system to another in response to the obsolescence of the underlying physical machines. Over its sixteen year history, MITH has used a succession of physical machines as servers. As machines approached their end of life, MITH would migrate the content of servers—the digital materials over which MITH exercised stewardship—to new hardware. As an example of this process and in the interest of precision, we’ll discuss actions undertaken on two generations of such systems: machines that hosted MITH’s projects from 2006 to 2009, and those that have hosted MITH’s projects since 2009. The two physical servers that hosted MITH projects from 2006 to 2009 also contain data representing an older generation of machines—copies of MITH’s digital work back to 2003. Given that the outgoing and the successor machines have run versions of the same operating system, migration refers to copying those parts of the system where users (rather than system utilities) have stored data. These include file directories associated with the accounts of people who were granted access to MITH machines, file directories utilized by the web server (containing the HTML and related documents and source code for various websites), as well as databases, and configuration files. The copying—migration—of these key files was part of the management of the MITH servers as active systems. The goal was to replicate functionality from one generation of physical hardware to the next. As part of server migrations, MITH also created disk images, that is complete copies, of the servers’ hard drives, and took other actions (about which we’ll say more below) to help ensure preservation of data.
For the majority of the digital work created at MITH, this represents the complete story. Basic active management has kept these websites, some quite old, online.
The life stories of digital materials can become more complex for a number of reasons. Security problems are one complication that face stewards of digital materials. In the case of digital humanities websites, the most common security problems involve SQL injection attacks in which interactive components of a site can be compromised to run malicious software code often for the purpose of sending spam messages or disrupting regular network traffic. MITH has direct experience with these challenges. At the request of the University of Maryland’s network administrators and with assistance from staff of the Division of Information Technology and paid external computer security consultants, MITH conducted a thorough review of all its projects in 2009 after several years of progressively increasing instability due to security problems. This review was conducted as part of migrating to new, clean systems and some sites were deemed too insecure to return to full, online availability. Compressed archives of projects, even those with security vulnerabilities, were copied to new machines and external storage media. No files were ever deleted but the process of responding to security concerns nonetheless created a category of projects that could not simply be reproduced online following the server migrations. These projects have needed to be managed in alternative ways. By evaluating some sites as archival records not solely as active projects, MITH staff decided to re-mount static resources—HTML pages, images, and so on—without the accompanying databases to lessen future security risk while preserving evidence of these projects. The 2008 Digital Diasporas Symposium is one example of this approach. A database had been part of the original site for the purpose of accepting registration information but no longer needed to be online after the end of the conference. In other cases, addressing security or other problems would require making choices or potentially altering features of sites. Where a site was no longer the active project of the scholar who created it (she/he has “completed” it and moved on), MITH has decided not to make substantial changes unilaterally.
Migrations and Transfers
Digital materials have also sometimes been migrated to other systems not managed by MITH, for example because a project director moved to another institution and wanted to transfer stewardship of the project. As a principle of stewardship, we believe that it is acceptable—even necessary—for the responsibility for digital work to change over time. This principle entails sharing copies of data with project creators and potentially others. At MITH, we consider sharing complete copies of digital materials (including the databases and all the other components) part of our stewardship in three specific ways: first, so that project creators have additional backup copies of their own, to manage as they see fit; second, so that those who request access can view materials from sites with security vulnerabilities offline; and third, so that stewardship of a site can be transferred to a new institution. MITH has made a practice of offering copies of digital material for personal backups when a funding period (for externally-supported projects), or a fellowship (for internally-funded projects), ends. Also, in 2007, those who had produced digital work at MITH up to that time were offered copies of their data to take for their own personal use. In these cases, project creators are provided with compressed archives of files copied from the MITH servers and MITH also retains copies. In the second case, of materials from sites that are offline due to security vulnerabilities, we have provided copies of files in response to specific requests on an ad hoc basis. When project directors or others choose to take over responsibility for a particular website, thereby ending MITH’s involvement in its active and ongoing management, MITH will link to or redirect traffic to the new versions of a digital project while also retaining copies of the earlier data created at MITH for preservation.
MITH’s experience with providing access copies of digital projects suggests issues that the community as a whole needs to grapple with further. As a principle, project creators should have access to their own work to use or build upon. Yet, almost all digital humanities projects are works of collaborative authorship and represent the efforts of not only project directors but staff members, graduate students, and others. Best practices for project charters probably need to indicate who can make decisions over the long-term. It is unwieldy to ask, long afterwards, all the graduate students who have worked on a project, for example, how to handle providing copies of materials but, at the same time, MITH takes seriously principles of respect for all contributors, as expressed in initiatives like the Collaborators’ Bill of Rights. When third parties, not the original creators, want complete copies of a site the prospects become more complex yet. Solutions to many of these kind of conundrums involve using open licenses for content and source code. Other conundrums may be trickier—if materials are being transferred to someone other than the original creator should project files be reviewed, as is the process in many analog archives, so that personal information (if any) or database credentials could be removed or reset?
Repositories and Archives
In addition to managing digital materials as part of active servers, MITH has taken steps to document projects created here as part of a broader stewardship strategy that also includes digital preservation activities. In 2012, MITH staff deposited a collection of materials with the institutional repository for the University of Maryland and we continue to make this a part of our project process. At around the same time, MITH moved its physical offices. In the course of the move, staff deposited boxes of physical materials documenting MITH’s activities with the University Archives. Also in 2012, MITH staff worked with the University of Maryland Libraries’ Special Collections and University Archives to ensure that internet addresses where most MITH projects are found would be regularly crawled by a web archiving tool (Archive-It) paid for through the Libraries’ collection budget. By adding an active partnership with the Libraries to MITH’s strategy we could be confident that our specific content would be collected and that collection would occur more regularly than could be expected from waiting for archiving software operated by the Internet Archive and other organizations to visit MITH sites.
To be sure, web archive versions of sites are not always identical to the original ones but we think they represent an important element of digital stewardship planning. Maintaining a live website, keeping it online and accessible at its original location with its complete original functionality, is not digital preservation but active management. A stewardship strategy predicated entirely on active management is unsustainable. For one, such a strategy is too expensive and labor-intensive given the limited resources of a digital humanities center such as MITH. Larger memory organizations such as libraries should aim to collect widely and reflect the diversity of practice of digital humanities work. The economies of collecting and preserving such work at scale also militate against stewardship strategies that depend on managing each digital humanities website (only) as an active system. Finally, active management of digital humanities websites, where they are embedded in a working web server, exposes them to ongoing risk of corruption and human error. We cannot ignore, for the convenience of present use, the aspect of preservation that involves removing materials from their original situation and relocating them where actions can be taken for their long term survival even if this entails changing the experience of using them. A serious discussion of digital stewardship must incorporate consideration of how best to sunset projects as active sites. For all these reasons, the role of web archives as they relate to the future use and preservation of digital humanities work is an area where there is much work still to do—and may be the subject of a future post.
Additional Digital Preservation Strategies
Finally, as we mentioned above, MITH has collaborated on a number of important digital preservation research projects that have affected how we preserve MITH’s own outputs. First, this research agenda has helped us recruit staff, students, and faculty interested in digital preservation challenges. MITH faculty and staff have authored books, articles, conference papers and reports, convened summit meetings, organized conferences, led training institutes, built tools, disseminated resources, and spoken widely on the importance of digital preservation and data curation. Second, by working on these digital preservation research projects, people at MITH have gained certain technical skills and conceptual approaches to problems. We’ll offer just one brief example of how this has played out. Though the physical machines that immediately preceded our current generation of servers were decommissioned about 6 years ago, MITH has retained this physical hardware, as well as various other internal and external hard drives of machines MITH staff and faculty have used. Credit this decision to the habits of thought and practice that MITH’s engagement with digital preservation has cultivated. Our work on digital preservation has caused us to consider the physical nature even of things like websites and to consider further how original hardware may represent a crucial element for preservation. A few months ago, MITH asked Porter Olsen, former Community Lead of the BitCurator project and a Graduate Assistant at MITH, to apply some of the skills he learned from his work here to help us evaluate whether new tools and capabilities could be applied to curating and preserving digital humanities materials from systems like MITH’s old web servers. We know from our research and our experience that no matter how thoughtfully we’ve crafted policies and procedures and no matter how carefully we’ve acted, digital stewardship is a complex and challenging endeavor that requires us to keep learning and trying to improve.
In the next post in the series, Porter will describe in more detail how he was able to use the BitCurator tools, which MITH helped develop, to collect and preserve additional information. In other future posts, Stephanie Sapienza, MITH’s Project Manager, will discuss work she’s been leading to revamp the section of our website where we document all of MITH’s work in order to make it easier to find information about previous projects. MITH’s Lead Developer Ed Summers will post some lessons learned about best practices for migrating complex and dynamic websites. And, since preserving computing hardware as well as software figures in the story of MITH’s digital stewardship practices thus far, we’ll also consider the preservation implications of the increasing move to “cloud” computing.
There are no ready-made solutions, no repositories ready to accept many of the kinds of complex objects digital humanists produce so every stewardship strategy falls somewhere along a spectrum of benefits and tradeoffs. Complex work carried out by multiple people over long spans of time during which our collective knowledge of best practices was itself evolving is difficult to judge according to a binary of success or failure, presence or erasure. Attempting to do so likely vitiates the value of the open and detailed discussion of this work for the community genuinely interested in preserving the fullest history of digital humanities work. It also runs counter to the conclusions generated by MITH’s own active research in digital preservation, and the theoretical and methodological writings of several of its staff and administration.
We would not claim that this process of caring for the range of MITH’s digital humanities websites has been flawlessly executed at every step. During server migrations, in particular, sites have sometimes gone offline due to a misconfiguration of one kind or another. When we discover these errors or when they have been pointed out to us, we’ve investigated and restored the online availability of materials where possible. Where this has not been possible because earlier work would need new investment to fix security vulnerabilities or make other substantial changes, we have documented projects and retained data—and even hardware. One thing the history of MITH’s digital work should suggest is that there are important distinctions between preserving data, maintaining or sustaining specific computing systems, and providing varying levels of access (online vs. offline, original vs. migrated).
The expectations of access for archival material differ from those for active material. Just as with analog collections, it may be necessary to find archival “websites” in different locations than their active counterparts; archival websites might take slightly different forms; additional requests or effort might be necessary to access these things. We at MITH recognize that it is a frustrating experience to find that digital work that was once available on the web is offline or that the experience of working with it has changed. As a community, our expectation of digital work has been that, in many cases, it’s present online or it’s gone. At MITH, where we have digital work that is preserved but not online this ingrained expectation is a challenge. For more and more digital work, we think that there must be a middle, archival way—how do we begin to incorporate this into our practices in satisfying ways? Most researchers have encountered analog collections that are unprocessed or materials that are out for repair and preservation. At the moment, this is perhaps the best comparison for the state of some few of MITH’s digital outputs, yet we are continuously working to develop and improve our own practices of curation and preservation.
At MITH, we have chosen a stewardship strategy that entails both actively managing systems that still run a variety of websites and also taking actions to preserve copies and alternate representations of these sites including through retention of compressed offline archives of project data, through web archiving, and through deposit of supplementary materials and documentation with both our (digital) institutional repository and our (analog) university archives.
We value the tradition of work that has been accomplished at MITH over the last 16 years. Digital humanities has a history and indeed multiple histories, not just in terms of the intellectual pedigree of the phrase or the concept, but in the legacy of material work that has been performed in its name. MITH’s research has addressed itself to issues of gender and diversity among other core humanist concerns. MITH has also attracted and recruited people interested in digital preservation, and has actively collaborated on a range of digital preservation projects and initiatives in its research. These intertwined aspects of our work have particular resonance now, given that digital humanities practitioners are placing increased emphasis on recovering narratives and origin stories for the field that are more diverse than some stakeholders seem interested in acknowledging. Stewarding the collective history of work at MITH thus helps make visible the diverse history of the digital humanities.