Tools for Data-Driven Scholarship

Tools for Data-Driven Scholarship: Past, Present, Future

A Report on the Workshop of 22-24 October, 2008
Turf Valley Resort, Ellicott City, Maryland

by
Dan Cohen, Neil Fraistat, Matthew Kirschenbaum, Tom Scheinfeldt
Center for History and New Media, George Mason University
Maryland Institute for Technology and the Humanities

Overall Objectives of the Workshop

As documented in Our Cultural Commonwealth: The Report of the American Council of Learned Societies Commission on Cyberinfrastructure for the Humanities and Social Sciences, we are witnessing an extraordinary moment in cultural history when the humanities and social sciences are undergoing a foundational transformation. Tools are one essential element of this transformation, along with access to cultural data in digital form. The need to “develop and maintain open standards and robust tools” was one of eight key recommendations in the ACLS report, inspired by the NSF’s 2003 Atkins report on cyberinfrastructure.[Unsworth et al., Our Cultural Commonwealth: The Report of the American Council of Learned Societies Commission on Cyberinfrastructure for the Humanities and Social Sciences, American Council of Learned Societies, 2006]

Indeed, many humanities scholars and social scientists have already built valuable digital tools over the years, including software and systems for text processing, markup, visualization, metadata generation, cataloging, GIS, and a number of other humanities-related tasks. Such digital tools, though broadly categorized as ‘humanities-related,’ would also be of great use to many scientific disciplines as well. By the same token, many science tools funded by the NSF, Department of Energy, and others would certainly be of great interest to humanists. One of the goals of this meeting was to bring these groups together.

On October 22-24, 2008, nearly fifty scholars, librarians, museum professionals, computer scientists, software developers, and funders attended a workshop at Turf Valley Resort in Ellicott City, Maryland, to discuss the past, present, and future of tools that can assist scholarship in an age of massive digital resources. The workshop, co-funded by the National Science Foundation, the National Endowment for the Humanities, and the Institute of Museum and Library Services, brought these people together because of their active engagement with the production and use of these digital tools and their deep knowledge of associated issues such as scholarly communication and sustainability. (Please see Appendix A for a list of attendees.)

The discussion was pragmatic rather than ideological; the goal was to understand from experience and example how such tools could be created, disseminated, and built upon in a more effective way than has been the norm. In 2005, the NSF sponsored a Summit on Digital Tools for the Humanities in conjunction with the University of Virginia’s Institute for Advanced Technology in the Humanities. This new meeting was explicitly conceived as an opportunity to build on that earlier effort and advance progress toward some of the challenges raised.[Summit on Digital Tools for the Humanities, http://www.iath.virginia.edu/dtsummit/SummitText.pdf]

The funders–as well as many of the attendees–came to the meeting with the sense that scores of these digital tools for academia, libraries, and museums had been created over the last two decades, and yet very few of them had been fully successful in making their way into widespread use or in using digital collections robustly. The leaders of the workshop (the authors of this report) distilled down the problems into three general areas:

1. Tools need to work better with other tools.

2. Tools need to connect better with content and use that content in a more robust way.

3. Tools need better mechanisms for being found by the scholars who need them. They are not currently finding their audience(s).

There may be intellectual and even practical value in reinvention–in “recreating the wheel.” However, it seems that too few tool builders looked for existing tools that already did what they envisioned, or if they did in fact do an initial check before beginning their own development efforts, they often had trouble finding related tools.

Moreover, despite the existence of standards and methods for interoperability between tools, most tools that have been built are one-off, standalone web applications or pieces of software. The ability to import and export between tools, to connect tools together, is often impossible or extremely difficult. Tools that have complementary strengths often cannot be paired for maximal utility by the end user, as can be seen, for example in the proliferation of annotation tools. Over the last five years a variety of applications that assist in the critical scholarly activity of annotating resources have appeared, including Pliny, eComma, Co-Annotea, CommentPress, and Zotero. But each tool has a distinct way of marking up documents and there is no way to pass annotations back and forth among them.

Interactions between tools and the exponentially growing number of digital collections is scarcely better. Collections often lack methods for communicating with tools, for instance through interfaces such as application programming interfaces (APIs) that many commercial services consider essential. Even if they are willing, providers of digital scholarly content often have no way to graft a particular tool onto their collection. Countless standards that should solve this problem are failing to do so because of implementation issues or simple lack of technical know-how.

Finally, and perhaps most distressingly for the attendees of this workshop, the vast majority of scholars who are not directly involved with the creation of digital tools and collections are not adopting these new applications and resources in the number one might anticipate this far into the digital revolution. Many traditional scholars remain ignorant of tools or cannot find a particular tool to suit their research need. To be sure, a digital tool is not always the solution to modern humanistic research. But for large-scale research corpora and operating on digital representations of sources, software can play a critical role.

A recent report from the Council on Library and Information Resources by Lilly Nguyen and Katie Shilton details the often immature and problematic state that most digital tools for scholarship are in. Major issues relating to the ease of finding tools, clarity of use and functionality, and documentation, support, and outreach persist. [Lilly Nguyen and Katie Shilton, "Tools for Humanists Project," Council on Library and Information Resources, April 18, 2008, Appendix F of Diane M. Zorich, A Survey of Digital Humanities Centers in the United States, Council on Library and Information Resources, May 19, 2008.]

Given these issues, the funders envisioned some kind of opportunity for creating more “visibility” for digital tools for advanced scholarship. But what would an entity that solved these problems look like? What would its function and functionality be? The funders asked the workshop leaders and the attendees to advise the agencies about ways in which they could enhance these crucial connections or encourage certain kinds of collaboration, interoperability, and standards. The National Endowment of the Humanities, the Institute of Museum and Library Services, and the National Science Foundation have been instrumental in funding digital tools for the humanities and social sciences and believe that some kind of curated infrastructure that supported sharing and reuse would help to make existing tools more widely available and new tools more viable and sustainable.

This report begins in Section I with a deeper analysis of the problems faced by tool creators, content providers, and scholars who want to use these digital tools and collections. In Section II we then cover specific, often narrow solutions the workshop attendees postulated for the problems detailed in Section I. Finally, we address the advantages and disadvantages of certain broad solutions and make overall recommendations.

Problems Elucidated by the Attendees

The first session of the workshop focused on laying out as specifically as possible the problems that tool builders have encountered in the creation, integration, and dissemination of their tools. The discussion was extremely wide-ranging, illuminating issues from community relations to training to data formats. The organizers of the workshop asked attendees to be as specific as possible in delineating these issues, with the understanding that not all of the problems could be solved by a single project.

It was in this initial session that we first ran into an overarching question about different constituencies. In the original vision for this workshop and the proposed RFP that would come out of it, the audience for the solution was not specified, and attendees felt that it was often unclear who the intended audience was. We identified five distinct constituencies, which often have conflicting interests:

1. digital humanists or researchers already actively using digital tools and resources

2. traditional humanists who comprise the vast majority of scholars, whereas #1 is a minority

3. software developers and technologists

4. content providers

5. funders/supporting organizations

Beginning with the “Problems” session, we began to unpack what each of those audiences is really looking for. Throughout this report we try to show how different problems and solutions address the concerns of each of these constituencies. For instance, funders might be interested in issues of sustainability and preservation of code whereas developers might be focused on daily needs like project management.

An effective way of understanding all of the problems faced in trying to move tools into the mainstream of scholarship in a digital age is to break that effort down into a roughly chronological list of issues.

The first hurdle faced by those who create tools for data-driven scholarship is in the conceptualization of the tool they wish to develop. What kind of application should one build? Workshop attendees outlined several conflicting pressures at this initial stage. Do you try to build a comprehensive tool or one that does something very narrow? A killer app to be adopted broadly or a tool to solve a particular problem faced by a specific scholarly community? Is there such a thing as a killer app in the humanities, or are tools necessarily discipline-specific? Some attendees noted that much tool development has been “tech-centric” rather than “scholar-centric.” That is, because a tool is often created by a software developer, it tends to take as its goal the surmounting of technical issues rather than research issues that will lead to new discoveries in a field.

Furthermore, a number of workshop attendees felt that even the very notion of a tool was ambiguous. As one breakout group noted, there is “no clean line between the data and the tools,” thus forcing the tool creator to try to figure out if a tool must be fully integrated with a collection or if it can stand alone. Some didn’t know when time would be better spent creating a “workflow” or connecting existing tools together rather than creating something new. And all attendees agreed with the funders that they often had great difficulty finding existing tools and assessing the applicability of what was already out there. Registries of tools have been tried, but they have encountered a lack of tool builder participation and problems related to taxonomy–how do you categorize a complex tool so that it can be found? Even if a similar tool is found that might be reworked to solve a particular need, funding is often not available for these efforts. Most funding goes to efforts to create something from scratch.

Additional concerns with the conceptualization stage were figuring out which media types, data types, standards, and programming languages should be used; trying to predict where the community of scholars and tool builders would be in coming years so as to avoid creating an anachronistic tool; and, in general, understanding better what users really want and then crafting a tool with the correct functionality and user interface to meet those needs. Few tools builders professed to doing a full needs assessment, or later on, to doing focus group testing of prototypes or the final product. Another issue that clouds this picture is whether researchers or content communities can accurately assess what they need ahead of time, or whether they are biased toward repeating modes and interfaces they have already seen but which will be less effective at digital scholarship than more innovative software.

Once a new tool is conceived, a software project faces major challenges related to staffing, participation, and project management. How do you find the programmers, designers, database administrators, and other people necessary for a particular effort? A survey of the existing digital tools for data-driven scholarship shows a chasm between projects that appear to have been done with a professional development staff and more amateurish efforts. Many software projects begin with just one or a few developers but as they grow face the problem of attracting new developers to take the project to a production-ready stage.

Related to that last staffing issue, workshop attendees noted a variety of problems in setting up a software project so that others could easily participate in its development. Few tools have roadmaps for development so others can see where the project is going and what features they might lend a hand to (and when). (Many project directors noted a fear to advertise a timeline that they might not meet and their desire to retain fluidity and flexibility in a project.) Important technical considerations made at the beginning of a project can make it difficult, if not impossible, for a developer from outside the project’s home institution to join the effort, or to take a tool and move it in a new direction (say, for a different discipline or methodology). Any project that is more than a solo effort faces issues of matching the skillsets of individuals with tasks and deliverables; this process is even harder when developers from more than one institution are involved. One breakout group noted that often the greatest innovation takes place at the fringes–i.e., by energetic contributors beyond a project’s home institution. How should a project account for these additional participants?

Workshop attendees candidly admitted that a great deal of tool building in the service of scholarship has been done with amateurish project management. Some of this problematic grounding of tools probably is related to the tradition of research and experimentation in academia, which is less-than-ideal for the production environment good software requires. Effective planning, housing of the code (e.g., version-control), communication among developers, and distribution of work were all considered troubling in the majority of digital tool projects. In general, software lifecycle management principles are ignored in many academic efforts. Few tool projects take advantage of project management tools that combine team messaging, a breakdown of tasks, and calendaring/to-dos. Managers of digital tool projects fail to understand how projects are born, evolve, and reach closure, thus the great number of applications that are half-written, not ready for constant use, or that haven’t been updated in years (what one participant in the workshop called “abandonware”). Few projects have a transparent development process. And very few have a plan for sustainability beyond the initial grant period.

Once a tool has been developed sufficiently enough to be distributed, more problems await in the effort to attract, retain, and serve users. Many of these problems are the same as those encountered by developers. Users have no effective way to find tools and assess their utility for a particular domain or research project. The lack of transparency means that users might never come back to the website for a tool once they deem it not useful because they will never see that the tool might be evolving in a helpful direction for them. Detailed documentation, guides for new users, screencasts, user forums, and the like are rare, and often an afterthought in the development process. Several groups at the workshop also noted that internationalization/multilingualism was especially lagging in most tool projects, with a single language being hard-coded into interfaces, even when most potential users speak a different language.

Many at the workshop acknowledged the challenges of building a community of users around a tool. Scholars are scattered and have various needs and levels of technical sophistication. Although technology has become something most scholars use every day, few have ventured beyond the basic applications such as word processing and email. Two breakout groups noted that digital tool use is not part of the curriculum, and that few opportunities exist for training and education. (Graduate education, where students learn to do the kind of in-depth research the workshop focused on, seemed a particularly big gap.) Obviously these elements are necessary for tool adoption to flourish.

In short, if concerns about the creation and production of tools has to do with the supply of new digital methods, more has to be done on the other side of the equation: the demand for these digital methods and tools. User bases must be cultivated and are unlikely to appear naturally, and few projects do the necessary branding, marketing, and dissemination of their tool in the way that commercial software efforts do.

A related problem is that scholars often go in search of content rather than tools. But content providers at the workshop (i.e., libraries and museums) wondered how to communicate with the tool builders and how best to integrate their tools (especially when they sometimes have limited technical resources). One frequently voiced fear is that tools and content, unless done well, can become silos out of which it is hard to extract research, which makes users even more hesitant to commit to their use. Those who turn to a tool first are often unaware of content that works with that tool. Like tool registries, content registries have seen mixed success, at best.

Finally, most workshop attendees felt that the rewards system in academia did not currently give proper incentives either to participate in tool building or to use tools to produce digital scholarship. Moreover, addressing many of the problems delineated in this section involves relatively unglamorous work–e.g., writing documentation, training others, or assisting in the design of a tool–that is unlikely to ever be rewarded. Part of the issue here is a lack of assessment, either by peer-review or by good metrics of usage or the impact of a tool or resource.

Proposed Solutions

Not surprisingly, the “solutions” proposed were as diverse as the range of problems. The majority of discussion was clustered around cyberinfrastructure. While variously given labels as diverse as “repository,” “registry,” “consortium,” and even “invisible college,” the underlying premise can be summarized as economy of scale and the focusing of attention. A common environment for technological services ranging from Subversion to the obligatory blogs, wikis, and discussion lists would as well provide a natural community focal point for related activities such as outreach, information-sharing, and peer review. Specifically, useful cyberinfrastructure in the form of a registry or repository could either provide or facilitate the following:

Provide a code depository
Provide basic development management tools from team management tools, to wikis, to bug tracking
Provide an outreach function that explains tools, methods, and practices
Provide a discovery function so tools can be found by naive users
Provide documentation and encourage standardization of documentation
Run contests and exchanges
Provide discipline-specific textbooks, recipes, cookbooks, and review
Run training seminars and develop training materials
Lobby on behalf of tool development, open access to content (for tools), and innovative methods

Brett Bobley of the Office of Digital Humanities at the National Endowment for the Humanities had prepared and circulated a “draft strawman RFP” encompassing a number of the above aims and objectives. Discussion of this document aired important issues and challenges. The question of audience remained paramount: who is the audience for this kind of cyberinfrastructure? End-users? Developers? Is the goal to document the use of tools or to encourage re-use? Both?

It was suggested that repositories and the like are ultimately limited by their funding and what kind of buy-in can be purchased through awards to grantees—without clear incentives for participation and collaboration any infrastructure will be partial and entropic. To that end, the “invisible college” model, which places its emphasis on communities rather than static resources, was suggested as an alternative. According to Wikipedia,

The idea of an invisible college became influential in seventeenth century Europe, in particular, in the form of a network of savants or intellectuals exchanging ideas (by post, as it would have been understood at the time). . . .The term now refers mainly to the free transfer of thought and technical expertise, usually carried out without the establishment of designated facilities or institutional authority, spread by a loosely connected system of word-of-mouth referral or localized bulletin-board system, and supported through barter (i.e. trade of knowledge or services) or apprenticeship.[http://en.wikipedia.org/wiki/Invisible_College]

In the specific case of the digital humanities tools community, an invisible college would foster symposia, expert seminars, and activities such as peer review. As an example of incentives and rewards, “membership” in the college would require that an institution assume responsibility for cataloging the tools that have been produced there and bringing them into whatever infrastructure has been created by the community. Such a community (or “college”) might or might not have a relationship to an emerging scholarly cyberinfrastructure project such as Project Bamboo [http://projectbamboo.org].

A number of other ideas converged around the rubric of outreach, education, and marketing. These included contests and “exchanges” (the latter along the lines of the MIREX Music Information Retrieval Evaluation eXchange, http://www.music-ir.org/mirex/2009/index.php/Main_Page); professional-grade manuals, textbooks, documentation, and “recipes” (like Bill Turkel’s The Programming Historian or TAPoR’s http://tada.mcmaster.ca/Main/TaporRecipes); an online journal devoted to publishing about tools and methods (it was suggested Digital Humanities Quarterly, among others, already meets this need); attempts to secure popular media coverage for successful tools in venues like The Chronicle of Higher Education, Wired, and Scientific American; and a collective effort to establish a digital humanities presence on sites like Slashdot. All of these, of course, require varying degrees of effort, coordination, and leadership.

Recommendations

As the previous sections of this report make clear, there is no single simple solution to the complex problems explored during the workshop. We believe, however, that the series of grant-based opportunities we articulate below could dramatically change the landscape for the production, evaluation, and use of digital tools for data-driven scholarship. Of course, not all solutions will require grants, and the energy of scholarly and technical communities will also help to drive innovation and adoption. Nevertheless, underlying all of these recommendations is the need to embed tool building and tool use deeply within scholarly communities of practice, which means that focused attention must be given both to the fostering of such communities and to making them international in their scope.

What we imagine is a dynamic site similar in some ways to SourceForge (“ToolsForge”?) that consists of (1) a tools development environment; (2) a curated tools repository that provides peer reviewing and discovery functions; and (3) a set of community building and marketing functions. We are aware that this is a tall order and have therefore tried to suggest below the ways it might be sliced into several related funding opportunities. We also recognize that some of these conclusions have been reached by the Bamboo Project as well and are outlined in their 7-10 year initial project plan.https://wiki.projectbamboo.org/display/BPUB/Bamboo+Program+Document+v.0.1

Promotion of Sustainable, Interoperable Tools

Fund a program for training trainers. This would enable outreach to humanities centers, teaching them about the applications of tools developed at other centers, and elsewhere, to humanities research. It would then be the job of those trained to reach out to the “analogue” humanists in their faculty. This program could be a core activity of centerNet.

Fund a grant opportunity for collaborative work that would embed one or more significant tools within a significant body of humanities content. There was strong agreement from workshop participants that “analogue” humanists are most likely to discover and use digital tools if they are located within prominent sources of content.

Fund a grant opportunity for collaborative work that would make two or more significant tools interoperable, and/or establish strong grantee guidelines encouraging open standards and interoperability. This is a way to build cyberinfrastructure from the ground up by promoting the integration of tools that are already being used to good effect. Projects that are designed to develop specifications for tool interoperability should be preferred. Movement in this direction is already occuring, with the SEASR project exploring workflows and APIs for connecting large-scale content with analytics (and in its connection with Zotero), and a recent SSHRC grant to Bill Turkel and others to convene an international workshop on APIs in the digital humanities.

Infrastructure Creation and Support

Fund the creation of a shared tools development infrastructure, replete with developer tools, cookbooks, and recipes. This could be done either through a CFP, cooperative agreement, or competitive grant. It need not be built from scratch, but it should not be associated with any one university or small group of universities who might be perceived as “owning” it; hosting could be done by centerNet or ADHO. Applicants would need to survey the most promising and cost efficient ways to realize this goal. As John Unsworth has pointed out to us, for instance, such a site could be set up on something like Slicehost (http://www.slicehost.com) on which various Atlassian developer tools could be licensed (http://atlassian.com/software/), or perhaps grant funding could support something like the hosted Jira Studio solution at Atlassian, which combines Atlassian’s bug tracker, wiki, and development tools with Subversion source control. Such a site would be a boon for tool builders who do not have access at their own universities to a sophisticated development environment. Careful thought needs to be given to what would make this environment enticing enough for experienced tool developers to want to build within it, as well as how to best leverage their experience to benefit new tools builders.

Whatever solution is ultimately chosen, funding for its development should also include salary for a project management evangelist, someone with broad experience in software development projects who could visit various centers, conference venues, and universities to explain the benefits of the shared development platform, demo the software, and help developers get organized in the new environment. This site would need initial long term funding to establish itself, and grantees should also be asked to provide a business plan for how to sustain it beyond the term of its initial grant funding.

Fund either through a grant or a cooperative agreement a Curated Tools Repository/Journal/Review Site. This site would at minimum peer review tools; keep an online copy of all tools that it accepts for publication; enable a community of users to review and tag any of the published tools; and provide a sophisticated set of discovery and recommender services that would help users easily find tools according to their needs. For such a site to run successfully, it would need a General Editor and an editorial board of distinguished digital humanists who would provide the first level of peer review. In order to provide a strong initial set of content, the General Editor could invite a number of already established tools to be submitted for peer review and potential publication. Beyond such solicitations, the content for the repository would come from tools that get promoted from the development infrastructure discussed in our first recommendation, from unsolicited submissions, and from collecting (within a dedicated subsection) tools of publishable quality that are no longer under development but that have useful code that others might build upon. Scholarly content of this site could also include peer-reviewed articles about tools, a Reviews section, and a crawl of news and notes. This Repository/Journal could become the source of a series of further community building initiatives, eventually taking the form of the “invisible college” discussed in our Solutions section above.

Increasing Recognition

Fund one, or possibly, two workshops focused on mechanisms for peer reviewing scholarly tools. The issues involving the credentialing of tools and tools builders are complex and need to be thought through carefully before tools builders can be assured of gaining suitable academic credit for their accomplishments and scholarly users can have confidence in the quality of the tools they discover online. The workshop(s) should be structured to provide the intellectual foundation for the funding opportunity we next recommend below.

Fund an annual software prize for the most useful or interesting tool of the year. This could be a relatively modest amount ($10,000 or $15,000) that would raise the profile of tool building in general, highlight and market a number of important tools each year, further credential and incentivize tool developers, and help to build community. The award could be determined by centerNet in conjunction with the editorial board of “ToolsForge.” A good example of this kind of recognition is T-REX (http://tada.mcmaster.ca/trex/).

We offer these recommendations not as a panacea for all of the complex problems that were identified during the workshop, but as a wide-ranging and coordinated means for addressing some of the most fundamental and intransigent of them. Whatever their individual disagreements, all participants in the workshop shared a common sense of urgency: they believed that effective action needs to be taken immediately and that the future of digital tools for data driven scholarship is now.

Tools for Data-Driven Scholarship

Final Report

Posted on March 25th, 2009 in Uncategorized

Pages

Archives

Categories

Hosts

Related Links

Sponsors

Meta