During the weeks of April 14 – May 3, I was fortunate to spend my MLIS practicum at MITH under the supervision of Trevor Munoz. My main reason for applying to MITH as a potential practicum site was my background in the digital humanities and library and information studies, as well as my interest in data curation and research data management in the humanities. I also wanted to apply the skills I learned in the “Data Curation for Digital Humanists” course at the Digital Humanities Winter Institute hosted by MITH and the University of Maryland in January of 2013 (taught by Trevor and Dorothea Salo from UW-Madison.) I was pleased, then, to discover what a unique project Trevor chose for us during my time at MITH.
Our main project was to imagine ways to curate the publicly available data from the New York Public Library’s “What’s on the Menu?” (WOTM) special collection of digitized historical menus. The Library crowdsourced the transcription of the digitized menus via the project website and compiled the completed menu transcriptions into large spreadsheets, which are updated every two weeks as more data continue to be produced. Recently, NYPL Labs added a geo-tagging tool to add geospatial information to the historical menus, where available. An award-winning digital project, NYPL’s “What’s on the Menu?” is a great resource for exploring data curation in practice as it is free, openly accessible, and a subject of interest to humanists.
Our task as data curators, therefore, involved cleaning up, organizing, classifying, describing, structuring, representing and otherwise making more accessible the data provided on the “What’s on the Menu?” site. A user exploring the project site will quickly discover that, while over 26,000 menus have been digitized and displayed online, one can only browse them by the year the menus were produced, the name of the restaurant that produced the menu, and the total number of dishes that appeared on the menu. In other words, there is no thematic or categorical classification of these menus that would appeal to the discovery of the rich historical and cultural information contained in this unique collection. We decided that categorizing and classifying the WOTM data to support richer browsing was curatorial activity that might add value.
Before we began to categorize and curate this data, however, it was important to take stock of our data at hand. Essentially, we downloaded two spreadsheets from the WOTM site: one “Menus” set containing over 28,000 rows of data (the names and types of menu pages) and one “Dishes” set containing over 400,000 rows of data (the individual names of dishes that appeared on those menus, including their prices). After assessing the data, we wrote a data management plan (in best Digital Humanist practice) for the work we were about to begin in curating the Menus and the Dishes data.
While we could speculate on the potential usefulness of the data in this collection, it’s always better to have actual evidence of what users want. NYPL Labs graciously shared summaries of the requests they have received from people requesting credentials to use the WOTM application programming interface (API). These short description served as evidence of user needs, which would help guide our decision-making process. Historians, social scientists, journalists, literary food scholars, chefs, novelists, teachers and students, as well as general enthusiasts all showed interest in access to this data. In order to make the menus more accessible for these potential users, we had to imagine what types of questions they might ask of this information. This would inform the general working taxonomy, or information architecture, of our data set.
First, however, we needed to clean it up. As you can imagine, crowdsourced data tends to me messy, containing many spelling variations, typos, ambiguities, and missing fields. We used OpenRefine (formerly GoogleRefine) software to cluster and rename data fields that we considered of particular use or interest to future users of this collection, principally names of the businesses offering these menus and also, where present, the names of the categories (supplied by original cataloguers?) to which the menus had been assigned. While OpenRefine helped with the initial clustering, the large number of name and spelling variations meant some tedious line-by-line editing. This is also the data curator’s job.
Upon a general data cleanup, I was excited to launch into the categorizing! While working on my taxonomy of menus and dishes, I learned the difference between taxonomies, thesauri and ontologies, and have built up a nice working list of controlled vocabularies to consult for various subjects. Overall, when creating the taxonomy of the menus, I focused on what the potential users of this data might seek in this unique collection. As such, I developed three levels of classification with four broad categories by which to group the menus. These included the Hosting Organization that sponsored the menu, Type of Meal the menu reflected, the Restaurant Address where the meal was held, as well as the Type of Gathering for which the menu was produced. For example, we decided that users might be interested in exploring political and military meals held in the “What’s on the Menu?” collection, as well as browsing the menus created for special occasions, such as George Washington’s Birthday meals. While our taxonomy structure is not the definitive version of this data, Trevor and I nonetheless believe that providing users with the ability to browse all wedding menus, High Society Banquets or meals held for royal individuals, for example, was a value-added service that data curators can provide. In other words, our categories were additional ways to enter, explore and understand this special collection that lead to discovery and learning.
User testing and evaluation was another important step in the curation process, as it allowed us to see whether we were on the right track with our curatorial practice. While we may have been following good metadata standards, referring to controlled vocabularies and linking URIs to our classification terms, none of this work would ultimately have mattered if the users could not navigate the data quickly and easily. I discovered a useful tool called TreeJack, which allows information architects to test their tree structure (the organization of information) by providing anonymous users with several tasks to complete. We also asked our test subjects to answer three brief survey questions about the choice of terms used to create the categories for the menus. Based on the feedback we received, we changed the placement and labeling of certain categories. Ultimately, a final user testing of this data curation project would help evaluate it as a whole.
Finally, in my last week, we decided to develop a kind of proof-of-concept for our data curation work and display it online. Inspired by Aaron Straup Cope’s talk at the Library of Congress about his work on “Parallel Flickr”, Trevor and I tried to imagine our own “Parallel Menus” to experiment with our newly-minted categories for this data. Initially, after inserting the two spreadsheets into a MySQL database hosted on my University of Alberta web server, I created a query to display each category on its own static web page by writing some PHP scripts. Trevor, however, taught me about Bootstrap, a front-end design framework, that allowed us to make our Parallel Menus a little more interactive through Java Script. Finally, using the API key which Trevor obtained from NYPL, we were able to get thumbnail images of our menus from the “What’s on the Menu?” API service and insert them into our web code to entice the users of the taxonomy into browsing our site.
Clearly, my three weeks at MITH went by quickly, and I didn’t have a chance to work on the Dishes side of our WOTM data. Nonetheless, I kept documentation for my work as we went along, and even developed some categories and controlled vocabulary terms for organizing this much larger data set. Trevor will be teaching several Data Curation workshops in the coming months, and I hope that he will be able to expand on my practicum work by getting students to work on the Dishes part of this project. I hope that the “What’s on the Menu?” Data Curation project will eventually be handed over to NYPL to help make their site even more usable to the many visitors they receive each day. I would be honored if, upon handing it back to the New York Public Library, they used any of our suggestions with regard to organizing and classifying this data.
During my practicum, I appreciated participating both in the administrative duties of digital humanists, such as attending project meetings, policy and grant writing, and data audits as well as in more creative tasks such as attending talks at the Library of Congress, exploring new tools, and participating in the MITH Incubator project by helping librarians develop their own research projects. I especially valued the way Trevor combined the practical with the theoretical, getting me to think about the broader implications of the data curation process – its costs, its biases and limitations, its potential, its role in the creation of new knowledge. Working on this project gave me a whole new perspective on research data in the humanities, and I believe it will help me complete my thesis, as well as further my career. This was definitely a transformative experience.
Overall, I am incredibly grateful to have had the opportunity to put my skills and interests in digital humanities data curation at MITH with Trevor’s leadership. In addition to meeting the wonderful MITH team, many visiting scholars and the larger DC digital humanities community, I learned many exciting and useful things in my time at UMD: APIs, JQuery, LCSH Linked Data Service, OpenRefine and its RDF plug-in, TreeJack and tree testing, GitHub, Freebase, along with many others. I look forward to visiting MITH again in the future and launching DH curation projects of my own!