Tracking Changes With diffengine

Our most respected newspapers want their stories to be accurate because once the words are on paper, and the paper is in someone’s hands, there’s no changing them. The words are literally fixed in ink to the page, and mass produced into many copies that are pretty much impossible to recall. Reputations can rise and fall based on how well newspapers are able to report significant events. But of course physical paper isn’t the whole story anymore.

News on the web can be edited quickly as new facts arrive, and more is learned. Typos can be quickly corrected–but content can also be modified for a multitude of purposes. Often these changes instantly render the previous version invisible. Many newspapers use their website as a place for their first drafts, which allows them to craft a story in near real time, while being the first to publish breaking news.

News travels fast in social media as it is shared and reshared across all kinds of networks of relationships. What if that initial, perhaps flawed version goes viral, and it is the only version you ever read? It’s not necessarily fake news, because there’s no explicit intent to mislead or deceive, but it may not be the best, most accurate news either. Wouldn’t it be useful to be able to watch how news stories shift in time to better understand how the news is produced? Or as Jeanine Finn memorably put it: how do we understand the news before truth gets its pants on?

As part of MITH’s participation in the Documenting the Now project we’ve been working on an experimental utility called diffengine to help track how news is changing. It relies on an old and quietly ubiquitous standard called RSS. RSS is a data format for syndicating content on the Web. In other words it’s an automated way of sharing what’s changing on your website, and for following what changes on someone else’s. News organizations use it heavily. When you listen to a podcast you’re using RSS. If you have a blog or write on Medium an RSS feed is quietly being generated for you whenever you write a new post.

So what diffengine does is really quite simple. First it subscribes to one or more RSS feeds, for example the Washington Post, and then it watches to see if any articles change their content over time. If a change is noticed a representation of the change, or a diff, is generated, the new version is archived at the Internet Archive, and the diff is (optionally) tweeted.

We’ve been experimenting with an initial version of diffengine by having it track the Washington Post, the Guardian and Breitbart News which you can see on the following Twitter accounts: wapo_diff, guardian_diff and breitbart_diff. Nick Ruest at York University and Ryan Baumann at Duke University have been setting up their own instances of diffengine to track what is now 25 media outlets, which you can see in this list that Ryan is maintaining.

So here’s an example of what a change looks like when it is tweeted:

https://twitter.com/wapo_diff/status/819885771469553664

The text highlighted in red has been deleted and the text highlighted in green has been added. But you can’t necessarily take diffengine’s word for it right? Bots are sending all kinds of fraudulent and intentionally misleading information out on the web — especially in social media. So when diffengine notices new or changed content it uses Internet Archive’s save page now functionality to take a snapshot of the page, which it then references in the tweet. So you can see the original and changed content in the most trusted public repository we have for archived web content. You can see the links to both the before and after versions in the tweet above.

diffengine draws heavily on the work and example of two similar projects: NYTDiff and NewsDiffs. NYTdiff is able to create presentable diff images and tweet them for the New York Times. But it was designed to work specifically with the NYTimes API. diffengine borrows the use of phantomjs for creating tweetable images. NewsDiffs on the other hand provides a comprehensive framework for watching changes on multiple news sites (Washington Post, New York Times, CNN, BBC, etc). But you need to be a programmer to add a parser module for a website that you want to monitor. It is also a fully functional web application which requires considerable commitment to setup and run.

With the help of feedparser diffengine takes a different approach by working with any site that publishes an RSS feed of changes. This covers many news organizations, but also personal blogs and organizational websites that put out regular updates. And with the readability module diffengine is able to automatically extract the primary content of pages, without requiring special parsing to remove boilerplate material on a site-by-site basis.

To do its work diffengine keeps a small database of feeds, feed entries and version histories that it uses to notice when content has changed. If you know your way around a SQLite database you can query it to see how content has changed over time. This database could be a valuable source of research data, or small data, for the study of media production, or the way organizations or people communicate online. One possible direction we are considering is creating a simple web frontend for this database that allows you to navigate the changed content without requiring SQL chops.

Perhaps diffengine could also create its own private archive of the web content, rather than relying on a public snapshot at the Internet Archive. Keeping the archive private could help address ethical concerns around documenting particular individuals or communities when conducting research. If this sounds useful or interesting please get in touch with the Documenting the Now project, by joining our Slack channel or emailing us at info@docnow.io.

Installation of diffengine is currently a bit challenging if you aren’t already familiar with installing Python packages from the command line. If you are willing to give it a try let us know how it goes over on GitHub. Ideas for sites for us to monitor as we develop diffengine are also welcome!

Special thanks to Matthew Kirschenbaum and Gregory Jansen at the University of Maryland for the initial inspiration behind this idea of showing rather than telling what news is. The Human-Computer Interaction Lab at UMD hosted an informal workshop after the recent election to see what possible responses could be, and diffengine is one outcome from that brainstorming.

This page was originally published on the Documenting the Now blog