Chris Groskopf: On Sharing Data and the Value of Transparency
Chris Groskopf is a reporter on the Things team at Quartz. Previously, he has worked on NPR’s visuals team and the news app team at the Chicago Tribune. Throughout his career as a data journalist, Chris has spent a lot of time building tools for other journalists, from the CSV kit to the PANDA Project to his most recent project, Agate, a data analysis library for Python. When not designing and refining data tools, he writes articles explaining issues from why it’s difficult to rig a US election to how the modern stock market is like a badly designed computer system.
We spoke with Chris over Skype and asked him about who goes into his many data tool creations and the importance of transparency in data journalism.
In your own words what is data journalism?
It seems like an easy question, but I don’t know that it actually is. I’ve increasingly wanted to not make a distinction between data journalism and regular journalism. I think that this distinction is still useful for hiring and for bringing people into journalism that might not have otherwise seen it as their calling, but from my perspective, data is a kind of evidence and all journalism requires evidence.
Increasingly, the facts of the matter take the form of data, whether that’s a CSV file or a dump of emails. In order to consider the facts the way that journalists should, we have to be able to process all that information. So that’s what I see my role as a data journalist — someone who has the skills from computer science and information science and these other related disciplines to process this new kind of information that is so important for journalists to work with.
What are some of the problems that you’ve tried to solve with the tools you’ve designed?
With CSV kit, that was designed to solve a problem of constantly doing the same thing over and over again and it taking too long every time. I can write Python code to cut columns out of a CSV file and it won’t take me that long to do that, but I’m writing custom code every time. And if you’re doing that every day or a few times a week, that is just wasted time because you can easily generalize that problem. And that’s what the original CSV kit did. As a computer scientist, I have a strong urge to automate those things and make sure I don’t reinvent the wheel every time.
So with CSV kit, I’ve built a set of tools that are the general versions of things that I would’ve written the code again and again to solve, whether that’s cutting a CSV by columns or filtering by rows or applying a regular expression or generating basic summary statistics. Those are all things that I was doing and now I have these basic command line tools that do them in just a few seconds and I don’t have to worry about, oh did I make a bug this time? Did I make a copy and paste error? Where do I store all this code? All of those sort of related problems you have when you write something from scratch.
You spoke about how part of the success of your CSV kit came from the sharing the code base with others. In addition to collaborating on open source tools, do journalists also share code and the actual data sets?
Definitely. I think this community of data journalists as it exists now has only been around for six or seven years. It’s still fairly new but part of what attracted me to it was when I went to my first NICAR conference and I met all these people, there was no sense of proprietariness about it. It definitely helped in a roundabout way that journalism was sort of in crisis and people were looking for ways to do more with less and ways to do new things that might help offset some of the big layoffs that we saw at the end of the last decade.
There was a very collegial attitude that we’re all in this together and nobody was trying to hide things or keep things secret. And out of that grew, I think, a real spirit of sharing. I use not only tools written by other people, but also code written by other people and there’s a very strong ethos around sharing source data and sharing techniques for writing FOIAs. I feel confident that if I had a problem or needed a source on a particular topic that there’s a lot of people that I could talk to and they wouldn’t be cagey about that.
I think that whether it’s data or code or whatever it is, this industry has really benefited from the fact that so many people are willing to share and I hope that that continues to be the case.
In your introduction to Agate, you talk about how an important component of this tool is the ability to allow other people to check your analysis. Why is this transparency so important?
As journalists, we make claims about the world. We are in the business of coming out and saying that something is the case. Twenty or thirty years ago, we would sort of marshall the facts and carefully constructed arguments that lead naturally to some conclusion that we want people to believe and we still do that. But increasingly, a really important part of doing that work is marshalling the data and the analysis in the way that more traditionally we would’ve only seen in the hard sciences. I can’t think of a really influential investigative project in the last few years that didn’t have some data component.
As we’re using data to present an argument, we have to be mindful of the fact that data is easy to manipulate and it’s easy to mislead and it’s easy to make mistakes. I think that for journalists who are working with data and trying to persuade the public of some important truth, it’s essential that we have a very high degree of transparency about our methods.
What’s needed for transparency to be effective? Is it enough to publish the data sets or do journalists need to mirror the hard sciences in providing and explaining their methodology?
It has to be something that other people can look at and say, “Yes, this follows.” If I needed to, I could go do this work again and I would reach the same answer. Or, “No, this doesn’t follow” — they got it wrong and I’m going to publish something that shows why they got it wrong and that’ll move the conversation forward.
It’s possible that journalists could do this work without doing that and we could just leave that to scientists and some people have suggested that, but I don’t think that’s a terribly realistic option. I think that increasingly for these important social issues and political issues and others, we are going to be doing the data work because we are best situated to understand where that data is and the complexities of looking at it and who we can talk to and all of those various factors.
I think it’s very important for journalistic ethics that we build reproducible processes and publish our process.
This interview is part of our Summer Data Journalism Series where we speak with data journalists based in Chicago and beyond about their work and challenges with data. The interview has been edited for clarity and length.