Abe Epton On Data Journalism: Pandas, Pain Points, and More
Abe Epton is a data reporter at KUOW-FM, an NPR member station in Seattle, and was previously on the news app team at the Chicago Tribune. Before getting into data journalism, Abe worked at Google as a technical specialist in News and Books publisher support. While in Chicago, he used data to shed light on topics including the city’s automated speed camera program, lead testing in tap water, and young victims of homicides and shootings.
We caught up with Abe over the phone to ask him about how he works with data, the tools he uses, and the efforts involved. We found out that geocoding data can be a real pain and that it might be a bad idea to publish a map of gang boundaries. Here are some highlights from our interview:
What is data journalism?
I tend to think of data journalism as just journalism. I think there is a distinction you could make but I don’t find it necessarily super helpful, especially because most journalism that I find really compelling includes some data. Some of the most successful stuff I’ve worked on was not just “here’s a data set make a story out of it!” but “here’s a story and in order to tell the story, we’re going to need to answer some questions and some of those questions are really data-heavy questions”.
What are some of the data tools that you use to make it easier to work with data?
I’m a developer and I primarily live in Python-land. So there is stuff in Python that is really helpful, like pandas. Chris Groskopf, who used to be at the Chicago Tribune, has this wire service suite of tools that includes a CSV kit for processing CSV files, [and] there’s this visualization stuff on top of it as well. But there’s so many things—there’s a million tools that are useful.
Are there any common formats that you get data in? Excel, Word, PDF?
The mother of all these is CSV. CSV is kind of like the lingua franca for a lot of data journalism. I don’t know if they’re the most common just because so many things are in Excel spreadsheets and not every Excel spreadsheet can be reduced to a CSV, although hopefully most of them that we would actually want to use are reducible that way. But outside of CSV, really, if it involves maps, there’s JSON files.
Are there some data processing tasks that you have to do repeatedly?
There are two main things that are constant pain points and I feel like some service could fix or solve.
One of them is geocoding. Geocoding [itself] is a solved problem — there are a lot of places to do it, you can get free packages to do it yourself in a database. The problem is scale. Right now I’ve got 5 million address I need to translate into latitude-longitude coordinates. I have servers and I can do it on my own, [but] it’s just going to take a really long time. The free versions aren’t necessarily the most reliable. Google’s geocoder is amazing but you can’t use it for this purpose because you’re not supposed to store the geocoded results from their service on your server.
The other thing that might even be more game changing is de-duplicating records. The folks at DataMade have this fantastic library called Dedupe and it’s really great and I think they’re starting to go down the road of making it more of a service for non-programmers to use. If you could build a cloud version of that, it would be really interesting and it could open up a lot of possibilities for folks.
Have you ever had to clean up data by standardizing the format?
Sure, that happens all the time. For example, I worked on this project, maybe a year-and-a-half ago, where…the independent police review authority gets cases referred to them when there’s issues of police misconduct and they publish a report. And the report is just plain text and doesn’t really have anything machine readable about what the outcome of that complaint was and you have to read the thing to see what they did. So we wrote a parser that looked for the summary structure of the document but it wound up being really manual. It was a pain to write the parser. [We] probably spent as much time writing the parser as it would take to just read the 500 documents and create a little spreadsheet with that data in it.
Have you ever needed to look at data across a particular time frame or a repeated pattern over a time series?
I think this is one of the places where it gets to be more of an art instead of a science, because a lot of times you get a data set or some ideas and you don’t really know if there’s a story there, and if there is a story you don’t really know what it is, you just have this data dump. So I know this interesting thing happened in 2008 so I look around then [to] see if there’s anything to reflect that. Or, I’m curious if things are getting better or worse, so let me to try to graph it.
This is one place where visualizations become really useful even if you’re never going to publish them, just because they make it kind of easy to see at a glance what happened. For example, at the Tribune, I was doing a story about the donors to the Republican party, [and] matched them to ZIP code, animated it over 16 years and you can kind of see trends, you can see times when a certain region got a really high density. Then you can go, “Oh, what was it about that election that caused Bellevue to kick in a lot more money than they usually did?” The visualizations make that apparent in a way that you would probably have to do some kind of fancy statistics if you wanted to identify it without just looking at a map.
In the various data projects you’ve done, what were some considerations that went into gathering and analyzing the data?
The Chicago police used to do these street stops and we FOIA’d for the street stop logs. We got hundreds of thousands of records of street stops and one of the fields that the street stop data had was gang institutions. So we were able build up a map of where all the gangs were in the city and where the gang boundaries were. It was like, “This is so cool, we should try to publish this!” And the other people in the paper were like “That would be a terrible idea!” There’s not really a use value there, but more importantly, you’re producing a map of these contested territories. That itself could become a problem. They [the gangs] would be like, ”that’s not their [territory], I’m going to go take that back from them.”
Just because you can do it doesn’t mean it would be a good idea. You have to be sensitive to, “If I put this out in the world, what happens?” If I say these hundred people are all donating to this person, this is all public record, they can all find that, that’s fine. If I say these hundred people have all had abortions or some medical procedure, that’s really private and maybe I change the names or something like that. It depends on the story you’re telling.
This interview is part of our Summer Data Journalism Series where we speak with data journalists based in Chicago and beyond about their work and challenges with data. The interview has been edited for clarity and length.