Matt Kiefer: Anachronisms, Data Exclusivity and Teaching the Best Tools
Matt Kiefer is a data editor for The Chicago Reporter, an investigative news outlet that focuses on race, poverty, and income inequality in Chicago and beyond. Previously also at the Chicago Sun-Times and a Freedom of Information Act (FOIA) researcher at the Better Government Association, a Chicago-based investigative journalism non-profit organization, Matt spent two years as a software programmer before returning to the field of journalism where he now contributes to key data-driven reporting such as the Reporter’s recent “Settling for Misconduct”.
In your own words, what is data journalism?
Data is getting to be ubiquitous. I don’t necessarily think there will be a title of “data editor”, “data reporter” or “data journalist” – it’ll just become journalism. That’s not to minimize but rather to underscore the importance of data literacy.
I’ll step back and give an example. One of my favorite books is “All the President’s Men”, and in the beginning the cops catch the Watergate Hotel burglars and it all leads to the President resigning. I think the Washington Post reporter who first phoned in the story was a reporter who didn’t write. He just hung around the police station and phoned the Metro editor and told him what he found out. I don’t know why this guy didn’t write, but the days of the journalist who doesn’t write, that’s a bygone era. Now everyone is expected to be able to write copy. And so not too long from now, if not already, there’s just going to be an expectation that when data comes your way, you should be processing it.
So I think that “data journalism” is an anachronism waiting to happen. Just call it journalism.
What kind of work do you do with data?
Quite a lot of the work is cleaning the data or just verifying its integrity. So if you’re expecting a date to exist in a record but then in some records it’s null or blank, you have to figure out if you need to put in a default or accommodate that in your analysis. There’s a lot of exploratory work that goes along with cleaning and importing.
I’m familiar with relational databases. I use them to store the data, and then I’ll use Python to actually analyze the data.
Because a life cycle for a data project could just be several days, I’ll often just use SQL Lite. There’s also PostgreSQL, and that’s better when you need other people to connect to the database. We’ve had projects where we’ve had people throw data over the wall and be like, “Here, process it.” And that’s fraught with potential error, so I’ve told myself that next time when I do a project with many hands on the data, we’ll use something like PostgreSQL.
Where did you get the data for your story on reverse commute times?
[The U.S.] Census primarily. I think we also got some data from the Bureau of Labor Statistics (BLS). The Census gave us the travel time, and the Bureau gave us where the jobs were. The Census does a better job tracking individuals and the work that they do, [and] it does do some publication of data on businesses and the types of jobs they hire for, but it wasn’t as precise.
When I say the BLS, I mean the Illinois Department of Employment Security (IDES), which is kind of like the state version of the BLS in a way — there’s a lot of overlap in their data. So the state happened to have really good up-to-date data that they were publishing.
On what he looks for in data analysis:
With my job, it’s to have geographic precision. Because we investigate race and poverty in Chicago, I’m curious about certain demographic and socio-economic conditions in certain parts of the city.
And so once you have data from different points – you’re looking at jobs, and transportation opportunity, and you have a common geographic element between them – it’s easier to start looking for correlations because you have the ability to walk across the data and say, “Ok, for each community area, give me the average travel time to work, the unemployment rate”, and do an index or a ranking or something to figure out how your correlation makes any sense.
On sharing code and data and why news organizations still face resistance to it:
The idea that we’ll share data and share code is still new to a lot of organizational leaders. Although, there’s definitely a benefit to it, sometimes giving away data can feel like giving away the farm. It’s sort of like giving away your code on GitHub — you’re not necessarily giving away your story, just the means to researching it.
It all boils down to showing your work, and academics have been doing this for generations. So peer review can be a good thing. It’s just sort of a concern sometimes that we’ve worked so hard on this, and our business model, whether its for-profit or not, kind of depends on our exclusivity. So that’s a consideration as well.
Tips for students who are learning to work with data:
I don’t really think there’s a point in learning something twice, so I usually try to teach the best tool, regardless of how steep the learning curve may appear to be. If they know the right tool, they don’t have to learn it the second time.
There’s sometimes where there’s a 200 line script in an intro class. A good introductory programming class should probably be 20 lines or less, and you should really get those 20 lines. Then you’re ready. If you can write 20 lines of code, you can write 200!
This interview is part of our Summer Data Journalism Series where we speak with data journalists based in Chicago and beyond about their work and and challenges with data. The interview has been edited for clarity and length.