Jonah Newman: Collaborating on Data Projects
Jonah Newman is a database reporter for the Chicago Reporter. His recent joint project on police misconduct settlements is an ambitious endeavour to make these cases more transparent by providing lawsuit data going back to 2012 in an easily searchable and sortable database. Jonah’s criminal justice coverage in the City of Chicago has included stories on racial profiling by the University of Chicago police department, and data visualizations on youth opinions on police and gun violence. Previously a database reporter at the Chronicle of Higher Education in Washington DC, Jonah started out in Chicago, at Northwestern University’s Medill School of Journalism.
We had the opportunity to speak with Jonah over the phone and ask him about his road to being a database reporter, the data challenges that he faces and how he gets around them. Here are some highlights from our conversation:
What got you started in data journalism?
Now there’s more emphasis on teaching data journalism and there are some cool resources out there like the Knight Lab. But when I was there [Medill School of Journalism], those opportunities either didn’t exist or they just weren’t as well publicized. So I really didn’t know much about it at all. I kind of got lucky. After college I applied for a data journalism job that I really, to be totally honest, was not at all qualified for. I was looking for a job in journalism back then, and it was at the Chronicle of Higher Education in DC. I applied and they said, “We’re not filling that job right now but we need a researcher to help out on a project.”
So, I got hired to do that and through that project, which was a big data journalism project, I learned a ton and learned what data journalism was and how it was useful and started picking up some skills including SQL and more advanced Excel skills. I ultimately got the data journalism job that I had applied for in the first place, but I kind of just fell into it by accident.
What kinds of data tools do you use, be it different languages or off-the-shelf tools?
So I use Excel a lot, and SQL for more advanced data analysis. I’m starting to learn R, but I’m still relatively new to it. I have tried to use Python, but never really picked it up. So those are the only languages I use.
In terms of other tools, I’ve used pretty much everything. I’ve used Google Charts, CartoDB for mapping things, and occasionally Google Fusion Tables. To clean data I’ve used OpenRefine. I’ve used Tabula to get data out of a PDF. I’ve started learning QGIS for mapping a little bit. Those are the main things. I tried out a couple off-the-shelf web-scraping tools, but none of them were particularly useful, so I’ve started learning R to try to do that on my own.
What are some of the most frustrating processes when you’re working with data?
One of the things I do a lot is changing field types to whatever I need them to be in. So if it’s a date that’s not actually in date format, or an ID that’s not really in number format – things like that that I spend a lot of time cleaning and formatting.
And then another major thing is when there’s just so much white space and things you have to clean. Getting rid of trailing and leading space[s] is probably something else too.
How do you resolve these problems right now? Is it just done within Excel, or are you using SQL?
Yeah, kind of a combination depending on what my needs are and where I want it to end up ultimately. There are times there’s no need to use SQL because it’s just simple operations like adding.
You’ve written and analyzed a lot of data concerning criminal justice in your work at the Chicago Reporter. Are there particular challenges with using data in this beat vs. sports data, etc.?
A challenge is dealing with redactions. I need to work around that sometimes. There are certain data sets that I get from the police department or wherever that’s de-identified, so they’ve already taken all the personal identifiers out. Just thinking about how to use that [is challenging] — how to match it to other records when figuring out how to [combine] it with other information.
When you’re doing data analysis at the Chicago Reporter, do you share the code you use for the analysis and/or the data used?
So yes. We do that. We’re a pretty small team. Matt [Kiefer] is the data editor, so he and I will sometimes share our analysis and work to check up on each other. But we don’t have a great mechanism for doing that at the moment.
In a recent story, I sent all or a ton of my analysis files and underlying data that I’d been keeping to Matt, and had him check through the steps and look over the conclusions I got to and make sure he could get to the same conclusions. But I just shared the files with him on our shared server here. We didn’t put it on GitHub or anywhere else because it was a story we were still working on and we wanted to still keep to ourselves. We haven’t developed a robust system for that yet.
If you were to develop a robust system for collaborating on a data set, what would that would look like?
Something that’s easy to share the data in a place where you could also do the analysis. Sometimes I have to export the data from whatever I’m using it, as a CSV, and then get it imported and analyze it. As many steps you could get rid of, that’s what I would want.
So almost like a portal?
Yeah, like a portal where you could run SQL queries or do other things within the portal.
This interview is part of our Summer Data Journalism Series where we speak with data journalists based in Chicago and beyond about their work and challenges with data. The interview has been edited for clarity and length.