Andy Boyle: On Developing Data Apps and Presenting Data
Andy Boyle is a full-stack web developer for NBC’s BreakingNews.com and has worked for a range of publications, including the Chicago Tribune and New York Times. Andy specializes in developing data apps such as the Chicago Tribune Election Center, which tracks election data and news. Other notable work he’s done includes an interactive crime map for the City of Chicago, which tracks shootings and crime in the greater Chicagoland area, and a parole decision app for the Boston Globe that simulates the parole decision-making process for second-degree murderers.
We had a chance to speak with Andy over Skype and asked him about his experience being a developer, how he gets data, and the project that he’s most proud to have worked on. Here are some highlights from our conversation:
Have you ever had to compile your own data set for a project?
Pretty regularly, even as far back as college. I remember when I was doing a story on fire inspection reports for all the fraternities and sororities on campus to see how many times they’ve failed. I took written documents and turned them into an Excel spreadsheet to be able to say, “17% of them failed” or, “Here’s a total number of violations they had.”
Also when I was at the Chicago Tribune, the City of Chicago does not give a full view of shooting data. At least when I was there, if seven people were shot in one shooting incident, they’ll say one shooting occurred, whereas we care more about how many people were shot. So we created a spreadsheet. Collecting all that data, reporters had to go to the scene, talk to people, find out how old they were, what’s their race, where were they shot, what hospital were they taken to. We had to create that ourselves to do that data analysis when it comes to shootings because the police aren’t releasing that information.
As a news app developer and programmer, how do you incorporate data into stories?
I don’t really work on the more editorial side of journalism now. But when I did that more often, I always try to view [using data] as an extra tool to help figure out story ideas, as opposed to the end-all and be-all. I need to be part of the conversation because I’m crunching numbers to add extra components. A term that I like to use is “investigative paragraphs”, where you’ll have a single paragraph where it’s like, “We’ve crunched all these numbers and this is what we’ve figured out”, in this one paragraph.
When it comes to data visualizations, I think the problem is that sometimes at some news organizations they’re just like, “Oh, let’s add a graph!” or, “Let’s make a map!” You should be creating data visualizations that have a purpose or could tell a different story than you can with prose.
When building interactive data apps, what considerations do you have when thinking about how best to present the data?
The first thing always is, is it accurate? Are we taking the data and giving an accurate representation of what the data is telling us? Are we also pointing out the caveats of when the data is flawed?
Like when we’re using crime data — crime is notoriously underreported. Let’s say you have an entire community of people who do not trust the police and have a culture of not snitching. Then all of a sudden they’re like, “So crime’s down in this neighbourhood”. When people use crime data to say either crime is up or down, is crime down or are people too scared of getting arrested that they’re not reporting crime? That can skew crime [data] in a different way.
Most police departments follow the FBI Uniform Crime Indicator, UCI. The basic idea is that say, if a guy breaks into a house, steals a bunch of money, and then lights the house on fire – according to the FBI crime statistics, one crime occurred and it was arson, because they go by the highest level of crime, [or] the highest level of felony. So if three misdemeanors occurs, what they report to the FBI is the highest level misdemeanor. It’s pretty standard for police departments to muck with the data a bit. So it’s not a real representation of what the crime data actually is.
There’s ways of making crime data seem less accurate. That’s something you always need to be aware of when dealing with any data – what are the limitations of what I’m working with?
What are the common sources that you get your data from?
A lot of times reporters have to file records requests. At the federal level there’s the Federal FOIA which you file to get data. States [and] sometimes even local cities have their own rules and regulations, but usually it’s state-run. You file records request to get data, and usually that’s a really long fight where you go back and forth trying to get the data you want in the format you want it in. It doesn’t always work out for you.
Another way involves you going out into the community to get data. One of my favorite stories was in St. Petersburg, Florida. They passed some local ordinance that was [about] buildings that use public money for heating and cooling. They can’t be below a certain temperature because they didn’t want to waste a lot of money on cooling. So they just sent a reporter to a bunch of these buildings with a thermometer and they were able to go, “Well, this building is 68 degrees, this building is 71”.
A bunch of places are making their data more open. The City of Chicago has their data portal. But regardless of that, it’s sometimes just put out there without any context. So you have this huge chunk of data and you’re like, “I don’t know what these fields actually mean”. And if you don’t have what’s called a “data dictionary”, and can’t talk to someone about what these fields actually mean, then what can you do with it?
What project have you been most proud of or found the most worthwhile?
This was years ago, it was actually a project that I worked on which was cited in the 2012 Pulitzer Prize for Breaking News. In April 2011, a bunch of tornados hit Alabama, did a ton of damage, and killed hundreds of people. I helped build some websites that were super simple – basically like, “Are you OK?” or “Are you looking for someone?” And then they ran that stuff in print, which helped them find people and helped the rescue operations actually look for where these people were. It helped people know, “Oh we don’t need to look for this guy, because he’s written in to say that he’s OK”, and they said that it helped save lives.
I also built, in this 24 hour timeframe, this website that was a memorial for all the victims, as they came up. This other paper in Alabama just had this list of their names and ages, and that’s all they could do online. I was like, “This is 2011 and the best you can do is a list? Like this hasn’t updated since World War I. That’s how we listed the names in World War I. We can do a little better now!”
Finally, we were able to use Google satellite data to show a before and after. So I could take one image and make a little slider to show, “Here’s where a church was, and here’s what it looks like after the tornado went through”. We were able to make stuff that helped to show you the level of the tragedy. We covered it from all these areas, from helping to informing. I’ve always been pretty proud of that.
How do you publish data, or do you publish data or methodology?
When I worked at the Tribune, we always tried to have a link. And then usually, we’d write up something that would explain how we did it [the data analysis]. Sometimes we’d have the code online so you could literally walk through our step-by-step process to make sure we’re doing everything correctly.
There used to be a phrase in data journalism called “your nerd box” – it’s like, “here’s how we did this analysis.” And that would be something that was next to the print story that said, “We pulled data from here, and we matched it with this, and here’s how we got the results.” Now we can give you the data, walk you step-by-step, and ask you, “Please, duplicate our efforts. See if we’re not wrong.”
Sometimes we’ll also write up a blog post, going even more in depth into how we did stuff. I did that for a couple projects, not only with how we came up with design, but also the code that I wrote to run through 13 years of suburban Chicago crime data.
Have you ever had feedback from that, where people take the code and duplicate it?
Yeah, I don’t know if I had any findings where people have been like, “Hey, you messed up here”. But I know people that have. It’s usually never a major mistake, but it does help to shore up your project.
If you’re doing it behind closed doors, it’s like, “trust us”. Whereas, if you’re putting the data out there, you’re being very open about it and others can do the same analysis and come to the same conclusions that we did. So show your work, just like you used to have to do in algebra tests. You should be showing your work in journalism.
This interview is part of our Summer Data Journalism Series where we speak with data journalists based in Chicago and beyond about their work and challenges with data. The interview has been edited for clarity and length.