Daniel Hertz: Demographic Data and Slim Margins of Error
Daniel Hertz is a Senior Fellow at the City Observatory, a website and think tank devoted to data-driven analysis of cities and the policies that shape them. A recent graduate of the Harris School of Public Policy at the University of Chicago, Daniel writes extensively on urban issues in the City of Chicago. His original research and data visualizations have been cited by the Chicago Tribune and Toronto Star among others. Most recently, he’s launched The Chicago Dispatch, an online publication that publishes interviews, essays and articles about the Windy City.
We sat down with Daniel to talk about his work on tracing demographic trends in Chicago, the pitfalls of margins of error when computing statistics, and his fascination with the city. Here are some highlights from our conversation:
How did you get started with data journalism?
I started writing and doing some freelancing several years ago and it really just began with questions I realized I could answer with data, especially publicly available data.
The first real thing that I did was about another little crime spike back in 2012 that was in the news. Something that frustrated me on the top, superficial level was that there were reports of this crime spike without a broader context [that] crime is down substantially over the last 20 years. I started thinking about it more and wondered where is it down, when is it down? Is it down in the same way on the North side as it is down on the South and West side? And are people on the South Shore and Austin experiencing that decline or not?
That seemed like a pretty obvious question to me. So I Googled around and couldn’t find anybody who’d written about it. I realized that I could probably just find the answers, so I went to the CPD annual reports, which they had on their websites and which went back, I think, to the 80s. It had both population and homicides per police district, so I could do per capita numbers. I ran the numbers [for Chicago] and as you might expect, [the crime rate] has not declined evenly across the city.
It usually starts with a question and you realize that this is an answerable empirical question, and the numbers are publically available, and nobody has put them together yet.
Are there any tasks that you perform repeatedly or involve a lot of effort when you work with data?
Doing GIS joins is always sort of a pain, [such as] matching some geographic data in spreadsheets with different categories, or getting the shapefile and the data from two different places and they don’t quite have a good join column. You have to go through and manually change everything to make it join properly.
[For example], I’ve been thinking of trying to do something with schools recently on changing demographics of children in the city, and people going to private schools. It’d be useful to have in the data portals some school-specific things. CPS has all of this data on their website, but it’s complicated to attach that data to geo-tags or shapefiles. The shapefiles of school locations or attendance boundaries that are available on the data portal are usually only updated to a year or two ago. But in a city where things are constantly closing, attendance boundaries are changing, new charter schools are opening up, it’s really intimidating to make a map of school demographics and boundaries.
Data that has an obvious GIS component that is completely unlinked from the actual GIS data is a problem. It’s a hurdle in a lot of open data.
What are some of the data tools that you use in your work?
Census Reporter – it’s one of the only places I know of where you can seamlessly visualize the data before you download it into QGIS and figure out what’s wrong with it. So you can sort of do a first pass and be like, “Is there a story here? Is there something interesting here?” And then if you decide that there is, you can download the shapefiles with the data in it so you don’t have to do any joining, which is amazing.
The downside to that being [Census Reporter] only has the most recent census as opposed to something like Social Explorer, which has data going back forever.
What are some considerations you have when looking at time series data?
I think the biggest challenge with time series is do they exist. How far back does the data go that’s publicly available, or [available] at all? And secondly, especially because again, I’m interested in neighborhood stuff, are the geographic units the same?
[For example], I would love to do an update of that crime analysis. The period I used was 2007 to 2011, because 2011 was the most recently available data when I did it. But in 2012 or 2013, CPD changed their district boundaries, and now there’s 22 of them instead of 25. And so that means they’re not [directly] comparable..
A really great service to get around this issue is the Brown Longitudinal Tract Database, which is from Brown University. Tons of research about cities and neighborhoods just don’t have that Census tract level, but between 2000 and 2010, the Census changed the boundaries of the Census Tracts. Brown basically reconciled the two geographies and created publicly free, downloadable census files and tract files for the entire country that were all comparable geographies, so you could compare 2010 to 2000. But there are a lot of data where [having that] would be really nice.
You’ve written and analyzed a lot of demographic and geographic data that reflect trends about segregation and desegregation. Is there anything in particular that you have to be cautious about when dealing with demographic data like this?
A lot of the time it’s what’s not there – and this is something that my boss is very big on. There’s so much data and because it is relatively new and there’s more stuff coming out all the time, there is so much low-hanging fruit. All the bandwidth in the conversation can be sucked up by processing the data that exists, but that means that we don’t talk as much about things for which data doesn’t exist or isn’t as clean.
So for example, something that we have really good data about is everything related to cars. We have really good data about traffic, traffic fatalities, all sorts of stuff. We have really bad data about pedestrians. We have really bad data about bikes. So we don’t write about it or talk about it. So look at what smart city programs are targeted at – they’re very heavily targeted at cars, and people just haven’t invested to get data about other types of transportation in the same way.
Are there any issues with the demographic data itself?
Actually, that is one of the things – margins of error. People [are] not understanding margins of error and using data, especially with Census stuff, because it’s easily accessible. A lot of well-meaning journalists or bloggers [use] Census data, especially at the tract level, where the margins of error swamp whatever differences they are ostensibly showing.
On Census Reporter, if you look at an indicator where the margin of error is more than 50% of the level, like when median income is $50,000 but margin of error is $25,000, it’ll put up a little, mini icon somewhere, but not big enough! You know, it really should be big enough, like, “Do not use this. This is worthless, don’t use this.” And 50% is like – I mean, frequently you’re talking about differences of 10% that you’re trying to show is meaningful, especially if you’re talking about change over time.
So, the data is not infallible?
Yeah, the data is not infallible! Like, “Don’t use this.” I think that would actually be an invaluable service: “Don’t use this!”
This interview is part of our Summer Data Journalism Series where we speak with data journalists based in Chicago and beyond about their work and and challenges with data. The interview has been edited for clarity and length.