Alex Richards: On Data Tools, Challenges, and Being Skeptical
Alex Richards is an investigative data reporter at NerdWallet, a personal finance website that empowers people to make and manage financial decisions. His previous work at the Investigative Reporters and Editors (IRE), and publications like the Chicago Tribune and the Las Vegas Sun, such as his reporting on Las Vegas hospitals whitewashing life-threatening complications, have earned him accolades such as a Pulitzer Prize nod in the local reporting category and a shared FOI medal from IRE for his more recent work on a truancy epidemic in Chicago.
We (Nate, Eileen, and Isabelle) had the opportunity to speak with Alex and covered a range of topics from what he looks out for when seeing a data set for the first time, to the roots of his story for the Las Vegas Sun, “No Harm Done”.
Here’s some highlights from our fascinating conversation:
What is data journalism?
I think at it’s core, data journalism is about finding angles, information and sort of being a tour guide for your audience about what’s important and what might not have been done. It may take on a more narrative form, sometimes it’s more visual in nature, but I think that’s what it all boils down to.
Data tools that Alex uses the most
The one program I always come back to is Excel. It’s a really flexible canvas to do some basic analysis work, whether you’re trying to trace trends over time, or make some summary statistics on a dataset. Excel is a real great “go-to” especially with small data, which usually, as journalists, is what we’re dealing with.
For the bigger tasks, I use MySQL a lot. Database managers are great; you can do a lot of work joining different kinds of information. They’re also good for all these other things that Excel can do, but when you’re dealing with hundreds of thousands or millions of records.
On frustrating data issues – dealing with PDFs and geocoding
So one of the big ones to this day is the whole PDF issue. How do you unlock tabular information that sits inside a PDF document?
The only people who have really hit on a way to get that [information] out in any free sense is a group that’s put together an open source thing in Java called Tabula. They use some algorithm to detect vertical and horizontal lines that extract tabular information in documents. It only works if there’s actually imbedded information. It’s not set up to deal with documents that need to be OCRed.
As far as free tools go [for OCR], there’s not really anything out there. You can shell out a couple hundred bucks for something that does the job, but that’s not always possible for smaller news organisations.
There’s another thing with geocodes. We need to geocode records where you might have hundreds of thousands of addresses and not have the corresponding latitudes and longitudes. A lot of services that can do the geocoding have a cap at 2,500 or 5,000 records per day, so if you have half a million, you can’t really spend 100 days sending that piece by piece.
Common data sources
The City of Chicago’s [open data portal] is actually pretty good compared to other places. I’d say the vast majorities of cities and counties don’t have anything like that with their platform.
I used to work at the Chronicle of Higher Education and we made major use of IPEds there, which were one of those central data repositories for information on colleges and universities.
The data-driven roots of “No Harm Done” in the Las Vegas Sun
We had negotiated access to administrative hospital data that spanned ten years and it had a lot of information in it about the types of inpatients these hospitals were seeing, what they were diagnosed with, what kind of treatments were done, what happened before they were discharged, etc.
But we didn’t know what we would find, so a lot of time was spent just doing some basic analysis seeing what we could come up with.
I was working with a guy who had been covering healthcare for quite a while and he brought up that there had been some discussion at the state level about certain types of hospitals and patients who get Medicare and aren’t reimbursed for it anymore, or certain things that happen in a healthcare setting like people who get stuff left inside them, or end up with a MRSA infection.
From there we started to focus on these at-risk events, and then taking on administrative data and doing these analyses and then basically taking that and comparing it to what the hospitals were reporting to the state – they’re required to report these things when they happen, in theory at least. And they were just reporting a fraction, a very, very small fraction of that. So it was looking at that comparison and thinking, “Oh, I think we might really have something here”.
First tasks when looking at a data set
I try to figure out what’s wrong with it. Because there’s always something wrong with it. Sometimes it’s small, sometimes it’s a huge problem.
I’m typically looking for a couple of things: Is there information missing? Are there things that just don’t pass any common sense test? Like any crazy outliers that are just too insane to be believed. I’m trying to figure out if the data is an accurate representation of reality, which is not always the case. Sometimes, it’s because people hand in pieces of information from paper into computer systems, and obviously mistakes are made. So before we start summarizing and running off with these incredible findings, we have to make sure the data is essentially real.
Alex’s advice to data journalism students
The number one, when it comes to data journalism, I would say is: Don’t make any assumptions about what you have in front of you.
Data is quite good at the “What”, but it’s never going to get at the “Why”. So it definitely can’t live in isolation, and it’s complementary to the rest of the traditional reporting process. It’s either going to build the foundation for a story or create some context for a story, it’s never going to be the whole story.
This is the first interview in our Summer Data Journalism Series where we speak with data journalists based in Chicago and beyond about their work and challenges with data. The interview has been edited for clarity and length.