Are Data Visualizations Trustworthy? Check the Data.
It’s easier than ever to make your own data visualizations these days. Countless new charts are being posted to the internet every day, made with a growing number of tools and openly available data sets.
However, for every well-researched map, there are even more incorrect graphics. The data viz boom has created all sorts of pitfalls, from intentionally misleading axes to lazily filtered statistics. And it’s not always easy to tell when you’re being duped.
Fortunately, it’s also easier than ever to act on your suspicions. No, you can’t always Tweet the creator of a graph and ask to see their work. But what you can do is access an openly available data set, make our own visualization, and then compare two different graphs.
Ultimately, the presence of open data gives the users a fighting chance against not just the media giants, but also the sheer amount of incorrect information on the Web. The opportunity is there for anyone to take a publicly accessible dataset, like Census data listed on a geographic data visualization, and make their own chart. What exactly do I mean? Let’s take a closer look.
Our story begins November 11, 2014, when the Washington Post (yes, that Washington Post) decided to run a map detailing gender breakdowns by state:
This map is problematic. Here are two of the more obvious issues:
- Ambiguous Numbers. What do those figures for North Dakota and Rhode Island represent? Based on the colors, you’d probably say they stand for % of men and women, respectively, and you’d be correct. But given the juxtaposition, those percentages look like two parts of a whole. So it’s rather confusing that they don’t add up to 100%. Also, it doesn’t help that they only decided to tell us percentages for only two states.
- Misleading Scale. All pink states are the same color, regardless of whether the state is 55% female or 75% female.
Note that we haven’t even yet considered whether the data is accurate. But we can act on our suspicions with open data, and that’s exactly what one user in Reddit’s Data is Beautiful community did.
WaPo listed their source as the 2013 Census, so /u/onejoey was able to download the relevant dataset and make his own map. Using the source data, he was also able to identify factual issues that accompanied the map in the Post’s paper. He talks about his experiences in his post on the issue.
Onejoey was able to clean up the Census Data and publish an accurate version of WaPo’s map. He shared it to the Data is Beautiful subreddit, and it became the 20th highest rated submission on a community of 3.5 million users. WaPo soon took notice, and (eventually) updated their map to the accurate version.
Below is the new and improved graphic. Much better, right?
Right away, I noticed this map answers many questions we didn’t even think to ask originally. The numbers for many states were off in the original map, not just Alaska. Onejoey also points out the accompanying article erroneously lists Oklahoma and Maine as among the Top 5 Female-Dominated States.
In addition, all of the values were right around 50%, and this map shades them accordingly. The original map’s information wasn’t terribly useful, since all the male and female states used solid colors. Further, the map uses one consistent metric–percentage of women–and lists them on all the states accordingly.
Most importantly, Onejoey included the source link to the dataset so that others in the community were able to access the dataset he used and create alternate maps just like he did to WaPo. Below is an example where one user broke it down by county:
Community collaboration on data science, be it cleaning up a clunky CSV or fixing a faulty map, benefits the entire community and presents more refined and professional products to the rest of the Internet. Not only that, the openly accessible dataset allowed Onejoey to confirm his suspicions about the map’s accuracy. If it weren’t possible to access the exact same dataset that WaPo did, his map wouldnt exist and we’d never see how cool it is.
Here’s snapshot of a relevant comment from /u/from_dust, in this same WaPo thread, that sums up exactly what I mean:
Not all datasets are trustworthy, but before the availability of open data, we were next to powerless to do anything about it. Thanks to the rise of publicly available information, the average Joe, an individual dataviz geek, and the New York Times are on an even playing field–and this benefits all lovers of data.