The Rise of the Y-Axis-Zero Fundamentalists

On Friday, I read a Natalie Kitroeff Businessweek.com story on the declining appeal of law school, and was so struck by this chart that I shared it on Twitter:

law schools

The chart tells a dramatic story: all the gains in law school enrollment since the mid-1970s have been wiped out in just three years. Twitter responded to that drama with lots of retweets and favorites — but also with lots of disapproving remarks like this:

And this:

There were many, many more responses like that. A couple of them wielded the name of Edward Tufte, today’s leading authority on the visual presentation of data. Which is interesting, because after about five seconds of Googling I found Tufte’s actual views on the practice:

In general, in a time-series, use a baseline that shows the data not the zero point. If the zero point reasonably occurs in plotting the data, fine. But don’t spend a lot of empty vertical space trying to reach down to the zero point at the cost of hiding what is going on in the data line itself. (The book, How to Lie With Statistics, is wrong on this point.)

For examples, all over the place, of absent zero points in time-series, take a look at any major scientific research publication. The scientists want to show their data, not zero.

The urge to contextualize the data is a good one, but context does not come from empty vertical space reaching down to zero, a number which does not even occur in a good many data sets. Instead, for context, show more data horizontally!

Thanks to one of the offended responders on Twitter, Abhinav Agarwal, we can see what the Businessweek.com chart would have looked like with a zero base:

I love that he went to the effort to make that (thanks, Abhinav!) but … it is less informative than the original chart. Yes, in the new version it’s now crystal clear that law-school enrollment hasn’t gone to zero. But who looked at the original chart and thought it had? (Well, this guy says he did, but I think he’s kidding.) And the contrast between the herky-jerky rise of the past four decades and the straight-line drop since 2010 is much less clear in the zero-base chart. It hides the precipitousness of law schools’ change in fortunes.

Such arguments seem to carry little weight, though, among the legions of what BuzzFeed’s Matthew Zeitlin has dubbed y-axis-zero fundamentalists. I had somehow missed out on their rise, I guess because all of my HBR time-series charts over the past few years have for various reasons (the main one being that my Excel skills are so limited that I don’t know how to truncate the axes) featured y-axes that go to zero. But apparently now this is a thing. The Huffington Post‘s Ben Walsh reported a similar experience with a recent (non-zero-based) chart on taxi medallions in New York. According to Walsh, “all the responses were like ‘rule violated. i refuse to consider your thesis’.”

When I checked the Twitter bios of the people who objected to Businessweek.com chart, most of them were software programmers, so I wondered if it was some weird coder obsession. It might be, but a simpler explanation was that prominent programmer Jeff Atwood had retweeted it to his 152,000 followers.

Instead, I think it’s mainly just that more and more people have acquired some amount of statistical literacy, and have learned along the way that not basing your y-axis at zero is can be misleading. As Duke sociology professor — and believer in non-zero-based charts — Kieran Healy Tweeted when I asked him where he thought the reaction came from:

“Narrow axes can make small and inconsequential changes seem big,” Healy went on, “but—symmetrically—zero-axes can make big and real changes seem small. What matters isn’t some iron rule like ‘Always have a zero-base axis!’, it’s your prior commitment to being honest with the data.”

It is easy enough to find examples of people using broken y axes to mislead. From a Media Matters compendium of Fox News chart outrages:

fbn-cavuto-20120731-bushexpire

This isn’t much of a time series, and I really can’t think of any good reason why the y-axis on a bar chart shouldn’t go to zero. But more important than any simple rule is that this chart was obviously crafted to deceive — there’s really no other reason to draw the chart this way.

The Businessweek.com chart, on the other hand, was crafted to show the data as fully as possible. Facebook “data visualization guru” Andy Kriebel recommends adding a note to any non-zero-based-y-axis chart explaining why you didn’t base it at zero. That’s not a bad idea, but I also think the overwhelming majority of those who read a chart like this one online (as opposed to those who see a chart flitting by on the TV screen) are able to figure out what’s going on. I love that so many people online are on the lookout for dodgy charts. But focusing on the data isn’t really dodgy.

Update:  My brilliant colleague Scott Berinato, who is working on a book on data visualization for the HBR Press and created the cool Vision Statement “How to Lie with Charts” in the December issue of HBR, emailed me with his thoughts, which I don’t entirely agree with but seemed worth sharing given that he knows more than I do:

I have to agree with them about the Y axis. Not because it should be a hard and fast rule but because of the metaphor problem. Our brains create 0 when your line begins or ends at the bottom — a metaphorical zero as in “no one is going to law school because the line’s at the bottom.” This is exacerbated by the headline “Empty Classrooms,” which creates a textual cue that “empty” is what matters. 

There’s also the slope problem. Tufte is right and wrong. He’s right about just show the data but a truncated axis doesn’t actually show the data. The data is not the line, the line divides space that represents the proportion of a (those enrolling) and b (those not). So by truncating the axis we not only create a more severe-looking slope, we literally hide representative space, and more on one side than the other.

Having said all that, this kind of thing is rampant, because of web design. This chart would be very tall otherwise. So we have to think about the tradeoffs. My developing sense for these situations is to go even simpler. The data that matters here is:

‘74: low

‘74-‘10: Steady, rolling climb.

‘10-‘13: precipitous fall off

In theory we could build this same chart with three data points — ‘74, ‘10, ’13 — unless those three small humps on the climb matter to the story, which I don’t think they do. Basically start with as few data points as possible then add as necessary. Don’t even connect the lines necessarily; use points.

35 thoughts on “The Rise of the Y-Axis-Zero Fundamentalists

  1. Yeah…charting fundamentalists… ggplot, which is probably the best charting package for the R statistical package, disallows separate left-right axes (without crazy workarounds). You ask how to do it, the creator says it’s a bad practice. I know why it can be misleading, but I can read an axis, and sometimes I just want to do a quick and dirty comparison of how two things move together. It’s like grammar nazis, the only way a language can evolve is by having a little flexibility…if people want to break rules, it’s for a reason, and if something is awkward or unclear, evolve new rules to make it clear.

    (Kind of a metaphor for using R, if someone has a package to do what you need, it’s amazing, otherwise you’re in a world of s***, but I digress).

  2. i suggest reading Stephen Few’s books. He has a thoughtful breakdown of when a zero base is or isn’t required.

    For example, bar charts use length to covey the message, so shortening the length artificially results in a different message. Note that the Fox chart is a bar chart.

  3. Another possible solution to this challenge is to create a zoom in, in which the full data with 0-y-axis is accompanied by a second chart that only focuses on the precipitous decline and trunactes–this is often done ‘magnifying glass’ style with a line pointing to the area of the base chart you’re blowing up. That way you get both accurate and alarming while honoring the space the represents those enrolled, below the line. Love the post, Justin. Scott

  4. Mr. Fox,

    I may be mentally impaired from years of drug use as a young man, but I think you’re on safe ground not zeroing in. However it’s possible there’s a y-axis hiding under your bed waiting until you drift into sweet sleep, at which point it will bludgeon you with null values.

    Have a good night.

  5. Most people agree that the axis shouldn’t be broken in a bar chart, so I’ll not discuss them.

    In a line chart what matters is the slope, and you can change it by breaking the axis or by changing the aspect ratio. So, there is no “right” slope. That’s why you should always have at least two series, and read the chart by comparing the slopes.

    If you are using large numbers, change tends to be small (GDP, for example) and that forces you to break the axis to improve resolution. That’s what happens here, and it is not the right solution. The right solution is to display rates of change in a bar chart. You would have the best of both worlds: an unbroken axis and the precipitous decline.

    I’m not one of the “y axis zero fundamentalists”, but I believe that zero should be our default origin and that there must be a very good reason to break the axis. If you get a flat line you’re probably using the wrong measure.

  6. I strongly disagree with the idea of “start with as few data points as possible then add as necessary.” Use all the data.

    I think it’s acceptable to make line charts (but not bar charts) that don’t start at zero, however, the more general rule is to not mislead the reader. I think some subtle tweaks could help this chart avoid doing that (I didn’t notice the truncated scale the first time I saw it). First, the line begins at the x-axis, implying that it starts from zero. Second, the vertical distance from the x-axis to 40k (those ticks are too precise, btw) is exactly the same as from 40k to 42k. If the difference was 1.5 times, then that would be a visual cue that something was different.

    The are more advanced ways to handle this, e.g. have the scale indicate peak values rather than regular intervals, but if the x-axis had been placed a quarter inch further down we probably wouldn’t be having this discussion.

    • This was exactly my thought. The horizontal axis is a dark line, and the plotted data starts exactly on this line. Further, the vertical axis is not labeled at its minimum, so it looks like the convention of using a hyphen rather than an actual zero (fine in a worksheet, but a bad practice on a chart). Fix these, and confusion is minimized.

      • I really like the removal/diminishment of the horizontal axis as a way to remove one of those metaphorical cues. But I would stick to my idea of showing just a few points, here’s why: The story we’re trying to tell is exactly as Justin describes: a precipitous drop off in the last three years has erased all the gains since 1974. So then I ask myself if I need data from 1987, 88, 89 etc. to tell that story. Probably not. More and more I find myself not starting with the data I have and plotting all of it but first asking, what am I trying to say and seeing how simply I can say it. No right or wrong here but lots of possible approaches. Good ideas both of you.

      • This is actually in reply to sberinato, whose comment I can’t reply to for some reason.

        I disagree with your use of only a few data points. By doing this you’ve decided what interpretation should be put on the data. Honest visualisations allow the viewer to interpret the message by making it clear. If there was some gain to leaving off the rest of the data you may have a point, but what does it cost you to have all of the data and therefore a more accurate chart? Not bandwidth, nor physical space within the graph. Nothing that I can think of.

        Put it all in and show the user that the increase between 1974 and 2010 was sporadic, not consistent. By choosing which data you put on the chart you apply your own priorities to the result.

      • I also disagree strongly with only showing three points. Do you then compress the X axis? That would just look like a line segment going up and another going down.

        Show the data you have. If it shows what you wanted to remove points to show, ithen you’re showing what you wanted to show. If it doesn’t, then removing points is probably lying.

  7. Another way to solve the issue is to normalize the data to either a starting point like 1970. Or use a mean or average. Which is good because for organizations the percentage drop is far more significant than raw numbers. A lot of housing price data is presented that way, especially when aggregated.

    So
    1970 == 100 (39000)
    2011 == 134 (52200)
    2013 == 102 (40000)

    Unambiguously shows the 32% drop in three years.

    • I like this as well, though for a broad magazine audience I wonder if adding a calculation helps or hinders? Housing data is done this way for people who use housing data all the time, that makes sense to me. Do most readers of Businessweek deal with normalized scales? Maybe, but it’s something to consider when making the decision.

  8. Of course you dont need to slavishly scale the y axis from 0. The answer is – it depends. Relatively small changes in relatively large figures can mean a lot and these get lost when following the fundamentalist line. It really depends upon the magnitude of what you are trying to observe.

  9. I can’t seem to reply to Tim Smith and Jon Peltier directly but this is in response to them. My idea on the three points was to *start* there, and depending on your audience that may be enough. Or you may add more. If you are trying to show that “Enrollment rose steadily for four decades then dropped precipitously in three years” you can show that with just three data points. You may even opt for a simple table. If that is not enough to tell the story you want to, then you explore alternatives.

    You state “by doing this you’ve decided what interpretation to put on the data.” I agree with you. But that’s why we’re here, to put an interpretation on the data. Every visual abstraction is an intepretation of the data. It’s a matter of knowing your audience. So I ask myself, what are the consequences of my audience not seeing the undulating bumps between ’74 and ’10? Do they need to? Does it help tell my story? If it doesn’t I can simplify. In this case, thinking about a general news audience, it may help to include every year’s plot (no right answer here) but if you do, I think the axis has to be set to 0, possibly with a zoom in to create that downward slope that’s so dramatic, but which shouldn’t stand on it’s own.

    I also strongly disagree with the idea that you show the data you have, if some of that data isn’t needed to say what you want/need to say. I approach it as “show the data you have to.”

    Very good discussion here. Thanks.

    Scott

    • Is it possible that showing all of the data provides better context for the alarming drop at the end?

      Being able to see the data points between 1974 and 2010 reveals:
      1) the enrollment rate had been generally rising every decade
      2) the size and timing of the decrease from 2010-2013 is unprecedented (at least since ’74)
      3) The 2013 value is the lowest enrollment has been since 1976

      If you don’t show that data, and instead just show points for 1974, 2010, and 2013, you cannot draw these conclusions. If the data is:
      1974: ~38,000
      2010:~52,000
      2013:~40,000

      There could be multiple explanations for this without knowing what happened between 1974 and 2010. Two examples:
      1) maybe enrollment is always around 40k and 2010 was just an outlier or part of a series of outliers
      2) maybe enrollment is cyclical and every few years a giant drop in enrollment happens

      I’d argue that the data between 1974 and 2010 does aid the story because it illustrates how the last 3 years have been a dramatic departure from the historical trend.

      • @skokenes – I wrote and posted my reply before I saw yours. I think we’ve agreed on the same points regarding the omitted data points.

    • Depending on your audience, showing just three points can be a disaster. To me, a line chart with three points at 1974, 2010, and 2013 would look like the chart was created or the data collected in a very lazy fashion. Why such a large gap between the first two points? Was it really a nice steady rise followed by a faster steady decline?

      Showing all the points show me that it was a more typical up and down behavior trending upwards, with every year accounted for and a resolution much smaller than the overall magnitude of the chart. The decline was much more sudden, with three straight years showing very similar declines.

      On another note, I prefer a line chart to actually show a marker for each discrete data point. This way I can see if there were any missing years, and if the author of the chart also committed the sin of smoothing the lines, I can see where the data came from that resulted in the smoothed line.

    • Scott, I think we have a fundamental difference in our approach and what we want to achieve. You seem to be taking a journalist’s approach by wanting to ‘tell your story’. That’s fair enough for journalists, but if you’re trying to present an objective chart and let the data tell the story, you’ll need to keep your own interpretation in check.

      skokenes has nicely articulated the reasons why three points don’t tell the whole story in this particular case, so I won’t repeat it.

      If you *start* with three data points and then put the rest of them in, we’re on the same page. If you start with three and leave it at three, you’ve told your story, not the story in the data.

  10. I agree with your overall point that whether or not to include zero on the y-axis depends on the goal of the visualization. If the story here is that all of the law-school enrollment gains since 1974 have been wiped out in two years, a relative scale is fine. But to assess the impact on law schools themselves (“Empty Classrooms!”) we need to see how large that change is in comparison to total enrollment. For that I think having a zeroed axis helps.

    You dismiss Abhinav’s chart because “it hides the precipitousness of law schools’ change in fortunes.” If you look closely at Abhinav’s chart, though, you’ll see that this is due to faulty plotting. The spatially equally-spaced x-axis points are separated by 5 years except for the last two, where the decline occurs. These are only separated by 2 and 1 years, artificially flattening the decline. Here’s a more accurate rendering: https://twitter.com/ebellm/status/545330343050366976/photo/1

    For those interested in the data itself, which is available back to 1947, it’s here: http://www.americanbar.org/content/dam/aba/administrative/legal_education_and_admissions_to_the_bar/statistics/jd_enrollment_1yr_total_gender.authcheckdam.pdf

    • Yes, I’d noticed the inaccuracies in Abhinav’s chart too, so thanks for taking the time to do an accurate version. Your chart shows the sharp decline clearly.

      Justin’s original chart exaggerates the decline, especially given that he starts the line on the bottom of the chart where there’s no label on the y-axis. To judge by his reaction on this page, he doesn’t like being asked to have his y-axis go to zero because it makes his chart look less dramatic. The decline is dramatic enough without having to be embellished in the way that Justin has.

      I agree that there are times when you don’t need to have your y-axis go all the way to zero. It’s just that this isn’t one of them. Calling people fundamentalists doesn’t change this.

      • Whoops, I’ve been inaccurate myself.

        Justin didn’t create the chart in question, so apologies to him for my accusation of embellishment.

        Nonetheless, I still think he dislikes the versions with a y-axis that goes to zero because they reduce the drama. He state’s that Abhinav’s chart ‘is less informative than the original chart’ but to me Eric B’s accurate version is clearly more informative than the original.

      • Yep, I agree that the full 1947-2013 version is the best.

        It’s interesting that the original author chose 1974 as the starting point. It conveniently starts just below the 2013 level and leaves out the equally precipitous increase between 1968 and 1971. Taken together with the ‘Empty Classrooms’ headline, it looks a little manipulative.

  11. No, maybe Scott’s right. If we cut out unimportant data points, we see that law school enrolment has grown dramatically, from 18,582 in 1947 to 39,675 in 2013, and amazing 113.5% increase.

    Uh, tongue-in-cheek, Scott.

  12. you can always truncate the axis on your graph using something like GIMP, simply select the x-axis then cut and paste it closer to your data along the y.

  13. “It hides the precipitousness of law schools’ change in fortunes.”

    LOL! That’s because there is no “precipitousness”. If you saw the second chart first, you would more accurately be telling us how the no-zero chart manufactured a false “precipitousness”.

    Doctor, heal thyself.

  14. Reblogged this on peakmemory and commented:
    This issue came up in a class I taught this Fall. The text book we used condemned the use of broken axis graphs. However, one of my students made a convincing case that, under some circumstances, you might be more interested in the fluctuations highlighted by the broken axis graphs, and, thus, justified in using them.

  15. It’s not a recent thing; it goes back to a classic that all should read, Darrell Huff’s _How to Lie With Statistics_, which came out in 1954.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s