Image: William Caxton showing his printing press to King Edward IV and Queen Elizabeth (public domain)
It’s a phrase that gets no respect: “nearly significant”. Horrified tweets, tittering, and all the rest – a remarkably large number of people are convinced that when someone finds P = 0.06 and utters the phrase “nearly significant”, it betrays that person’s complete lack of statistical knowledge. Or maybe of ethics. It’s not true, of course. It’s a perfectly reasonable philosophy to interpret P-values as continuous metrics of evidence* rather than as lines in the sand that are either crossed or not. But today I’m not concerned with the philosophical justification for the two interpretations of P values – if you want more about that, there’s my older post, or for a broader and much more authoritative treatment, there’s Deborah Mayo’s recent book (well worth reading for this and other reasons). Instead, I’m going to offer a non-philosophical explanation for how we came to think “nearly significant” is wrongheaded. I’m going to suggest that it has a lot to do with our continued reliance on a piece of 15th-century technology: the printing press.
I came to this hypothesis while teaching graduate biostats this semester. Literally “while” teaching it, I mean – I was in front of the classroom with a chunk of a t-table projected on the screen when I thought, Huh. Why wouldn’t we all think about P-values as lines in the sand, when we keep teaching (and being taught) critical values?
It’s simple, really. For decades, roughly from Fisher’s time until fairly recently, the way we thought about statistics (I claim) has been influenced by two technological constraints. We didn’t have the computing power to calculate an exact P-value for each test (whether analytically, by randomization, or something else**). Instead, we were forced to use lookup tables, which had to be printed on paper (the 15th-century technology of this post’s title). And if the paper lookup tables weren’t to be absolutely enormous, they had to show a limited number of critical values.
Of course, neither constraint holds today. We use powerful computers to do nearly unlimited computations, and to hold nearly unlimited lookup tables. Either approach, or a combination, can give us exact P values for almost any situation. But those disdaining “nearly significant” aren’t interested in those exact values; instead, they prefer to simply compare them to an alpha (0.05, or 0.01, or something else) and declare them bigger, or smaller, and that’s the end of it. In other words, they simply transplant the critical-value approach from the test statistic to the P value. Isn’t that odd?
Except it really isn’t that odd. For generations, students have been taught to use tables of critical values. Those students have then become the teachers, and have taught their students to use tables of critical values. We don’t have to do that any more, but the habit is ingrained (“intaught”?) to the point where we don’t question it. And, in a curious inversion, the critical-value approach has been so entirely normalized*** that I think we assume that it must have some overwhelming philosophical/logical primacy. But it doesn’t. We may think, consciously or unconsciously, that it must – but really, our preference (under my hypothesis) is just an artefact of the 15th-century technology that shaped all those years of statistical teaching.
Look, I’m not saying the legacy of the printing press is the only reason people adopt the absolutist, line-in-the-sand interpretation of P values over the continualist, strength-of-evidence interpretation. I’ve no doubt that there are folks out there who have made that decision on carefully reasoned philosophical grounds (or on practical ones, as in the application of statistics to process control). But I don’t think this can explain the dominance of absolutist interpretations. In my experience, some of the most vociferous disdainers of “nearly significant” seem largely unaware of the historical and philosophical literature around it. That means we have to look elsewhere for explanations; so I’m looking at you, printing press.
© Stephen Heard December 4, 2018
*^Someone will tweet in horror at that, outraged that I could say such a thing and insisting that a P-value has nothing to do with evidence. They will do so without defining what they mean by “evidence”, and in blissful ignorance of plenty of statistical and philosophical work to the contrary. I know this, because it’s happened before.
**^Fisher’s exact test is an exception, although it wasn’t widely used because calculation was possible only for a narrowly constrained range of study designs and datasets.
***^There has got to be a distributional joke there. If you can come up with it, please let us know in the Replies.