*Image: William Caxton showing his printing press to King Edward IV and Queen Elizabeth (public domain)*

It’s a phrase that gets no respect: “nearly significant”. Horrified tweets, tittering, and all the rest – a remarkably large number of people are convinced that when someone finds *P *= 0.06 and utters the phrase “nearly significant”, it betrays that person’s complete lack of statistical knowledge. Or maybe of ethics. It’s not true, of course. It’s a perfectly reasonable philosophy to interpret P-values as continuous metrics of evidence* rather than as lines in the sand that are either crossed or not. But today I’m not concerned with the philosophical justification for the two interpretations of *P* values – if you want more about that, there’s my older post, or for a broader and much more authoritative treatment, there’s Deborah Mayo’s recent book (well worth reading for this and other reasons). Instead, I’m going to offer a *non*-philosophical explanation for how we came to think “nearly significant” is wrongheaded. I’m going to suggest that it has a lot to do with our continued reliance on a piece of 15^{th}-century technology: the printing press.

I came to this hypothesis while teaching graduate biostats this semester. Literally “while” teaching it, I mean – I was in front of the classroom with a chunk of a t-table projected on the screen when I thought, *Huh*. *Why wouldn’t we all think about P-values as lines in the sand, when we keep teaching (and being taught) critical values?*

It’s simple, really. For decades, roughly from Fisher’s time until fairly recently, the way we thought about statistics (I claim) has been influenced by two technological constraints. We didn’t have the computing power to calculate an exact P-value for each test (whether analytically, by randomization, or something else**). Instead, we were forced to use lookup tables, which had to be printed on paper (the 15^{th}-century technology of this post’s title). And if the paper lookup tables weren’t to be absolutely enormous, they had to show a limited number of critical values.

Of course, neither constraint holds today. We use powerful computers to do nearly unlimited computations, and to hold nearly unlimited lookup tables. Either approach, or a combination, can give us exact *P* values for almost any situation. But those disdaining “nearly significant” aren’t interested in those exact values; instead, they prefer to simply compare them to an alpha (0.05, or 0.01, or something else) and declare them bigger, or smaller, and that’s the end of it. In other words, they simply transplant the critical-value approach from the test statistic to the *P* value. Isn’t that odd?

Except it really isn’t that odd. For generations, students have been taught to use tables of critical values. Those students have then become the teachers, and have taught *their* students to use tables of critical values. We don’t have to do that any more, but the habit is ingrained (“intaught”?) to the point where we don’t question it. And, in a curious inversion, the critical-value approach has been so entirely normalized*** that I think we assume that it must have some overwhelming philosophical/logical primacy. But it doesn’t. We may *think, *consciously or unconsciously, that it must – but really, our preference (under my hypothesis) is just an artefact of the 15^{th}-century technology that shaped all those years of statistical teaching.

Look, I’m not saying the legacy of the printing press is the *only* reason people adopt the absolutist, line-in-the-sand interpretation of *P* values over the continualist, strength-of-evidence interpretation. I’ve no doubt that there are folks out there who have made that decision on carefully reasoned philosophical grounds (or on practical ones, as in the application of statistics to process control). But I don’t think this can explain the dominance of absolutist interpretations. In my experience, some of the most vociferous disdainers of “nearly significant” seem largely unaware of the historical and philosophical literature around it. That means we have to look elsewhere for explanations; so I’m looking at you, printing press.

*© Stephen Heard December 4, 2018*

*^Someone will tweet in horror at that, outraged that I could say such a thing and insisting that a P-value has nothing to do with evidence. They will do so without defining what they mean by “evidence”, and in blissful ignorance of plenty of statistical and philosophical work to the contrary. I know this, because it’s happened before.

**^Fisher’s exact test is an exception, although it wasn’t widely used because calculation was possible only for a narrowly constrained range of study designs and datasets.

***^There has *got* to be a distributional joke there. If you can come up with it, please let us know in the Replies.

MacrobeExcellent!

LikeLike

John PastorI am old enough to have learned statistics mostly by hand and with calculators (my old HP actually had a sum of squares button!), so this essay made me think of the philosophical implications of these critical tables. But I have also stopped using the word “significant” in almost all my papers during the past ten years. One reason is the magical thinking of critical values (P = 0.05 is good but P = 0.055 is not). Another reason is that when we use this word, we often neglect to mention the direction of the effect of x on y. So now I write things like: “Adding N strongly increased plant growth (P = 0.015) but adding P only weakly increased growth (P = 0.065)”.

LikeLiked by 1 person

ScientistSeesSquirrelPost authorJohn – I agree with you that overemphasis on “significant” has led people to pay too little attention to effect size and direction. But I’ll push back a bit against your solution. I think your phrasing tempts readers to fall into a different trap: confusing level of significance with strength of effect. The P-value is a metric of strength of evidence that the effect is *real*; it doesn’t have anything to do (directly) with whether the effect is strong or weak. Which you know, of course, and presumably you would report by *how much* N increased plant growth separately. If you don’t like “significance” for “evidence effect is real”, you could substitute “credibly” or something, but I’m not sure that’s a big help?

LikeLike

John PastorNot sure I like “credibly” in place of “significant” but I know I don’t like “significant” any more at all. Giving the magnitude of the response per unit of x is one good way to get at strength of effect. Usually I report an r2 value as well as the P value, which also gets to one aspect of the strength of the effect (predictability of seeing a particular value of y given a value of x). I prefer regressions instead of Anova because they get at the direction and shape of an effect, and I usually design my experiments to use regressions (many treatment levels, fewer replications within a treatment) rather than Anova (fewer treatment levels but more reps). The good thing about this approach is that I can usually fall back on an Anova if the response is a complicated shape, but I can’t usually do a regression if I don’t have many treatment levels. I am also reminded of something I saw posted on a bulletin board in Sweden: ‘Most people think “significant” means “important”, but scientists are the only people who think “significant” means “not random”.’

LikeLiked by 1 person

David HuntExactly. P-values are a combination of effect and sample size. I always report AT LEAST the following: effect size, sample size or df, which test I used, and the resulting exact p-value.

Adding N clearly increased plant growth by 35.6% (t-test, n=14, p=0.013), but additional P had an unclear effect (14.1% increase, t-test, n=14, p = 0.069).

When you say unclear, etc… well, it’s a judgement call. If all your other p-values are <0.0001, that's different than when they are all in the same order of magnitude, just on one side of 0.05 or the other.

LikeLike

John PastorThanks, David. Good advice.

LikeLike

Philippe Marchand“It’s a perfectly reasonable philosophy to interpret P-values as continuous metrics of evidence rather than as lines in the sand that are either crossed or not.”

Yes, but the phrase “nearly significant” suggests they’re trying to have it both ways, i.e. set a line in the sand for what effect they care about then say they still care about it because it’s just inside the line.

LikeLike

ScientistSeesSquirrelPost authorOr, alternatively: that the hegemony of line-in-the-sand forces us to use constructions like “nearly significant”. (I agree, something like “modestly supported” would be better)

LikeLike

Pingback: Friday links: adding up publication fees, the deep roots of the phrase “nearly significant”, and more | Dynamic Ecology

Andrew Stoehr (@andrew_stoehr)A friend of mine, when reading “it is approaching significance” said “How do you know it’s not running away from significance?” 🙂

LikeLiked by 3 people

Pavel DodonovI don’t show critical table numbers in my stats classes. I try to focus on distribution of the test statistic (i.e. showing a probability density function, not a table; or, alternatively, showing a histogram for a randomized test), and reporting the exact p-value. And for a couple of months I’ve been speaking about the different view and that scientists don’t agree on p<0.05 or that a line-in-the-sand is needed. Yet many students are totally fixed on the p<0.05 idea; I think this has been taught to them in Genetics. Anyway. What are you thoughts on how to teach this?

LikeLike