Is “nearly significant” ridiculous?

Graphic: Parasitoid emergence from aphids on peppers, as a function of soil fertilization. Analysis courtesy of Chandra Moffat (but data revisualized for clarity).

“Every time you say ‘trending towards significance’, a statistician somewhere trips and falls down.” This little joke came to me via Twitter last month. I won’t say who tweeted it, but they aren’t alone: similar swipes are very common. I’ve seen them from reviewers of papers, audiences of conference talks, faculty colleagues in lab meetings, and many others. The butt of the joke is usually someone who executes a statistical test, finds a P value slightly greater than 0.05, and has the temerity to say something about the trend anyway. Sometimes the related sin is declaring a P value much smaller than 0.05 “highly significant”. Either way, it’s a sin of committing statistics with nuance.

Why do people think the joke is funny? Because of course we all know that a result can’t be “nearly significant”, and a “trend toward significance” (as for Experiment 2 in the graph above) isn’t evidence of anything except the statistical ignorance of the person who mentions it. We set a significance criterion (often α = 0.05), we conduct our test, and whether or not our calculated P value reaches that criterion is all that we should interpret, or even notice. Right?

Well, not so fast. It’s true that a lot of people are taught this way – I was, and odds are good that you were too. But in fact, this is only one of two ways we can think about the magnitude of a P value (I’ll call these, a bit loosely, two “philosophies”, although this usage is likely to bother philosophers*).   People who make jokes at the expense of “nearly significant” P values are adopting one philosophy, but seem unaware that there’s another one.

What are these two philosophies? I’ll call them the absolutist and continualist philosophies**. They differ in what kind of consideration they give the magnitude of a P value, and there’s a case for and against each one. Neither is ridiculous.

It’s the absolutist philosophy that’s described in my second paragraph, and that dominates our statistical teaching (at least in biology, and at least to undergraduates). The absolutist significance criterion is a line in the sand: your result stands on one side or the other, and that’s it. Once you’ve adopted this philosophy, it’s nonsensical to describe the degree of significance of a P value: one can’t be “nearly” or “barely” or “highly” significant, only significant or not. This has one very big advantage: it forces you to make a decision about the strength of evidence*** that you’re looking for before you have the results of an analysis in sight. As a result, there’s no temptation to be lenient with your pet hypothesis but stringent with something you’re skeptical of. In a sense, absolutist statisticians are like drivers who use cruise control: they want to make a careful decision on strictly rational grounds, and deliberately give that decision primacy over their instincts in any particular situation.

The continualist philosophy is quite different. A continualist would hold that the job of the P value is to express the strength of evidence against the null, and that this is a naturally continuous thing. It follows that drawing different conclusions from patterns with P = 0.0498 and P = 0.0507 is pretty silly (one might be forced to do so in the graph above, and no, I did not make that result up). Those two patterns are, after all, almost equally unlikely under the null. Not only that: P values, just like means and just like test statistics, are influenced by sampling uncertainty (even if we don’t conventionally put standard errors on them). So at the risk of getting too meta, it’s entirely likely that P = 0.0498 and P = 0.0507 are not significantly different from each other, leaving the absolutist philosophy nicely hoist by its own petard****. A continualist statistician would say that you can’t make statistical analysis a line in the sand, because if strength of evidence is continuous, our inferential conclusions should be too. (Bayesian statisticians and model selectionists would presumably agree, since continualist interpretation of the magnitude of Bayes factors and AIC values is conventional.)

Note, by the way, that the continualist position is not that anything goes – that any P value is worth getting excited about. A remark that something is “trending towards significance” at P = 0.4 deserves all the scorn it’s likely to get. Rather, we can recognize and distinguish between results that provide weak evidence, moderate evidence, and strong evidence against the null. We don’t have to consider all propositions either supported or rejected; some we merely lean towards.

So there are two ways one can think about the interpretation of P-value magnitudes; and each has a sensible rationale behind it. It’s perfectly sensible to be a committed absolutist, and to defend that decision (and so I’m disagreeing with Hurlbert and Lombardi (2009)’s blistering takedown). It’s equally sensible to be a committed continualist, and to defend that decision. What’s not sensible is to think that people who hold one philosophy or the other are ridiculous.

So: back to the joke I started with. I hope it’s clear by now why it isn’t funny: it’s based on ignorance. People making the joke are unaware that continualist interpretations of P value magnitudes are perfectly sensible, and have both a long history and distinguished proponents. If you don’t know this, and can’t explain the case for and against each philosophy, then you probably shouldn’t make jokes about either of them. Doing so achieves irony, not comedy. After all, what could be more ironic than displaying your own ignorance by poking fun at what you think is somebody else’s?

But I’ll end with a confession. I was taught absolutist inference, and once made those “nearly significant” jokes myself. I’ve learned, though, and I’ve stopped. Pass this post on to someone you love, and maybe they’ll stop too.

© Stephen Heard (sheard@unb.ca) November 16, 2015

Thanks to Deborah Mayo for comments on two early drafts. I expect she’ll still disagree with some of my treatment here, but nonetheless her comments greatly improved my post. See her excellent Error Statistics blog for much more on the logic and philosophy of inference.

Related posts:


*^Because, I think, philosophers would say these are not really “philosophies” in the sense of well-formed, logically built epistimological structures. A more accurate label might be “informal, but commonly adopted, opinions about how to proceed”. That’s a little unwieldy, though, so I’ll stick with “philosophies”.

**^The absolutist philosophy is often labelled “Neyman-Pearsonian”, and the continualist philosophy “Fisherian”, but if you read Fisher, Neyman, and Pearson carefully these turn out to be misleading and confusing names for them. For starters, Fisher clearly held a “Neyman-Pearsonian” position (at least early in his career) and Neyman and Pearson arguably held a “Fisherian” one (at least for science, although perhaps not for process control). Later in his career, Fisher became more “Fisherian”, although perhaps only because he was feuding with Neyman. Finally, Deborah Mayo argues that there’s little difference between the original positions of Fisher, Neyman, and Pearson, and so no case for naming different philosophies after them. The bottom line is that the names of famous dead statisticians seem to generate more heat than light, and this obscures the very real differences between contemporary scientists who teach and practice statistics using one philosophy or the other.  UPDATE: a new and relevant post from Deborah Mayo here, with more on the history of Fisher and Neyman’s thinking in part through the lens of Neyman’s first student, Erich Lehmann.

***^By “strength of evidence” I mean the degree of inconsistency of data with the null hypothesis. It’s important not to think of a P value as indicating (on its own) any degree of confirmation of, or consistency with, the null. It’s also important not to confuse strength of evidence for an effect with strength of an effect (the latter being measurable by a regression coefficient or similar statistic, not by a P value).

****^A petard is a mine (a small bomb) used to destroy fortifications. To be hoist by one’s own petard is to be blown up by one’s own weapon. The expression comes from Hamlet, in which the moody prince discovers that his schoolmates Rosencrantz and Guildenstern are carrying letters ordering his murder. Hamlet modifies the letters to order the murders of Rosencrantz and Guildenstern instead, and is (perhaps pardonably) quite proud of himself: “For ’tis the sport to have the engineer / Hoist with his own petard; and it shall go hard” (Hamlet 3:4 206-207, spelling modernized). So that’s petard. The similar word pedant is unrelated, and describes someone who thinks he should explain the meaning of petard in a blog post about statistics.

Advertisements

22 thoughts on “Is “nearly significant” ridiculous?

    1. jeffollerton

      Yes, I’ve just had that conversation with a real statistician and he pointed out the the 0.05 cut-off was purely a function of ease of calculation in pre-computer days. So I think I’d disagree slightly with Steve that an absolutist position is defensible because I’ve only ever seen 0.05 defended in this way, not 0.04 or 0.06, etc. etc., all of which are equally defensible as an absolutist position.

      The cruise control metaphor is an interesting one here. There’s a story (probably an urban myth) about a rich foreign visitor on his first visit to the USA who hired a Winnebago for a long road trip. He managed to write-off the vehicle on its first day when he set the cruise control and then went into the back to make himself a coffee…..

      Liked by 1 person

      Reply
  1. Chris

    I agree that neither stance is ridiculous, and I do think that the continualist approach makes intuitive sense. However, the issue with strictly interpreting the p-value as a measure of the strength of the relationship (or effect size) is that P-value is a composite of effect and sample size. Thus, the decision of whether or not a p=0.055 is significant needs to consider sample size as well as effect size and p-value. Of course, I suspect that most scientists that prefer a continualist approach already use these multiple statistics in their inference of significance.

    Like

    Reply
  2. separatinghyperplanes

    Couple thoughts:
    First, mathematically speaking, the “continualist” view is the correct one. P-value is continuous. We’d have bias if we picked the range based on the result, but again that is easily averted by simply declaring our ranges in advance– declaring 0.1>p>0.05 as “marginally significant” in advance is at least as valid as picking the p<0.05 threshold in advance.

    That said, there are times where we must be functionally in the "absolutist" philosophy. Even though there's no discontinuity at p=0.05, our policy decision is typically binary: we must decide to do the intervention or not, regardless of how fuzzy the evidence is. For example, a doctor needs to know whether to give a treatment to a patient or not–you cannot "almost treat" a patient because the evidence for that treatment is "almost significant." In that respect, a binary interpretation of significance makes sense, though even then we've typically stacked the deck when we decide which hypothesis to use as the null (eg, is it riskier to treat a non-sick patient or to fail to treat a sick patient?)

    Like

    Reply
    1. ScientistSeesSquirrel Post author

      Nicely put. I think the distinction you’re getting at in your second paragraph is between inference (which can be continuous) and decision (which can’t be). The use of absolutist tests for process control, which goes back to Neyman and Pearson, is analogous to your medical example. I didn’t get into that in depth, but that was certainly something heavily discussed in the early stats literature!

      Liked by 1 person

      Reply
  3. Richard D. Morey

    The problem is that yes, p values are continuous numbers, but the justification for using them as strengths of evidence — at least, by any commonly used definition of evidence — is lacking. I can take any continuous function of p that I like, and it is still continuous. That doesn’t mean that the number that comes out is meaningful or useful. The mathematical fact of p’s continuity is irrelevant from the point of view of its meaning.

    The absolutist philosophy gives a solid justification for p values in terms of error rates. Without that justification, it isn’t clear how they are to be used. Suppose p=.045. What does this mean? It means that the probability of a more extreme observation, if the null is true, is .045. What are we to *do* with this fact? For his own part, Fisher said that the p value was related to a “feeling of doubt”, but what does that mean? How much doubt? What if two people have different amounts of doubt? Should p=.045 lead to the same about of “doubt” if N=10 as if N=1000? The absolutist view solves this problem, and this problem *needs* to be solved.

    Like

    Reply
    1. Mayo

      Richard: You might was to see my post on p-values as error probabilities. An observed p-value is still an error probability whether it’s the one you preset as a cut-off or what I reported as observed.
      http://errorstatistics.com/2014/08/17/are-p-values-error-probabilities-installment-1/
      Also, I now see that Heard qualifies what he means by degree of evidence in terms of extent of inconsistency. But Bayes ratios don’t supply degrees of evidence in this sense.

      Like

      Reply
  4. Mayo

    Stephen: I’ve had no time to comment, and since we had that extended exchange, I figured I already did. But here’s a quick reaction to your latest:
    I don’t know about statisticians falling down, but what’s problematic is declaring any positive difference as “trending” in the direction you want it to trend. (It’s also trending insignificant.) I’m glad you now mention that’s what is being derided.
    But there are some crucial points that are being confused: the continuous measure that P-values offer and what is sought by “magnitude of Bayes factors” or even Bayesian posteriors shouldn’t be run together. Bayes factors are just a comparative notion like likelihood ratios, and is entirely relative to what alternative you decide to contrast with a null or other test hypothesis, and possibly a prior as well. There is no strength of evidence with a comparison, and there’s also no obvious error control offered. By contrast, a P-value gets it’s meaning in tandem with its being an error probability. With a Bayesian posterior you might say there’s an attempt to give a degree of plausibility, but that will still be very different from the continuous P-value. Heard makes it seem as if any two continuous measures involved in statistical inference are more similar than are reports of observed P-values, and reports of whether a cut-off has been reached. That’s not so. It’s fine to see the P-value as a continuous measure of evidence, but what’s being measured would be the extent of discrepancy or inconsistency (or consistency).

    The statistical significance of an (observed) difference, d-obs, is the probability of a more extreme difference resulting even if the null hypothesis adequately describes the data generation, in the respect modeled. One cannot ascertain the indicated inconsistency or discrepancy without knowing the sample size, and I don’t think Heard mentions this in the post (I apologize if I missed this). But the main point is that the P-value arises within a falsificationist or corroborationist account, and is not offering degrees of belief, support, plausibility or the like.

    As far as what Fisher and NP really thought, Fisher was what Heard is calling an “absolutist” (terrible term), while NP objected to rigid cut-offs. People confuse the fact that experimental planning—advocated by Fisher and by NP—occurs pre-data. That’s what planning means. So, if you want to plan an experiment with reasonable power to detect discrepancies of interest (while controlling the probability of erroneously finding an effect), you need to set the worst case error probabilities in advance. That does NOT mean you are stuck reporting merely whether or not your result made it to a cut-off used in doing the planning!!!! NP and Lehmann (main NP spokesperson and N student) explicitly advocate reporting the observed P-value, and opposed setting a rigid cut-off. Lehmann has written many accessible papers and a great short recent book on the history of NP and F.
    Unlike Fisher, NP recommended balancing the probabilities of type 1 and 2 errors depending on what discrepancies were of interest. In some places N recommended post-data power analysis to determine what may legitimately be “confirmed” post data. Pearson (Egon) always rejected a behavioristic performance approach. NP developed tests at the same time N was developing confidence intervals. The test result corresponds to an indication of a lower or upper bound of a CIat a chosen level. Again, what’s continuous is the estimated or indicated discrepancy or inconsistency with a test hyp.

    From a post on my blog: Pearson shows that treating the observed difference (between the two proportions) in one way yields an observed significance level of 0.052; treating it differently (along Barnard’s lines), he gets 0.025 as the (upper) significance level. But in scientific cases, Pearson insists, the difference in error probabilities makes no real difference to substantive judgments in interpreting the results. Only in an unthinking, automatic, routine use of tests would it matter:
    “Were the action taken to be decided automatically by the side of the 5% level on which the observation point fell, it is clear that the method of analysis used would here be of vital importance. But no responsible statistician, faced with an investigation of this character, would follow an automatic probability rule.” (ibid., 192)
    http://errorstatistics.com/2015/08/14/performance-or-probativeness-e-s-pearsons-statistical-philosophy/
    Much, much more on this badly misunderstood business of N, P, NP, and F may be found by searching errorstatistics.com.

    This is bound to contain typos, sorry.

    Like

    Reply
    1. ScientistSeesSquirrel Post author

      Thanks – this is a really useful supplement. I’d be interested to hear a committed Bayesian comment on “there’s no strength of evidence with [a Bayes factor]”. Could be entertaining… 🙂

      By the way, I completely agree that “absolutist” and “continualist” are terrible, ugly terms. Because they’re normally called “Neyman-Pearsonian” and “Fisherian”, I had used those in an early draft, but your convinced me they were historically inaccurate. I’m completely open to better suggestions!

      Like

      Reply
    1. ScientistSeesSquirrel Post author

      Thanks, Jeff! I’d seen that but forgotten about it. It’s an excellent example of exactly the kind of thinking I argue here is overly simplistic – but it’s so amusingly written that it’s hard to criticize. Although it would be really fun to make a contrasting list of times people have run across p = 0.051 with big effect sizes and dismissed them – only to discover later there were real effects lurking beneath the type II error. Which I recognize is a bit off my main point, so I’d better stop before I get myself in trouble 🙂

      Liked by 1 person

      Reply
  5. Pranab

    As public health doctors who have to make clinical as well as policy decisions, we are always taught that the p value is a line in the sand, and the results are either significant, or not. We are rarely taught about the continuous “philosophy”! Thanks for such a wonderfully written article that non-statisticians can also appreciate. Love the blog, subscribing to it and hoping to learn from it in the days ahead!

    Like

    Reply
    1. ScientistSeesSquirrel Post author

      Thanks, Pranab. Welcome to the blog. I wonder if the clinical/policy “absolutist” stance is a borrowing from the kind of industrial process control where I think pretty much all statisticians would apply absolutist methods. One could see diagnosis, I suppose, as somehow analogous to testing widgets as they roll down the assembly line – but, of course, one probably shouldn’t! Anyway, glad you found this interesting.

      Like

      Reply
  6. Pingback: A year of Scientist Sees Squirrel: thoughts and thanks | Scientist Sees Squirrel

  7. Pingback: Statistics and significant digits | Scientist Sees Squirrel

  8. Pingback: The most useful statistical test that nobody knows | Scientist Sees Squirrel

Comment on this post:

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s