*(graphic by Chen-Pan Liao via wikimedia.org)*

The *P*-value (and by extension, the entire enterprise of hypothesis-testing in statistics) has been under assault lately. John Ioannadis’ famous “Why most published research findings are false” paper didn’t start the fire, but it threw quite a bit of gasoline on it. David Colquhoun’s recent “An investigation of the false discovery rate and the misinterpretation of *P*-values” raised the stakes by opening with a widely quoted and dramatic (but also dramatically silly) proclamation that “If you use *P*=0.05 to suggest that you have made a discovery, you will be wrong at least 30% of the time.”* While I could go on citing examples of the pushback against *P*, it’s inconceivable that you’ve missed all this, and it’s well summarized by a recent commentary in Nature News. Even the webcomic xkcd has piled on.

Now, the fact that this literature exists is understandable. After all, it’s rather fun to be iconoclastic (and on other topics I’ve been known to indulge). And Ionnadis and Colquhoun, among others, make very important points about ways that *P*-values can be misused. But it betrays some real misunderstanding of statistics (and what statistics is for) to suggest as a result that *P*-values are not valuable, or – worse – to suggest that we should stop teaching hypothesis testing entirely.

The problem is that too many people have lost track of what the *P*-value does, and why that needs to be done. The P-value tells us whether a pattern in our data is surprising, under the “null hypothesis” that patterns are produced only by random variation**. That is, calculating the *P*-value gives us a check against self-delusion; only when the *P*-value confirms that our data would be surprising given randomness should we consider them further. The *P*-value is the beginning, not the end, of a statistical analysis.

This function of a check against self-delusion is absolutely essential, not because of math, but because of human psychology. Humans are very, very good at seeing pattern in nature, even when there’s no pattern to see (Nassim Taleb’s “Fooled by Randomness” explores this at length). We get excited about faces on Mars, runs in lotteries, clutch hitting streaks, and bunnies in clouds. We get similarly excited about patterns in our data, and we need a tool to counter our apophenia (yes, there’s a word for it, and an xkcd comic too). That tool is the *P*-value, and we can’t think clearly without it.

Criticisms of the use of *P*-values, when examined closely, are nearly always criticisms of *P*-values being used for other purposes. There are, unsurprisingly, quite a few things that *P*-values do poorly (since they not designed to do them). For example, a *P*-value **does not**:

- Measure effect size. A very small
*P*-value is “highly significant” because it provides strong evidence that there’s some real effect; it does not mean that effect must be large. This misunderstanding is frequent in the media, likely because in lay language, “significant” and “important” are synonyms. In statistics they are not. Measure effect sizes (and their uncertainties) and report them; the*P*-value can’t do that work for you.

- Rule out random “cause”. A “significant”
*P*-value is permission to investigate further, not proof of an effect. If I ran 100 studies of a homeopathic remedy, about 5 of them would yield*P*< 0.05; but all five would be false positives. That’s not a flaw in the*P*-value, but thinking it might be is evidence of a flaw in one’s statistical thinking. (Colquhoun actually deals with this very nicely, if you ignore the dramatically silly proclamation I quoted above and read his paper carefully instead.)

- Give a yes/no answer to “should I believe this effect?”. The
*P*-value expresses probability on a continuous scale. Despite widespread recommendation, deciding*a priori*to believe absolutely in effects for which*P =*0.049 and to disbelieve absolutely in effects for which*P*= 0.051 is every bit as silly as it sounds. Report exact*P*-values, and don’t let reviewers bully you away from hedges like “weakly” significant or “suggestive”. Such hedges are perfectly accurate.

- Express the relative performance of two different (non-nested***) models. Tempted to compare predictive value of precipitation, temperature, or soil type for plant growth by asking which gives the smallest
*P*-value? Don’t do it; that’s what techniques like the Akaike Information Criterion are for.

- Dice onions without tears. OK, maybe nobody thinks it does. But would this belief really be more foolish than thinking a
*P*-value could measure an effect size?

Think about it this way: the *P*-value is a #2 Phillips-head screwdriver. Observing that it does a poor job of crackfilling drywall is not a reason to throw it away – it’s a reason to learn what it’s actually for (driving #2 Phillips-head screws) and to learn what tool actually *is* good for crack-filling drywall (an 8” drywall taping knife). Don’t abandon the *P*-value. Do use it for the crucial purpose it fulfills – and only for that purpose.

And please stop with the cheap shots at our poor beleaguered *P*-value.

*© Stephen Heard (sheard@unb.ca) Feb 9 2015*

UPDATE: When I wrote this I had somehow missed the excellent interchange in Ecology’s *Forum *section: *P values, hypothesis testing, and model selection: it’s deja vu all over again *(Ecology 95:609-653, 2014; Hat tip, Daniel Lakens). This includes excellent arguments both pro and con. Feel free to poke fun at the irony of this given my earlier post When Not To Read The Literature…

*Not even Colquhoun really believes this – it depends entirely on things like the power of the test and the size of the true effect (if there is one). The false-positive fraction can be zero or it can be 100%, and you don’t get to know before you start. The body of his paper outlines this quite well, but that’s not what got quoted, tweeted, and blogged.

**More precisely, the *P*-value is the probability of drawing data the way we did and getting a pattern as strong, or stronger, than in our actual data – given that all assumptions are met and that patterns arise only through sampling variation (that is, the null hypothesis is true). A statement like this appears in every statistics textbook, but the fraction of statistics students who don’t master it is distressingly large.

***For nested models, *P*-values can accomplish a function very close to this – for instance, when we test significance of a quadratic term in a model, to see if it explains variance better than the linear alternative. But I’m wandering away from my point.

jeffollertonNice to see a bit of common sense being applied here, in contrast to the hyperbole that’s accompanied some of these papers. As I pointed out in a letter to the Times Higher a few years ago, if, as Ioannadis suggests, most research is incorrect, then clearly that’s probably also true of his paper claiming that “most published research findings are false”!

LikeLike

ScientistSeesSquirrelPost authorThanks, Jeff! (And nice twist on the Ionnadis claim!)

LikeLiked by 2 people

Andrew Hendrythe #2 phillips analogy is great, since that is a pretty crappy head. Robertson is way better, as are many other options: http://en.wikipedia.org/wiki/List_of_screw_drives

LikeLike

markusqAnd there’s an xkcd for that, too: https://xkcd.com/1474/

LikeLike

Pingback: Bookmarks for February 9th | Chris's Digital Detritus

Alex EtzHi Stephen, I appreciate you posting this and sticking your head “above the battlements” as you said on twitter. I would normally tweet to you with comments but this would be a little too long for that.

“Criticisms of the use of P-values, when examined closely, are nearly always criticisms of P-values being used for other purposes.”

There is a great book by Michael Oakes (1986), called “Statistical Inference: A Commentary for the Social and Behavioural Sciences” that critiques p-values and significance tests for exactly what they are. It’s pretty short and straightforward (and if you are interested I can share a few digital chapters with you). While it does, of course, address how they are often misused, it also has a broad range of critiques against the general theory and idea of significance testing.

Also,

I would be interested to know which p-value procedures you are advocating for here, as there are *at least* 3 major divisions of significance tests (I wrote about this here: https://nicebrain.wordpress.com/2014/12/11/are-all-significance-tests-made-of-the-same-stuff/). From what I can tell it would fall under number 2 in the post I linked, but I would be interested to hear your thoughts. The bit of your post that I quote below makes me think you definitely aren’t a subscriber to Neyman-Pearson testing.

Depending on which one you want to use, your point “[P-values don’t g]ive a yes/no answer to “should I believe this effect?”. The P-value expresses probability on a continuous scale.” and the rest of that bullet is either a helpful reminder (for Fisher’s version) or way off base (Neyman and Pearson’s version).

LikeLike

ScientistSeesSquirrelPost authorThanks, Alex, and nice post – while I don’t agree with your final take-home (as I suspect you don’t agree with mine), it’s clearly laid out and you distinguish the “flavours” of P-value very usefully. Readers of my post should read yours too.

You are correct that I am siding with Fisher and not Neyman-Pearson in terms of interpreting P-values: your #2. (But I am not completely convinced that your #3 is different; it may just be the endpoint of a continuum of strength of evidence against the null).

Cheers,

Steve

LikeLike

Alex EtzThanks, Stephen. I agree with you that 3 and 2 are very similar and I wouldn’t oppose anyone lumping them together. The extra layer of making decisions is what I feel justified the split, but that’s just my interpretation.

It makes it hard to make recommendations for what p-values can and can’t do when there are so many flavors. Neyman Pearson would say there is no weak or suggestive evidence from borderline ps, since there is no evidence in their procedure at all! So when reviewers hound you about using them they probably aren’t being pedantic, but they are just in framework 1 whereas you’re in framework 2 or 3.

LikeLike

Chris Fonnesbeck (@fonnesbeck)I agree with most of this, but I would phrase your toolbox analogy differently. Its more like a Robertson head screwdriver, in that its very rare that you will encounter the sort of screw that it is useful for. What I mean by this is that the “under the null hypothesis” condition is typically a trivial hypothesis that is not worth rejecting. Effect sizes, regression parameters, and differences between groups are always non-zero, and hence testing it (via a p-value or otherwise) is not helpful. So, bad null hypotheses (and those who formulate them) should be the target of our ire. I always encourage the use of estimation in place of hypothesis testing–estimate the size of the difference and provide an estimate of uncertainty in that estimate, and leave it to subject matter experts to tell us whether it is relevant (or go and collect more data). Non-trivial null hypotheses are like Robertson head screws — so rare that it may not be worth buying the screwdriver, and if its in your toolbox, it doesn’t need to be near the top.

LikeLike

ScientistSeesSquirrelPost authorThanks for the comment, Chris. Actually, in Canada, Robertson heads are very common! And I think so are non-trivial nulls. Precisely because humans are so very good at seeing differences and patterns (whether real or not), I think we should not a priori assume that “Effect sizes, regression parameters, and differences between groups are always non-zero”. But you are very, very right that we should be estimating and reporting effect sizes; I just would like them to complement P-values rather than replace them.

I didn’t put this in my post, but I think it’s very interesting that P-values get criticized for underdetecting (“Effect sizes are always non-zero, so nonsignificant Ps are just low power”) and simultaneously, albeit by different people, for overdetecting (Ionnadis, Colquhoun, etc.). These two criticisms seem logically incompatible to me. Like a political stance detested by both left and right, a statistical procedure assailed from both sides seems pretty good to me!

LikeLiked by 2 people

Chris Fonnesbeck (@fonnesbeck)Yeah, I’m a Canuck also, but I didn’t realize who I was speaking to!

Its a simple mathematical truth that any two real-valued population parameters will not be exactly the same. One of the issues is that scientists play fast and loose with mathematical notation: mu_0 = mu_1 doesn’t mean “about” the same, it means **exactly** the same. And if they are not exactly the same, you can always collect data until the null hypothesis is rejected. That’s one of the problems with the p-value is that it is so sensitive to n. With analyses of very large datasets it is common for multiple regression parameters to be statistically significant (because, as I contend, there is no such thing as a regression parameter that is zero-valued), even when the underlying effect size is trivially small.

So, maybe my correction to your analogy did not go far enough — when there is a Robertson head screw, the Robertson head screwdriver is the best tool for the job. I can’t think of a scenario where statistical hypothesis testing using p-values is the most suitable choice, so I’d be excited to see if you can provide an example where it is.

LikeLike

TobiAs Steve nicely pointed out, there are things that cannot be estimated by p-values. But the abuse of p-value stems from institutional problems (e.g. statistical training etc). More importantly, the success of a study is largely evaluated by p-values. That is, the burden of evidence is generally light as p-value is often the central evidence and it’s almost certain that one would be able to find citations that can convince reviewers of the veracity of the findings. While I think that p-values can lend credibility to research findings, I would like to see them discussed in a more logical and system specific manner.

Furthermore, as a friend once pointed it out to me, there can be orderliness in randomness. Because p-values are sensitive to N, it seems sometimes that all they can tell us is that nature is not digital. Going back to Christ Fonnesback’s comments: how useful is the null-hypothesis? For instance, what can we really learn from heights of plant populations with P < 0.0001 or p = 0.51 when we have no a priori knowledge of the population structure (age etc)? To my knowledge, an inference has been drawn from such observations before.

While Ionnadis and Colquhoun’s claims may be superficial to some degree, the understanding of p-value displayed by Steve and others is not particularly common. Therefore, there’s something to worry about.

LikeLike

Pingback: Recommended reads #46 | Small Pond Science

grumble10Thanks for this nice post, it is a very important discussion that started a while back and is still not resolved (if it ever will be).

I also feel more and more uneasy with p-values due to two main reasons: their dependency to sample size, an effect/term is not significant? Just go back and sampled 100 more replicates it will likely be. And also the arbitrary cut-off that you mentionned, that makes us only show things below 0.05 skewing up the distribution of p-values that should (theoretically) be normally distributed but that most certainly peaks just below 0.05.

I find it much more informative to report (standardized) coefficient if we are in a regression framework or something like % of sum of squares if we are in a ANOVA framework.

I wonder how other fields of science are dealing with these issues …

LikeLike

Alex EtzGrumble10:

Just curious, why do you think p’s dependency on sample size is a problem?

LikeLike

grumble10Let’s say some team do an experiment with 20 replicates looking at the effect of X on Y, they apply whatever stats is adequate to their data and find no significant effect of X, they write a paper: “No effect of X on Y”. A few year later another team do the same thing with 100 replicates, they find significant results, they write a paper: “Strong effect of X on Y”, who is wrong? They have been measuring the same effect, the only difference come from the sample size and therefore all their interpretation/discussion will be affected by this. Bottom line is, if you collect enough data everything will be significant, it is therefore not so interesting to talk about significance.

LikeLiked by 1 person

jeffollertonThis seems to me to be a very mechanistic way to think about p-values that only applies if there is likely to be some effect of x on y. There’s are many examples where x will never have an effect on y, no matter how many times an experiment is performed. If a replication effect is suspected, isn’t that the time to use corrections to the p-value such as Bonferroni?

LikeLike

Pingback: Weekly links round-up: 20/02/2015 | BES Quantitative Ecology Blog

Jeff HoulahanHi Steve, I’m enjoying the blog. My growing philosophy on this is that we get buried in arguments about model selection (and null hypothesis testing is just another model selection approach – just on two imprecise models. That’s my biggest complaint about null hypothesis testing…models as imprecise as those you normally find in null hypothesis tests should be relatively rare for questions that have moved beyond the earliest stages) – whether to use p-values or AIC or effect sizes or cross-validation etc – when we should be seeing any of these methods as coarse filters that allow us to identify a short list of plausible models/hypotheses. The real test of the models on the short list comes when we make predictions on out-of-sample data. What I would really be interested in knowing is – of the long list of model selection techniques including p-values – which is most likely to identify the model that makes the best predictions on out of sample data.

If null hypotheses are a good way of learning how the natural world works then I’m on board. But I expect that when we are forced to view all the processes of the world as dichotomous, and it seems to me that null hypothesis testing kind of forces us to do that, we probably aren’t getting a very nuanced understanding of how the world works. Jeff Houlahan

LikeLike

Pingback: Our literature isn’t a big pile of facts | Scientist Sees Squirrel

Pingback: Weekend Reading XCXVII : Blogcoven

Pingback: Why do we make statistics so hard for our students? | Scientist Sees Squirrel

Pingback: In defense of statistical recipes, but with enriched ingredients (scientist sees squirrel) | Error Statistics Philosophy

Pingback: Should impact factor matter? | Scientist Sees Squirrel

Pingback: Is “nearly significant” ridiculous? | Scientist Sees Squirrel

Pingback: Vaccinations, global warming, and the fork in the canning jar | Scientist Sees Squirrel

Pingback: A year of Scientist Sees Squirrel: thoughts and thanks | Scientist Sees Squirrel

Pingback: Vaccinations, global warming, and the fork in the canning jar | Scientist Sees Squirrel

Pingback: Good uses for fake data (part 1) | Scientist Sees Squirrel

Allie Hunter (@AllieLHunter)Late to the discussion, but saw your tweet calling the ASA’s statement “very sensible” and I agree, and appreciate this blog post! The statement is worth reading for anyone who hasn’t seen it yet (http://www.amstat.org/newsroom/pressreleases/P-ValueStatement.pdf).

When you say, “The P-value tells us whether a pattern in our data is surprising, under the “null hypothesis” that patterns are produced only by random variation.**”, the ** and associated footnote is to me the key part of the sentence – as you note, it’s random variation given the assumptions of your model for the test (which the ASA principle 2 is meant to address this misconception). I think people are sometimes unaware of these assumptions they are making and don’t think to assess if they are appropriate.

I appreciate Alex Etz’s link- I don’t always think these differences are made clear to students. I would also argue that 2 and 3 are conceptually quite different, though whether they are different in practice I would be interested in a discussion about. I certainly did not see these different viewpoints made explicit in either my 1-semester Stats for Psychologists class or in my year-long Stats&Probability courses from the math department when I was a student. I certainly see 2. being used all the time, which is interesting to me because it was always emphasized that you needed to choose your significance level of 0.05, 0.01, etc., ahead of time (while at the same time, reading literature that operates under viewpoint 2).

LikeLike

ScientistSeesSquirrelPost authorThanks, Allie! With regards to pre-specifying the threshold vs. degrees of rejection – I expanded on this at some length in another post: https://scientistseessquirrel.wordpress.com/2015/11/16/is-nearly-significant-ridiculous/. I agree with you that those viewpoints are frequently muddled!

LikeLike

Pingback: The most useful statistical test that nobody knows | Scientist Sees Squirrel

Pingback: Two tired misconceptions about null hypotheses | Scientist Sees Squirrel

Pingback: Temporal trends in the Journal Diversity Index | Scientist Sees Squirrel

Pingback: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (book review) | Scientist Sees Squirrel

Pingback: What is science’s “Kokomo”? | Scientist Sees Squirrel

JustinRe: Colquhoun’s “If you use P=0.05 to suggest that you have made a discovery, you will be wrong at least 30% of the time.”

he is looking at a single experiment to make a conclusion, which is an odd thing to do (especially for frequentist methods) in my opinion. In fact, if we use updating on the P(H0|data), even with one replication things get statistically significant very quickly. I show this here using his numbers http://www.statisticool.com/fdr.JPG

And of course Fisher told us back in the like 1930s or something of the need to do more than one experiment,

Justin

http://www.statisticool.com

LikeLike