(graphic by Chen-Pan Liao via wikimedia.org)
The P-value (and by extension, the entire enterprise of hypothesis-testing in statistics) has been under assault lately. John Ioannadis’ famous “Why most published research findings are false” paper didn’t start the fire, but it threw quite a bit of gasoline on it. David Colquhoun’s recent “An investigation of the false discovery rate and the misinterpretation of P-values” raised the stakes by opening with a widely quoted and dramatic (but also dramatically silly) proclamation that “If you use P=0.05 to suggest that you have made a discovery, you will be wrong at least 30% of the time.”* While I could go on citing examples of the pushback against P, it’s inconceivable that you’ve missed all this, and it’s well summarized by a recent commentary in Nature News. Even the webcomic xkcd has piled on.
Now, the fact that this literature exists is understandable. After all, it’s rather fun to be iconoclastic (and on other topics I’ve been known to indulge). And Ionnadis and Colquhoun, among others, make very important points about ways that P-values can be misused. But it betrays some real misunderstanding of statistics (and what statistics is for) to suggest as a result that P-values are not valuable, or – worse – to suggest that we should stop teaching hypothesis testing entirely.
The problem is that too many people have lost track of what the P-value does, and why that needs to be done. The P-value tells us whether a pattern in our data is surprising, under the “null hypothesis” that patterns are produced only by random variation**. That is, calculating the P-value gives us a check against self-delusion; only when the P-value confirms that our data would be surprising given randomness should we consider them further. The P-value is the beginning, not the end, of a statistical analysis.
This function of a check against self-delusion is absolutely essential, not because of math, but because of human psychology. Humans are very, very good at seeing pattern in nature, even when there’s no pattern to see (Nassim Taleb’s “Fooled by Randomness” explores this at length). We get excited about faces on Mars, runs in lotteries, clutch hitting streaks, and bunnies in clouds. We get similarly excited about patterns in our data, and we need a tool to counter our apophenia (yes, there’s a word for it, and an xkcd comic too). That tool is the P-value, and we can’t think clearly without it.
Criticisms of the use of P-values, when examined closely, are nearly always criticisms of P-values being used for other purposes. There are, unsurprisingly, quite a few things that P-values do poorly (since they not designed to do them). For example, a P-value does not:
- Measure effect size. A very small P-value is “highly significant” because it provides strong evidence that there’s some real effect; it does not mean that effect must be large. This misunderstanding is frequent in the media, likely because in lay language, “significant” and “important” are synonyms. In statistics they are not. Measure effect sizes (and their uncertainties) and report them; the P-value can’t do that work for you.
- Rule out random “cause”. A “significant” P-value is permission to investigate further, not proof of an effect. If I ran 100 studies of a homeopathic remedy, about 5 of them would yield P < 0.05; but all five would be false positives. That’s not a flaw in the P-value, but thinking it might be is evidence of a flaw in one’s statistical thinking. (Colquhoun actually deals with this very nicely, if you ignore the dramatically silly proclamation I quoted above and read his paper carefully instead.)
- Give a yes/no answer to “should I believe this effect?”. The P-value expresses probability on a continuous scale. Despite widespread recommendation, deciding a priori to believe absolutely in effects for which P = 0.049 and to disbelieve absolutely in effects for which P = 0.051 is every bit as silly as it sounds. Report exact P-values, and don’t let reviewers bully you away from hedges like “weakly” significant or “suggestive”. Such hedges are perfectly accurate.
- Express the relative performance of two different (non-nested***) models. Tempted to compare predictive value of precipitation, temperature, or soil type for plant growth by asking which gives the smallest P-value? Don’t do it; that’s what techniques like the Akaike Information Criterion are for.
- Dice onions without tears. OK, maybe nobody thinks it does. But would this belief really be more foolish than thinking a P-value could measure an effect size?
Think about it this way: the P-value is a #2 Phillips-head screwdriver. Observing that it does a poor job of crackfilling drywall is not a reason to throw it away – it’s a reason to learn what it’s actually for (driving #2 Phillips-head screws) and to learn what tool actually is good for crack-filling drywall (an 8” drywall taping knife). Don’t abandon the P-value. Do use it for the crucial purpose it fulfills – and only for that purpose.
And please stop with the cheap shots at our poor beleaguered P-value.
© Stephen Heard (firstname.lastname@example.org) Feb 9 2015
UPDATE: When I wrote this I had somehow missed the excellent interchange in Ecology’s Forum section: P values, hypothesis testing, and model selection: it’s deja vu all over again (Ecology 95:609-653, 2014; Hat tip, Daniel Lakens). This includes excellent arguments both pro and con. Feel free to poke fun at the irony of this given my earlier post When Not To Read The Literature…
*Not even Colquhoun really believes this – it depends entirely on things like the power of the test and the size of the true effect (if there is one). The false-positive fraction can be zero or it can be 100%, and you don’t get to know before you start. The body of his paper outlines this quite well, but that’s not what got quoted, tweeted, and blogged.
**More precisely, the P-value is the probability of drawing data the way we did and getting a pattern as strong, or stronger, than in our actual data – given that all assumptions are met and that patterns arise only through sampling variation (that is, the null hypothesis is true). A statement like this appears in every statistics textbook, but the fraction of statistics students who don’t master it is distressingly large.
***For nested models, P-values can accomplish a function very close to this – for instance, when we test significance of a quadratic term in a model, to see if it explains variance better than the linear alternative. But I’m wandering away from my point.