*Comic: **xkcd #892, by Randall Munroe*

* *For some reason, people seem to love taking shots at null-hypothesis/significance-testing statistics, despite its central place in the logic of scientific inference. This is part of a bigger pattern, I think: it’s fun to be iconoclastic, and the more foundational the icon you’re clasting (yes, I know that’s not really a word), the more fun it is. So the *P-*value takes more than its share of drubbing, as do decision rules associated with it. The null hypothesis may be the most foundational of all, and sure enough, it also takes abuse.

I hear two complaints about null hypotheses – and I’ve been hearing the same two since I was a grad student. That’s mumble-mumble years listening to the same strange but unkillable misconceptions, and when *both* popped their heads up again within a week, I gave myself permission to rant about them a little bit. So here goes.

**(1) “The null hypothesis is often uninteresting”** (sometimes followed by

*“if it were true, we wouldn’t care about it”*). Well, sure, of course the null is “uninteresting”. By definition, the null expresses the

*lack*of pattern, and usually, it’s that pattern we’re really interested in. So we frame and test a null because

*rejecting*it would be interesting.

* *It’s true enough that if we can’t reject, we may be disappointed. We’re human, and often emotionally invested in our ideas. But so what? It isn’t Nature’s responsibility to make itself interesting for us; and if one of our pet hypotheses turns out not to have compelling evidence behind it, well, so it goes. Don’t worry, something else cool will be along in a moment.

**(2) “The null is never actually true, so rejecting it isn’t helpful”**. This complaint is usually accompanied by a declaration that any two groups of things (let’s say) always differ somehow, at least a little, so distinguishing them is simply a matter of gathering a large enough sample size to get a significant

*P*. Here’s an example among many: Andrew Gelman, as an aside in an otherwise excellent post about multiple regression (etc.):

*“I don’t think there are zero effects, so I think it’s just a mistake overall to be saying that some predictors matter and some don’t.”**

The objection that *“the null is never actually true”* is a strange one. It’s also a bit slippery, because I think it can be wrong in any of three ways. The first two seem like open-and-shut cases, stemming from clearcut misconceptions about how null hypothesis testing works. The third is much more interesting, and contains a striking claim about the nature of the universe. I’ll save the best for last.

- First, in some hands the objection seems to betray confusion about samples vs. populations. It’s true that even when two populations are identical, sampling from them will nearly always produce a small difference; and so if you make the mistake of p-hacking by checking “significance” of many repeated samplings, you’ll always eventually reject the null. You do need the p-hacking part, because of course even when two populations are very different, you’ll sometimes get two samples that are identical. Keep going, though, and if you reason this way your belief that ‘the null is never true’ will be sustained. When the null actually is true, you may have to keep going longer; but with a bit of patience you’ll always get two samples that “differ”. But of course this doesn’t make the null true (about the populations); it just makes you guilty of not understanding sampling.
- Second, the objection sometimes arises because people pay attention only to
*P*values and neglect effect sizes. Let’s assume for the moment that large enough sample sizes always reveal differences between groups, not because of p-hacking but because the null actually isn’t ever true (more about this soon, in #3). Those effects that turn up for large*n*are tiny (otherwise the line about “large enough sample size” wouldn’t be there), and there’s no problem with declaring such an effect real but unimportant. To do so, we use the degree of the pattern’s departure from the null hypothesis – and when that degree of departure is very small, we may find the null hypothesis a useful representation of nature even though we find it false. Here, the degree to which we reject the null matters – as long as we’re paying attention. -
Third, and most interestingly, sometimes the objection seems to be that there

*really**aren’t*any zero effects (that is, the null really is always false), and that these universally-real effects matter. But what would it mean if this were true? I think it would mean we were making a deep but completely unfounded claim about the nature of the universe. That claim is that*all*explanatory variables matter, that*all*possible causes exist. For each such possible cause, making this claim is a strong statement that you know the true nature of the universe – and that you know this without the need to gather evidence. Does fish body size respond to environmental phosphates? Yes. Does it respond to environmental silica? Yes. Does it respond to environmental xenon? Yes. Does it respond to the number of supernova remnants in the sector of sky defined by the Unicode number for the second letter of the fish’s Latin name? Yes**. This claim that all causes exist is a breathtaking one, and it seems to reduce science to an exercise in mensuration. But I’m not sure why its claimants stop there. If we know without the need for evidence that causes exist, why don’t we similarly know their magnitudes? How is it that we can reject (without evidence) a value of zero for an effect size, but no other value? If we claim that we know these things without evidence, do we need evidence to know that we know them? (Ouch.)

Now, I’ll forgive you if you think all of my arguments are silly caricatures. But they’re the arguments we’re inevitably pushed to, once we decide to pay serious attention to the claim that null hypotheses don’t matter. The universe is a complex place***, and unassisted humans are really good at misinterpreting it. Null hypotheses and our apparatus for testing them are something science needs. Could we please stop clasting this particular icon?

*© Stephen Heard (*sheard@unb.ca*) April 3,, 2017*

*^This is not *at all* to pick on Andrew Gelman, who is an excellent statistician and whose work is thought-provoking and invaluable – look further into his blog for a sampling. In fact, the objection turning up from someone whose work I respect so much, when I had this post half-written, really made me think.

**^You may object that I made this last hypothesis up to be deliberately ridiculous. You’re right, I did (and it was fun). You could respond with the claim that it’s only *plausible* causes that always exist. But then you’re making *another* deep and unfounded claim about the nature of the universe: that plausible hypotheses are always true, and implausible ones always false, and that we can distinguish plausible-and-thus-true hypotheses from implausible-and-thus-false ones – without (again) the need for evidence. I may have guided you gently toward this bottomless rabbithole, but I didn’t put it there.

***^Darn it, now *I’m* the one making deep claims about the nature of the universe without evidence.

Harald von WaldowA very nice and insightful post that I’ll retweet in a second. But I think you hide the problem with objection #2 a bit with “… and when that degree of departure is very small, …”. Because there is no objective judgment about whether that departure is large or small — at least not from the results of the test.

LikeLike

ScientistSeesSquirrelPost authorNot sure I understand you, but the fault may be mine. When I say “degree of departure is small” I mean the effect size. So there is certainly an “objective judgement about whether that departure is large or small”, as long as you count an effect size estimate. Or do you mean that deciding whether and effect size of 5 (let’s say) is “large” or “small” isn’t objective? If so, well, sure, but then nothing is objective 🙂

LikeLike

crowtherWhen teaching intro bio last fall, I actually wrote a short song/rap about objection #1! Let’s see if WordPress will let me leave the URL here…. http://faculty.washington.edu/crowther/Misc/Songs/MP3/Null_Hypothesis.mp3

LikeLiked by 1 person

ScientistSeesSquirrelPost authorSee, this kind of thing is why I love having people comment!

LikeLiked by 1 person

Harald von WaldowYou are right, I did not think about the lucky researchers who regularly can report an effect size of 5. That is large, no doubt. I rather had in mind those poor sobs who get numbers between 0.1 and 0.8, which mean not much, unless you have in-depth domain knowledge that would allow you a judgement.

LikeLike

Harald von WaldowAnd then it turns out your distribution is not normal. What measure for effect-size do you choose, objectively, and what is its interpretation, again?

If so, well, sure, but then nothing is objectiveThat is right. Almost nothing that comes from statistical considerations only, in particular if you analyze observations made outside a tightly controlled experimental setting. Statistical tests are overrated.

LikeLike

ScientistSeesSquirrelPost authorAs should probably be obvious, my “5” example is not a *standardized* effect size, but in the biologically relevant units! But I didn’t make that clear.

If indeed you are bothered that no statistical or scientific procedure is objective – well, I’m not sure what I have to offer you!

LikeLike

Harald von Walddowbut in the biologically relevant units!Ok, I see. Misunderstanding. Good that you have a clearly defined “biologically relevant unit”!

If indeed you are bothered that no statistical or scientific procedure is objectiveFine difference: The procedure might well be and ideally should be objective (difficult enough, think about unconscious p-hacking). Most of the time the conclusions that can be drawn from that are far from objective, but often enough are presented as such.

LikeLike

Jeff HoulahanMy problem with null hypothesis testing, Steve, is that when null hypothesis testing is the default approach then most questions get approached as if they are a null hypothesis question – as if the answer as binary. Is there an effect or not? This is reasonable in the early stages of study but as a question matures it should become about how large the effect is and how the effect size varies in different contexts and how the effect size interacts with other drivers. This point is not a new one but I still don’t think it’s been given enough weight. I believe that one index of a disipline’s maturity is how much it moves away from null hypothesis testing for longstanding questions. I would say that ecology’s track record on this is spotty at best – it is still possible to get research published that asks “Does predation have an effect on abundance or not”? Jeff

LikeLike

ScientistSeesSquirrelPost authorSure, this is indeed a problem. But it’s kind of like saying “My problem with hammers is that people sometimes hit the nail with the wrong end”, isn’t it?

LikeLike

ScientistSeesSquirrelPost authorWait, no, that’s not the right analogy. It’s more like “The problem with hammers is that people don’t hit nails hard enough with them”. Is it the fault of the tool if people harness only part of its power?

LikeLike

Jeff HoulahanI don’t think the analogy works, Steve because making a decision about whether to accept or reject the null hypothesis uses all the power of null hypothesis testing. The fact that statistical packages will include parameter estimates along with p-values isn’t relevant to what null hypothesis testing is intended to do – null hypothesis testing sets up two choices and then chooses between them.

I think the traditional saying is closer to the truth – that when all you have is a hammer everything looks like a nail. So, when all you have is null hypothesis testing everything looks like a binary decision. In my opinion, there are only two contexts where null hypothesis testing makes sense – (1) where the outcomes are truly binary (very rare in ecology) and (2) early on when you are attempting to see if you have a promising idea for explaining some pattern/phenomenon. Otherwise, we should be trying to develop models that contain increasingly large amounts of information (i.e. by identifying more of the true drivers of a process, by getting closer to the true functional relationships among variables and by making more accurate and precise parameter estimates) – null hypothesis testing is just about the worst way to do that.

So, I’m all for using null hypothesis testing when it makes sense – I just think that it makes sense almost never.

Jeff

LikeLike

Mary PThis is reasonable in the early stages of study but as a question matures it should become about how large the effect is and how the effect size varies in different contexts and how the effect size interacts with other drivers.

If indeed you are bothered that no statistical or scientific procedure is objective – well, I’m not sure what I have to offer you!

LikeLike

Jeff HoulahanHi Mary, I’m not sure what I wrote that suggested that I’m bothered that no statistical or scientific procedure is objective – it certainly wasn’t my intention to suggest that.although I may have inadvertently done it. The only point I intended from the sentence that you quoted was that I think disciplines, as they mature, should move to approaches/models that make much more precise predictions than those generally associated with null hypothesis testing. Jeff

LikeLike

Gavin SimpsonYou’re a little unfair to Gelman here. Yes, in the linked blog post, he does say

exactlywhat you quote. However, he alludes to this being a long-standing position of his and I recalled this being discussed in earlier on the “bet on sparsity” principle and discussions by Gelman of the Lasso (for example, here and the linked manuscript.)There, Gelman specifically addresses social science problems (and environmental ones) that he works on. He’s not claiming that all situations are dense (have non-zero effects). He’s also, I believe, not saying that

— he gives an example of his thinking in the ensuing discussion. My interpretation here is that he’s talking about plausible effects, ones that pertain to variables that we’ve already decided to include in our model or world view of how the data were created.Furthermore, in the discussion of the Whither the Bet on Sparsity Principle in a Nonsparse World, Gelman makes the following distinction

. This speaks directly to your point 2.iii — whilst Gelman might believe that the truth is dense, he’ll still fit sparse models to samples from truth.You 2.i also gets address by Gelman somewhere and he takes the opposite view to the one you present; whilst he believes the truth to be dense, he sees sampling as a means to generate a sparse model with some zero or effectively zero effects. Crucially he’s stating that he anticipates sparse solutions (some zero effects) in samples, especially small ones.

Finally on Gelman, it’s a little perverse to discuss p-values and Gelman in the same breath; I get the feeling he wouldn’t be caught dead doing the sort of hypothesis testing you describe 🙂

In 2.iii aren’t you in danger of drawing causal conclusions from the model? In the kind of setting I believe you are envisaging, there may well be some small, but crucially non-zero, relationship between xenon and fish body size — for many reasons that have nothing to do with any

causalrelationship between that noble gas and the size of fish. (I know little of note about either of these.)I’ll admit to struggling with this post. I find your point 1 problematic because I often deploy such tests in situations where I have fitted a wiggly “effect” and the null is one of absolutely no effect. A better test would be one that said is this wiggly “effect” better than (or different than) a linear effect. (One solution to that, ironically is to apply a form of shrinkage and assume that the data generating process approximated by my model is sparse… which gets us back to the “bet on sparsity” argument…). In that sense, the zero-effect null is both trivial and a weak test for the purposes of my science and that we should challenge ourselves with stronger nulls than those that simply assume zero effect.

I say the above as someone that uses frequentist methods all the time though I rarely perform hypothesis testing sensu Neyman-Pearson. I’m more pragmatic Fisherian when it comes to p-values (or closet Bayesian when I happen to be fitting GAMs)

LikeLike

ScientistSeesSquirrelPost authorGavin, thanks for these thoughtful comments. I’m actually pleased to hear that Gelman may not quite mean what he says. I don’t think I’m quoting him unfairly, although as I say in the footnote, he’s an excellent statistician and I was surprised to see 2.iii coming from him!

You have a point about causality, at least the way I wrote the post. I kind of elided over the distinction between a significant causal correlation and a significant association driving by some indirect causal pathway. Although I’m not so sure it matters. Is the claim that “all pairs of variables are significantly associated because there is always some nonzero causal pathway between them” really that much less improbable that “there are no null effects”? I’m not sure.

LikeLike

Gavin SimpsonStephen, Fair point on the “unfair” quoting — I considered “misrepresented” but that sounded too negative also.

There’s a problem in the second paragraph of your reply and I’m not sure if you meant this literally; “all pairs of variables are significantly associated…” Gelman doesn’t consider all non-zero effects “significant”. You are placing that description on these relationships because of the frequentist framework you are describing and the potential for infinite, or at least sufficiently large, samples such that any non-zero effect no matter how small will be rendered “statistically significant”. Or perhaps you didn’t mean “significant == statistically significant”?

Anyway, I certainly don’t see Gelman’s approach or beliefs here as accepting that all non-zero effects are significant. Far from it. (Gelman doesn’t see the point in testing at all if effects are zero or not.) He assumes all effects are non-zero but that many will be trivially small, just non-zero. In that sense, he’s considering the utility of a statistical approach or mind-set that is encompassed in the “bet on sparsity” principle.

What I see Gelman as meaning with his “no zero effects” statement is really that there are no truly orthogonal variables in the systems he studies, that no two variables have

exactly0 association. He accepts that most non-zero effects will be trivially small (not “significant”) and therefore in small samples assuming a sparse model can be an expedient approach. He also doesn’t mean that the true causal relationships cannot be discerned from data with appropriate techniques. But that’s different from an observational setting and a regression model where all we are estimating are relationships not causal mechanisms.I don’t see the problem myself with this as a philosophical viewpoint. It’s a reasonable position to take to view testing if some tiny effect is significantly different from 0 is a pointless exercise, in part because one views all associations/relationships (but not causal mechanisms) as non-zero.

LikeLike

CuriousGeorgeYour Argument #3 doesn’t make any sense.

The more information you get (i.e., higher n), the more likely you are to reject the null. To the point where if you had perfect knowledge (i.e., infinite n), you would be able to detect any effect whatsoever, no matter how small. So p-values don’t help you at all in determining whether effects are “really non-zero” or not. If you had perfect knowledge, almost every single p-value you tested would be “significant”.

Since everything probably has SOME effect on what you’re studying (and here’s the key point: **it’s especially likely to have an effect if it’s something that you a priori considered potentially relevant to your research question**), it makes far more sense to estimate effect sizes of the things you’re interested in. Then, based on the effect sizes (and their s.e.), make a subjective decision if it’s an effect that you think is worth examining further. Allow your readers to agree or disagree with your subjective decision… If the effect sizes are published, then they can do that.

There’s a reason that no one uses p-values in model selection anymore.

LikeLike