Warning: wonkish. Also long (but there’s a handy jump).
Over the course of a career, you become accustomed to reviewers raising strange objections to your work. As sample size builds, though, a few strange objections come up repeatedly – and that’s interesting. Today: the bizarre notion that one shouldn’t do significance testing with simulation data.
I’ve use computer simulation models as a tool many times over my career. The point of a simulation model is to compare the behaviour of a modeled system with and without a particular bit of modeled biology. I routinely use standard significance testing to make that comparison – to ask whether I have reason to believe that the bit of modeled biology makes a difference to the results. And several times now, I’ve run into the same strange objection to this.* Here’s one reviewer (paraphrased for length and anonymity):
It seems inappropriate to do significance tests on simulations, because P-values depend on sample size. You can, therefore, get as small a P-value as you want simply by running more simulations.
I hope you can see immediately why this objection doesn’t make any sense. It’s absolutely true that with simulations, one can “get as small a P-value as you like” by running more simulations – but critically, *only* if the null hypothesis is false. If there is no real effect, you can run all the simulations you like, and you’ll get significant P-values at the expected rate α, but never more.**
If this is obvious to you, great – and you’ll probably want to take this jump past my demonstration that it’s true. But in case it isn’t obvious, I ran – appropriately enough – some simulations.
Here’s what I did. I ran ordinary two-sample t-tests, with each individual test comparing means of X and Y, where I drew n values of X and then n values of Y from normal distributions. What I’m doing here is, essentially, simulating simulations! You can think of each t-test comparing Xs and Ys as a comparison between two sets of simulation runs with and without that focal bit of modeled biology. (Actually, you pretty much have to think of my t-tests that way; otherwise, what follows is nothing more than a straight-up demonstration of how P-values work.)
I ran my t-tests-that-represent-simulations with sample sizes of 10 (per group), then for 100, then for 1,000, and so on. This is exactly what reviewers object to: “simply running more simulations” (including more Xs and more Ys in each comparison by t-test). In order to see clearly what happens to P-values as we run more and more simulations, I repeated the whole process 1,000 times for each sample size to see how the distribution of P-values changes – or doesn’t.*** Here are the results:
Look at the red lines first. Those are for simulations in which the null hypothesis is false: the n values of X were drawn from a normal distribution with mean 10 and standard deviation 1, while the n values of Y were drawn from a normal distribution with mean 10.1 and standard deviation 1. (In other words, the focal bit of simulated biology changes the model outcome, although not very much – only from 10 to 10.1, or about 1%.) With a comparison based on just a few simulations (n = 10; red dotted line), power is very low. Individual runs aren’t very likely to yield a small P-value, and the mean P-value is quite large. If we run a few more simulations (n = 100; red dashed line), we start to see small P-values cropping up more frequently. If we run more simulations still (n = 1,000; red solid line), the P-value is usually small (75% are < 0.01). If we go all-out and run a ton of simulations (n = 10,000; red vertical line), the P-value is always small (the distribution can’t be plotted on this scale, but all 1,000 were < 0.001 and the mean was 2 × 10-7). So, when the null hypothesis is false, everything works as the reviewer suggests: running more simulations yields smaller P-values. And so it should: with more simulations, we get data that are increasingly unlikely under the null, and we’re more and more confident that we’re seeing a real effect.
Now the black lines. These are for simulations in which the null hypothesis is true: the n values of X and the n values of Y are both drawn from a normal distribution with mean 10 and standard deviation 1. (In other words, the focal bit of simulated biology does not change the model outcome.) This time, running more simulations makes absolutely no difference: whether you run 10 simulations (n = 10; black dotted dashed line), or 1,000 (n = 1,000; black dashed line), or 100,000 (n = 100,000; black solid line), the P-values are consistent with a distribution uniform on [0,1]. (We can breathe a sigh of relief, because if they weren’t, something would be seriously wrong with the universe.)
So: with simulation data, running more and more simulations and seeing what happens to P-values is in fact very nicely diagnostic. P-values shrink under the alternative hypothesis, but they do not do so under the null. There’s nothing surprising about this, of course: it’s just frequentist statistics working exactly as it should.
I think the more interesting question here is not whether you can use P-values with simulation data (of course you can!) – it’s what might lead someone to think that you can’t. Where does this strange objection come from? I can think of four possibilities, arranged here from least to most interesting (to me).
First, I may simply be running into reviewers who lack an intuitive understanding of basic statistics. That’s more common than you might think, even among professional biologists who use statistics to make inference; I suspect it’s partly because the subject is often abysmally taught. I may simply have had a reviewer who doesn’t realize that when the null is true, the expected distribution of P-values doesn’t depend on sample size. [Note: in the first version of the post I wrote “when the null is false…”, which is about as embarrassing as a brain misfire can get! I knew what I meant…grrr.]
Second, I may be running into reviewers who don’t understand the distinction between P-values and effect sizes. If you (mistakenly) believe that a small P-value indicates an important effect, then it would indeed be worrisome that P-values depend on sample sizes. But that’s not what P-values do. Running more simulations can make you more and more sure of an effect you’re seeing, but it won’t affect your estimate of how large that effect is.
Third, I may be running into reviewers who subscribe to a common but strange philosophical position about inference. The reviewer’s objection makes perfect sense if they believe that the null hypothesis is always false – in which case the null-hypothesis case in my demonstration (black lines in the plot) is irrelevant and there’s no useful diagnosis to be had from running more simulations. I hear this objection a lot, and it’s always mystified me, because I can, with trivial ease, come up with a null that I’m 100% confident is true (astrology might be involved, or in simulations a throw-away variable that isn’t actually used to calculate results). If those examples seem silly, well, they’re meant to be. If you’re skeptical that more relevant nulls are ever true, please consult a longer exposition here.
Fourth, I may be running into reviewers who think simulation models are fundamentally different from experiments. They aren’t. An experiment – whether it’s in the lab or in the field – is always a simplified model of the real world. So is a simulation model. The only difference is that in a simulation model the connection between inputs and outputs involves electrons in chips (and is usually fully specified); in an experiment it involves neurons firing, or DNA replicating, or plant roots taking up nutrients, or whatever. Sure, you can run more simulations merely by changing a parameter and waiting longer, while running more experimental replicates may be harder (involving money, ethics, space, you name it). But in terms of how inference works there just isn’t any difference.
I suspect that the fourth of these lurks behind my reviewer’s objection, and here’s why. The first three explanations account for an objection to P-values – but not for an objection to P-values for simulations. In fact, the more-replicates-will-shrink-the-P-value objection applies to experiments just as much as it applies to simulations applies (if, that is, if it applies at all – which it doesn’t).
So it’s strange. This repeated objection, upon just a little close examination, makes no sense at all – and it betrays one of four very peculiar beliefs about the universe. Or more likely, it simply betrays the lack of that “close examination”. Which maybe isn’t strange at all. Most of us – probably all of us – hold a few beliefs that would crumble rapidly under close examination. Yes, even scientists (and I describe one of my own here).
Why did I write this post? Well, I’ve seen the “you can’t use P-values with simulations” objection often enough that I’m pretty sure I’m going to see it again. When I do, my response can simply point here. If you find yourself in the same situation, yours can too.
© Stephen Heard May 7, 2019
Writing clearly about statistics is hard (which is one reason I admire Whitlock and Schluter so much). If something here is confusing, please let me know!
*^Most recently, it happened last month in the reviews of a paper that’s now in press in Conservation Biology. Until I can link to the definitive version, the preprint is here. In case you’re wondering: yes, that paper will be published with P-values for its simulations, after a Response to Reviews that ended up being a dry run for this blog post.
**^Unless you P-hack by using a stopping rule dependent on P, of course.
***^If you’d like to do this yourself, here’s the very simple R code:
#set two group means
mean2 <- 10.1
#set a sample size
samplesize <- 10000
#set a number of replicate t-tests
tps <- replicate(numtests,t.test(rnorm(samplesize, mean1),rnorm(samplesize, mean2), var.equal=TRUE)$p.value)
write.csv(tps, file = ‘/data/hofalse-10000.csv’)
Notice that I include quick-and-dirty plots, but I didn’t make the illustrative figure above in R. I didn’t have twelve hours to spare, so I used Excel and SigmaPlot (an actual graphing program) and made it in about 10 minutes. Don’t @ me.