Image: Reproducible bullseyes, by andresmp via publicdomainfiles.com. Public domain.
You read a lot about reproducibility these days. In psychology, it’s a full-blown crisis – or so people say. They may even be right, I suppose: while it’s tempting to dismiss this as melodrama, in fact a surprising number of well-known results in psychology do actually seem to be irreproducible. In turn, this has given rise to fervent calls for us to do “reproducible science”, which has two main elements. First, we’re asked to publish detailed and meticulous accounts of methodologies and analytical procedures, so that someone else can replicate our experiments (or analyses). Second, we’re asked to actually undertake such replications. Only this way, we’re told, can we be sure that our results are reproducible and therefore robust*.
Being reproducible, though, makes a result robust in only one of two possible senses of the word, and I think it’s the less interesting of the two. What do I mean by that? Consider some conclusion we’ve drawn about the way the world works. That conclusion might be “narrowly robust” or “broadly robust”:
- A conclusion is “narrowly robust” if you can confirm it by precisely replicating the experiment that suggested it. That is, the result is reproducible as long as the experiment is done just so.
- A conclusion is “broadly robust” if you can confirm it by performing different experiments, to test the same hypothesis in slightly different circumstances, with different covariates and sources of noise and so on.
The reproducibility movement seems to be aimed almost entirely at making sure conclusions are narrowly robust. That’s all right, I guess; it’s just that I have a hard time getting very excited about effects that reproduce only if you do the experiments just so. Such effects just don’t seem likely to matter much in the real (noisy and complex) world. Or equivalently, hypotheses supported by narrowly robust results seem likely to be hypotheses of distinctly limited explanatory power. Results that are broadly robust, on the other hand, seem likely to reveal fundamental things about the world, things with reach, things that matter not just if you hold your tongue so and scrinch your left eye thus – but no matter who you are or what else is happening around you.
So why are we so focused on narrow robustness? I can think of three possibilities.
The first possibility is that it’s best to test all findings for narrow robustness – reproducibility – before trying to extend them to other circumstances. This sounds like a plodding, meticulous-but-slow way to work, but maybe in the long run it’s the most efficient way. It would certainly spare us getting all worked up over arsenic life or the hyperdilution memory of water**. We could see this as a workflow in which narrow robustness is a filter to reduce the number of results that require checking for broader robustness: if something won’t reproduce in a precise replication, there’s no point looking further. But while this might avoid false starts, there’s also a cost to meticulous replication: every old experiment we repeat means a new experiment we aren’t doing. How these costs balance is an empirical question, albeit one that could only be answered by setting up two completely independent scientific enterprises and watching both for a century or so. Twice, of course, so we could see if the effect of reproducibility is reproducible.
The second possibility is that it’s the narrow robustness of a particular finding that matters to us. This could be true, for example, in engineering, when we need to know if this particular bridge will fall down, or in medicine, when we need to know if this particular drug will shrink this particular tumour in this particular type of patient. There are, I bet, such narrowly important results in almost any branch of the sciences, and my relative disinterest in them may be just a character flaw.
The third possibility is that sometimes, results that are only narrowly robust are the only shots we have at testing a hypothesis. Sometimes, Nature holds her cards very close to her chest (as seems to be the case with string theory). We might have no other choice but to seek a genuinely fussy result, because only in a narrow set of circumstances is any result informative at all. How common is this situation? I have no ideal.
What’s interesting to me is how often our discussions of reproducibility seem to skip any real engagement with these three quite different possibilities. Instead, it’s widely depicted as axiomatic that we should replicate every experiment, checking every result for narrow robustness before moving on. Perhaps it’s only my experience in a field (evolutionary ecology) in which true replications are very rare*** that makes me question this at all. But I think not. Much of science, for much of its history, has worked a different way: by looking for broad robustness via consilience of induction; by looking for effects that show up in many places; by looking for hypotheses that survive not only one repeated test, but different tests. This approach has certainly taken science down a few blind alleys, but most of them have been short, and it’s hard to see how it’s been a disaster.
© Stephen Heard (firstname.lastname@example.org) April 17, 2017
*^We tell each other, and we teach undergraduates, that a properly-written Methods sections should let someone else replicate our experiments. We’ve been telling each other this for nearly 400 years, largely without heeding our own advice. This is actually for the best, I think. Most readers would be done a real disservice by a Methods section that really was that detailed. Put the details in an online supplement, where most readers can conveniently ignore them.
**^Remember that paper? It claimed that antiserum diluted too far to plausibly contain anything but water still retained properties of the molecules that were once there. This, of course, is how homeopathic medicine is supposed to work. Believe me, it doesn’t.
***^Replication of entire experiments, I mean, as opposed to replication of experimental units within studies, which is of course a sine qua non.