Reproducibility and robustness

Image: Reproducible bullseyes, by andresmp via  Public domain.

You read a lot about reproducibility these days.  In psychology, it’s a full-blown crisis – or so people say.  They may even be right, I suppose: while it’s tempting to dismiss this as melodrama, in fact a surprising number of well-known results in psychology do actually seem to be irreproducible.  In turn, this has given rise to fervent calls for us to do “reproducible science”, which has two main elements. First, we’re asked to publish detailed and meticulous accounts of methodologies and analytical procedures, so that someone else can replicate our experiments (or analyses).  Second, we’re asked to actually undertake such replications.  Only this way, we’re told, can we be sure that our results are reproducible and therefore robust*.

Being reproducible, though, makes a result robust in only one of two possible senses of the word, and I think it’s the less interesting of the two.  What do I mean by that?  Consider some conclusion we’ve drawn about the way the world works.  That conclusion might be “narrowly robust” or “broadly robust”:

  • A conclusion is “narrowly robust” if you can confirm it by precisely replicating the experiment that suggested it. That is, the result is reproducible as long as the experiment is done just so.
  • A conclusion is “broadly robust” if you can confirm it by performing different experiments, to test the same hypothesis in slightly different circumstances, with different covariates and sources of noise and so on.

The reproducibility movement seems to be aimed almost entirely at making sure conclusions are narrowly robust.  That’s all right, I guess; it’s just that I have a hard time getting very excited about effects that reproduce only if you do the experiments just so.  Such effects just don’t seem likely to matter much in the real (noisy and complex) world. Or equivalently, hypotheses supported by narrowly robust results seem likely to be hypotheses of distinctly limited explanatory power.  Results that are broadly robust, on the other hand, seem likely to reveal fundamental things about the world, things with reach, things that matter not just if you hold your tongue so and scrinch your left eye thus – but no matter who you are or what else is happening around you.

So why are we so focused on narrow robustness?  I can think of three possibilities.

The first possibility is that it’s best to test all findings for narrow robustness – reproducibility – before trying to extend them to other circumstances.  This sounds like a plodding, meticulous-but-slow way to work, but maybe in the long run it’s the most efficient way.  It would certainly spare us getting all worked up over arsenic life or the hyperdilution memory of water**.  We could see this as a workflow in which narrow robustness is a filter to reduce the number of results that require checking for broader robustness: if something won’t reproduce in a precise replication, there’s no point looking further.  But while this might avoid false starts, there’s also a cost to meticulous replication: every old experiment we repeat means a new experiment we aren’t doing.  How these costs balance is an empirical question, albeit one that could only be answered by setting up two completely independent scientific enterprises and watching both for a century or so. Twice, of course, so we could see if the effect of reproducibility is reproducible.

The second possibility is that it’s the narrow robustness of a particular finding that matters to us.  This could be true, for example, in engineering, when we need to know if this particular bridge will fall down, or in medicine, when we need to know if this particular drug will shrink this particular tumour in this particular type of patient.  There are, I bet, such narrowly important results in almost any branch of the sciences, and my relative disinterest in them may be just a character flaw.

The third possibility is that sometimes, results that are only narrowly robust are the only shots we have at testing a hypothesis.  Sometimes, Nature holds her cards very close to her chest (as seems to be the case with string theory).  We might have no other choice but to seek a genuinely fussy result, because only in a narrow set of circumstances is any result informative at all. How common is this situation?  I have no ideal.

What’s interesting to me is how often our discussions of reproducibility seem to skip any real engagement with these three quite different possibilities.  Instead, it’s widely depicted as axiomatic that we should replicate every experiment, checking every result for narrow robustness before moving on.  Perhaps it’s only my experience in a field (evolutionary ecology) in which true replications are very rare*** that makes me question this at all.  But I think not.  Much of science, for much of its history, has worked a different way: by looking for broad robustness via consilience of induction; by looking for effects that show up in many places; by looking for hypotheses that survive not only one repeated test, but different tests.  This approach has certainly taken science down a few blind alleys, but most of them have been short, and it’s hard to see how it’s been a disaster.

© Stephen Heard ( April 17, 2017

UPDATE: Dave Harris points me to this old blog post by Ed Yong about a failed replication in psychology.  It has arguments based on both narrow and broad robustness. It’s also completely fascinating.

*^We tell each other, and we teach undergraduates, that a properly-written Methods sections should let someone else replicate our experiments.  We’ve been telling each other this for nearly 400 years, largely without heeding our own advice.  This is actually for the best, I think. Most readers would be done a real disservice by a Methods section that really was that detailed.  Put the details in an online supplement, where most readers can conveniently ignore them.

**^Remember that paper?  It claimed that antiserum diluted too far to plausibly contain anything but water still retained properties of the molecules that were once there.  This, of course, is how homeopathic medicine is supposed to work.  Believe me, it doesn’t.

***^Replication of entire experiments, I mean, as opposed to replication of experimental units within studies, which is of course a sine qua non.


11 thoughts on “Reproducibility and robustness

  1. David Mellor

    The reproducibility movement focuses on this definition of robustness not because it is more important than the broadly robust concept, but rather because it is important* and is conducted so infrequently.

    These narrowly robust, direct replications are definitely the mundane, boring, and unrewarding parts of the process of science. The tragedy is that only through direct replications can we have any real measure of the rigor of a claim before we start to do the more interesting and rewarding conceptual replications, which are broadly robust and help to generalize a finding into wider ecological contexts.

    The difficulty in funding and publishing direct replications** keeps us chasing zombie ideas for generations when we could instead be adequately focusing on both finding new testable hypothesis and confirming them through a-priori hypothesis testing.

    *I initially wrote “equally important” but changed my mind. I haven’t had enough coffee yet to think through the relative importance of direct v. conceptual replications. But I can say that they are important, necessary, and almost never done.
    **If the results are positive, they are deemed too boring to publish. If they are negative, there is overwhelming incentive to attack methodology and experimenter competence. The result is substantial ignorance on the degree and scale of the problem.

    Liked by 1 person

    1. ScientistSeesSquirrel Post author

      Thanks, David. I think you are taking the position that direct replication is important because of my first possibility (narrow robustness as a filter). And you’ve put it very well. But what’s your argument for the total pace of progress being increased if we divert some fraction F of resources into direct replications? And what is the optimal value of F? Of course, I don’t really expect you to have solid answers to that – nobody does 🙂


      1. David Mellor

        The current size of F is €3 million*, which is pretty close to 0%. Also, I think a lot of direct replication attempts occur but fail and do not get circulated outside of lab meetings. An order of magnitude guess for an ideal frequency would be something between 1 and 10%, along with an assurance that results are disseminated regardless of outcome. In order to avoid the nefarious waste bin of the Journal of Null Results, methods should be peer reviewed before results are known, which addresses the strong biases that occur after seeing results. I’d also recommend the Pottery Barn Rule, “you break it you buy it”, where journals face an obligation to publish the results of direct replications of work that they originally published (given a large number of safety checks to assure the replications are faithful and of high quality).

        The result of this ecosystem would be more rigorous confirmation of through the narrow robustness filter and less chasing down dead ends. Then, even if failed replications still become the “seed” for experimentation into a different context (the broader robustness filter), then at least the experimenter goes into that journey with a more accurate assessment for the likelihood of success.


        Liked by 1 person

  2. Peter Apps

    In chemistry it is common (and often required) for methods to be tested for repeatability, reproducibility and robustness, to give confidence in results of measurements using those methods. Very little of that goes on in field biology.


  3. Nick Collins

    I enjoyed your thoughts on this topic, Stephen. I agree that seeking broad reproducibility is more appealing scientifically than more narrow approaches, and in the psychology literature it has become a dominant format for papers. A single paper usually consists of an experiment that yields an interesting result, plus a number of additional related experiments that seek to broaden the applicability and demonstrate the replicibility of the general result of the first experiment. However, this approach has been criticized because the secondary experiments are being designed and performed by experimenters who are hoping for a particular result, and there are many experimenter degrees of freedom in the combined design and execution processes of the secondary experiments. That is, there are many opportunities for unconscious or conscious bias or harking to occur, and extensive analysis of the psychology and biomedical literature has demonstrated that these conditions will lead to biased results unless the experimenters invoke the most rigorous procedures, such as registering their studies in advance and insisting on double blind procedures. In short, one of the reasons we require more replication is that it’s a lot harder to produce truly unbiased work than we once thought. I think there is a much bigger proportion of false positives floating around the ecology/evolutionary biology literature than we are currently willing to believe, and excess false positives continue to be generated because we don’t really take seriously enough the possibility and aren’t willing to expend the effort needed to seriously evaluate it in these disciplines.

    Liked by 1 person

  4. Jeff Houlahan

    Hey Steve, I like your distinction between narrow and broad robustness – it seems similar to the idea of out-of-sample tests (same test and same context but new samples) versus transferability (same/similar test but new context and new samples). But I’m completely unconvinced by your argument that whether to test for narrow robustness before we check for broad robustness is an empirical question. The argument for among-experiment replication strikes me as similar to the one for within-experiment replication – and few scientists would argue that the need for within-experiment replication is an empirical question. For any question that matters (will this vehicle stay in the air, will this drug do more good than harm) an experiment would have to be replicated many times before we would ever except the truth of a conclusion. That may be why we’re so comfortable with not worrying about among-experiment replication in ecology – nobody who matters gives a s**t.
    But even at a pedagogical level it concerns me – anybody that’s ever taught a course in introductory ecology has probably taught the paramecium – Didinium story, or Huffakers predator-prey mites story, or Park’s flour beetle work…any idea if those results are reproducible? They are a big chunk of the core of ecology curriculum – any clue if we could could get qualitatively the same result if we tried to reproduce the identical experiment? As an aside, several years ago I toyed with the idea of getting 8 or 10 researchers at different institutions to actually try and repeat some of these classic experiments just to see how reproducible they were. I even contacted a few folks – but I ended up getting distracted (I think I might have spotted a squirrel). It’s the kind of work that could probably be done as part of an Honours in a couple of semesters – it still seems like not a bad idea. In any case, it seems to me that narrow robustness has to be sorted out first. What I will give you is that what constitutes enough certainty about narrow robustness is going to be subjective – I’m reasonably sure we’re way below any defensible standard in ecology. Jeff


    1. ScientistSeesSquirrel Post author

      Nice squirrel reference! And I find myself very sympathetic to the idea of replicating Huffaker (e.g.).

      Re: whether the need for narrow replication is an empirical issue or not – I don’t see what else it can be. If we make a claim that X is the optimal way to do science, surely the tools of science are what we use to test that claim? And I would say that’s true for within-study replication, too. Power analysis is an empirical thing, based on estimates of error variance that either come from data, or are guesstimates based on related data. Returning to replicability – if the need for X amount of replicability it isn’t an empirically testable claim, then it’s just people trumpeting opinions and the loudest one wins. Surely we don’t want that?

      Cheers, Jeff!


  5. Jeff Houlahan

    Fair enough, Steve. I’m a pretty devout empiricist so I’ll take back the comment that it’s not an empirical question. I guess my point is that the question – certainly, when we use power analysis – for within-study replication is “how many samples? not “Do we need replication?” The question for among-study replication should be “how many studies?”. With the default – as for within-study replication – that it’s likely more than one. And I say the default is more than one because we do millions of studies every year and whatever we use as evidence for conclusions (p-values, AIC scores, kappa scores) there are a whole bunch of times that the data we’ve collected are going to mislead us about what is going on in the world. Even if we don’t consider issues like p-hacking and publication bias.
    So, if you’re saying that the question of how often we should replicate studies is a critically important one and one we should figure out methods for answering then I’m on board. If you’re saying that it’s an open question as to whether among-study replication is necessary – I don’t think we need more data on that. In the same way that I don’t think we need more data to conclude that the vast number of studies need an N >1.
    And I get that an argument can be made that maybe science progresses best if we aren’t overly concerned about whether or not any particular study makes the right conclusion as long as there is a general tendency is the right direction. But even here, the underlying implication is that science is self-correcting…that somewhere along the way the research will get repeated and eventually our erroneous conclusions corrected. Because if a study that reached the wrong conclusion was never repeated the erroneous conclusion would be unlikely to be corrected.
    So, I don’t believe that the implication of your position is that among-study replication isn’t necessary…only that we shouldn’t do it in a considered and deliberate way. So, if we agree on two assumptions (and I’m not sure we do) – (1) that we won’t be able to do the experiment to compare directed among-study replication and undirected among-study replication, and (2) it is necessary to repeat studies to correct mistaken conclusions then I think there is a very strong argument for directed replication over undirected – aim for your target rather than wandering eventually towards it.

    Liked by 1 person

  6. Manu Saunders

    As a field ecologist, I’ve mostly assumed reproducibility meant the ‘broadly robust’ option you mention above. As you say, for field-based disciplines it is essentially impossible to reproduce an experiment exactly, even if you replicate a field experiment at the exact location and time of year, the system will have changed in some way since the previous experiment. So perhaps the obsession with narrow robustness is an artefact of the field/lab divide in life sciences?
    As an interesting side, reading this post made me realise that when I peer review papers with vague methods, I often find myself using the argument “to improve reproducibility/replication” when I ask authors to provide more detail, because it’s impossible to judge the validity of results if they haven’t explained clearly how they collected and analysed data. So I guess I am using the concept (rather than the actual practice) of reproducibility as a justification for clarity…


  7. Pingback: Friday links: the most famous ecologist, March for Science, and more | Dynamic Ecology

  8. Pingback: Originality is over-rated. (Including by me.) | Scientist Sees Squirrel

Comment on this post:

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.