Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (book review)

Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, by Deborah G. Mayo.  Cambridge University Press, 2018.

If there’s one thing we can all agree on about statistics, it’s that there are very few things we all agree on about statistics.  The “statistics wars” that Deborah Mayo would like to help us get beyond have been with us for a long time; in fact, the battlefield and the armies shift but they’ve been raging from the very beginning.  Is inference about confidence in a single result or about long-term error rates?  Is the P-value essential to scientific inference or a disastrous red herring holding science back?  Does model selection do something fundamentally different from null-hypothesis significance testing (NHST), and if so, what?  If we use NHST, is the phrase “nearly significant” evidence of sophisticated statistical philosophy or evil wishful thinking?  Is Bayesian inference irredeemably subjective or the only way to convert data into evidence?  These issues and more seem to generate remarkable amounts of heat – sometimes (as with Basic and Applied Social Psychology’s banning of the P-value) enough heat to seem like scorched-earth warfare*.

In Statistical Inference as Severe Testing (henceforth, “SIST”) Mayo attempts the Herculean (Solomonesque?) task of making it clear what each battle is actually about, what light can be shed by a historical perspective, and how some philosophical thinking can let opposing camps talk with each other rather than past each other.  Mayo is a philosopher of statistics, or more broadly of science, and so it makes sense that if I had to sum up the book in a single sentence, it would be “OK, everyone, what exactly are we saying when we say X about statistics?”.  I find it fascinating (and frustrating) that people often hold strong opinions about statistics that don’t seem to be based on careful thinking – in particular, careful thinking about the basic logic of what statistics are doing and what we want them to do.  It’s clearly also fascinating and frustrating to Mayo, as a philosopher of science; after all, “careful thinking about logic” is pretty much what the philosophy of science is.

What, by the way, is the “severe testing” of the title?  Mayo argues that we want statistical methods that test hypotheses severely; and by that, she means:

If [a claim] C passes a test that was highly capable of finding flaws or discrepancies from C [if they existed], and yet none or few are found, then the passing result is…evidence for C.

Just as one example, a conventional NHST test is a severe test under a set of important conditions (assumptions met, sample size set before experiment, only a single pre-decided test run, etc.) but not otherwise (given p-hacking, etc.)  But the same lens can be used to examine any kind of inferential tool.

Why do we need severe testing?  Because we want to find things out (we want to test claims), but we’re far too good, as humans, at seeing pattern where none exists (apophenia).  Mayo works hard to ask which statistical approaches achieve severe testing, and how.  Those who know her past work will not be surprised to find a spirited defence of NHST methods, but there’s much more to the book than that.  In SIST, severe testing turns out to help not just practically but also with some deep philosophical problems, such as the annoying fact that there’s no non-circular justification for induction.

SIST begins with basics – Fisher’s lady tasting tea, and some simple but pernicious ways that people corrupt statistical testing.  She covers a lot of ground, though, moving through Bayesianism (in its several flavours), problems of induction, reproducibility [her discussion of the “reproducibility crisis” in psychology, pp 97ff) is excellent], objectivity and subjectivity, most-published-results-are-false, Popper, power, bias, and more.  There are some thought-provoking surprises along the way; for instance, in Mayo’s argument that “it is rare for any interesting scientific hypothesis to be logically falsifiable” (p 60; and take that, shallow readings of Popper!).

Much of SIST requires, but also rewards, careful reading.  Partly that’s because a lot of philosophical thinking has gone into statistics, and a lot of that thinking is sophisticated and subtle.  (That doesn’t mean that teaching or using basic statistics has to be hard.)  It’s also because, as a statistician and philosopher, Mayo knows her ground very, very well, and she sometimes assumes that her reader does too.  But at her best, she forces a careful reader to think as rigorously and clearly as she does.  A less careful reader may skim many sections, but will still gain a lot.

Mayo’s real-world examples are both entertaining and instructive, even for the reader who skims.  Some show inference at its best: Einstein’s 1919 light-bending predictions, or the lady tasting tea**.  Others show it at its worst, and are entertaining in a head-desk kind of way: for instance, the Duke University personalized-medicine study (p 6) or the TSA’s screening performance at airports (p 363).

Given Mayo’s ambitious goals, it’s not surprising if SIST falls short in a few places.  Mayo’s text can be clear and engaging; sometimes, though, sentences sag under the combined weight of their philosophical freight and their structure – this one, for instance:

You might say that even if some people deny that selection effects actually alter the “evidence”, the question of whether they can be ignored in interpreting data in legal or policy setting is not open for debate. 

 I’m carping, though, around the edges, because I’m not sure one can write a book that interrogates statistics and philosophy of science so thoroughly without having material that requires careful reading.  We’re asking for careful thought, after all; that’s the whole point.

At a broader scale, Mayo’s book uses a somewhat odd structure, with sections dubbed “excursions”, “tours”, “exhibits”, and “souvenirs”.  This museum metaphor seemed strained to me, and not very helpful, although I’ll acknowledge that no structure is likely optimal for everyone.

There’s reward in SIST for the reader who skims, or who dives deep in just occasional sections.  There’s more reward for a reader who engages thoroughly and carefully, even if that’s not a small task.  This, of course, is as it should be.

Should you read SIST?  Yes, if you want to “get beyond the statistics wars”, and if, to do that, you’re willing to do some careful thinking about what a statistical analysis is doing and what we want it to be doing.  No, if you’re in the crowd that prefers incensed sloganry along the lines of “NHST makes no sense because all nulls are false anyway” (one of many canards that SIST tackles).  But if you’re in that crowd, you aren’t still reading this review anyway.

 © Stephen Heard  November 13 2018

You can find some teasers from SIST on Mayo’s blog, Error Statistics Philosophy.  

Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, by Deborah G. Mayo.  Cambridge University Press, 2018.


*^Although by now I’m starting to tire of the military metaphor.  I’ll try to stop.

**^It’s interesting that my two favourite examples of ingenious inference lie at opposite ends of the exotic-mundane continuum.  But that’s the fundamental cleverness of statistics, in a way.

Advertisements

16 thoughts on “Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (book review)

  1. Jeremy Fox

    I am jealous that you’ve read this book already. It’s on my list.

    Curious how you found the bits about subjective Bayesianism. Did you find it relevant to you as a scientist, or did it come off more as an intra-philosophy fight disconnected from science? My impression is that subjective Bayesianism has mostly only ever attracted philosophers, not scientists or statisticians. In practice, even scientists who *say* they’re subjective Bayesians (e.g., Jim Clark, if memory serves) don’t really *act* like subjective Bayesians when it comes to doing and interpreting statistics. I find Mayo’s critiques of subjective Bayesianism very interesting, but not because I feel like they have any practical connection to real-world scientific practice. I recall back in the mid-90s that Brian Dennis was very worried about subjective Bayesianism infiltrating ecology, but I don’t think it ever happened (and not because Dennis’ polemics prevented it). It’s only other sorts of Bayesianism that have been attractive to ecologists (https://dynamicecology.wordpress.com/2013/06/19/why-saying-you-are-a-bayesian-is-a-low-information-statement/).

    Also curious if/how Mayo’s book affected your thinking on widely-discussed statistical issues of the moment. The replication crisis in psychology, “most published research findings are false”, etc.

    Unanswerable questions: how good are ecologists at severe testing? And what are the biggest obstacles to us becoming better at it?

    Like

    Reply
    1. ScientistSeesSquirrel Post author

      Have to admit I don’t have a useful position on subjective Bayesianism. I found myself skimming those bits of the book, and maybe that’s your answer: for me the frequentist-subjectivist debate (whether Bayesian or not) is interesting but doesn’t have much bearing on what I actually do as a scientist – I think.

      On replication, most-published-results-are-wrong, etc: if I’m the reader, Mayo is preaching to the choir on these issues, so no, I don’t think they moved my opinion. Which makes me worry about confirmation bias, of course!

      Like

      Reply
    2. Mayo

      Your referring to EGEK (in what you say about subjective Bayesianism). That book was mainly focused on how philosophers of science use statistics (to represent scientific inference, to solve problems about evidence, to perform a meta-methodological critique, e.g., does the novelty of evidence matter)? But the discussion of statistics in this new book (Statinfasst) grows directly from statistical practice and the existing statistics wars. Only Excursion 2 deals with philosophical approaches. There are still cases of subjective Bayesians in practice (Kadane, O’Hagan, Lindley, Dawid are some that arise in Statinfasst). Most Bayesians are either default/reference/non-subjective or murky mixtures of subjective, default, and empirical Bayesian. That’s one reason it’s so hard to interpret any resulting posteriors. However, it’s an important reason that the situation is so unsatisfactory: the main selling points of today’s Bayesian approaches are in tension with each other: the prior is purported to capture background info while at the same time being as data dominant as possible.

      Like

      Reply
  2. Emilie Champagne (@MissEmilieC)

    Now I want to read it, but I’m a bit scared about what you said regarding the structure of sentences (for example, I really don’t understand the sentence you cited). I still have trouble understanding complex writing in English (like James Joyce’s Ulysses…I tried, but I failed). Would you recommend the book to a ESL reader with a good level of proficiency in reading?

    Like

    Reply
      1. Mayo

        I gave a lot of attention to the writing in this book. I completely rewrote the book after a third draft. I devised an entirely new way to frame and approach the issues. Precisely why I chose this way is something that I will not delve into fully now, but I think readers who are aware of just how unusual and even pathological some of these debates are (hence, calling for chuzpah) will reap the benefits, even if they scratch their heads: why would she do something so strange as suggest we’re on one of those intellectual cruises?

        Now I’m the first to say that despite the agony on writing, I would have wished for a copy-editor who was more intrusive. (Someone like you.) But I don’t think it’s fair to grab a sentence out of context (on p. 271) and point fingers at it. Remember that from the very, very first excursion (Tour I & II), the idea that is front and center is that many accounts of inference distinguish the “evidence” that data x afford a claim–whether the claim resulted from multiple testing, optional stopping or other selection effects–from subsequent interpretations, or actions taken as a result. Logics of evidence, and accounts that follow the Likelihood Principle, make this distinction (e.g., pp. 48-9). The sentences before and after the one you point out, the entire imaginary trial of Dr. Hack in this Tour, are about this distinction. The error statistician, and surely the severe tester (a proper subset) thinks that selection effects DO matter to the evidence.

        You might think that no one could get away with insisting that selection effects do not matter when it comes to legal and policy settings. That’s gist of the unhappy sentence.

        I’m describing what some people say: don’t worry, they would never get away with ignoring multiplicity or trying and trying again when it comes to policy. There are statutes the FDA and other agencies have in place that require taking such things into account. That is what the next several sentences say. An example from the reference manual for lawyers follows. Then:
        “Nevertheless statutes can be changed if their rationale is overturned.” Then Note 3 on the same page gives an example from the Supreme Court in which some argue that selection effects do not require any adjustments, and you should be free to explore and use p-values in just the same way as with preregistered trials.

        So, having said all that, please tell me how you would rewrite my sentence. Surely I can think of improvements, but in context, and given all that has happened in the 271 pages before, it’s not that unclear. Is it?

        Like

        Reply
    1. Pavel Dodonov

      Emilie: based on my own experience – in order to fully understand complex sentences, you have to read things with complex sentences in them 🙂 You may have to read some of them ten times to understand, but it’s part of the process. And I don’t think this is different for people who have English as their first or as their second language – I’m pretty sure a lot of native speakers would fail in reading Joyce (and perhaps even Tolkien or Frank Herbert).

      PS: My copy of this book is supposed to arrive today! Yey

      Like

      Reply
      1. Emilie Champagne (@MissEmilieC)

        Definitiely, pratice makes perfect. But I have to say that I hesitate because I’ll have to buy it. Renting a book from the library and not being able to understand it is a thing, but when it takes a part of your book budget, it’s another thing. I might still try that one!

        Like

        Reply
        1. Mayo

          There are some book groups on twitter, and I’m more or less following their pace. https://errorstatistics.com/2018/11/08/souvenir-c-a-severe-testers-translation-guide-excursion-1-tour-ii/
          Form a book group and collect your questions and send them to me weekly. If you care about these issues, the current challenges to science, and confusions about statistics, it’s worth your time. Check my blog for excerpts. We worked hard to keep the paperback under $30. If $ is keeping your from getting it, send me your address on email and I’ll send you a copy of the book for free.

          Like

          Reply
      2. Mayo

        Pavel: As I just explained in a long note, I think context and what came before should prevent having to reread the same sentence over and over. When you get your copy, begin with Excursion 1, both parts. Then you can skip around a bit. Please also survey the notes and mementos I put on my blog errorstatisticscom. Write to me with questions or place them in comments on the blog.

        Like

        Reply
    2. Mayo

      Don’t be scared. Take a look at my blogpost which links to all of the first Tour (chapter): https://errorstatistics.com/2018/09/08/excursion-1-tour-i-beyond-probabilism-and-performance-severity-requirement/
      The book, in addition to being an intellectual journey, is a resource. Once you get through the first 3 excursions (parts), say, you may look up the particular problems you’re interested in. You’ll find a wealth of core references explained right in front of you.

      Like

      Reply
  3. Pingback: Friday links: a remarkable cv, why be an EiC, Andrew Gelman vs. Richard Levins, and more | Dynamic Ecology

  4. Pingback: 15th century technology and our disdain for “nearly significant” | Scientist Sees Squirrel

Comment on this post:

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.