This semester, I’m coteaching a graduate/advanced-undergraduate level course in biostatistics and experimental design. This is my lecture on how to present statistical results, when writing up a study. It’s a topic I’ve written about before, and what I presented in class draws on several older blog posts here at Scientist Sees Squirrel. However, I thought it would be useful to pull this together into a single (longish) post, with my slides to illustrate it. If you’d like to use any of these slides, here’s the Powerpoint – licensed CC BY-NC 4.0.
(Portuguese translation here, for those who prefer.)
How should you present statistical results, in a scientific paper?
Well, to start with, there are actually two different things we might mean by “presenting statistical results” – presenting data, or presenting summary statistics, test statistics, P value, and the like. In this post I’ll largely limit myself to the latter. For coverage of the former, see Chapter 12 of my book, The Scientist’s Guide to Writing; or better (of course!) Edward Tufte’s The Visual Display of Quantitative Information.
So imagine that you’ve done an experiment – on the slide above, a simple one comparing the density of cabbage looper (a damaging caterpillar) on plots of kale either sprayed with a new biological insecticide or left unsprayed as a control*. We want to know if the spray is effective, so we compare looper densities between treatments using a Welch’s t-test. We can generate a whole pile of numbers in doing so – some of them, although by no means all, are shown in the box at the lower left of the slide. What do we do with them? I’ll break that question down into six smaller ones. The key to answering each of them is the same thing that’s key to answering all questions about writing: what does the reader need to understand and accept the story your paper is telling?
- Which numbers do you present?
There are a lot of numbers you could present, to communicate the results of your test. Fortunately, for each common statistical test, there’s usually a consensus practice.
Assuming we’re dealing with a null-hypothesis-testing approach**, the consensus will usually include a test statistic (in our case, t); the degrees of freedom (in our case, 9); a P value (in our case, 0.022), and some measure of effect size (in our case, 63 vs. 84 loopers). So we might write “there were 23% fewer loopers on sprayed kale (t(9) = 2.77, P = 0.022)”.
That’s for our t-test; if we had done an ANOVA or a regression or any other test, there would be equivalents (see the slide above). Notice, by the way, that the omission of the effect size is a common but unfortunate mistake.
- Where do you present these numbers?
When you have to decide where in your paper to present the statistical numbers, you’ll have four main choices: in the text, in a figure, in a table, or in an online supplement.
Placement in text works well with the simplest statistics (as on the slide above – original paper here). But only with the simplest statistics, because as sentences become more and more liberally festooned with test statistics and P-values, they get harder and harder to read – and “harder” reaches “impossible” long before we run out of statistical sophistication.
For moderately complicated stats, you might consider placing the relevant numbers right on a figure – as I did in the slide above (original paper here). Of course, figures get cluttered too, so be careful.
For the most complex statistics, tables tend to be most effective. The one on the slide above reports multiple ANCOVAs in a reasonably compact way (original paper here). Note the usual caveats about tables, though: publishers hate them (they’re expensive to typeset), and readers hate them too (if they’re not designed carefully, they’re difficult to read). They’re a necessary evil, but like all necessary evils, they should be indulged in with moderation.
Finally, what about the online supplement? Those have become so routine and so easy to include with a paper that it’s tempting to shovel every number you’ve ever generated into one. I think the key to understanding the online supplement is the fact that almost nobody ever reads one. (Yes, I know some people read some online supplements; hence the word “almost”. But I would bet money that the average online supplement is read by less than 0.1% of the readers of its paper.) So, use online supplements for stats that most readers don’t need, but a very few might want:
- Stats first or pattern first?
A common mistake is to think your reader cares about the statistics more than they care about the biology. This mistake accounts for awful sentences (and I’ve written them!) like “The Welch’s t-test produced significant results (t = 2.77, df=9, P = 0.022); see Figure 1”. That sentence tells the reader nothing of interest other than that there’s some kind of pattern, and that for some reason you think it’s the reader’s job to figure out what it is. It isn’t! So consider instead one of the stronger alternatives on this slide:
People often claim that “the data speak for themselves”. Maybe in one sense, they do; but that sense is not a useful one when you’re writing a scientific paper. Respect your reader by leading them through the story you want to tell. Don’t worry: a reader who wants to remain critical of your interpretation will have no problem doing so.
“P < 0.05” or “P = 0.022”?
Let’s say your analysis produces a result with P = 0.02. Do you report it that way, or report “P < 0.05”? A decision to do the latter is a based on a line-in-the-sand, or “absolutist”, philosophy of statistical inference. Under that philosophy, one should set a significance criterion α before beginning analysis, and then care only about whether the obtained P value is larger or smaller than α. That absolutist philosophy isn’t stupid, but it’s also not the only one. Plenty of your readers will believe that P = 0.022 and P = 0.00000022 tell us different things (using a “continualist” or “strength-of-evidence” philosophy). (More about this later, in “6. What about P = 0.051?”)
- P values can be used in meta-analysis (this post explains how, using Fisher’s method for combining P-values).
- Even a line-in-the-sand reader may prefer a different line in the sand than you do; providing the exact P value accommodates any reader’s choice of α.
- A line-in-the-sand reader can always ignore P’s exact value, but a strength-of-evidence reader can’t magically reconstitute it if you’ve thrown it away.
- “P = 0.037” or “P = 0.022823511”?
OK, you should report the exact value of P – but how exact? P = 0.022823511? No – and this is a special case of the more general principle of significant digits.
You wouldn’t report a seed mass to 8 digits, so why would you report the P value for a seed mass comparison to 8 digits? Statistical software often reports all those digits, making it tempting to cut-and-paste, but don’t. I’ve explored this issue in more detail elsewhere, discussing the two relevant issues of “data-significant digits” and “reader-siginficant digits”; but in general, 2 or 3 digits (not decimal places, digits) should normally be plenty for test statistics and P values.
- What about P = 0.051?
And finally, the question that seems to provoke stronger (and more ill-informed) reactions than almost anything else in statistics. How do you report a statistical test that yielded P = 0.051?
The world seems to be divided into two kinds of people: those who are comfortable describing such a result as “nearly significant” (or something similar), and those who react to such phrasing with horror and smugly certain belief in their superior statistical virtue. But the only way to have that smugly certain belief is to be unaware of a lot of statistical history and philosophy. I’ve developed this in detail elsewhere, so only a brief summary here.
This is where the two alternative philosophies of P-value testing come in. To an absolutist, P = 0.051 means the same as P = 0.851, and both should simply be declared nonsignificant. But to a continualist, P = 0.051 suggests stronger evidence against the null than P = 0.851; perhaps even enough evidence to find interesting.
The absolutist view suits statistical process control well (when you’re testing samples from batches of potato chips off a production line, you have to either package or trash each batch; continuous measures of strength-of-evidence are of no use). It also lines up easily with Popper, falsifiabilty, and strong inference, at least for those who don’t think too carefully about it.
But there’s in fact no reason that the absolutist view is a better way to think about statistical inference at the level of a single experiment – and arguably, it’s ill-suited to that function. After all, P = 0.049 and P = 0.051 are really not different outcomes to an experiment, either logically (both show data in moderate discord with the null hypothesis) or statistically (P values have uncertainty, and are unlikely to be precise enough that we can separate values so close). This argument is developed more fully here; it is also consistent with the American Statistical Association’s Statement on the P Value.
So: P = 0.051? Go ahead and describe it as “marginally significant”, or words to that effect, and know that your practice has impeccable philosophical foundations. That won’t stop reviewers from objecting anyway, of course. (Feel free to cite this blog post in your rebuttal letter.)
Still with me? We’re done. Well, almost. First, one slightly meta thought. I’ve taken three statistics courses in my career, and have reviewed syllabi and curricula for quite a few more. Not one has had an explicit module in how to write about statistical results. Isn’t that peculiar? What else do statistics instructors think their students will do with the analyses they run?
© Stephen Heard October 2, 2018
This post is based in part on material from The Scientist’s Guide to Writing, my guidebook for scientific writing.
*^Unpopular opinion: actually, the best thing that could possibly happen to a plot of kale is for it to be devastated by cabbage looper attack. Why on Earth do humans eat the stuff? It’s like grass clippings coated in Bitrex.
**^Other approaches, such as model-selection or Bayesian techniques or strictly descriptive statistics, will have different consensuses.