Why do we mention stats software in our Methods?

Image: Excerpt from Heard et al. 1999, Mechanical abrasion and organic matter processing in an Iowa stream. Hydrobiologia 400:179-186.

Nearly every paper I’ve ever written includes a sentence something like this: “All statistical analyses were conducted in SAS version 8.02 (SAS Institute Inc., Cary, NC)*  But I’m not quite sure why.

Why might any procedural detail get mentioned in the Methods?  There are several answers to that, with the most common being:

  1. because it allows someone else to replicate our experiment;
  2. because it establishes authority of work that’s the credible product of an authentic scientist; or
  3. because it helps the reader understand the Results.

(There’s a 400-year history of hair-rending and teeth-gnashing over which of these matters most, and there’s a good case that we don’t act consistently with our beliefs about this.  I’ve explored the issue in an older blog post and in more depth in The Scientist’s Guide to Writing.)

Let’s apply these criteria. There’s a good reason you’ve never seen this sentence in a paper:

Data were recorded with a PentelTM #2 0.5 mm mechanical pencil on quad-ruled 8.5 × 11” Rite-in-the-RainTM paper.

Knowing this doesn’t meaningfully help someone replicate your experiment (although drawing a clear line under the list of details that are needed to “replicate” points out how peculiar our discourse can get around replication).  It won’t persuade anyone that our work should have authority, and it won’t help the reader understand the Results. So we keep these details to ourselves**.

How does our stats-software sentence fit with these possibilities? It certainly doesn’t help the reader understand the Results: a 2-way ANOVA is a 2-way ANOVA, whether it’s executed in R, SAS, SPSS, or (gasp) Excel.  There are probably readers for which it can help establish authority – but there shouldn’t be.  These readers are the people who believe that if you aren’t using R (for example), you’re not an authentic scientist.  I hope we can all agree that nobody should believe this***.  Finally, what about replication?  This point is a little more nuanced, so let’s think about it a bit.

If I ran some “standard stats”, like a 2-way ANOVA, a principal components analysis, or a logistic regression, then you don’t need to know which software package I used in order to replicate my work.  Standard stats are like pencils and paper: I use them, but you don’t care that I did; if you want to replicate what I did, you’ll use your own and it won’t make any difference.  “Exotic stats” are different.  If I ran “exotic stats”, perhaps I invented a new test, or used a method recently published for which there remains some doubt about its performance or even correctness – BiSSE-class models, for example. Here someone really might get different answers using one R package versus another (for example); and so we really do need to report the stats software we used.  (By the way, one mark of good experimental design is that it puts the weight as much as possible on the simplest, and thus most standard, stats.)  Where’s the line between standard stats and exotic stats?  Ah, that’s a bit tricky – but we make judgements like this all the time with respect to other methods.  Does it matter what our vials were made of?  No, if we’re storing insect specimens for morphological measurement; but yes, if we’re storing them for analysis of cuticular hydrocarbons.  We’re smart; authors can decide, reviewers can question, and we can get this right.

But that isn’t what we do.  Instead, in ecology and evolution we seem to have a de facto standard that we report our stats software, even if we used nothing fancier than one-way ANOVA.  That’s weird.  Just as weird, in cell biology, the standard seems to be not to mention stats software.  Neither makes any sense.

So am I going to put my money where my mouth is, and stop bothering my reader with trivia like whether I calculated correlation coefficients in R or in Excel?  No, because there’s another completely different reason one might mention a stats package, and it’s one I can get behind.  Mentioning a stats package gives me an opportunity to cite it.  Citation has a number of functions, but the one I care about here is as a currency of appreciation.  When I cite a stats package, I thank its authors and give them a tiny but tangible reward (a CV boost).  This doesn’t matter to me for commercial packages, so if I do stats in SAS I don’t feel the need to cite for appreciation.  It does matter to me for software written by one of my colleagues as part of their contribution to science, so when I do stats in R I cite both base R and any add-on packages.  This practice may be bizarrely inconsistent seen through the lens of the function of the Methods, but it’s entirely logical seen through the lens of science as a community.  Getting the right lens in place makes all the difference.

© Stephen Heard (sheard@unb.ca) November 7, 2016

This post is based, in part, on material from The Scientist’s Guide to Writing, my guidebook for scientific writers. You can learn more about it here.


*^I’m trolling you a little.  These days most of my papers specify “R version something-or-other” instead. Trust me, really, I belong to the 21st century – even if it does cost me the occasional bottle of wine supplied to younger, R-hipper members of the lab.  Of course, my results are unaffected.

**^Although I’ve reviewed manuscripts specifying the brand of calculator used to make calculations, the brand of -80º freezer used to preserve specimens, and the size of vial used to hold specimens preserved in ethanol.  I was as mystified as you surely are about each of these.

***^Cults are cults (speaking of trolling), in science as in everything else. I once had a reviewer argue that my work wasn’t publishable because I did analyses using code I wrote in Microsoft Visual BasicTM – on the grounds that even though I provided the code, the compiler isn’t open-source. Sigh. This is not, of course, to say there’s anything wrong with R as a statistical tool; only that there’s nothing uniquely right with it, either.

Advertisements

22 thoughts on “Why do we mention stats software in our Methods?

  1. Pavel Dodonov

    Nice post 🙂
    I just wish to emphasize a bit more the need to cite open source software when we use them – after all, they’re made mostly by volunteer work or through public funding. Not citing them would be like failing to cite a paper on which a study was based. Plus, the authors sometimes ask you to cite the software and provide the precise reference – this is the case for Past and R; apparently not for QGIS, which I’m never sure how to cite.
    I also usually mention the brand of measurement devices used because different brands may have different errors… But I think it’s a fine line between what shoud and what should not be mentioned.

    Liked by 1 person

    Reply
  2. Yolanda Morbey

    Hi Steve… I have found that even some pretty standard statistical analyses give different results in R & SAS (e.g., p-values for mixed models and standard output from a PCA come to mind). And I believe Type III sums of square is a (wonderful) SAS thing… lm’s in R don’t give Type III SS but you have to code for them. This seemingly small difference can make a huge difference in interpretation. As for mixed models, I’m sure statistical experts understand why SAS & R would give different p-values, but it’s not easy for the statistical practitioner in biology to figure this out!

    Like

    Reply
    1. ScientistSeesSquirrel Post author

      Yolanda – PCA differences may be due to different defaults about centering and scaling (just a guess based on some stuff one of my students was doing). Mixed models, good question… but your point about type III sums of squares is a good one. I’d say it really makes my point well: what matters is that you report whether you used type I, II, III, or IV sums of squares; but if you do that, it shouldn’t matter if you say which package you used. Subject, of course to our being able to figure out where the line is between “standard” and “exotic” stats!! Which you are arguing is not trivial…

      Like

      Reply
      1. Sam

        For mixed models, there’s a ton of different methods to estimate parameters, and SAS implements some that some R packages do not, and vice versa. I always appreciate when papers say whether they used, for example, Laplace estimation versus Penalized quasilikelihood, since in some cases it is important for replicating results. Same with stating how p-values are calculated (estimating degrees of freedom is tough in GLMM!), if that’s something one wants to do.

        Bolker, et al. (2009) Generalized linear mixed models: a practical guide for ecology and evolution. TREE. is a good overview of what some different GLMM software can do. I found it useful at least, and I’m certainly no statistician.

        Like

        Reply
        1. ScientistSeesSquirrel Post author

          Yes – agreed, I’d place mixed-model analyses in my “exotic stats” basket (which is probably not well named), precisely because there are a lot of ways to do them. (It’s an interesting question how often it makes a difference which way you pick – there are certainly datasets for which it matters, and others for which it doesn’t).

          Like

          Reply
  3. T.J. Benson

    However, Type III sums of squares results in R are not necessarily equivalent to those reported by SAS unless you first change a few options in R. If you tell me the software as well as procedure/function, I’m much more likely to be able to figure out what you’ve done.

    Like

    Reply
  4. Margaret Kosmala

    Hmmm… disagree a bit. Software is a black box to most ecologists. And there are almost certainly bugs in statistical software. Saying which you used, yes, increases replication, but also allows everyone to know which results are untrustworthy when bugs *are* discovered. I prefer to do as much of my stats “by hand” as possible, or else only use tried-and-tested stats packages that are generally trusted. It’s all about trust, and you can’t know if you trust someone’s methods (including stats software) unless you know what they did/used.

    Like

    Reply
    1. ScientistSeesSquirrel Post author

      That’s an interesting point, Margaret, thanks! So, for example, if someone did a 2-way fixed-factor ANOVA “by hand”, you might know to be cautious about trusting their result! (Not sniping at you, I promise – but surely for many folks, that would introduce lots of possibilities for error).

      I’d analogize this to a digital caliper – also a “black box” to most ecologists. Do there exist makes of calipers that measure distances wrong? If yes, then it matters to report which one you used. If not, then it doesn’t. For stats: does commonly used software exist that does a t-test wrong? If so, then it matters that you report what you did; if not, it doesn’t.

      Seems like you’re essentially arguing that _all_ stats fall into my “exotic stats” bucket. Would you agree?

      Like

      Reply
      1. Margaret Kosmala

        Yes, for sure about technical hardware. We end up trusting measurements based on the reputation of the companies that make them. I report that I got my digital calipers from Cheepest Everz Scientific and you might worry a bit about my data…

        Yes, I’d put all stats in the “exotic stats” bucket — not based on the stats themselves, but based on what I know about software development. There are bugs in even the most trusted software environments. No doubt about it. If you’re “just” doing a t-test and you report a reputable software, I’m going to assume you got a reasonable answer (which assumes I believe you know how to use the software, understand its limitations, and extracted the proper numbers — all of which are big assumptions) — but not a correct one. But only because that code for t-test has been well-tested and heavily driven by many people and so if there’s a bug in the code, it has a small enough error that it doesn’t affect scientific inference. I trust it enough.

        Like

        Reply
        1. ScientistSeesSquirrel Post author

          This discussion (not just your comments, but more generally) is really interesting. Apparently I’m much more sanguine about technical hardware/software than many of my readers. Do we *really* have examples of where crappy calipers, or doing basic stats in an unusual program, led science astray? Am I just Pollyanna?

          Like

          Reply
          1. Margaret Kosmala

            The real issue is that no one is checking most of the time. There is no incentive for testing scientific software. And we well know all the bugs that still ship with commercial software, even though there’s a massive monetary incentive to get them out.

            A quick search will find stories of scientific bugs that have been caught that have an unknown, but potentially sizeable, impact on research results. Here’s one:
            http://www.sciencealert.com/a-bug-in-fmri-software-could-invalidate-decades-of-brain-research-scientists-discover

            Like

            Reply
            1. ScientistSeesSquirrel Post author

              Ooh, yeah, I’d forgotten about the fMRI one. That one is huge! Although: squarely in the realm of “exotic stats”, I’d say. That’s not the same risk as finding a bug in Excel’s ANOVA routine (for example).

              Like

              Reply
  5. Chris MacQuarrie (@CMacQuar)

    “Let’s apply these criteria. There’s a good reason you’ve never seen this sentence in a paper:

    Data were recorded with a PentelTM #2 0.5 mm mechanical pencil on quad-ruled 8.5 × 11” Rite-in-the-RainTM paper.

    Knowing this doesn’t meaningfully help someone replicate your experiment (although drawing a clear line under the list of details that are needed to “replicate” points out how peculiar our discourse can get around replication). It won’t persuade anyone that our work should have authority, and it won’t help the reader understand the Results. So we keep these details to ourselves”

    I’m not sure this is the correct analogy

    Consider that we often do report the equipment that we use to collect data. This is because equipment can have biases or may not be appropriate to the task (for instance if I reported that I used my bathroom scale to collect weights and report values to 0.01 g you may question my findings). It’s for these same reasons we report the software we use to do analyses. Software has errors, biases or may not be appropriate (as others have pointed out for the specific example of type 2 SS in SAS vs. R).

    The difference between reporting what pencil and paper I used and what software and equipment I used is that the primary source of error in the first equation is me, the wielder of the pencil. With software and hardware I can be the primary source of error, but just as likely as not it may be the equipment. We report this because it helps us understand where errors and biases may have crept into the paper. Knowing this helps puts the results into context.

    Like

    Reply
  6. ScientistSeesSquirrel Post author

    I don’t disagree. But: I don’t report which brand of balance I used either, and for exactly the reason you suggest: I’m unaware of any suggestion that Mettler balances produce different results from Sartorius ones. (Agreed on the bathroom scale, though…)

    Like

    Reply
      1. ScientistSeesSquirrel Post author

        Absolutely – but that has nothing to do with the make or model. You need to report “I measured mass to a precision of +/- X”, but I’d argue that nobody should care which model of balance you did it on. (I mean, if people aren’t going to believe you that your balance can do that, why would they believe you didn’t just make up your data out of whole cloth?)

        Like

        Reply
        1. Peter Apps

          The reason that nobody should care what (reputable) manufacturer’s balance you used is that laboratory balances are calibrated against standards that trace all the way up to the international standard kilogram, and the sources of uncertainty in weighing are very well understood. A current calibration is more important than the make, unless check weighings on standard mass pieces were done as part of the weighing process.

          Since the conditions in the lab are known to contribute to weighing uncertainty there are strict guidelines on e.g. temperature control and bench stability and for critical weighings the location of the balance is part of the calibration certificate. Operator error is always the elephant in the room, and so-called “blunders” are not even included in uncertainty budgets – to try to keep operator error under adequate control laboratory accreditation in e.g. analytical chemistry involves both method validation and requirements for personnel skills, demonstrated by audits of actual procedures and vertical audits from sample to reported result. . Check weighings give additional assurance that the weighing results are accurate.

          Volumes are much more difficult to measure properly than masses, and much more vulnerable to operator error.

          Rather than the make of the hardware, it makes more sense to report that balances and pipettes were calibrated (and how) and what the repeatability on 5 replicate samples was – but I hardly ever see that in papers.

          Like

          Reply
  7. Peter Apps

    I think that a large part of the explanation is herd behaviour – stats packages get named in papers because they get named in other papers; it has become part of the culture that nobody (until this) gives much thought to – a bit like the convoluted strings of long words that pass for scientific writing. A simple experiment would be to leave out the mention of the package and see whether the referees ask for it to be included.

    Like

    Reply
    1. Ben Bolker

      FWIW the second link you cite talks about issues with floating-point numbers (i.e., any non-integer numbers) that are common to practically *all* computer systems (SAS, R, etc etc), not just Excel …

      Like

      Reply
  8. Mike Koontz

    Thanks for the post! I totally agree that more credit is due to colleagues that develop software/analysis pipelines/data products that contribute to science and influence the way we develop and advance our understanding of a particular facet of it.

    One part stuck out to me that I wasn’t sold on:

    “This doesn’t matter to me for commercial packages, so if I do stats in SAS I don’t feel the need to cite for appreciation.”

    Why doesn’t this matter for commercial packages? The statisticians that work for SAS Institute Inc are still scientists contributing to science even though they aren’t in academia. Citations for the work they do may not affect the reward structure in their domain, but isn’t the ultimate motivation to give credit where it’s due, not just where appreciation been historically lacking?

    Or perhaps I’m reading too much into it and you’d still cite the commercial package, but the motivation wouldn’t primarily be the need for the software to be more broadly appreciated?

    Thanks for your thoughts!

    Like

    Reply
    1. ScientistSeesSquirrel Post author

      My rationale is that (current) SAS development is essentially work for hire. I don’t feel the need to cite the developers of Microsoft Word. So yes, I agree that SAS developers contribute to science; but they do so as a paid technical service. BUT: as I write this I’m aware that my own papers are in some sense work for hire too, since I’m also paid. This feels different to me, but I’m having some trouble expressing why.

      Like

      Reply

Comment on this post:

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s