Good uses for fake data (part 2)

In Good uses for fake data (part 1), I expounded on the virtues of fake – or “toy” – datasets for understanding statistical analyses. But that’s not the only good use for fake data. Fake data (this time, maybe a better term would be “model data”) can also be extremely useful in planning and writing up research. Once again, let me assure you that I’m – of course – not advocating data fakery for publication! Instead, fake data can help you think through how you’re going to present and interpret results of an experiment or an analysis (or perhaps, even if you can interpret them), before you actually spend effort getting data in hand.

I’d put this in the context of “early writing”, which is a strategy that interweaves the writing of science with the doing of science – as opposed to doing the science first and writing it up when it’s “done”, which always seemed to me so obvious I never thought to question it. Early writing makes writing easier, and can help you spot problems with your work’s design before it’s too late.

The advantages of early writing are obvious enough with a Methods section – there’s no easier time to write up your methods than when you’re planning the work, or (nearly as good) when you’re executing it. And seeing what you plan down on paper can alert you to gaps in your logic, or other problems, while there’s still time to make changes*. But other sections can be written early too – for example, the Introduction and Discussion while you’re reading literature and thinking about knowledge gaps in order to plan and justify your study. If you can’t sell your study to readers (an important function of both sections), it’s best to know that before, not after, doing the work.

Have I wandered away from fake data? Perhaps, but I’m circling back. What about the Results section? Surely you can’t write the Results section before you have any results? Actually, even large portions of a Results section can be drafted early, if you use some fake data (or model data) to make mock-ups of tables and figures. Doing this before you’ve taken a single datum is an excellent way to test-drive the design of your study, and to discover design features that might make data analysis or presentation needlessly complicated. Similarly, you can learn whether the results you expect can let you tell a convincing story. You’ll want these fake data to be pretty realistic – you might even make mock-ups of raw data and take them through all the steps of analysis and data display that you plan. Doing this gives you  a Results section that you judge as effective (or not), and when you’ve gathered the real data, you can cut-and-paste them into the mock-ups (a huge time-saver).

Perhaps you’re still uneasy with the notion of fake data – but remember you’re using them only for the mock-ups and the test-drive and they’ll never see the light of day. It’s worth labelling each mock-up figure or table with a “simulated data only” banner to avoid confusion later.

Of course, your real data may not turn out the way you imagine it will (imagine that!). You have to be willing to let your early writing go – to revise, rewrite, or even discard some of it when the data are in. But this doesn’t make early writing wasted effort. It’s always easier to revise a draft than to write from scratch, and that’s on top of the fact that even early writing that’s completely discarded will have helped polish the science itself.

Fake data used malevolently (as in, published as a fraud) can certainly muddy and retard science**. But fake data can clarify and accelerate science when you use them carefully to think through the interpretation of your expected results, and to get writing off to a running start.

© Stephen Heard (sheard@unb.ca) March 29, 2016

This post is based in part on material from The Scientist’s Guide to Writing, my guidebook for scientific writers. You can learn more about it here.


*^A particularly useful rule of thumb: if you find you’re struggling to explain a particular piece of Methods, or that you’re writing convoluted or defensive prose, your writing self is probably sending strong hits that you’ve chosen the wrong approach to the problem.

**^Although it’s debatable whether they often retard science for very long, which is one reason I’m not that concerned about the “reproducibility crisis”, at least in biology. I can’t think of many examples where fraudulent data held a field back more than briefly. They might even sometimes accelerate it, as people strive to explain inconsistency with (real) data from other labs (I wonder if this was the case, for example, for the Hwang Woo-suk human stem-cell cloning fraud). This does not, of course, make the fraud a good thing.

Advertisements

3 thoughts on “Good uses for fake data (part 2)

  1. David Mellor

    Not only does this help hammer down methods and results sections, but making those decisions about the analysis and results before seeing your real data is a great way to preserver the diagnosticity of traditional statistical tests. Preregistration (https://cos.io/prereg) takes this one small step further and puts those decisions in a read only format so that pesky biases can’t get in the way and convince us to modify those analyses in the face of the data.

    On using “real” fake data for exploratory work and hypothesis generation: if you have a dataset that is larger than you need (I know, what a great problem to have…) you can randomly split it in half and tweak your analysis and hypotheses as much as you want using real, but exploratory data. Then, when you think you’ve found something interesting, unlock the really real data and confirm the results. This prevents one from using the same dataset to both create and test hypotheses.

    Liked by 1 person

    Reply
    1. ScientistSeesSquirrel Post author

      The split dataset is a great idea – when (as you say) you have the surprising luxury of that much data. It’s sometimes really useful to just analyze the heck out of a dataset in an exploratory way (we used to joke about SAS needing a “PROC FISH”), but then of course you need protection against the expected false positives. If only data were cheap enough to do that all the time!

      Like

      Reply
  2. Pingback: Is this blog a “science blog”? If not, what is it? | Scientist Sees Squirrel

Comment on this post:

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s