In Good uses for fake data (part 1), I expounded on the virtues of fake – or “toy” – datasets for understanding statistical analyses. But that’s not the only good use for fake data. Fake data (this time, maybe a better term would be “model data”) can also be extremely useful in planning and writing up research. Once again, let me assure you that I’m – of course – not advocating data fakery for publication! Instead, fake data can help you think through how you’re going to present and interpret results of an experiment or an analysis (or perhaps, even if you can interpret them), before you actually spend effort getting data in hand.
I’d put this in the context of “early writing”, which is a strategy that interweaves the writing of science with the doing of science – as opposed to doing the science first and writing it up when it’s “done”, which always seemed to me so obvious I never thought to question it. Early writing makes writing easier, and can help you spot problems with your work’s design before it’s too late.
The advantages of early writing are obvious enough with a Methods section – there’s no easier time to write up your methods than when you’re planning the work, or (nearly as good) when you’re executing it. And seeing what you plan down on paper can alert you to gaps in your logic, or other problems, while there’s still time to make changes*. But other sections can be written early too – for example, the Introduction and Discussion while you’re reading literature and thinking about knowledge gaps in order to plan and justify your study. If you can’t sell your study to readers (an important function of both sections), it’s best to know that before, not after, doing the work.
Have I wandered away from fake data? Perhaps, but I’m circling back. What about the Results section? Surely you can’t write the Results section before you have any results? Actually, even large portions of a Results section can be drafted early, if you use some fake data (or model data) to make mock-ups of tables and figures. Doing this before you’ve taken a single datum is an excellent way to test-drive the design of your study, and to discover design features that might make data analysis or presentation needlessly complicated. Similarly, you can learn whether the results you expect can let you tell a convincing story. You’ll want these fake data to be pretty realistic – you might even make mock-ups of raw data and take them through all the steps of analysis and data display that you plan. Doing this gives you a Results section that you judge as effective (or not), and when you’ve gathered the real data, you can cut-and-paste them into the mock-ups (a huge time-saver).
Perhaps you’re still uneasy with the notion of fake data – but remember you’re using them only for the mock-ups and the test-drive and they’ll never see the light of day. It’s worth labelling each mock-up figure or table with a “simulated data only” banner to avoid confusion later.
Of course, your real data may not turn out the way you imagine it will (imagine that!). You have to be willing to let your early writing go – to revise, rewrite, or even discard some of it when the data are in. But this doesn’t make early writing wasted effort. It’s always easier to revise a draft than to write from scratch, and that’s on top of the fact that even early writing that’s completely discarded will have helped polish the science itself.
Fake data used malevolently (as in, published as a fraud) can certainly muddy and retard science**. But fake data can clarify and accelerate science when you use them carefully to think through the interpretation of your expected results, and to get writing off to a running start.
© Stephen Heard (firstname.lastname@example.org) March 29, 2016
This post is based in part on material from The Scientist’s Guide to Writing, my guidebook for scientific writers. You can learn more about it here.
*^A particularly useful rule of thumb: if you find you’re struggling to explain a particular piece of Methods, or that you’re writing convoluted or defensive prose, your writing self is probably sending strong hits that you’ve chosen the wrong approach to the problem.
**^Although it’s debatable whether they often retard science for very long, which is one reason I’m not that concerned about the “reproducibility crisis”, at least in biology. I can’t think of many examples where fraudulent data held a field back more than briefly. They might even sometimes accelerate it, as people strive to explain inconsistency with (real) data from other labs (I wonder if this was the case, for example, for the Hwang Woo-suk human stem-cell cloning fraud). This does not, of course, make the fraud a good thing.