Good uses for fake data (part 1)

Graphic: A fake regression. You knew those were fake data, right? I may spend my entire career without getting a real regression that tight.

If you clicked on this post out of horror, let me assure you, first off, that it isn’t quite what you fear. I don’t – of course – endorse faking data for publication. That happens, and I agree it’s a Very Bad Thing, but it isn’t what’s on my mind today.

What I do endorse, and in fact encourage, is faking data for understanding. Fake data (maybe “toy data” would be a better term) can help us understand real data, and in my experience this is a tool that’s underused. I’m an ecologist, and ecological data are complex. Consequently, our analyses are complex and growing ever more so (sometimes perhaps ill-advisedly). It’s very easy to run an analysis that declares a pattern – or the lack of one – and yet to feel a little uncomfortable with your ability to explain just why the data do or don’t show that pattern. Yes, of course you can and should plot your data, and inspect pattern by eye. But the human brain is really, really good at finding pattern when none is there (which is why we still need the P-value and the null-hypothesis significance testing toolkit). So if your eye suggests one thing, and a fancy but unfamiliar analysis suggests something different, what do you do? Blind faith in fancy analysis is just as bad as blind faith in your eye. If, instead, your eye agrees with the fancy analysis, are you sure that isn’t just confirmation bias? Here’s where fake data can come to the rescue.

How can fake data help? When I’m trying to get comfortable with a new kind of analysis, I know what kind of pattern I’m testing for, and I can build some fake data that definitely has that pattern (or, just as usefully, that definitely doesn’t). Sometimes I can do this right off the top of my head; other times I’ll do a more formal simulation*. Then I can run an analysis in which I know what should happen – and if it doesn’t, I know I can’t trust the results with real data. Let me give you two concrete examples.

    • Back before SAS priced its way out of the academic market, I used to use PROC CATMOD a lot (for log-linear analysis of categorical data). I could never remember whether a positive parameter value meant the dependent variable increased from group 1 to group 2, or the other way around. My fix was a little fake dataset I kept on hand for which I knew which direction the effect went; I’d run that one through every time before tackling my real data.
  • When I was a grad student, a paper came out suggesting a new analysis that worked with body-size data and purported to reveal a new property of ecological communities. Let’s call it blargishness (to keep it simpler than the real thing). The paper detected blargishness all over the place. I was pretty excited, but also a bit skeptical, so I decided some fake data were called for. I had two bookshelves of paperback science-fiction novels in my living room, so I took the number of pages in each one and ran that dataset through the paper’s new analysis. My bookshelves showed strong blargishness, and I instantly became unimpressed with the new analysis and with the conclusion of widespread blargishness in ecological communities**.

I could give you many more examples, but those will do to make my point: fake data can be your friend. They can orient you to an analysis, so you understand what reported parameter values are telling you. They can reassure you that you can trust an analysis, or warn you that you can’t trust one. They can give you intuitive feel for what an analysis is doing – and this is important given that most users won’t (or can’t) gain that understanding by working through an analysis by hand. (Doing so is a great idea with a 2-way ANOVA, but simply isn’t going to happen with the kind of complex and often multivariate analyses that we all need to deal with.)

So: by all means fake some data! Don’t take it beyond your quest for understanding, of course (I feel the need to say that again, lest I be taken horribly out of context), but you knew that. There’s no substitute for a good intuitive feel for what an analysis is doing, and watching one operate when you know the “right” result is one of the best ways to get that. Think of it as some time in the sandbox before you build a real castle; and with that attitude, you can learn a lot from fake data.

© Stephen Heard ( March 10, 2016

*^Formally simulated “fake data” are an important tool and publishable in their own right, of course, in contexts like simulation studies of newly designed statistical tests. This is how we verify that tests perform as designed, or how we assess how they don’t perform as designed, especially when assumptions are violated. But what I’m talking about today is fake data used for exploration, without any intention of them being a product for publication.

**^I should have run some completely random data, too, of course. I used my bookshelves because I wanted some “body-size” data that were completely unrelated to processes that might work in biological communities, and that was all I could think of. But of course lengths of novels are probably non-random in other ways. I still think I was right to dismiss the blargishness analysis on the grounds that my bookshelves showed it.


8 thoughts on “Good uses for fake data (part 1)

  1. Catherine Scott (@Cataranea)

    Another great use for fake data is when you are designing an experiment and figuring out what kind of data to collect and how it will be analyzed. Too often students dive in to an experiment, collect a bunch of data, and only then start thinking about how to analyze it. At this point they often realize that they have no idea what analysis to run, or that the analysis they thought they were going to run doesn’t actually allow them to answer the question they wanted to answer. Making a fake dataset of the kind of data you intend to collect (with and maybe also without the pattern you expect to see), and then analyzing it with the test you think is appropriate/figuring out what test will be appropriate BEFORE you ever start your experiment is an excellent way to avoid all kinds of issues down the line.

    Liked by 1 person

  2. Jeremy Fox

    Yep. Checking what your newfangled analytical approach does with fake data is always a good idea.

    Can’t find it now, but a while back over at Dynamic Ecology we linked to an amazing story about fake data in physics. The folks in charge of one of those big physics collaborations using some hugely expensive detector to look for some subtle physical phenomenon that may not exist routinely create fake data to test the reliability of their data processing and analytic methods. Without warning, the person in charge of the project will literally *move bits of the detector around* so that the detector will behave *exactly* as it would if it were detecting something. Then you see if the physicists in charge of data processing and analytics find the “signal”. At least once, those physicists were allowed to go so far as to *write a draft Nature paper* before it was announced to them that the data were fake.

    The only caution I’d raise is that you need to follow the example of those physicists and make sure that the features of your fake data have a 1:1 correspondence to whatever it is you’re trying to detect. Too often in ecology, we’ve mistaken data that are non-random in some statistical sense for a signal of some particular process. Which they aren’t, because many other ecological processes might have generated the same “non-random” signal. I’m thinking for instance of trying to treat non-random patterns in species x site presence-absence matrices as a signal of interspecific competition. When of course many other processes besides competition can generate the same non-random patterns in those matrices, and competition doesn’t necessarily generate the patterns it’s widely thought to generate.

    Liked by 1 person

  3. ScientistSeesSquirrel Post author

    Yes, the long-running debate over checkerboard patterns is a really good example. It’s one thing to know if you can detect the expected pattern; quite another to know if the expected pattern necessarily means what you think it does. My “blargishness” example is a case in point!


  4. Morgan Maryk

    This post is a bit beyond my understanding however, I found it interesting since this was all that my stats teacher discussed at Simon Fraser University. How data can be manipulated and spun in any way you want really. Drop this outlier, or include that one, then you get the outcome that is more to your liking.

    A good lesson to read between the lines I suppose!

    Nice post.


    1. ScientistSeesSquirrel Post author

      Morgan – thanks for commenting. Manipulation is mostly a different issue, I think – you’re quite right that one can manipulate all kinds of ways. That’s not good, of course – except when you do it on purpose to tweak a dataset and see what happens. That can help you understand things like sensitivity to outliers, as long as you know you aren’t fishing for the preferred answer!

      Liked by 1 person

  5. Pingback: What does it mean to “take responsibility for” a paper? | Scientist Sees Squirrel

  6. Pingback: ChatGPT did not write this post | Scientist Sees Squirrel

Comment on this post:

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.