Graphic: A fake regression. You knew those were fake data, right? I may spend my entire career without getting a real regression that tight.
If you clicked on this post out of horror, let me assure you, first off, that it isn’t quite what you fear. I don’t – of course – endorse faking data for publication. That happens, and I agree it’s a Very Bad Thing, but it isn’t what’s on my mind today.
What I do endorse, and in fact encourage, is faking data for understanding. Fake data (maybe “toy data” would be a better term) can help us understand real data, and in my experience this is a tool that’s underused. I’m an ecologist, and ecological data are complex. Consequently, our analyses are complex and growing ever more so (sometimes perhaps ill-advisedly). It’s very easy to run an analysis that declares a pattern – or the lack of one – and yet to feel a little uncomfortable with your ability to explain just why the data do or don’t show that pattern. Yes, of course you can and should plot your data, and inspect pattern by eye. But the human brain is really, really good at finding pattern when none is there (which is why we still need the P-value and the null-hypothesis significance testing toolkit). So if your eye suggests one thing, and a fancy but unfamiliar analysis suggests something different, what do you do? Blind faith in fancy analysis is just as bad as blind faith in your eye. If, instead, your eye agrees with the fancy analysis, are you sure that isn’t just confirmation bias? Here’s where fake data can come to the rescue.
How can fake data help? When I’m trying to get comfortable with a new kind of analysis, I know what kind of pattern I’m testing for, and I can build some fake data that definitely has that pattern (or, just as usefully, that definitely doesn’t). Sometimes I can do this right off the top of my head; other times I’ll do a more formal simulation*. Then I can run an analysis in which I know what should happen – and if it doesn’t, I know I can’t trust the results with real data. Let me give you two concrete examples.
- Back before SAS priced its way out of the academic market, I used to use PROC CATMOD a lot (for log-linear analysis of categorical data). I could never remember whether a positive parameter value meant the dependent variable increased from group 1 to group 2, or the other way around. My fix was a little fake dataset I kept on hand for which I knew which direction the effect went; I’d run that one through every time before tackling my real data.
When I was a grad student, a paper came out suggesting a new analysis that worked with body-size data and purported to reveal a new property of ecological communities. Let’s call it blargishness (to keep it simpler than the real thing). The paper detected blargishness all over the place. I was pretty excited, but also a bit skeptical, so I decided some fake data were called for. I had two bookshelves of paperback science-fiction novels in my living room, so I took the number of pages in each one and ran that dataset through the paper’s new analysis. My bookshelves showed strong blargishness, and I instantly became unimpressed with the new analysis and with the conclusion of widespread blargishness in ecological communities**.
I could give you many more examples, but those will do to make my point: fake data can be your friend. They can orient you to an analysis, so you understand what reported parameter values are telling you. They can reassure you that you can trust an analysis, or warn you that you can’t trust one. They can give you intuitive feel for what an analysis is doing – and this is important given that most users won’t (or can’t) gain that understanding by working through an analysis by hand. (Doing so is a great idea with a 2-way ANOVA, but simply isn’t going to happen with the kind of complex and often multivariate analyses that we all need to deal with.)
So: by all means fake some data! Don’t take it beyond your quest for understanding, of course (I feel the need to say that again, lest I be taken horribly out of context), but you knew that. There’s no substitute for a good intuitive feel for what an analysis is doing, and watching one operate when you know the “right” result is one of the best ways to get that. Think of it as some time in the sandbox before you build a real castle; and with that attitude, you can learn a lot from fake data.
© Stephen Heard (firstname.lastname@example.org) March 10, 2016
*^Formally simulated “fake data” are an important tool and publishable in their own right, of course, in contexts like simulation studies of newly designed statistical tests. This is how we verify that tests perform as designed, or how we assess how they don’t perform as designed, especially when assumptions are violated. But what I’m talking about today is fake data used for exploration, without any intention of them being a product for publication.
**^I should have run some completely random data, too, of course. I used my bookshelves because I wanted some “body-size” data that were completely unrelated to processes that might work in biological communities, and that was all I could think of. But of course lengths of novels are probably non-random in other ways. I still think I was right to dismiss the blargishness analysis on the grounds that my bookshelves showed it.