From the outside, it seems like data is impartial. It’s cold, objective, accurate.
In reality though it’s more complicated. In the hands of someone with an agenda, data can be weaponized to back up that viewpoint. Even in the hands of someone benevolent, data can be misinterpreted in dangerous ways.
Someone who wants to win an argument using data can usually do so.
“I like data because it helps me win arguments” – Never has a phrase better revealed someone who doesn’t get value from data
— Andrew Anderson (@antfoodz) January 6, 2015
Pro Tip: Be Skeptical
In 1958, Darrell Huff wrote a bestselling book called “How to Lie With Statistics,” so this stuff isn’t necessarily new to our age of #bigdata. Most of the same lies, cheats, and misrepresentations still exist today (there’s also a whole Wikipedia page on “misuse of statistics”).
Data deception can occur for a variety of reasons, some benevolent and some not. Wikipedia lists a few possible causes here:
- The source is a subject matter expert, not a statistics expert.
- The source is a statistician, not a subject matter expert.
- The subject being studied is not well defined.
- Data quality is poor.
- The popular press has limited expertise and mixed motives.
- “Politicians use statistics in the same way that a drunk uses lamp-posts—for support rather than illumination” – Andrew Lang
And to add one, as Andrew Anderson’s tweet mentioned above, sometimes people are simply motivated to prove their points. Data can be a trump card when it comes to certain debates, so the message gets skewed by the messenger.
This goes beyond misinterpreting A/B testing statistics (though you should certainly brush up on the basics there). There are wider and broader offenses of data deception. And marketers don’t just spread mistruths to others. We use data to lie to ourselves as well.
To be a better consumer and user of data, you should know these misdirections.
Here are some of the most prevalent mistakes I’ve seen.
Always Check the Sample
When presented with an interesting statistic, one must examine how the data was collected.
For example, in A/B testing, since we can’t measure ‘true conversion rate,’ we have to select a sample that is statistically representative of the whole. This applies for all methods of data collection, including surveys. Sampling is used to infer answers about the whole population.
To explain this, Matt Gershoff gives the example of cups of coffee. Say we have two cups of coffee and we want to know which one is hotter and by how much. All you need to do is measure the temperature of the two cups and subtract the lower temperature coffee from the higher one to see the difference. Very simple.
But if you wanted to discover, “which place in my town has the hotter coffee, McDonald’s or Starbucks?” you’d have a statistics question. Essentially, you’d want to collect a representative sample that is large enough to infer the results of the whole population.
The more cups we measure, the more likely it is that the sample is representative of the actual temperature. The variance shrinks with a larger sample size, and it’s more likely that our mean will be accurate.
So sampling can be a primary source of problems in bad data. It usually comes down to samples that are too small or unrepresentative of a population. In addition, it’s easy to cherry pick your sample and your data to get the answer you’d like.
Small Sample Sizes
In conversion optimization, it’s easy to be fooled by small sample size. Often it comes in the form of celebratory case studies where the company “lifted conversions by 400%.” As Tywman’s Law suggests, if the data is too surprising, there’s probably something wrong with it.
That’s usually the case in experimentation. As Peep Laja, founder of CXL, put it in a previous blog post:
That’s why you’ll see smart marketers calling out case studies with missing data or that seem too absurd to be true. That’s why case studies on WhichTestWon or LeadPages were such a problem, especially for new marketers who have yet to develop a hardened, cynical worldview actually questions the results.
To combat this problem, first, check your own data and don’t publish rubbish case studies. But on the consumption side, always be skeptical about other people’s test results, especially if they seem too good to be true.
Surveys are a major culprit of using unrepresentative “convenience” samples (and small samples). That’s why you should be especially skeptical when you’re viewing the results of attitudinal surveys (“X percent of people say Y”).
Tomi Mester, who writes the Data36, blog gave an example of a fictional character, Clara, who needs to do research for a University class:
Cherry Picking Segments or Biasing Samples
This is pretty similar to the above case of unrepresentative samples, but it’s a bit more conscious. Essentially, if you want to make a point, you can pollute the sample with biased measurements, or you can cherry pick after the fact to prove your point.
Market researchers – well, marketers in general – can be deceiving from the start. If you choose a sample that is likely to be skewed attitudinally in your favor, it’s very easy to come up with nice marketing sound bites.
- If you only survey your best customers, it’s easy to find that most of them prefer your software to others.
- If you only analyze top cohorts, it’s easy to prove your campaign is effective.
- If you only look at top performing segments in an experiment, it’s easy to call it a winner.
Proper sampling is hard to do, especially the further you venture out into the real world as opposed to lab controlled experiments. When data simply sounds weird, question the sampling. When you analyze your own data, be careful not to cherry pick to prove your points. In summary, know that much of the questionable data you read about in news stories has a solid chance of being affected by bad sampling. From How to Lie with Statistics:
Correlations ≠ Causation
One of the easiest ways to be fooled by data is to assume that correlation implies causation. Just because two variables have a high correlational coefficient does not mean they’re related in a meaningful way, let alone causal.
Some of my favorite examples come from a website that chronicles spurious correlations. This one shows that Nicholas Cage movies are highly correlated with swimming pool drownings:
Correlational data can be valuable, especially in experiment ideation. Say you find that people who download a certain PDF are worth much more money to you over the long term. Well, a simple experiment would attempt to get more people to download the PDF and see what the results are.
The problem, though, is when you take these correlational observations at face value. Ronny Kohavi, Distinguished Engineer at Microsoft, gave the following example in a recent presentation:
The larger your palm, the shorter you will live, on average (with high statistical significance).
You wouldn’t believe there’s any causality in this case, right? Of course not. There’s a common cause: women have smaller palms and live six years longer on average.
As Kohavi put it, “obviously you wouldn’t have believed that palm size is causal, but how about observational studies about features in products reducing churn?”
In addition, these sorts of correlations turn up all the time in popular media. “X is associated with Y.”
Jordan Ellenberg gave an example in How Not to Be Wrong: say you have two binary variables “are you a smoker?” and “are you married?”
You find, after doing this research (with the proper and representative sample) that smokers are less likely than the average person to be married. This gets reported as such, and that’s where the confusion starts. As Ellenberg puts it, you can safely express this by saying, “if you’re a smoker, you’re less likely to be married.”
But one small change to this sentence would make the meaning very different, “if you were a smoker, you’d be less likely to be married.”
The second statement infers causality, which the original study did nothing to confirm. But when reading a sound bite like that, many would understand it to be the latter statement: “if you smoke, you’re more likely to be single.”
Post Hoc and Other Storytelling Methods
Post hoc ergo propter hoc, or “after this, therefore, because of this,” is a post hoc fallacy that establishes causation where there is only correlation. It looks backward in time and says, “this happened earlier, therefore it caused what followed to happen.”
It’s essentially a narrative fallacy, a method of storytelling by which you can explain past events, though your explanations likely have no bearing on reality.
The best explanation I’ve found of this comes from The West Wing: