I was recently assigned to give a 25-minute presentation on a subject of my choice. After choosing “scientific dishonesty and fraud,” I happened upon a paper by John Ioannidis claiming that “Most Published Research Findings Are False.”
After skimming through Ioannidis’ paper and reading some of the references, I quickly changed my presentation’s title to “Statistical Significance Testing is the Devil’s Work,” and dug up some 23 papers on the subject. What follows is a reformatted version of why you should never trust papers that claim “statistically significant” results. It’s split into several parts for length. (The slides I used are available here.)
(Note to statisticians: Please do send me comments if I screwed something up. I’d rather not preach falsehoods while preaching against preaching falsehoods.)
Why Statistical Significance Means Everything’s Wrong
(well, not everything, but an alarming amount of things)
First, let’s address an important question: What is statistical significance, anyway, and what does it mean?
Well, consider a medical study. Suppose I have a fantastic new cold medication that should make the average cold a day or two shorter. Now, I have to design an experiment to test if my medication works, so I get a bunch of people with colds, and give half of them my magic medication and the other half some sort of placebo. We then follow the group for a week or two until the colds are over and see how long their colds lasted.
However, we all know that colds are never the same length. Sure, a cold may average to four days long,[ref]A completely arbitrary number I just made up.[/ref] but there are eight-day colds and two-day colds too. Hence, if I take ten people and see what the average cold length is, it might be 4.2 days, or 3.6, or 2.4, or anywhere in a large range.
This is difficult if I’m doing an experiment — there’s a huge random fluctuation that I have to distinguish from the real effect of the medicine. So, I take a larger sample, and pay a few hundred undergrads $15 for their time. The more people I sample, the more the random fluctuations balance out, and the closer to the “true” average my numbers get.
So far, so good. Now, when I evaluate my results, I have to decide if the difference I observed was caused by random fluctuation or by the medication’s effects. That’s where statistical significance comes in.
Statistical Significance and p-values
There’s a whole variety of statistical tests I won’t bore you with, but they all test things like “is the difference between these populations caused by chance?” and “are these two variables correlated?” Many of them give you a “p-value.” I asked a graduate student who’s taken some statistics courses, and he told me this is what a p-value is:
the probability that the statistic you just derived happened by chance, essentially
So if the p-value is less than, say, 0.05, there’s less than a 5% chance that my results happened because of random fluctuations. Not bad! If p is 0.7, though, there’s a huge chance my results happened by chance.
Sound good so far?
I hope not, because that’s all wrong.
Here’s what p-value really means:
the probability, under the assumption of no effect or no difference (the null hypothesis), of obtaining a result equal to or more extreme than what was actually observed[ref]Goodman, Steven N. “Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy.” Ann Intern Med 130, no. 12 (1999): 995-1004. http://www.annals.org/cgi/content/abstract/130/12/995.[/ref] [emphasis added]
Note the bolded part carefully. When you calculate the p-value, you ask, “If my medication had no effect, what are the odds I’d see this result?” You assume there was no effect or no significant difference, and then make some calculations.
I hope we can all see now that you can’t assume the difference occurred by chance alone, and then do some math to calculate the probability the difference occurred by chance alone. “Assuming this happened by chance, what are the odds this happened by chance?” It’s nonsensical. And it leads us to some major problems.
False Negatives and Statistical Insignificance
A false negative occurs when there is a real difference, but my study misses it. In our example, this would happen when the cold medication works, but somehow my study concludes it doesn’t. How would that happen?
Well, remember those random fluctuations. If the random fluctuations are bigger than the actual effect of the medication, there’s almost no way to tell if the medication worked — so we need to study more people until the random fluctuations balance each other out.
Statisticians come up with a number to describe this problem: statistical power. Statistical power tells us the odds that our study will detect the difference, assuming there really is one. If my sample size is too small, I’ll never be able to detect a small difference, and my statistical power is too low. If I sample every person in the entire country who has a cold, my statistical power will be excellent.
What, then, would be a good statistical power to aim for when testing medicines? You don’t want to miss a perfectly good drug, right?
The average statistical power of a medical study is 50%.[ref]Sterne, J a, and G Davey Smith. “Sifting the evidence-whatʼs wrong with significance tests?” BMJ (Clinical research ed.) 322, no. 7280 (January 2001): 226-31. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1119478&tool=pmcentrez&rendertype=abstract.[/ref]
That means the average medical study has a 50% chance of completely missing the effect they’re looking for, and then concluding there was a “statistically insignificant” difference and that the drug is useless.
“Statistically insignificant difference” does not mean “no difference.” It simply means you could not detect the difference. Science news articles often say “the difference was statistically insignificant, so age could not have been a factor,” but that is simply false.
Next in part 2: False positives and multiple comparisons