False positives are somewhat scarier than false negatives. What if my research claims a drug cures cancer, giving thousands of sufferers hope, but it turns out to be useless?
(Before you go on, make sure you’ve read part 1 in my series on statistical significance, since it has some important detail.)
If we followed the conventional (erroneous) wisdom that “statistical significance” means less than a 5% chance that a result occurred by random fluctuations alone, we might think that false positives are rare. After all, a statistically significant result is very unlikely to occur by chance — less than 5% odds. But as we pointed out, the conventional wisdom is very wrong. Let’s find out why.
Suppose I’m a geneticist and I want to find genes that cause cancer. I’ll pick out a bunch of candidate genes and do cellular or animal tests to see which are correlated to higher cancer risks. Now, it’s fairly safe to assume that most genes do not cause cancer — I’m just trying to find the few genes that do. For ease of mathematics, let’s say that 90% of the genes I test don’t cause cancer. I want to find the 10% that do.
How do I fare? Well, remember that in part 1 we found the average statistical power to be 50%, so the average study has a 50% chance of missing a real correlation. (It can be difficult to perform animal studies with large sample sizes — ethics committees don’t like it.) And with a statistical power of 50%, that means we only find 50 genes out of the 100 that cause cancer. We miss fully half of the cancer-causing genes.
What about false positives? Remember that p-value represents the odds of getting a certain outcome under the assumption that there is no difference between the populations. For the 900 genes that have nothing to do with cancer, there is indeed no difference in the two populations, so a p-value of 5% corresponds to false positives in 5% of the 900 genes. Hence, we find 45 statistically significant correlations with genes that have nothing to do with cancer![ref name=”sterne”]Sterne, J a, and G Davey Smith. “Sifting the evidence-whatʼs wrong with significance tests?” BMJ (Clinical research ed.) 322, no. 7280 (January 2001): 226-31. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1119478&tool=pmcentrez&rendertype=abstract.[/ref]
For those doing the math, that means we found correlations for 95 genes, and 45 of them are false positives. That’s nearly half! And we missed half of the genes that really do cause cancer as well. All because we tested hundreds of genes until we got some random correlations.
Multiple Comparisons of Doom
The root cause of the problem is multiple comparisons. If you test enough genes, you will eventually find one that randomly correlates with cancer, even though it does not cause it. If you test enough socioeconomic factors, you will eventually find one that randomly correlates with test scores. And so on.
So how many factors is enough? What does it take to get a false positive? Well, the math is tricky, but if you make just 20 comparisons — testing 20 genes, for example — you have a 64% chance of getting a false positive that is statistically significant.[ref name=”multiple”]Smith, Dg, J Clemens, W Crede, M Harvey, and Ej Gracely. “Impact of multiple comparisons in randomized clinical trials.” The American Journal of Medicine 83, no. 3 (September 1987): 545-550. http://www.ncbi.nlm.nih.gov/pubmed/3661589.[/ref]
How many comparisons does the average medical trial make? 30.[backref name=”multiple”] Remember, medical trials don’t just test whether the drug worked — they test for side effects, vital signs, drug interactions, and all sorts of things. And with 30 comparisons, 75% of medical trials have a statistically significant conclusion that may well have happened by chance, simply because they kept testing different factors until they found one that correlated.
“Statistically significant” does not mean “real effect.” It does not mean there is a real correlation. The odds of a false positive depend highly on what percentage of hypotheses you test are true, and how many hypotheses you test.
But Surely Experimental Replication Saves Us!
The easiest response is to point out that science does not rely on single studies: scientists insist on replicating important results in multiple experiments before accepting their conclusions. Right?
Not really. Take, for example, medical trial studies that were cited more than one thousand times in the 1990s — highly influential and important studies, clearly. Of those studies, only 44% were replicated by later studies, and some 32% were either contradicted or found to be exaggerated by later studies. 24% didn’t have a second study big enough to either contradict or confirm their results![ref]Ioannidis, J P. “Contradicted and initially stronger effects in highly cited clinical research.” Jama 294, no. 2 (2005): 218-228. http://jama.ama-assn.org/cgi/content/abstract/294/2/218.[/ref]
Mind you, these were medical studies cited more than a thousand times, and must be highly important and influential.
(Similarly, in genetics it is not uncommon for studies to be contradicted by later research. It appears initial results are often too enthusiastic.[ref]Ioannidis, J P, E E Ntzani, T a Trikalinos, and D G Contopoulos-Ioannidis. “Replication validity of genetic association studies.” Nature genetics 29, no. 3 (November 2001): 306-9. http://www.ncbi.nlm.nih.gov/pubmed/11600885.[/ref])
So, What Do We Do?
Just kidding. There’s a simple answer:
The description of differences as statistically significant is not
Saying “this result is statistically significant” is meaningless. It does not help us decide if the result is true. You must report your statistical power and confidence intervals, and you must stop assuming that “statistically significant” and “statistically insignificant” are the last word.
There is also one other option: Bayesian statistics. Bayesian statistics take into account previous knowledge when deciding new results, so if I know that only 10% of genes I test cause cancer, I can use Bayesian stats to see if my experimental results are strong enough to overcome that low likelihood.
But first, stop it with the statistical significance stuff. It’s just wrong.