Low p-values are good for you (or at least unavoidable)

Why at least 5% of p-values are ≤5% (even if the null hypothesis is true)

Imagine that 1001 laboratories – L1, L2, ..., L1001 – are enlisted in a phony study, ostensibly to test the mutagenic strength of a new chemical. Each laboratory is to test the chemical using the same standard protocol, by spreading a fixed amount of diluted treated bacteria on several Petri dishes, and counting the total number of mutant colonies.

(Null) hypothesis = "The tested chemical has no effect."

Test statistic = # of mutated colonies on several Petri dishes

The study is phony because each laboratory assumes (perhaps because they are told) that they, and only they, are testing the new mutagen; all the other laboratories are merely controls, testing water. In reality, every laboratory is testing inert, healthy, water.

Nonetheless, there is a certain rate of mutations even among control colonies, and it has a random element, so some labs will report more mutations than others. At the end of the experiment, put yourself in the shoes of laboratory L1 and ask what is the p-value for the null hypothesis.

Lab L1 came up with some test statistic value – say 7942.

P-value means – What is the chance to observe such a high test statistic, if the null hypothesis is really true?

Restated: What is the chance to observe 7942 mutant colonies if the chemical has no effect?

Since water has no effect, and "chance" means "probability" that is the same as asking:

What is the probability to observe 7942 mutant colonies when the bacteria are treated with water?

That is: What is the probability to observe 7942 mutant colonies, given exactly the experiment that was performed 1000 times, by laboratories L2, L3, ..., L1001? In other words, from L1's point of view, the other labs' purpose is to calibrate the test statistic.

Since "probability" means long-run frequency given repeated trials, and each laboratory's work can be regarded as a trial, that's essentially the same as asking: What % of the 1000 test statistics reported are greater than or equal to 7942?(1)

Imagine that we arrange all the test statistics in order of size, and assign ordinal ranks from largest to smallest:
Lab: L211 L592 L88 ... L1 ... L916 L147 L18 L666
Test stat: 7 120 249 ... 7942 ... 12122 14229 21000 92929
Ordinal position: 1000 999 998 ... 108 ... 3 2 1 0
p-value: p=1 .999 .998 ... .108 ... .003 .002 .001 p=0

In this way we get an empirical estimate p=0.108 as the p-value for L1's score. It just means that L1's score lies at the 10.8%ile mark (counting from 0=largest), among a large set of "control" scores (scores obtained or expected assuming the null hypothesis).

But what goes for L1 goes for every other lab as well. Remember, they all used water. Each of those labs ends up with a p-value corresponding to their ordinal position.

Five percent of the labs occupy the 5% extreme right end of the picture, and all of these labs therefore necessarily have a p-value≤5%. So, assuming the null hypothesis, a p-value≤5% occurs 5% of the time – which is what was to be shown. (If the null hypothesis is false, then small p-values occur even more often.)

Of course, what is true for 5% is equally true for any other number. When the null hypothesis is true, the % of labs that will report a p-value x is exactly x, for any probability x. Or in symbols,

Pr(p-value ≤ x | null hypothesis) = x.

1. To avoid ascertainment bias, we shouldn't count the lab itself as one of the labs with a "greater or equal test statistic."

Go to home page of Charles H. Brenner