Today I want to talk about the problem that the significance testing system is trying to solve.

Please note that this piece is entirely framed within Fisher's vocabulary. I have no intention of dragging the Neyman-Pearson framework into the discussion and complicating everything. I understand that for many readers, making sense of two sets of conditional probabilities is painful, and there is no reason to invite that trouble here. If you want to discuss this piece with an LLM, paste this note in full and strictly instruct your LLM not to introduce Pearson's terminology partway through.

The Problem

Consider the following scenario: you want to know whether middle school students at sports academies are taller than the national theoretical average. Measuring these students is not particularly difficult. You recruit some colleagues, enlist students from sports academies across various provinces and cities, measure their heights, and collect the data. You then compute the mean and find that it is indeed higher, though not by much: just one centimeter. You are unsure whether that one centimeter means anything, so you look for an answer.

To answer this, we have two tools available: effect size and significance level. Today we focus on the latter, which we can understand as "how reliable is the result we obtained."

To understand this "confidence," we need to understand why we lack it in the first place. What confidence is describing is whether, if we ran the experiment again, we would obtain similar data.

You can probably imagine that as long as you do not measure every sports academy student in the country, the numbers you get each time will differ to some degree. Even if you did measure every one of them, your ruler is not accurate down to the atomic level, so measurement error still enters the picture. If that variation is large, the entire study is unreliable, and we have "no confidence" in it.

Yes or No

There are two ways to quantify this confidence: Say Yes and Say No.

Your intuitive idea might be to split all the data in half, one half above the theoretical model mean and the other half below it, and then look at what percentage exceeds the theoretical model mean as a measure of "confidence."

If you were measuring every sports academy student in the world, I cannot say your idea is wrong, but in practice that is not feasible. Once we want to keep research costs within an acceptable range, we inevitably use sampling to answer the question, inferring things about the heights of all sports academy students from our sample. And the moment we start inferring, we introduce a probability of being wrong.

In statistics, we use the sampling distribution to express this probability. This distribution describes what pattern would emerge across the infinitely many means you would obtain if you ran the experiment infinitely many times, each time using exactly the same method: recruit a group of students, measure their heights, compute the mean.¹

This connects to the parametric tests you have learned about: under correct experimental design and properly functioning measurement instruments, if you replicate the experiment infinitely many times, the resulting collection of means forms a bell-shaped curve centered on the national mean, with a spread that depends on sample size.

The sampling distribution is currently the only tool we have for describing the uncertainty introduced by sampling. It maps "a definite population difference" onto "the probability of observing a particular sample result." In other words, if you assume that the difference between the national student mean and the sports academy mean is 1, it gives you a distribution telling you, if that assumption is true, what the probability is of drawing each possible mean in a single experiment.

The awkward part is that the sampling distribution requires us to supply a precise center value rather than a "fuzzy range of center values." This is where things get troublesome. If we want to use this tool to answer a broad question like "sports academy students are taller than the national average," we would have to separately compute the probability for the case where they are 1 centimeter taller, 2 centimeters taller, 3 centimeters taller, 4 centimeters taller, and so on until the end of time. We would have to figure out the probability of each scenario and add them all up, otherwise our data cannot be translated into the answer we want. Making matters worse, height is a continuous variable, and between 0.1 and 0.2 there are infinitely many values, which makes the whole thing extraordinarily complicated. You need integrals; you need Bayesian methods (a statistical approach that is divine, extremely power-hungry, and not eco-friendly).

To be clear, this is not an inoperable approach. There is already a wealth of mature statistical methodology and software packages that help us use this line of thinking to answer questions. But the challenge of applying it is cognitive overhead. Most researchers in the social sciences and medicine cannot form a correct mental model of it, and it is difficult to teach, which is why the Say Yes approach has not become the mainstream data analysis method in those fields.

From a philosophy of science perspective, Say No is far simpler than Say Yes. To prove that every swan in a pond is white, you have to pull out every swan and check its color one by one. To prove that not every swan in the pond is white, you only need to find one black one.

Building on this idea, Fisher designed the significance testing system, whose core logic is proof by contradiction. This framework constructs a narrative: if "asserting that something exists is difficult," we can instead argue "it is impossible for it not to exist." In our sports academy height example, translating "the difference does not exist" into statistical language gives us "the difference equals zero." This hypothesis points to a single, precise distribution rather than a sprawling collection of joint probabilities. And "impossible" in this context means "probability is very small."

The Distribution

Now let us look at how the "difference equals zero" distribution is actually constructed. In Fisher's framework, we begin by establishing a premise: assume that the "sports academy" factor has no effect whatsoever on students' height (equivalent to a difference of zero).² Let us also assume that the "national baseline" is a fixed number already looked up from years of compiled statistics.

When this assumption holds, the "sports academy" classification carries no information for the height variable. A sports academy student, with respect to height, gains nothing from the "sports academy" label. Their true average is simply the number you can look up directly from historical national statistics. If we replicated the experiment infinitely many times within the sports academy population, sampling students, measuring heights, and computing means, all the sample means would fluctuate around the national mean. Nothing surprising there.

But suppose an unexpected event occurs: we obtain a mean whose probability of appearing is very small. This signals that the data in hand is fundamentally incompatible with that imagined world. In other words, the assumption that treats the national baseline as the true mean of sports academy students cannot hold. From this we conclude in reverse: the sports academy classification does have a distinguishing effect on height.

Statisticians dress up this inference as "a statistically significant difference." It is not hard to see that "significant difference" is simply a packaging of the proof by contradiction described above.

The Assumption

One common misreading of the p-value is interpreting it as the probability that the research result is wrong.

In fact, the p-value does not measure "the probability that the research hypothesis is true," nor does it measure "the probability that the data were produced by chance alone." It is closer to "how incompatible the data are with a specified statistical model." That is the spirit of proof by contradiction: if we treat "no difference" as an exact, fixed premise along with all the experimental design premises and hold them constant as established facts, how often would data this extreme appear in that world?

In summary, the p-value does three things:

It sets up a computable ideal world in which the sports academy factor has no effect and the sampling, randomization, and measurement assumptions of the experiment are sufficiently satisfied. This world is not the actual truth; it is simply the computational stage on which the test depends.

Within that premise, we ask: if this stage were real and randomization were perfect, and we repeated the entire sampling experiment infinitely many times (each time randomly drawing a sports academy sample of the same size from the national student population), what pattern of fluctuation would the sample mean differences follow?

Finally, we place the observed value onto the distribution of means: does the 1-centimeter difference you observed fall near the center of this map or out in the tail? The p-value is the tail probability occupied by your result and everything more extreme. A smaller p-value only means that the data are less compatible with that "no-difference, premises-satisfied" computational world, from which we infer "the sample in our hands did not come from this ideal world."

A very common and interesting misreading of the p-value is: "the p-value is the probability that the data were caused by random error."

Return to the stage we just built. On that stage, "sports academy" has been forcibly set to have zero effect on height. This is a rule written into the world at the moment of its creation, not a conclusion we later derived. Under this rule, regardless of whether the sample produces a difference of 0.1 centimeters or 10 centimeters, "sports academy" cannot be responsible for that difference. It has no standing to carry an effect. The only things left to explain the differences are sampling chance, measurement disturbance, and natural individual variation, all of which we collectively call "random."

In other words, on the hypothetical stage we have constructed, "the data were produced by randomness" is not a proposition with any uncertainty. It is a fact carved into the stone tablets of this world at the outset. Its "probability" is permanently 100 percent. If the p-value truly measured that, then any dataset would yield a p-value of 1, which is obviously not what statistical software produces.

To put it even more plainly: we are rolling a fair die. If we ask "what is the probability that this outcome came from a random process," the answer is always 100 percent, because outcomes from rolling a fair die are random by construction. But if you ask "what is the probability of rolling a 6 on this die," the answer is 1/6. The p-value is clearly addressing the latter question.

These two questions look similar, but one asks about the origin of the result and the other asks about the probability of a specific result.

I strongly resist interpreting the p-value as a "probability," even though it technically is one. The word invites too many misreadings and misunderstandings. If we call it a "compatibility index" instead, things become much simpler. Or if you are a party person and want to inject some fun into the definition, interpreting it as a "surprise index" is not bad either, since it really is a number describing "how surprised I would be by this result if the effect did not exist."

Further Considerations

When interpreting research results, researchers must be very careful. All statistical inference takes place within the ideal world we have constructed, but the real world contains many additional influences, such as: whether sampling was truly random, whether measurement instruments were valid, whether the shape of the data matched the assumptions of the statistical method, whether there were experimenter effects, whether the research design itself was sound, whether the data were uncontaminated, and whether coding was error-free.

In the philosophy of science, this is known as the Duhem-Quine problem (holism): we can never test a hypothesis in isolation. We always test a hypothesis together with all of its auxiliary assumptions as a whole. When we see a very small p-value, logically all we can say is: given the ideal stage of the Say No hypothesis and given that all of the above assumptions simultaneously hold, the probability of obtaining data this extreme is very low.

In practice, however, risk is risk. If we penalized every potential confounding factor, almost no study in the world would be "significant."

Fisher himself, who invented significance testing, was partially aware of this. He viewed significance testing as one step in inductive reasoning, and opposed treating it as a "judgment machine without feelings." He emphasized that a single significant result proves nothing and that independent replication is required. Significance testing is for "learning something," not for issuing final verdicts. The researcher's judgment is indispensable in interpretation.

In practice, however, independent replication experiments are rarely published, and researchers have little incentive to conduct them. When everyone is doing "new research" and pursuing "new frontiers," both Fisher's statistical system and the Bayesian statistical system will introduce a systematic risk.

This is not to say that science is useless or that statistics is useless. It is an expectation the author holds for readers: we should stay alert to risk and maintain a sense of humility toward statistical methods and the world. Reporting confidence intervals, pre-registering experiments, and encouraging replication are all expressions of humility and virtues worth cultivating.