Let's start with a game. Suppose we want to study the factors that influence a person's body weight. I'll invite you to come up with three possible factors. Here are mine: daily caloric intake, physical activity level, and basal metabolic rate.

Now take each of those three factors and break them down further, three sub-factors each.

Daily caloric intake might be shaped by total energy consumed, macronutrient ratios, and meal timing and frequency. Physical activity level might break into aerobic exercise, resistance training, and non-exercise activity thermogenesis. Basal metabolic rate might correspond to age, sex, and muscle mass.

We can keep going indefinitely. Macronutrient ratios depend on how many vegetables and how much meat you ate today, and vegetable intake can be broken down further into leafy greens and colored produce, then into specific varieties and who grew them. Body weight is also shaped by whether you are happy, whether your office orders milk tea every day, whether you are willing to go out for late-night grilled skewers. Countless factors, layered upon one another, each contributing to the final result.

What we observe as body weight is the end product of infinitely many factors acting through layer upon layer of influence. In statistics, this is called a data generating process, or DGP.

Note: All sample code in this article can be run multiple times. The author encourages you to run each script a few times and get a feel for the dynamic nature of the data generating process.

Variation

Faced with such a sprawling generative system, a natural question is: among all these densely packed factors, which one is most closely tied to body weight? Does body weight swing wildly when diet fluctuates, or does it barely budge when exercise habits change?

To answer that, we first need to be clear about something: if everyone in a group weighs exactly the same, there is nothing to study. It is precisely because some people are heavier and others lighter that research becomes possible. Statisticians call this degree of difference between individuals variation.

This is a concept we use every day. When a doctor at a checkup tells you "your child's weight is on the heavier side for their age group," the basis for that statement is the variation in body weight among all children of the same age. Without variation, if every child weighed exactly the same, the phrase "on the heavier side" would not exist.

Saying "there is variation" is not enough. We want to know how much. Are all the weights clustered between 60 and 65 kilograms, or do they range from 40 to 120? To measure variation, we construct a statistic called variance.

Variance works like this: compute the difference between each person's weight and the group average, square those differences (so that deviations above and below the mean do not cancel each other out), and take the average. A larger number means the group's weights are more spread out.

With this ruler in hand, the next question becomes tractable: the total variance of body weight is one number. Can we slice it into pieces and see how much belongs to diet, how much to exercise?

As it turns out, variance has a particularly useful property: when several factors combine additively, the total variance can be decomposed along those same factors. This is called the additivity of variance.

The Additivity of Variance

Let's get more concrete. Suppose we add dietary intake and exercise level together to form a composite "lifestyle variable." Is the variance of that composite equal to the sum of the variances of its two components? This question is exactly where the additivity of variance comes in.

Conceptually, the additivity of variance says: given a dataset, if you add two variables (say A and B) to get a third (C), then the variance of C should equal the sum of the variances of A and B.

Let's run an experiment to see what the numbers actually say.

n <- 100
a <- runif(n)
b <- runif(n)
c <- a + b

cat("Sample size:", n, "\n")
cat("Variance of A:", round(var(a), 4), "\n")
cat("Variance of B:", round(var(b), 4), "\n")
cat("Variance of C:", round(var(c), 4), "\n")
cat("Var(A) + Var(B):", round(var(a) + var(b), 4), "\n")

Most social science textbooks stop roughly here. But if you look at the output, you will notice that while the numbers are close, the variance of C does not always exactly equal the sum of the variances of A and B. Why not?

Think carefully: when a system is built from several variables, what is left out if you only account for their individual effects?

The answer, which is a bit hard to guess, is the part the two variables share with each other: covariance.

Let's put the full formula together:

n <- 100
a <- runif(n)
b <- runif(n)
c <- a + b

cat("Sample size:", n, "\n")
cat("Variance of A:", round(var(a), 4), "\n")
cat("Variance of B:", round(var(b), 4), "\n")
cat("Variance of C:", round(var(c), 4), "\n")
cat("Cov(A, B):", round(cov(a, b), 4), "\n")
cat("Var(A) + Var(B) + 2*Cov(A, B):", round(var(a) + var(b) + 2 * cov(a, b), 4), "\n")

You will find that it is "Var(A) + Var(B) + 2*Cov(A, B)" that matches the variance of C. In fact, every variance has this structure built into it.

In practice, body weight is not shaped by just two factors. If there are four factors W, X, Y, and Z, every pair potentially has a covariance, giving sixteen combinations in total. We can organize them in a table. If variable V is composed of W, X, Y, and Z, the full variation structure inside V can be written as:

WXYZ
WCov(W, W)Cov(W, X)Cov(W, Y)Cov(W, Z)
XCov(X, W)Cov(X, X)Cov(X, Y)Cov(X, Z)
YCov(Y, W)Cov(Y, X)Cov(Y, Y)Cov(Y, Z)
ZCov(Z, W)Cov(Z, X)Cov(Z, Y)Cov(Z, Z)

In this table, the covariance of a variable with itself is simply its own variance. Covariance captures the variation that two variables jointly produce; variance captures the variation a variable produces jointly with itself. Only by combining all the entries in the table can we recover the complete variation structure.

In other words, the additivity of variance describes how all the elements within a variation system interact with one another, not merely how "pure variances" add up.

Covariance

For many readers, covariance is an unfamiliar and abstract term.

Let's start from something you already know. Variance measures how a variable disperses relative to itself: how spread out a group's body weights are, with weight compared against weight. It captures a single variable's internal spread.

Now introduce a second variable. Say we have dietary intake and body weight. Each one spreads on its own, but something more complex is happening between them: they spread together. A person who eats above average tends to weigh above average; someone who eats less tends to weigh less. The deviations of these two variables are happening in sync.

This is exactly what covariance is trying to capture: the joint dispersion of two variables as a system.

How do we quantify this "deviating together"? Breaking it down:

Cov(X, Y) = ρ · σ_X · σ_Y

Here ρ is the correlation coefficient and σ_X, σ_Y are the standard deviations of the two variables, reflecting their individual levels of variation.

ρ answers "how in step are these two variables," always between −1 and 1. σ_X and σ_Y answer "how much room does each variable have to spread." Multiplying their variation amplitudes together builds a stage on which they can move together.

Covariance is the product of two things: how consistently the two variables move in the same direction, and the combined magnitude of their individual dispersions. A system that is already very stable has limited disturbance to pass along, even if its components move in perfect lockstep. A highly dispersed system can have a large covariance even from a small degree of co-movement.

Now let's test some boundary cases against intuition.

If Y = X, then σ_X = σ_Y, and the formula becomes Cov(X, X) = ρ · σ_X². Since a variable's correlation with itself is always 1, this simplifies to Cov(X, X) = σ_X². Congratulations, you have just reinvented variance. This confirms what we said at the outset: variance is a special case of covariance, the joint dispersion of a variable with itself.

When ρ = 0, covariance is zero. The two variables have no systematic co-movement. Knowing that someone eats above average tells you nothing about whether their weight is above or below average. The two variables each spread independently, with no connection between them.

When ρ < 0, covariance is negative. The two variables move in opposite directions: when one is above average, the other tends to be below. Exercise frequency and resting heart rate, for example, tend to have this kind of relationship.

Independent Variables

Look carefully at the data generating process in the code above. A and B are generated completely independently. From the perspective of ground truth, from "God's stone tablet," the two have no connection whatsoever. But if we compute their correlation:

n <- 100
a <- runif(n)
b <- runif(n)

cat("Sample size:", n, "\n")
cat("Mean of A:", round(mean(a), 4), "\n")
cat("Mean of B:", round(mean(b), 4), "\n")
cat("Correlation of A and B:", round(cor(a, b), 4), "\n")

You will find that the result is never exactly zero. This is not because A and B are related. It is because correlation, as a tool for estimating the association between two variables, is always subject to noise. Two variables whose true correlation is zero are said to be orthogonal in mathematical terms.

Pay close attention to this: from the data generating process perspective, even when two variables are generated completely independently, noise alone will almost always prevent the data from being perfectly orthogonal. In practice, you will almost never collect two variables with a correlation coefficient of exactly zero.

Closing

At this point, you should understand that "additivity of variance" holds strictly only when all variables are orthogonal. Since perfectly orthogonal data is almost never what you get in practice, covariance cannot be ignored. This matters a great deal when dealing with multicollinearity in generalized linear models, so make sure you understand what it means.

As for the opening example, using A − B to construct C gives the same picture:

n <- 100
a <- runif(n)
b <- runif(n)
c <- a - b

cat("Sample size:", n, "\n")
cat(sprintf("Mean of A: %.4f   Variance of A: %.4f\n", mean(a), var(a)))
cat(sprintf("Mean of B: %.4f   Variance of B: %.4f\n", mean(b), var(b)))
cat(sprintf("Mean of C: %.4f   Variance of C: %.4f\n", mean(c), var(c)))
cat("Cov(A, B):", round(cov(a, b), 4), "\n")
cat("Var(A) + Var(B) - 2*Cov(A, B):", round(var(a) + var(b) - 2 * cov(a, b), 4), "\n")

Because variance involves squaring, the sign of the operation makes no difference. You can probably work out why on your own.

Finally, among all statistics that describe variation, variance is the one that most naturally satisfies the additivity principle. The squaring operation eliminates sign cancellation and enables clean orthogonal decomposition; the algebraic properties of the square function allow terms to separate neatly after expansion. This is why statistical modeling (t-tests, ANOVA, and linear regression all count as statistical modeling) favors variance as the standard measure of error.

If that last part does not quite click, do not worry. It will not stop you from understanding statistical methods.

One more thing: most large language models cannot reliably distinguish "independent" from "orthogonal." If you feed the first code block in this article to a language model, most will attribute the discrepancy to the sample size being too small, or to not having drawn enough repeated samples to observe the expected value. Neither explanation correctly accounts for the additivity of variance. Keep that in mind when using language models to study statistics.