- Understanding Degree of Freedom

This concept has been ruined by almost every statistics teacher.

Here is a simple piece of common sense: a system of three linear equations in three unknowns needs three equations (three conditions) to have any chance of a unique solution. When there are more parameters to solve for than conditions, you end up with infinitely many solutions.

Another thing you may not have paid much attention to, though the idea is already familiar: independent information. A system of three linear equations in three unknowns needs three independent equations to have a chance at a unique solution. If two of those equations are really saying the same thing (identical formulas, or one equation multiplied by two on both sides and treated as a new condition), then together they still count as only one condition. That is not independent information.

These two points carry over to the hypothesis testing framework. Here the unknowns become the model parameters, and the conditions become the amount of data (the sample size). When the number of samples is larger than the number of parameters, the system is called overdetermined. When there are fewer samples than parameters, it is called underdetermined.

But there is a catch. The data in a hypothesis testing framework are noisy. (When you study the relationship between calorie intake and body weight, you can never exhaust all the variables: who grew the vegetables, how much seasoning was added, where you ate, whether the food was hot or cold, how happy you were while eating, and so on.) We are always trying to infer some feature of a population from a sample, and we never get an exact solution. We are only trying to find an optimal solution. This is fundamentally different from solving an ordinary system of equations.

In most cases, if you have fewer samples than parameters, the conventional hypothesis testing framework collapses completely. (Machine learning, of course, has its own clever little tricks.)

Degrees of freedom can therefore be seen as a kind of token. They tell you how the maximum capacity for parameters is distributed across the whole model.

In an overdetermined system, what we are doing is finding an optimal solution for the unknown parameters inside a noisy system. In other words, most of the data analysis you do is about hunting for clues of a signal within a noisy system. This is where ideas like the signal-to-noise ratio come in. It is fairly easy to grasp: your sample size must be sufficiently larger than the number of parameters, so that the noise (which most analytical methods assume follows a normal distribution) cancels itself out through superposition and the true shape of the signal can emerge.

That is why you must report degrees of freedom when you do a statistical analysis. They tell you how much data redundancy you have. That redundancy lets you weigh the risk of overfitting the whole model (overfitting means you have fitted noise as if it were signal) and whether you are hacking R-squared by brute-forcing extra parameters into the model.

Integer degrees of freedom assume that you can cleanly count "how many independent pieces of information are in this dataset." But with something like a mixed-effects model, the data structure itself is complicated (the same person measured repeatedly, students within a class resembling one another). You cannot simply count independent pieces of information, because the data points are correlated. Each data point no longer gets a fair one-person-one-vote weight; some pieces carry only half a vote's worth of information, others maybe 0.8 of a vote.

Since we cannot count them as integers, statisticians use a set of formulas to work backwards to an "effective" amount of information. That effective value is usually not an integer, so the degrees of freedom become a decimal. It estimates: if these data had come from some idealized simple situation, roughly how many independent observations would they be equivalent to?

The explanation that "degrees of freedom are the number of values free to vary" is unintuitive and even misleading. Although it is technically correct, that correctness is not very valuable, because almost nobody outside pure mathematics can really understand what it means.