A practical primer on p-values

Noah Haber
There is a recent trend among the health statisticians to discourage the use of p-values, commonly used to define as a threshold at which something is “statistically significant.” Statistical significance is often viewed as a proxy for “proof” of something, which in turn is used as a proxy for success. The thought goes that the obsession with statistical significance encourages poorly designed (i.e. weak and misleading causal inference), highly sensationalized studies that have significant, but meaningless, findings. As a result, many have called for reducing the use of p-values, if not outright banishing them, as a way to improve health science.

However, as usual, the issue is complicated in important ways, touching on issues of technical statistics, popular preference, psychology, culture, research funding, causal inference, and virtually everything else. To understand the issue and why it’s important (but maybe misguided) takes a bit of explaining.

We’re doing a three-part set of posts, explaining 1) what p-values do (and do not) mean, 2) why they are controversial, and 3) an opinion on why the controversy may be misguided. Here we go!

What does a p-value tell us?

A p-value is a measure that helps researchers distinguish their estimates from random noise. P-values are probably easiest to understand with coin flips. Lets say your buddy hands you a coin that you know nothing about, and bets you $10 that it’ll flip heads more than half the time. You accept, flip it 100 times, and it’s heads 57 times and tails 43. You want to know if you should punch your buddy for cheating by giving you a weighted coin.

To start answering that question, we usually need to have a hypothesis to check. For the coin case, our hypothesis might be “the coin is weighted towards heads,” which is another way of saying that the coin will flip heads more than our baseline guess of an unweighted 50/50-flipping coin. So now we have a starting place (a theoretical unweighted coin, or our “null hypothesis”) and some real data to test with (our actual coin flips) and someThis triggers the tooltip, or “alternative hypothesis”).

One way we could compare these hypotheses is if we had a coin which we knew was perfectly balanced and we 1) flipped it 100 times, 2) recorded how many were heads, 3) repeated 1 and 2 an absurd number of times (each of which being one trial), and 4) counted how many of our trials had 57+ or more heads flipped and divided that count by the total number of trials. Boom, there is an estimated p-value. Alternatively, if you know some code, you could simulate that whole process in a program without knowing any probability theory at all.

Even better, we can estimate that number empirically without having to flip a bajillion coins, using what we know about probability theory. In our case, p=0.0968. That means that, had you done this 100-flip test infinite times with a perfectly balanced coin, 9.68% of those 100-flip tests would have had 57+ heads.

To interpret a p-value, you might say “If the coin wasn’t weighted, the expected probability that we would have gotten 57 or more heads out of 100 had we repeated the experiment is 9.68%,” or “9.68% of trials of with 100 flips would have resulted in having the same or greater number of heads than we found in our experiment.”

Should you punch your buddy? That depends on whether that number above is meaningful enough for you to make a punching decision. Ideally, you would set a decision threshold a priori, such that if you were sufficiently sure the game was sufficiently rigged (note the two “sufficiently”s), you would punch your buddy.

The default threshold for most research is p≤0.05, or 95% “confidence,” to indicate what is “statistically significant.” With our p=0.0968, it isn’t “significant” at that level, but it is significant if we make our threshold p≤0.010, or 90% confidence.

If you wanted to interpret your p-value in terms of the threshold, you might say something like “We did not find sufficient evidence to rule out the possibility that the coin was unweighted at a statistical significance threshold of 95%.”

A p-value is a nice way of indirectly helping answer the question: “how sure can I be that these two statistical measures are different?” by comparing it with an estimate of what we would have expected to happen if it didn’t. That’s it.

Things get a bit more complicated when you step outside of coin flips into multivariable regression and other models, but the same intuition applies. Within the specific statistical model in which you are working, assuming it is correctly built to answer the question of interest, what is the probability that you would have gotten an estimated value at least as big as you actually did if the process of producing that data was null?

What DOESN’T a p-value tell us?

Did you notice that I snuck something in there? Here it is again:

“Within the specific statistical model in which you are working, assuming it is correctly built to answer the question of interest…”

A p-value doesn’t tell you much about the model itself, except that you are testing some kind of difference. We can test if something is sufficiently large that we wouldn’t expect it to be due to random chance, but that does doesn’t tell you much about what that something is. For example, we find that chocolate consumption is statistically significantly associated with longevity. What that practically means is that people who ate more chocolate lived longer (usually conditional on / adjusted for a bunch of other stuff). What that does NOT mean is that eating more chocolate makes you live longer, no matter how small your p-value is, because the model itself is generally not able to inform that question.

The choice of threshold is also arbitrary. A p-value of .049 vs. 0.051 are practically exactly the same, but one is often declared to be “statistically significant” and the other not. There isn’t anything magical about the p-values above and below our choice of threshold unless that threshold has a meaningful binary decision associated with it, such as “How sure should I be before I do this surgery?” or “How much evidence do I need to rule out NOT cheating before I punch my buddy?” 95% confidence is used by default for consistency across research, but it isn’t meaningful by itself.

P-values neither mean nor even imply whether or not our two statistical measures are actually different from each other in the real world. In our coins example, the coin is either weighted (probability that it is weighted is 100%) or it isn’t (probability is 0%). Our estimation of the probability of getting heads gets more precise the more data we have. If the coin were weighted to heads, our p-values would become closer to 0 both if the weighting was larger (i.e. more likely to flip heads), but ALSO if we used the same weighted coin and just flipped it more. The weight didn’t change, but our p-value did.

Here’s another way to think about it: the coin in the example was definitely weighted, if only because it’s near impossible to make a fully unweighted coin. To choose whether or not you want to punch your buddy depends both on how much weighting you believe is acceptable AND how sure you want to be. A p-value indirectly informs the latter, but not the former. In almost all cases in health statistics, the “true” values that you are comparing are at least a little different, and you’ll see SOME difference with a large enough sample size. That doesn’t make it meaningful.

P-values can be really useful for testing a lot of things, but in a very limited way, and in a way that is prone to misuse and misinterpretation. Stay tuned for Part II, where we discuss why many stats folks think the p-value has got to go.

Thoughts and comments welcome