Chat: Do p-values get a bad rap?

Editor’s note: The transcript below is lightly edited from the original transcript for clarity. For a human-oriented explainer on what p-values are (and are not), see A Practical Primer on P-Values

Noah: New chat time!

Darren: The answer to the question is yes, they do. Are we done?

Mollie: **flexes fingers, cracks knuckles**

Andrew: Good to see you all. Prepare to die.

Darren: You don’t need me if Andrew is here.

Noah: Good to have you both, maybe we can get you two into a fight

Darren: Have you seen him lift weights?

Noah: And with that excellent, excellent tone setting. LET’S GO!

Andrew: (starting gun fires, klaxon sounds)

Noah: Unfair, possibly meaningless question: In an ideal world, in a frictionless plane in a vacuum, what should we do about p-values? Scale of 0 (ban ALL the p-values) to 100 (I love p-values, especially in table 1). Everyone must give a number!

Mollie (postdoc in repro/perinatal pharmacoepi): hoo boy. 50.

Andrew (professor at University of Pittsburgh, clinical trial statistician): 67

Darren (epidemiologist and statistician, University College Cork): 100 – a time and a place

Laure (epidemiologist and statistician, Maastricht University): 70

Noah (postdoc in HIV/AIDS/causal inference econometrics/meta-science at UNC): I’m going 49

Alexander (postdoc in epidemiology at UNC): I’m going to give a 50, but I feel weird giving anything, since I think it’s easy to confuse p-values with null hypothesis significance testing (NHST)

Darren: +1 to Alexander

Andrew: Cripes, Alexander, we’ll get to NHST in a minute.

Alexander: Fine, then count me as a 50.

Noah: NEW CHALLENGE! Define a p-value in a single sentence (no semicolon cheats), that is both technically correct AND understandable by a human.

Darren: Irresponsible to even try. Grrr

Andrew: Agreed with Darren. Not possible to define in a single sentence without at least one semicolon or hanging clause.

Alexander: A p-value is the probability of observing a test statistic at least as large as that observed, given the null hypothesis is true, the statistical model is correct, and the data can be trusted.

Laure: I would go for “The chance of seeing your data (or a more extreme finding) if in reality there were no effect” (under the assumption that the model used to compute is correct and null hypothesis is no effect)

Mollie: The chance that you would see that magnitude of difference in groups if the real difference was zero, just randomly

Andrew: I’ll cede to Alex’s definition

Darren: Assuming that the data were generated by some null model, which includes a hypothesis and some assumptions, a p-value gives the tail area probability for the sampling distribution implied by said null model.

Mollie: You guys don’t talk to real humans much, huh?

Darren: Nope.

Mollie: Good move

Alexander: I don’t leave my windowless office much…

Noah: I’m still in PJs <editor’s note: it was 11:30am>

Andrew: OK. Now, for one that will be more understandable by a human: in the context of a single trial comparing a treatment against a placebo, a p-value of 0.03 means that there was a 3% chance of observing a difference equal to or larger than the difference we actually observed in the data if the treatment has no real effect.

Darren: But p-values aren’t for real humans. They are a single piece of information about a single estimate in a single study, from which nothing of substance should ever be drawn.

Noah: Why shouldn’t people ever draw from them anything of substance?

Laure: Because they are a random variable themselves. Get a new sample, you get a new p-value.

Darren: To Alexander’s definition, are the data of good quality? What are the risks of bias? There are plenty of things to consider about any estimate. A p-value is just one of these.

Mollie: Darren, I’m very curious about the 100-endorsement of p-values (given appropriate time and place) vs. nothing of substance should ever be drawn

Noah: So if everyone is roughly agreed that it’s a pretty limited statistic, how did it become THE statistic?

Darren: Once people wrote books with sampling distributions (the really hard part), then calculating p-values was relatively easy.

Andrew: Borne from a demand to have a simple, “easy to understand” (ha!) way to make a conclusion about whether an effect is ‘real’ or not.

Laure: It became THE statistic because it is practical and using it with an arbitrary cut-off leaves no room for uncertainty (in users minds)

Andrew: People are uncomfortable embracing uncertainty, and much happier to have a yes/no answer from any given statistical test

Mollie: I mean, they’re very useful if you want to talk about things like long runs of heads or tails on successive coin tosses

Alexander: People couldn’t understand the writings of Neyman and Pearson (NP), and bastardized the writings of Fisher, and they came out with our current use of p-values.

Darren: To Mollie’s point – all of these inferential tools are useful fictions. I think Alexander probably hit on it. (accidental mashing of Fisher and NP)

Noah: Let’s get wonky! What’s the mashup?

Alexander: Fisher: 1) Specify a null hypothesis (that can be anything, doesn’t have to be no effect), 2) Compute p-value, 3) P-value tells you compatibility of data with hypothesis. Neyman-Pearson: 2) Calculate p-values under each hypothesis, 3) Accept hypothesis with greater p, reject hypothesis with smaller p

Darren: In my mind, I think NP makes some sense if you do all the work up front around your alternative hypothesis. Fisher’s continuous interpretation of p-values makes sense post-data. But treating the post-data p like a hard decision rule without engaging in the NP manner is problematic.

Alexander: I agree with Darren. The big thing with NP is they emphasize carefully choosing alpha, beta, null and alternative hypotheses to be question-specific. Basically, NP want you to carefully consider the cost/benefits of false positives and false negatives, and to design the hypothesis test accordingly

Laure: What about NP with composite hypotheses? Even that becomes complicated rather quickly…

Alexander: To add to Andrew’s point, they are not only uncomfortable embracing uncertainty, but they are also uncomfortable making what may seem like subjective decisions. We thus get everyone use alpha=0.05, etc.

Darren: But this is again moving from p-values to null hypothesis significance tests

Andrew: Good point, Darren. I just can’t help myself, apparently.

Noah: Why is it that we can’t seem to separate p-values and NHST? Just to point out, several of you used/implied NHST in your definitions.

Andrew: Noah, that’s a good question.  I think the answer lies in Alexander’s next comment: “they are not only uncomfortable embracing uncertainty, but they are also uncomfortable making what may seem like subjective decisions”

Laure: I agree

Alexander: People can call me Alex

Andrew: Can I call you Al?

Noah: You can call me Betty.

Darren: *groans*

Noah: What is NHST? And why are the two ALWAYS together?

Laure: Why? Because that how it is being taught?

Andrew: NHST (human definition): testing the compatibility of data against a null hypothesis (such as “the difference between the two groups = 0”), and drawing a conclusion based on the probability of observing some data under the assumed conditions of your null hypothesis.  If the probability of the observed data is low, we reject the null hypothesis, and conclude that the difference between the two groups /=/ 0. If the probability of the observed data is not low, we “fail to reject” the null hypothesis, though this is commonly confused with “accepting” the null hypothesis.

Good lord, our “human definitions” need some work.

Mollie: We have GOT to get out more

Noah: I think that’s the key though. P-values are really really hard to get a handle on, the complication is important, and they are REALLY easy to misunderstand

Darren: The only way to make any of it human readable IMO is to have humans who understand the useful fiction that is a sampling distribution.

Noah: So back to Laure’s earlier point: How is it being taught?

Mollie: I think you have to also add “and to whom?” Because huge numbers of people get their statistics education in 2 weeks in medical or pharmacy school.

Darren: How is it being taught? Hard to know. But it is obviously passed down from senior scientist to junior researcher all over the place in medicine.

Andrew: Because of the interweaving of p-values and NHST, a p-value is often viewed as a dichotomous piece of information (small p-value  -> evidence of ‘real’ effect, not-small-p-value -> no evidence of ‘real’ effect) rather than being viewed as a continuous piece of information about the compatibility of some observed data with a hypothesis.

Mollie: I got a lot of “the p-value isn’t perfect but it’s too hard to explain other methods so let’s go with it” explanations in my MPH program, for example, especially around model building and variable selection

Laure: Oh yes, variable selection! Very true…

Noah: <squints menacingly at potential thread drift>

Mollie: No no, I can keep this on topic! I think. There are, I think, lots of people who avoid p-values for the main effect estimate because they’re heard the p-value is weird, but who see no issue with using it to decide how to retain variables in a model.

Alex: I think the problem is that it seems like we all agree to some degree on what p-values and NHST is, but there are so many places where these things are used incorrectly so it’s easy to go down rabbit holes.

Darren: Mollie has a good point though. Epidemiologists in good programs learn that you don’t select covariates based on p-values (for prediction or causal problems). So they are somewhat inoculated from the start.

Noah: We’ve talked a lot about how RESEARCHERS get it wrong. What about “humans?”

Darren: This is the most human contact I’ve had since Christmas. But why should they need to get it right? It’s complicated and hard.

Mollie: I feel like this group will appreciate this anecdote about talking to humans in which I was telling a very smart, PhD in something not-science about a 70% increased risk of cancer in some study and he said “70% of people got cancer??”

Andrew: Humans (readers of research, anyway) typically look at a paper and hunt for p-values; when they see small p-values, they take that as evidence that there is a ‘real’ effect; when they don’t see small p-values, they conclude the paper has minimal interesting information.

Noah: Is it just normal people who do that? Or do researchers, journal editors, doctors, policy wonks, etc do that too?

Mollie: No! lots of people do that

Darren: Doctors for certain.

Andrew: Overall understanding of probabilities is generally poor.  I’ve had stories very similar to Mollie’s, even working with people who have spent decades working in some form of biomedical research.

Laure: I’m afraid editors do it too. I wish I was making this up, but I attended a talk of an editor of a pretty important journal for the doctoral school of a medical faculty on how to write papers. He admitted that editors who decide on sending out papers for peer review largely skip the methods and results section (except for some figures and tables) and go straight to conclusions. That leaves them vulnerable for whatever spin the authors put in there.

Andrew: Cripes, and I thought *I* was the cynic of this group.

Alex: Even people who produce the researchers themselves use it. David Spiegelhalter had an example of a paper where a 25% reduction in mortality (a huge effect) and p=0.06 and the researcher said there was no effect in the paper and presentation.

Mollie: So one of the threads I see coming out of these discussions a lot is, “well we have to decide on actions somehow,” which is true. Docs have to decide on treatments, governments need to decide where to allocate policy money, etc

Alex: Yeah, but to say they make those decisions just based on a p-value is silly.

Mollie: Of course it is! But the p-value is so ingrained, and so many people *think* they understand it

Andrew: Great point, Mollie.  That is a very common defense of NHST paradigm: “we have to make decisions somehow” – combined with Alex’s earlier point that people are uncomfortable making decisions that seem subjective.

Mollie: I am 100% in Camp Estimate & Replicate

Andrew: Many people take comfort in using p<0.05 as the rule for making those decisions because…well, just because.

Alex: Because you don’t have to justify it!

Mollie: Right, so the p-value, and especially p<.05 has become a kind of fig leaf of objectivity

Alex: It’s just accepted. If I chose p<0.10 as my cutoff the reviewers would have a field day. But p<0.05 there isn’t even a comment.

Darren: Even if you roll NP, people skip consideration of the relative costs of false + and –

Noah: I roll Neyman Pearson is an excellent, excellent bumper sticker

Andrew: Very much agreed with Darren (mindless use of p=0.05 and 80% power for all trials instead of considering the relative costs of a wrong decision in each direction) and Alex (that 0.05 is automatically defensible because everyone else does it, but god forbid I suggest p=0.10 for an alpha level in an academic journal)

Laure: Trying to interpret CI limits is rather difficult, takes more space in journals… hard to get the message across

Darren: And CI limits are just a set of tests. There. I said it.

Mollie: Didn’t a journal recently respond to the replicability crisis by proposing a new alpha of .001 or something?

Laure: John Ioannidis proposed to set the threshold to 0.005. I think he saw it more as a way to sift through published articles claiming significance, following his piece on Why most published research findings are false.

Darren: How many authors was that one?

Andrew: hahahahahaha

Mollie: ZING

Andrew: Though in the same paper, John Ioannidis also basically said that this wasn’t a very good solution, either, just something easy to do right away). Basic and Applied Social Psychology straight banned p-values altogether.

Darren: And BASP was apparently a trainwreck afterwards

Andrew: Right. A year or two later, it was just chaos. People had no clue how to write papers without p-values.

Alex: Changing the cutoff is fine, I guess, as long as they also publish null findings.

Alex: It won’t fix any replication crises, but it won’t hurt either I suppose.

Noah: Is it that people don’t know how to write papers without p-values, or papers without NHST?

Mollie: I get into fights with reviewers about it a lot.

Darren: Epidemiologists do it all the time. Not to quote my cv, but I have two stats/model focused papers in the International Journal of Epidemiology, neither mentions “statistical significance”, per journal standards. Epidemiology and American Journal of Epidemiology are the same.

Alex: The last two papers I published in clinical journals don’t include anything about statistical significance, and the confidence intervals (barely) include the null. Amazingly I didn’t even get a comment from the reviewers (though I did get a comment from a coauthor)

Mollie: A student’s manuscript, where she reported RR 2.0, 95% CI 0.95 to 3.5 or whatever, got a comment about how she couldn’t say there was an increased risk. And she didn’t say significantly increased! Just increased.

Darren: Ugh

Mollie: BUT, to refute my own point slightly, if the RR had been, say, 1.15, how wide would CI need to be before we’d all call it null?

Andrew: For many people the two concepts are inextricable. I have an interesting story of helping analyze and publish a trial with a secondary endpoint with a result of (something like) HR=0.85, 95% CI 0.68-1.02, p=0.07 and my phrase in the results ” (Event) was lower in Arm A versus Arm B (HR=0.85, 95% CI 0.68-1.02, p=0.07)” was hammered by 2 of the 3 reviewers with the “you can’t say that the risk was lower.” Basically, the same exact story that Mollie is telling above. I wanted to include the p-value because, as we’ve touched on several times, I think it’s a continuous piece of information about how incompatible the data are with a null hypothesis of no effect.  Perhaps that means I can’t write a paper without NHST, either. Blargh.

Alex: Charlie Poole proposes reporting the p-value function, which I doubt anyone will do, but that gives compatibility with ALL hypotheses. Which makes it clear to see that the point estimate is most supported by data. And that there are lots of points more supported than the null.

Noah: That brings us to the elephant in the room

Darren: Hey! I’m on a diet


Darren: Oh, that

Mollie: Oh my. OK, who signed? (I did not, but I would have if I’d remembered to email them)

Alex: <raises hand>

Noah: <raises hand, waits for someone to high five>

Darren: Nope

Laure: I did

Andrew: I also did not sign. After some deliberation.

Noah: <is left hanging, puts hand down and pretends it’s cool> What was, and importantly, WAS NOT, in the letter?

Alex: What was in it: a call to stop NHST, what was not: a call to stop using p-values

Noah: Was it just me, or did we see a ton of people treating it like a ban on p-values, when it explicitly, unambiguously said otherwise?

Alex: Yeah I saw that, but nobody reads things

Andrew: Yes, I think a lot of the reaction treated it as a proposed ban on p-values

Laure: People just read the title?

Mollie: You mean people were dichotomizing the message unnecessarily?

Noah: BOOM goes the dynamite

Andrew: Score one for Mollie

Darren: **applause**

Mollie: thanks, I’ll be here all week, try the veal

Noah: So it seems like the two ideas are inseparable in most people’s minds, even when we go through great efforts to avoid it. So even if we all agree (ish) that NHST is the bigger problem. What happens if we actually banned p-values? Would NHST go with it? Just get replaced? Would the replacement change anything?

Darren: People would do the same stuff with Bayes factors or posterior credible intervals.

Alex: I would guess people would just report confidence intervals and use them for implicit NHST

Andrew: +1 to Darren. Was going to say exactly the same thing. One thing for us all to chew on:

This guy is an editor of a major journal and highly willing to engage discussions about stats and methods on twitter. A real problem is that there seems to be a thirst to just get rid of the p-value and replace it with something else (this ONE WEIRD TRICK TO MAKE YOUR STATS BETTER!) which we all know is a nonsense task.

Laure: Even cost-based analyses…

Alex: Or yes, as Darren says, posterior credible intervals using noninformative priors

Noah: I have a slightly dissenting opinion here (surprising no one): there is something psychologically different about things like CIs which actually might change things over the long run. Plus the usual “wrong” interpretation of a CI is way less wrong than the usual “wrong” interpretation of a p-value.

Mollie: I do think the CI (whatever the C stands for) is an improvement over the p-value, even if it’s just used for NHST

Andrew: Which is why I remain a (tepid) defender of the p-value in general, though I’d like to see it eradicated from some obvious misuses (p-values in Table 1 of RCT’s, for one). Banning p-values will not get rid of the desire to make NHST-like decisions.

Darren: Even with informative or regularizing priors…people can/will still find ways to make clear cut conclusions based on an estimate. From a background in epi, I was *shocked* to learn there were experimentalists who *just* report p-values with no effect sizes.

Alex: Good point, Darren

Laure: Well it just comes back to what we discussed already: people want to know what to do and hence will oversimplify.

Alex: Maybe we can say what the alternative to NHST would be? Reporting point estimates with a reasonable measure of uncertainty, embracing the uncertainty, and being willing to make decisions with uncertainty present?

Laure: Yes. I doubt whether the scientific community will be prepared to do that. This reminds me of Latour’s work in the 80s. He did ethnographic work in labs and described how science basically is messy inconclusive findings. and he contrasted that with ready-made science: after peer review and publishing, everything was treated as if the evidence always spoke for itself. it became a “scientific fact.” he got fierce critique from the scientific community, because they could no longer claim objectivity and to hold the truth?

Noah: This is one of the key, central difficulties in meta-science: It’s extremely tough to communicate critique of scientific institutions and practice vs science as a whole.

Laure: Other question: would CIs lead to less reporting and publication bias and better meta-analyses?

Noah: I think no in the short run, yes in the long-run

Darren: Frequentist CIs are still just sets of tests…so I don’t think so.

Mollie: There’s an added complication, possibly, in that journals that explicitly prefer CIs also seem to lean less on NHST

Noah: I don’t think that’s either coincidence or confounding. CI’s bring with them the idea of ranges and uncertainty. P-values don’t do that quite as well.

Mollie: I guess I was thinking, do those journals embrace CIs because they’re already more comfortable with uncertainty, or does using more CI lead to more comfort with using ranges? What direction does the arrow go?

Darren: But I object to the often touted the idea that the 95% CI says everything an exact p-value says, because it doesn’t.

Alex: Are there journals that still don’t make you give the point estimate but rather just a simple yes/no to significance?

Darren: I don’t know. Probably still common in in vivo experimental work.

Andrew: I am currently grading class assignments for our clinical trials course (which means I get exposed to a ton of articles that I may not otherwise read) and I’ve seen even some RCT’s that include p-values for the primary treatment comparison but no summary data on the outcome itself or the difference between groups

Darren: I think another major factor is the role of randomization. In fields where it can’t be used, the bulk of the hard work is on dealing with bias.

Alex: YES Darren!!

Andrew: Agreed, again.  An element that we’ve left out of this discussion (we can only cover so much in one chat…) but the use of p-values (and even NHST) in more controlled settings may be more acceptable.  One problem that some statisticians point out is that p-values aren’t really p-values any more when the entire data generation procedure and choice of statistical model was modifiable along the way.

Noah: The question that never gets asked: p-value of what? P-values tell you little to nothing about the model itself. And as in your example, the model is the important bit in these kinds of things. So, you often get a “significant,” p-value of a meaningless parameter.

Andrew: One of the reasons we’ve all grown so cranky about stepwise regression, for example.  Do the p-values in a stepwise-built regression model retain their original meaning?

Mollie: Ugh, that. Taken out of context, that would be a nice way to promote p-hacking.

Andrew: A single p-value for a single NHST of the primary outcome in a single randomized controlled trial with a prespecified analysis plan at least does retain its original meaning (though you may still argue about the absurdity of treating p=0.045 and p=0.055 as dramatically different results; that’s a separate problem).  I’m not even sure what the p-values mean in the setting of multivariable regression model where the authors may have peaked at the results a couple times, changed which variables to keep in or remove from the model, changed which people to include in the analysis, etc.


Darren: But I haven’t goaded anyone into a fight about CIs and “precision” yet!

Noah: Darren and Alex: quick summary of the problems w/ technical stuff.

Alex: *gulp* how did I get selected to do technical!?

Noah: because you uttered the names Neyman, Pearson, Fisher, and Poole. You and Darren dug your graves.

Darren: Can I share that I almost failed Charlie Poole’s epi class at UNC? Well…I did.

Mollie: A lot of science is constructing a narrative around uncertain estimates. P-values, and particularly their use in NHST, allow people to make decision rules– it feels like a way to let science speak for itself. But P-values are routinely misunderstood by both producers and consumers of research, NHST leads to publication bias, and we’re all spiraling into a meta-science pit of despair from which there is no escape.

Alex: Technical summary: P-values are most often used for NHST, which, as used today, is a bastardization of the approaches originally laid out by Fisher and NP, which emphasized that p-values are continuous (Fisher) and that cost/benefits should be considered when designing hypothesis tests (NP). Instead, people often test against the null hypothesis of no effect with, typically, alpha=0.05, and make dichotomous yes/no decisions based on the results.

Laure: NHST is still omnipresent in introductory stats courses, to the point that researchers do not even realize it is contested.

Darren: The technical problem with p-values is that they are in fact technical. Their use requires careful consideration. 100+ years of conversations have been polluted with “gotcha” talking points among statisticians, leaving researchers to fend for themselves. In the face of poor incentives, people have taken the path of least resistance, turning an absolute genius of an idea into a silly caricature of objective reasoning from data. This problem won’t be sorted until we stop looking for things like “p-values that everyone can understand easily”. Fin.

Andrew: My goodness, I think my work is done.  Darren, Mollie, and Alexander all have covered things so beautifully.

Noah: :heart_eyes: It’s . . .  glorious

Andrew: Oh, boy, and Laure just hit on another point that we ought to talk about more. NHST is still omnipresent in introductory stats courses, both inside and outside of stats departments, and a nontrivial (read: significant majority) of researchers that I work with or encounter are not even aware that this is a controversy.

Noah: Yes, indeed, but not today!

Noah: ONE MORE TIME! Unfair, possibly meaningless question: In an ideal world, in a frictionless plane in a vacuum, what should we do about p-values? Scale of 0 (ban ALL the p-values) to 100 (I love p-values, especially in table 1).

Mollie: 48 from 50

Darren: Not fair with the table 1 quip

Noah: Literally the first word is “unfair.” I’m going from 49 to 51.

Laure: I’m going from 70 to 60

Alex: I’ll go from 50 to 51

Andrew: I’ll stick with 67. There are certain situations where they’re clearly nonsense and ought not to be used.

Laure: regression to the mean?

Darren: 100 – a time and a place

Mollie Wood is a postdoctoral researcher in the Department of Epidemiology at the Harvard School of Public Health. She specializes in reproductive and perinatal pharmacoepidemiology, with methods interests in measurement error, mediation, and family-based study designs. Tweets @Anecdatally

Andrew Althouse is an Assistant Professor at the University of Pittsburgh School of Medicine.  He works principally as a statistician on randomized controlled trials in medicine and health services research.  Tweets @ADAlthousePhD

Darren Dahly is the Principal Statistician of the HRB Clinical Research Facility Cork, and a Senior Lecturer in Research Methods at the University College Cork School of Public Health. He works as an applied statistician, contributing to both epidemiological studies and clinical trials, and is interested in better understanding how we can improve the nature of statistical collaboration across the health sciences. Tweets @statsepi.

Laure Wynants is Assistant Professor at the Department of Epidemiology at Maastricht University (The Netherlands) and post-doctoral fellow of the Research Foundation Flanders at KU Leuven (Belgium). She works on methodological and applied research focusing on risk prediction modeling. Tweets @laure_wynants

Alex Breskin is a postdoctoral researcher in epidemiology at the UNC Gillings School of Global Public Health. His methodological research focuses on study designs and methods to estimate practice- and policy-relevant effects from observational data. Substantively, he focuses on the evolving HIV epidemic in the United States. Tweets @BreskinEpi

Noah Haber is a postdoctoral researcher at the Meta-Research Innovation Center at Stanford University (METRICS), specializing in meta-science from study generation to social media, causal inference econometrics, and applied statistical work in HIV/AIDS in South Africa. He is the lead author of CLAIMS and XvY and blogs here at Tweets @NoahHaber.

Thoughts and comments welcome