Proposed framework for peer review in open science

A few weeks ago, I was approached to guide public-facing strength of evidence assessments for two separate projects related to COVID-19. I got a hold of a few folks (main credit to Emily Smith here) to create a framework that accomplishes two main goals 1) Frame strength of evidence review against a near-universal fixed standard, and 2) Translate strength of evidence with its applicability to decision-making for public audiences.

In the interest of getting these ideas circulating and getting critique, feedback, and suggestions, I wanted to make this framework public. If you like the ideas, feel free to steal, remix, revise etc them. If you don’t, let us know why!

Framework for peer review in open science

Consider the main claims and question of interest of the paper, drawing particularly from the abstract and title. Consider a hypothetical ideal study to test that question of interest in the population of greatest interest without regard to whether such a study is feasible, ethical, or even physically possible. 

Examples (in drop-down, mouseover, etc.) For example, if the study is about the impact of city masking orders, the hypothetical ideal study might be a cluster randomized controlled trial where every city was randomly assigned to have masking orders. However, for the question of whether masks are protective, the hypothetical ideal study might involve random assignment of actually using masks on an individual level (with perfect compliance and adherence). A hypothetical ideal study for a diagnostic or screening test might involve testing against a fictional perfect test with no false positives or negatives. The hypothetical ideal study will be specific to each question of interest.

Now consider the following categories for strength of evidence:

Strong: The main study claims are very well-justified by its methods and data. There is little room for doubt that the study produced has very similar results and conclusions as compared with the hypothetical ideal study of relevance. The study’s main claims should be considered conclusive and actionable without reservation.

Reliable: The main study claims are generally justified by its methods and data. The results and conclusions are likely to be similar to the hypothetical ideal study. There are some minor caveats or limitations, but they would/do not change the major claims of the study. The study provides sufficient strength of evidence on its own that its main claims should be considered actionable, with some room for future revision.

Potentially informative: The main claims made are not strongly justified by the methods and data, but may yield some insight. The results and conclusions of the study may resemble those from the hypothetical ideal study, but there is substantial room for doubt. Decision-makers should consider this evidence only with a thorough understanding of its weaknesses, alongside other evidence and theory. Decision-makers should not consider this actionable, unless the weaknesses are clearly understood and there is other theory and evidence to further support it.

Not informative: The flaws in the data and methods in this study are sufficiently serious that they do not substantially justify the claims made. It is not possible to say whether the results and conclusions would match that of the hypothetical ideal study. The study should not be considered as evidence by decision-makers.

Misleading: Serious flaws and errors in the methods and data render the study conclusions misinformative. The results and conclusions of the ideal study are at least as likely to conclude the opposite of its results and conclusions than agree. Decision-makers should not consider this evidence in any decision.

At-a-glance table

Claims are _ by the methods and dataDecision-makers should consider the claims in this study _ based on the methods and data.
Strongvery well-justifiedactionable without reservation
Reliablegenerally justifiedactionable with limitations
Potentially informativenot strongly justified, but may yield some insight. not actionable, unless the weaknesses are clearly understood and there is other theory and evidence to further support it
Not informativenot substantially justifiednot actionable
Misleadingnot at all justifiedmisinformative

Which of these categories best represents your view of the methods, data, results, and claims from the study, and why?

[Review of study goes here, with reference to the above question]

Chat: Trial by Virus

This was an organized chat about how how clinical trials are impacting and are impacted by COVID-19. The original chat took place on April 15, 2020. Since then several major trials have had partially or fully published results. The transcript below is revised and edited from the original for clarity.

Andrew (professor, clinical trial statistician, at University of Pittsburgh): Let’s get ready to rummmmmmmble

Noah (postdoc in meta-science, METRICS): We are starting off the day with the space jam theme song. Not bad. Where’s Darren?

Mollie (postdoc in repro/perinatal pharmacoepi): He’s probably still fighting about random confounding on Twitter.

Darren (epidemiologist and statistician, University College Cork), several minutes later: Wut up.

Noah: Here we go! As is tradition, we start off with an unfair, possibly meaningless question: Given all the clinical trial evidence we have seen to date, how well does any combination of hydroxychloroquine, remdesivir, azithromycin, or any other pharmaceutical work for treating COVID-19? The scale goes from 0 (does absolutely nothing) to 100 (instant cure, COVID-19 solved). Everyone must answer.

Andrew: If I have to give a number, 14.

Darren: Is this scale called a “probability”?

Mollie: Yeesh. Is 50 equipoise?

Darren: Yeah, do over.

Sarah (postdoc and clinical ethicist, philosophy, METRICS): Mutiny on question 1.

Mollie: We are all Spartacus.

Sarah: So why are we all so uncomfortable answering?

Andrew: I think that reaction, itself, kind of sets the tone for this: the whole point is that to date the evidence accumulated for or against any therapy is generally low quality and scattershot.

Darren: Based on existing evidence…7

Noah: What part of “possibly meaningless” was unclear :)?

Emily (professor at the George Washington University, epidemiologist): Yes, it’s a meaningless question. I’ll go with a 10.

Andrew: We’re four months into this, with hundreds of thousands of confirmed cases of the disease, and minimal quality evidence to tell whether any of it works. Inasmuch as we can ever know that something “works”

Darren: But it could be 68 6 weeks from now.

Noah: Slightly revised question then: What is your projection for what your answer will be two weeks from now, within what you would call reasonable bounds (e.g. 10-45)

Emily: 8-10

Darren: 6-8

Sarah: 10-20

Andrew: Two weeks from now? Still 14. Two months from now, maybe we’ll have meaningful movement.Oh, wait, bounds like a “confidence interval” – uh, 11-17.

Mollie: 10-20

Emily: Given the short timeline of two weeks, we’re unlikely to have much additional (high quality) information. The timeline for clinical trial readouts is unfortunately a bit longer than this.

Noah: Well, since no one else is gonna do it, I will: -10 to 40

Mollie: No fair!

Sarah: What am I supposed to make of -10?

Noah: Worse than useless. -10 could happen if no therapies are found to be at all effective for COVID-19, and have at least some negative side effects. So all net negative is totally plausible. <editor’s note: since this chat was recorded on 4/15, early results have mixed impact, some suggesting net harm, and unconfirmed hints at maybe some positive results.>

Darren: So not a probability then ok

Noah: Some of us are fine with probability estimates less than 0 <wink>

Andrew: Oh I see it came from the linear probability model.

Mollie: There’s also individual-level good/harm and population-level good/harm.

Noah: Right! But before we dive into that, what even is a trial?

Sarah: A risk that human subjects take on in order to produce a social good.

Andrew: NIH defines a clinical trial as “A research study in which one or more human subjects are prospectively assigned to one or more interventions (which may include placebo or other control) to evaluate the effects of those interventions on health-related biomedical or behavioral outcomes”

Emily: That’s great. And I like to think of a trial in 4 parts — PICO. Population. Intervention. Comparison group. Outcomes.

Darren: On the twitter I defined trials as a tool for changing people’s minds. The key is that the trial must be trustworthy, so we don’t need to rely on trust in people.

Emily: I would argue that ideally we (clinical trialists) don’t start with something that we want people to do. But rather we start with something we aren’t sure if people should do. Of course we start with a reasonable hypothesis/evidence about why it might be a good idea to do something/use this drug. But we aren’t sure. However, I 100% agree with your point that trials must be trustworthy and transparent.

Noah: So is this what a trial is SUPPOSED to do, vs what how people actually do them?

Andrew: Supposed to do: improve our knowledge of whether the benefits of that drug/device/intervention outweigh the harms relative to some alternative course of action. And, in large part, I think that generally is what they do. But, there are some structural issues that can make them ill equipped to respond quickly to something like a pandemic situation. 

Noah: Which brings us to our current moment. Some scene setting: When I checked last night, Cochrane’s registry had 568 studies listed as “interventional studies” related to COVID-19. We’ve had a few trials, at least a few of which (all of questionable quality) have made it their way into the mainstream. The biggest and most elephant-in-the-roomiest: the Raoult study.

Darren: Not a trial.

Noah: Damn, these takes are getting served up hot.

Emily: 26 patients got hydroxychloroquine (HCQ) (6 also got azithromycin because their doctor suspected infection). 6 of 26 were excluded from analysis bc they were transferred to ICU, died, or stopped treatment. Ultimately, 20 treated patients were compared to 16 patients who refused treatment or were in another facility. The study concludes that subjects treated with HCQ tested negative for SARSCoV2 sooner than 16 people who refused treatment or were at a different facility. No other clinical data was included.

Darren: A group of people were selected through unknown means, and given something. Their outcomes were compared to another group of people, also mysteriously selected. Outcomes in the 2 groups were compared. Then they did 9 other wrong things, and we were all rolling our eyes about the study. Then the President of the United States tweeted about it and here we are.

Noah: So one of the key things: it wasn’t randomized. Why is that so, so important here?

Andrew: Randomization ensures unbiased allocation to treatments, minimizing risk that “sicker” or “healthier” patients are preferentially assigned to one treatment to give appearance that it is better than the other.

Darren: It’s really the only way to feel confident you are comparing like to like

Andrew: Alongside randomization lives the importance of “concurrent controls” – meaning that the people in one group are also in the same hospitals, season, etc versus the other group – so any observed differences are more likely due to the treatment and less likely to be explained by those other factors. What do you call them again? Starts with a “c”

Darren: Colliders

Noah: Cookies. <editor’s note: the correct answer is confounders>

Mollie: One of the Very Special Pharmacoepidemiology terms is “channeling bias”, which occurs when physicians preferentially assign or avoid treatments based on the prognosis of the patient in front of them.

Darren: Nice! I’m adding that to Noise Mining as go-to phrases

Sarah: So Raoult study = worse than useless?

Emily: Yes.

Darren: As it turned out? Very much so it seems.

Mollie: Actively harmful, I’d say.

Sarah: Without randomizing, is there any way they could have at least been useful? Or does it all hang on the failure to randomize?

Darren: This is where trust comes in.

Emily: Yes, and I’ll add one more thing to the list of problems — people who died or went to the ICU were excluded from the final analysis. And this leaves me wondering if HCQ was actually harmful. But we can’t know for sure given the way the information was presented.

Mollie: Also important to note that all the COVID attention to HCQ has made it difficult for patients using it for other conditions (e.g., lupus) to get treatment.

Darren: The PI was advocating HCQ months before. It is a treatment he has long advocated for all kinds of maladies.

Noah: Holding aside issues with the participants in that study themselves, isn’t having some study better than no study?

Andrew: Whether “something is better than nothing” is a big picture question that hangs on some of the other issues here. As noted, it’s created huge demand for the drug based on extremely shoddy evidence, in some cases preventing people who take this drug regularly (with proven benefit for their condition!) from getting their medication.

Emily: No – something (HCQ) isn’t better than nothing – when we don’t know about the safety of the ‘something’ (HCQ).

Andrew: If this study was being treated as “Hey this might work, enroll your patients in trials so we can generate better quality evidence” – it would arguably be positive. But it’s become a polarizing (and politicized) issue where people have swallowed the weak evidence and started openly advocating that all patients with COVID be treated with this therapy. So, that’s how in some cases I’m not sure something is better than nothing in terms of evidence.

Noah: What about for the patients that participated?

Sarah: Yes! Thank you for bringing this up. Health Care Systems reallocating this drug from patients who have been taking it for years in order to use it in COVID trials on the basis of this study was very very concerning.

Andrew: The net effect of the Raoult study (globally) has been taking a drug away from patients who are known to benefit from it so it will be available for patients that may or may not benefit from it.

Emily: They may or may not benefit from it. And they may be harmed from it in fact!

Andrew: And to toss a grenade out there that gets very hot very fast, some very smart physicians on Twitter have basically said “We don’t even have enough of this drug to treat everyone anyway; at least if we randomize while the supply chain is short, we’ll know if it works”

Noah: Alright, clearly more study is not always better. And a bad study can be (and clearly was in this case) worse than nothing. What do we make about the MASSIVE number of new trials that are happening right now?

Darren: It’s good, but they aren’t nearly well coordinated enough

Sarah: I am worried about duplication of efforts and lack of coordination.

Andrew: A reflection of the bizarre structural inefficiencies built into our entire research infrastructure, which then get exacerbated during a pandemic when everyone feels like they have to do something.

Emily: There are definitely organizations (like WHO, the Gates Foundation, COVID therapeutics accelerator, etc.) that are coordinating some of these efforts to make sure they are complementary and informative. But there are many more that are not coordinated. And coordination is key to make sure that studies are harmonized across each part of the PICO. That will allow us to learn faster. To learn better. <editor: harder, stronger>

Sarah: Some institutions have many more proposed trials than relevant participants, which has led to some interesting conversations about allocation which is not about vents for a change.

Andrew: I have heard of a few trials where they are even planning to have a Data and Safety Monitoring Boards – which review interim results from a trial and determines whether the trial should continue or stop based on the accrued data – from separate trials talk to one another, which I believe is quite uncommon.

Noah: I’ve skimmed through the registries, and had a few big takeaways: the vast majority are tiny, single center studies, extremely few are even measuring the same outcomes.

Andew: Noah, your point is bang on. Tons of “trials” being started, the vast majority of which are likely to be too small and too slow to be informative.

Darren: Yet I’ve already spotted a systematic review, lolz!

Noah: There are meta-analyses in the works for sure, maybe even published by now?

Darren: Also…a lot of the registered Chinese trials have stopped recruiting, for good reason of course: too few patients to recruit. So they had what, a 2 or 3 month window.

Andrew: Funny, when the pandemic dies out in your part of the world, no more cases means a trial that doesn’t finish…which brings us to a potentially interesting tangent: I have seen some folks on Twitter argue that COVID trials need to be finished in <12 months to mean anything because of the likelihood that we’ll have a vaccine in approximately that time frame.

Noah: Also worth noting that a huge proportion of the trials are from hospitals, not necessarily research centers.

Mollie: It seems like there’s a prevailing sense that we should be able to relax certain strict rules for trials because this is an emergency. But a lot of those rules, like the ethics of who is included/eligible, seem to be perceived as a lot more malleable than others.

Sarah: Relaxation of standards is the wrong move. An emergency is exactly the time for high standards. Sloppy studies can be more harmful than nothing.

Andrew: Building from this, an important question is which components of clinical research machinery that we perceive as “red tape” in normal circumstances can be meaningfully expedited without compromising standards. One good example, I think, would be whether we can have a more streamlined process for broad-spectrum approvals across the nation in a pandemic situation rather than requiring every individual investigator starting a trial to get their own regulatory approval. Or, making the Institutional Review Board process more efficient for multicenter trials.

Mollie: My gut reaction on the speed of review is “yikes.” After all this, I don’t think prioritizing IRB speed, absent an emergency, is necessary, and we shouldn’t expect it.

Darren: Anyone arguing to lower standards doesn’t deserve our trust. That’s why we have standards. Trustworthy studies, so we don’t have to trust people…

Andrew: Most trials that are led by academics take a lot of time. We have to apply for funding (often from the federal government or foundations), wait a couple months for the grant to get reviewed, then if it gets funded, we bring the team together and actually start building the infrastructure to do the thing. It takes months (if not years!) to get a study started during normal working conditions. For trials to start rapidly, a lot of the infrastructure has to be in place – existing networks of sites, databases ready to capture the relevant information, framework to enroll and randomize patients. When a trial has to be done in 3 months to mean something, that can only happen if a lot of the framework already exists or if barriers can be removed to allow things to happen very rapidly due to the pandemic.

Mollie: I think this leads into some important points about: who is included in trials (generally, and specifically for COVID), and who is not included. First: institutions with existing trials infrastructure will get up and running faster, and they’ll treat the patient pool they have (usually). This means wealthier institutions treating wealthier and often healthier patients.

Sarah: YES. There seems to be a mis-match between places who have the institutional support and places that have the patients.

Mollie: For example: last I looked, Massachusetts General Hospital and Brigham and Women’s (private, Harvard-affiliated) in Boston both had COVID trials, but Boston Medical Center (Boston’s public safety-net hospital) did not. Especially considering that this pandemic is hitting poor people, and black and brown communities much, much harder, this is a big deal.

Sarah: So are there instances from the global pandemic trial world where work was done to get the institutional support and the patients together?

Andrew: Fantastic points, Sarah & Mollie. It can be challenging for the so-called “community” hospitals (not affiliated with academic medical centers) to get involved in research studies. This comes back a bit to the structural/incentive problem: faculty in academic centers who commonly lead trials (who are generally kind-hearted people fighting to do good! Not blaming them for this) may not have connections with the community hospitals; or if they do, there are often hurdles to overcome to get them started as a trial site.

Mollie: These institutional structural inequities in trial location could exacerbate what are already alarming differences in outcome for different patient pools. I also want to note that as of right now, pregnant women will be ineligible for trials, including vaccine trials, unless specific exemptions are made. Same for prisoners, possibly same for other vulnerable groups.

Sarah: ^This is ethically concerning. In the world pre-COVID I thought we were making progress on explaining why efforts to avoid including pregnant women in trials actually contributed to harms to pregnant women because it resulted in a harmful lack of information about pregnant women.

Noah: Are we just moving too much, too fast?

Darren: We aren’t prepared to go this fast. And too many would rather be first than be correct.

Sarah: I think there is a really strong impulse that needs to be fought here. Everyone wants to do something. And not everyone can be the ones leadingTHE trial.

Mollie: Someone brought up the incentives in academia and how they might (ha) be damaging here. Darren?

Darren: What’s that quote? ~ It’s amazing what you can accomplish when nobody cares about who gets the credit.

Andrew: I think this comes back a bit to the overall incentive/structure for much of academic medicine. Ideally, for crisis-level situations that require a large, coordinated response, there would be an existing framework ready to implement one or more large-scale trials more or less “ready to be activated” when called upon. But that doesn’t exist;

Noah: But we didn’t have that set up.

Andrew: No. We all work at separate universities, apply for competitive grants, and as Sarah just pointed out, not everyone can be the ones leadingTHE big and glamorous trial. So instead of hundreds of people signing up to be part of 5 really big trials, we have 568 small trials.

Emily: We need coordinated plans for how to do trials in a pandemic, and we need these plans in place before the pandemic. WHO has done exactly this with the R&D blueprint established in 2016. It seems (from outside view) this has mostly been applied to the COVID vaccine trials. The COVID-19 Therapeutics Accelerator (funded/coordinated by The Gates Foundation, Wellcome Trust, and Mastercard) are playing a similar role in the therapeutics trials. All of this to say that I’m more optimistic than everyone else that we will get useful information faster than usual (although we’re still talking in terms of months, not weeks).

Darren: In the UK, they had REMAP-CAP up pretty fast.

Andrew: REMAP-CAP could be a whole chat of its own, but it’s important to discuss briefly here for “what can we learn about trials” I think. There is a large network of clinical trials units.

Noah: Yup. Things like REMAP-CAP have been proposed before, but maybe the idea behind it is important more now than ever. Give a go at explaining?

Andrew: It’s a bit complicated, but the idea is the entire trial is embedded in routine clinical care in a highly adaptable way. So all the data collection and treatment decisions are part of the clinical process, including the randomization of treatment decisions to get the benefits of RCTs.

Noah: The super interesting part to me is the adaptation. It’s not a “normal” trial; the randomization probabilities themself change as more data.

Andrew: Right. The randomization chances change to allocate patients to arms that appear to be performing better as the trial goes on. And it’s multifactorial, so it’s not just one decision (i.e. a drug), but lots of decisions all at once, so you get some “bang for your buck” instead of setting up a whole big trial system to answer just one question.

Emily: It’s also not just one kind of treatment, there are  several domains of treatment including antivirals, corticosteroid strategies, immune modulation strategies, and more being added.

Noah: It’s a type of trial that has been experimented with in the last few years, but is a totally different way to do a trial than is normally done. I would never have thought we’d see something like this set up so quickly.

Sarah: Sounds like this has benefits ethically in that it avoids research waste and has these critical thresholds for stopping randomization built in beforehand. It may be difficult to untangle obtaining informed consent for participation in such a trial, given so many possibilities for the treatment of each individual patient.

Noah: We are doing lots of treatment trials, but what do we not know right now that trials won’t be able to help us with?

Darren: Immunity…case projections…masks…social inequities…

Andrew: Trials will be able to help us with “Does hydroxychloroquine / azithromycin / remdesivir do any damn good at all” – just about anything else, trials probably won’t be able to help with. Do masks work in a community setting to reduce transmission? No trial likely to be done in a meaningful time for this pandemic. When can we go back to work? No trial gonna solve that one. How do we prevent this from hurting the most vulnerable populations? Trials aren’t going to help with that

Sarah: You all covered it- masks, timeline, inequalities. Some of the questions we most want answers to.

Noah: Alright, so the trials world is totally on its head, just like everything else. To close out, here’s a scenario. You have 5 minutes to speak to any group of people you want about trials as they relate to COVID. Could be anyone (general population, docs, patients, elected officials), but you have 5 minutes. Who do you pick, and what do you tell them?

Darren: I’ll take docs, because that’s what I try to do anyway. I would help them (those that need it, certainly not all) understand the limits of their own understanding of trial methods; to not to put stock in a trial just because it’s in Lancet or NEJM; and remind them of all the  common medical practices  that were later shown to not work in rigorous trials.

Sarah: I pick trialists. Can my magical five minutes be that we all get in on zoom and do priority setting across institutions?

Noah: Only if it has a cool space background

Sarah: I worry that without communication across institutions, we will focus on testing one treatment and have many competing under-powered inconclusive studies of that treatment. I worry that without communication across institutions, we will (as discussed earlier) see that the places with lots of sick patients will not be linked with places with lots of study expertise to work together, since these institutions tend not to be the same. 

Andrew: I’ll speak to elected officials; I will plead with them that this scenario is why we should have a standing National Pandemic Trials Unit that includes a large number of sites who can “opt in” to join the study with an expedited review process and framework of a database that can be activated quickly for a new study. A core of experienced trial investigators will be prepared to lead these trials, pulling in relevant content experts if necessary for a new outbreak. In non-pandemic times, the unit can be used to run trials of, well, other stuff that’s of national health interest.

Darren: Yeah, let’s do that!

Sarah: I donate my 5 minutes to Andrew.

Andrew: Sarah has a good point, of course, which is really what inspires my comment. Of course this would be better if we had a few dozen people hopping into bigger, better coordinated trials rather than 568 (mostly single center) trials

Mollie: Good thing no one is talking about withdrawing funding from WHO, then

Noah: Nope. Nobody would do something so stupid in the middle of a pandemic.

Mollie: I’m trying to decide between the general population and elected officials

Noah: So are the elected officials

Mollie: So I’ll take the teeming masses. And I guess I’d try to talk about how often, your doctor is good at picking patients who will respond to treatments, but aren’t good at (and aren’t trained to) decide whether treatments themselves work. And how trials are really the only way we can answer that last question.

Andrew: Since everyone is (justifiably) worried about the economic impacts of a prolonged shutdown for social distancing purposes, you can argue that the ROI on these things would be there if doing effective trials quicker = less need for broad mitigation strategies if we can discover effective therapies faster

Noah: Bring it home, Emily

Emily: I’ll also opt for elected officials. And if I have to get specific, I’ll go for Trump and his advisors. When elected officials advocate for unproven interventions, people get hurt. And we’re seeing that on a major scale here during the COVID-19 pandemic. For example, there was a run on tylenol/paracetomol after a French health minister declared ibuprofen might be unsafe, and there have been marked increases in chloroquine/hydroxychloroqine prescriptions since officials advocated for this. People have become sick and even died from treatments advocated for without evidence from our top officials. All of this is to say that it’s extremely important for public figures to understand that if we’re doing a clinical trial – it’s because we truly don’t know whether the treatment will work or not. It’s never a good idea to bet on the outcome of a trial, and it’s certainly a bad idea to make public health recommendations before the data comes in.

Noah: And that’s a wrap! Thanks for sticking around, see y’all next time!

Mollie Wood is a postdoctoral researcher in the Department of Epidemiology at the Harvard School of Public Health. She specializes in reproductive and perinatal pharmacoepidemiology, with methods interests in measurement error, mediation, and family-based study designs. Tweets @Anecdatally

Sarah Wieten is the Clinical Ethics Fellow at Stanford Health Care and a postdoctoral researcher at the Stanford Center for Biomedical Ethics.  She specializes in interdisciplinary projects at the intersection of epistemology and ethics in health care. Tweets @SarahWieten.

Emily Smith an epidemiologist and an Assistant Professor of Global Health at The George Washington University, Milken Institute School of Public Health in Washington D.C. Her research focuses on clinical trials aimed at generating evidence for global public health practice and policy. Tweets @emily_ers

Andrew Althouse is an Assistant Professor at the University of Pittsburgh School of Medicine.  He works principally as a statistician on randomized controlled trials in medicine and health services research.  Tweets @ADAlthousePhD

Darren Dahly is the Principal Statistician of the HRB Clinical Research Facility Cork, and a Senior Lecturer in Research Methods at the University College Cork School of Public Health. He works as an applied statistician, contributing to both epidemiological studies and clinical trials, and is interested in better understanding how we can improve the nature of statistical collaboration across the health sciences. Tweets @statsepi.

Noah Haber is a postdoctoral researcher at the Meta-Research Innovation Center at Stanford University (METRICS), specializing in meta-science from study generation to social media, causal inference econometrics, and applied statistical work in HIV/AIDS in South Africa. He is the lead author of CLAIMS and XvY and blogs here at Tweets @NoahHaber.

Chat: Do p-values get a bad rap?

Editor’s note: The transcript below is lightly edited from the original transcript for clarity. For a human-oriented explainer on what p-values are (and are not), see A Practical Primer on P-Values

Noah: New chat time!

Darren: The answer to the question is yes, they do. Are we done?

Mollie: **flexes fingers, cracks knuckles**

Andrew: Good to see you all. Prepare to die.

Darren: You don’t need me if Andrew is here.

Noah: Good to have you both, maybe we can get you two into a fight

Darren: Have you seen him lift weights?

Noah: And with that excellent, excellent tone setting. LET’S GO!

Andrew: (starting gun fires, klaxon sounds)

Noah: Unfair, possibly meaningless question: In an ideal world, in a frictionless plane in a vacuum, what should we do about p-values? Scale of 0 (ban ALL the p-values) to 100 (I love p-values, especially in table 1). Everyone must give a number!

Mollie (postdoc in repro/perinatal pharmacoepi): hoo boy. 50.

Andrew (professor at University of Pittsburgh, clinical trial statistician): 67

Darren (epidemiologist and statistician, University College Cork): 100 – a time and a place

Laure (epidemiologist and statistician, Maastricht University): 70

Noah (postdoc in HIV/AIDS/causal inference econometrics/meta-science at UNC): I’m going 49

Alexander (postdoc in epidemiology at UNC): I’m going to give a 50, but I feel weird giving anything, since I think it’s easy to confuse p-values with null hypothesis significance testing (NHST)

Darren: +1 to Alexander

Andrew: Cripes, Alexander, we’ll get to NHST in a minute.

Alexander: Fine, then count me as a 50.

Noah: NEW CHALLENGE! Define a p-value in a single sentence (no semicolon cheats), that is both technically correct AND understandable by a human.

Darren: Irresponsible to even try. Grrr

Andrew: Agreed with Darren. Not possible to define in a single sentence without at least one semicolon or hanging clause.

Alexander: A p-value is the probability of observing a test statistic at least as large as that observed, given the null hypothesis is true, the statistical model is correct, and the data can be trusted.

Laure: I would go for “The chance of seeing your data (or a more extreme finding) if in reality there were no effect” (under the assumption that the model used to compute is correct and null hypothesis is no effect)

Mollie: The chance that you would see that magnitude of difference in groups if the real difference was zero, just randomly

Andrew: I’ll cede to Alex’s definition

Darren: Assuming that the data were generated by some null model, which includes a hypothesis and some assumptions, a p-value gives the tail area probability for the sampling distribution implied by said null model.

Mollie: You guys don’t talk to real humans much, huh?

Darren: Nope.

Mollie: Good move

Alexander: I don’t leave my windowless office much…

Noah: I’m still in PJs <editor’s note: it was 11:30am>

Andrew: OK. Now, for one that will be more understandable by a human: in the context of a single trial comparing a treatment against a placebo, a p-value of 0.03 means that there was a 3% chance of observing a difference equal to or larger than the difference we actually observed in the data if the treatment has no real effect.

Darren: But p-values aren’t for real humans. They are a single piece of information about a single estimate in a single study, from which nothing of substance should ever be drawn.

Noah: Why shouldn’t people ever draw from them anything of substance?

Laure: Because they are a random variable themselves. Get a new sample, you get a new p-value.

Darren: To Alexander’s definition, are the data of good quality? What are the risks of bias? There are plenty of things to consider about any estimate. A p-value is just one of these.

Mollie: Darren, I’m very curious about the 100-endorsement of p-values (given appropriate time and place) vs. nothing of substance should ever be drawn

Noah: So if everyone is roughly agreed that it’s a pretty limited statistic, how did it become THE statistic?

Darren: Once people wrote books with sampling distributions (the really hard part), then calculating p-values was relatively easy.

Andrew: Borne from a demand to have a simple, “easy to understand” (ha!) way to make a conclusion about whether an effect is ‘real’ or not.

Laure: It became THE statistic because it is practical and using it with an arbitrary cut-off leaves no room for uncertainty (in users minds)

Andrew: People are uncomfortable embracing uncertainty, and much happier to have a yes/no answer from any given statistical test

Mollie: I mean, they’re very useful if you want to talk about things like long runs of heads or tails on successive coin tosses

Alexander: People couldn’t understand the writings of Neyman and Pearson (NP), and bastardized the writings of Fisher, and they came out with our current use of p-values.

Darren: To Mollie’s point – all of these inferential tools are useful fictions. I think Alexander probably hit on it. (accidental mashing of Fisher and NP)

Noah: Let’s get wonky! What’s the mashup?

Alexander: Fisher: 1) Specify a null hypothesis (that can be anything, doesn’t have to be no effect), 2) Compute p-value, 3) P-value tells you compatibility of data with hypothesis. Neyman-Pearson: 2) Calculate p-values under each hypothesis, 3) Accept hypothesis with greater p, reject hypothesis with smaller p

Darren: In my mind, I think NP makes some sense if you do all the work up front around your alternative hypothesis. Fisher’s continuous interpretation of p-values makes sense post-data. But treating the post-data p like a hard decision rule without engaging in the NP manner is problematic.

Alexander: I agree with Darren. The big thing with NP is they emphasize carefully choosing alpha, beta, null and alternative hypotheses to be question-specific. Basically, NP want you to carefully consider the cost/benefits of false positives and false negatives, and to design the hypothesis test accordingly

Laure: What about NP with composite hypotheses? Even that becomes complicated rather quickly…

Alexander: To add to Andrew’s point, they are not only uncomfortable embracing uncertainty, but they are also uncomfortable making what may seem like subjective decisions. We thus get everyone use alpha=0.05, etc.

Darren: But this is again moving from p-values to null hypothesis significance tests

Andrew: Good point, Darren. I just can’t help myself, apparently.

Noah: Why is it that we can’t seem to separate p-values and NHST? Just to point out, several of you used/implied NHST in your definitions.

Andrew: Noah, that’s a good question.  I think the answer lies in Alexander’s next comment: “they are not only uncomfortable embracing uncertainty, but they are also uncomfortable making what may seem like subjective decisions”

Laure: I agree

Alexander: People can call me Alex

Andrew: Can I call you Al?

Noah: You can call me Betty.

Darren: *groans*

Noah: What is NHST? And why are the two ALWAYS together?

Laure: Why? Because that how it is being taught?

Andrew: NHST (human definition): testing the compatibility of data against a null hypothesis (such as “the difference between the two groups = 0”), and drawing a conclusion based on the probability of observing some data under the assumed conditions of your null hypothesis.  If the probability of the observed data is low, we reject the null hypothesis, and conclude that the difference between the two groups /=/ 0. If the probability of the observed data is not low, we “fail to reject” the null hypothesis, though this is commonly confused with “accepting” the null hypothesis.

Good lord, our “human definitions” need some work.

Mollie: We have GOT to get out more

Noah: I think that’s the key though. P-values are really really hard to get a handle on, the complication is important, and they are REALLY easy to misunderstand

Darren: The only way to make any of it human readable IMO is to have humans who understand the useful fiction that is a sampling distribution.

Noah: So back to Laure’s earlier point: How is it being taught?

Mollie: I think you have to also add “and to whom?” Because huge numbers of people get their statistics education in 2 weeks in medical or pharmacy school.

Darren: How is it being taught? Hard to know. But it is obviously passed down from senior scientist to junior researcher all over the place in medicine.

Andrew: Because of the interweaving of p-values and NHST, a p-value is often viewed as a dichotomous piece of information (small p-value  -> evidence of ‘real’ effect, not-small-p-value -> no evidence of ‘real’ effect) rather than being viewed as a continuous piece of information about the compatibility of some observed data with a hypothesis.

Mollie: I got a lot of “the p-value isn’t perfect but it’s too hard to explain other methods so let’s go with it” explanations in my MPH program, for example, especially around model building and variable selection

Laure: Oh yes, variable selection! Very true…

Noah: <squints menacingly at potential thread drift>

Mollie: No no, I can keep this on topic! I think. There are, I think, lots of people who avoid p-values for the main effect estimate because they’re heard the p-value is weird, but who see no issue with using it to decide how to retain variables in a model.

Alex: I think the problem is that it seems like we all agree to some degree on what p-values and NHST is, but there are so many places where these things are used incorrectly so it’s easy to go down rabbit holes.

Darren: Mollie has a good point though. Epidemiologists in good programs learn that you don’t select covariates based on p-values (for prediction or causal problems). So they are somewhat inoculated from the start.

Noah: We’ve talked a lot about how RESEARCHERS get it wrong. What about “humans?”

Darren: This is the most human contact I’ve had since Christmas. But why should they need to get it right? It’s complicated and hard.

Mollie: I feel like this group will appreciate this anecdote about talking to humans in which I was telling a very smart, PhD in something not-science about a 70% increased risk of cancer in some study and he said “70% of people got cancer??”

Andrew: Humans (readers of research, anyway) typically look at a paper and hunt for p-values; when they see small p-values, they take that as evidence that there is a ‘real’ effect; when they don’t see small p-values, they conclude the paper has minimal interesting information.

Noah: Is it just normal people who do that? Or do researchers, journal editors, doctors, policy wonks, etc do that too?

Mollie: No! lots of people do that

Darren: Doctors for certain.

Andrew: Overall understanding of probabilities is generally poor.  I’ve had stories very similar to Mollie’s, even working with people who have spent decades working in some form of biomedical research.

Laure: I’m afraid editors do it too. I wish I was making this up, but I attended a talk of an editor of a pretty important journal for the doctoral school of a medical faculty on how to write papers. He admitted that editors who decide on sending out papers for peer review largely skip the methods and results section (except for some figures and tables) and go straight to conclusions. That leaves them vulnerable for whatever spin the authors put in there.

Andrew: Cripes, and I thought *I* was the cynic of this group.

Alex: Even people who produce the researchers themselves use it. David Spiegelhalter had an example of a paper where a 25% reduction in mortality (a huge effect) and p=0.06 and the researcher said there was no effect in the paper and presentation.

Mollie: So one of the threads I see coming out of these discussions a lot is, “well we have to decide on actions somehow,” which is true. Docs have to decide on treatments, governments need to decide where to allocate policy money, etc

Alex: Yeah, but to say they make those decisions just based on a p-value is silly.

Mollie: Of course it is! But the p-value is so ingrained, and so many people *think* they understand it

Andrew: Great point, Mollie.  That is a very common defense of NHST paradigm: “we have to make decisions somehow” – combined with Alex’s earlier point that people are uncomfortable making decisions that seem subjective.

Mollie: I am 100% in Camp Estimate & Replicate

Andrew: Many people take comfort in using p<0.05 as the rule for making those decisions because…well, just because.

Alex: Because you don’t have to justify it!

Mollie: Right, so the p-value, and especially p<.05 has become a kind of fig leaf of objectivity

Alex: It’s just accepted. If I chose p<0.10 as my cutoff the reviewers would have a field day. But p<0.05 there isn’t even a comment.

Darren: Even if you roll NP, people skip consideration of the relative costs of false + and –

Noah: I roll Neyman Pearson is an excellent, excellent bumper sticker

Andrew: Very much agreed with Darren (mindless use of p=0.05 and 80% power for all trials instead of considering the relative costs of a wrong decision in each direction) and Alex (that 0.05 is automatically defensible because everyone else does it, but god forbid I suggest p=0.10 for an alpha level in an academic journal)

Laure: Trying to interpret CI limits is rather difficult, takes more space in journals… hard to get the message across

Darren: And CI limits are just a set of tests. There. I said it.

Mollie: Didn’t a journal recently respond to the replicability crisis by proposing a new alpha of .001 or something?

Laure: John Ioannidis proposed to set the threshold to 0.005. I think he saw it more as a way to sift through published articles claiming significance, following his piece on Why most published research findings are false.

Darren: How many authors was that one?

Andrew: hahahahahaha

Mollie: ZING

Andrew: Though in the same paper, John Ioannidis also basically said that this wasn’t a very good solution, either, just something easy to do right away). Basic and Applied Social Psychology straight banned p-values altogether.

Darren: And BASP was apparently a trainwreck afterwards

Andrew: Right. A year or two later, it was just chaos. People had no clue how to write papers without p-values.

Alex: Changing the cutoff is fine, I guess, as long as they also publish null findings.

Alex: It won’t fix any replication crises, but it won’t hurt either I suppose.

Noah: Is it that people don’t know how to write papers without p-values, or papers without NHST?

Mollie: I get into fights with reviewers about it a lot.

Darren: Epidemiologists do it all the time. Not to quote my cv, but I have two stats/model focused papers in the International Journal of Epidemiology, neither mentions “statistical significance”, per journal standards. Epidemiology and American Journal of Epidemiology are the same.

Alex: The last two papers I published in clinical journals don’t include anything about statistical significance, and the confidence intervals (barely) include the null. Amazingly I didn’t even get a comment from the reviewers (though I did get a comment from a coauthor)

Mollie: A student’s manuscript, where she reported RR 2.0, 95% CI 0.95 to 3.5 or whatever, got a comment about how she couldn’t say there was an increased risk. And she didn’t say significantly increased! Just increased.

Darren: Ugh

Mollie: BUT, to refute my own point slightly, if the RR had been, say, 1.15, how wide would CI need to be before we’d all call it null?

Andrew: For many people the two concepts are inextricable. I have an interesting story of helping analyze and publish a trial with a secondary endpoint with a result of (something like) HR=0.85, 95% CI 0.68-1.02, p=0.07 and my phrase in the results ” (Event) was lower in Arm A versus Arm B (HR=0.85, 95% CI 0.68-1.02, p=0.07)” was hammered by 2 of the 3 reviewers with the “you can’t say that the risk was lower.” Basically, the same exact story that Mollie is telling above. I wanted to include the p-value because, as we’ve touched on several times, I think it’s a continuous piece of information about how incompatible the data are with a null hypothesis of no effect.  Perhaps that means I can’t write a paper without NHST, either. Blargh.

Alex: Charlie Poole proposes reporting the p-value function, which I doubt anyone will do, but that gives compatibility with ALL hypotheses. Which makes it clear to see that the point estimate is most supported by data. And that there are lots of points more supported than the null.

Noah: That brings us to the elephant in the room

Darren: Hey! I’m on a diet


Darren: Oh, that

Mollie: Oh my. OK, who signed? (I did not, but I would have if I’d remembered to email them)

Alex: <raises hand>

Noah: <raises hand, waits for someone to high five>

Darren: Nope

Laure: I did

Andrew: I also did not sign. After some deliberation.

Noah: <is left hanging, puts hand down and pretends it’s cool> What was, and importantly, WAS NOT, in the letter?

Alex: What was in it: a call to stop NHST, what was not: a call to stop using p-values

Noah: Was it just me, or did we see a ton of people treating it like a ban on p-values, when it explicitly, unambiguously said otherwise?

Alex: Yeah I saw that, but nobody reads things

Andrew: Yes, I think a lot of the reaction treated it as a proposed ban on p-values

Laure: People just read the title?

Mollie: You mean people were dichotomizing the message unnecessarily?

Noah: BOOM goes the dynamite

Andrew: Score one for Mollie

Darren: **applause**

Mollie: thanks, I’ll be here all week, try the veal

Noah: So it seems like the two ideas are inseparable in most people’s minds, even when we go through great efforts to avoid it. So even if we all agree (ish) that NHST is the bigger problem. What happens if we actually banned p-values? Would NHST go with it? Just get replaced? Would the replacement change anything?

Darren: People would do the same stuff with Bayes factors or posterior credible intervals.

Alex: I would guess people would just report confidence intervals and use them for implicit NHST

Andrew: +1 to Darren. Was going to say exactly the same thing. One thing for us all to chew on:

This guy is an editor of a major journal and highly willing to engage discussions about stats and methods on twitter. A real problem is that there seems to be a thirst to just get rid of the p-value and replace it with something else (this ONE WEIRD TRICK TO MAKE YOUR STATS BETTER!) which we all know is a nonsense task.

Laure: Even cost-based analyses…

Alex: Or yes, as Darren says, posterior credible intervals using noninformative priors

Noah: I have a slightly dissenting opinion here (surprising no one): there is something psychologically different about things like CIs which actually might change things over the long run. Plus the usual “wrong” interpretation of a CI is way less wrong than the usual “wrong” interpretation of a p-value.

Mollie: I do think the CI (whatever the C stands for) is an improvement over the p-value, even if it’s just used for NHST

Andrew: Which is why I remain a (tepid) defender of the p-value in general, though I’d like to see it eradicated from some obvious misuses (p-values in Table 1 of RCT’s, for one). Banning p-values will not get rid of the desire to make NHST-like decisions.

Darren: Even with informative or regularizing priors…people can/will still find ways to make clear cut conclusions based on an estimate. From a background in epi, I was *shocked* to learn there were experimentalists who *just* report p-values with no effect sizes.

Alex: Good point, Darren

Laure: Well it just comes back to what we discussed already: people want to know what to do and hence will oversimplify.

Alex: Maybe we can say what the alternative to NHST would be? Reporting point estimates with a reasonable measure of uncertainty, embracing the uncertainty, and being willing to make decisions with uncertainty present?

Laure: Yes. I doubt whether the scientific community will be prepared to do that. This reminds me of Latour’s work in the 80s. He did ethnographic work in labs and described how science basically is messy inconclusive findings. and he contrasted that with ready-made science: after peer review and publishing, everything was treated as if the evidence always spoke for itself. it became a “scientific fact.” he got fierce critique from the scientific community, because they could no longer claim objectivity and to hold the truth?

Noah: This is one of the key, central difficulties in meta-science: It’s extremely tough to communicate critique of scientific institutions and practice vs science as a whole.

Laure: Other question: would CIs lead to less reporting and publication bias and better meta-analyses?

Noah: I think no in the short run, yes in the long-run

Darren: Frequentist CIs are still just sets of tests…so I don’t think so.

Mollie: There’s an added complication, possibly, in that journals that explicitly prefer CIs also seem to lean less on NHST

Noah: I don’t think that’s either coincidence or confounding. CI’s bring with them the idea of ranges and uncertainty. P-values don’t do that quite as well.

Mollie: I guess I was thinking, do those journals embrace CIs because they’re already more comfortable with uncertainty, or does using more CI lead to more comfort with using ranges? What direction does the arrow go?

Darren: But I object to the often touted the idea that the 95% CI says everything an exact p-value says, because it doesn’t.

Alex: Are there journals that still don’t make you give the point estimate but rather just a simple yes/no to significance?

Darren: I don’t know. Probably still common in in vivo experimental work.

Andrew: I am currently grading class assignments for our clinical trials course (which means I get exposed to a ton of articles that I may not otherwise read) and I’ve seen even some RCT’s that include p-values for the primary treatment comparison but no summary data on the outcome itself or the difference between groups

Darren: I think another major factor is the role of randomization. In fields where it can’t be used, the bulk of the hard work is on dealing with bias.

Alex: YES Darren!!

Andrew: Agreed, again.  An element that we’ve left out of this discussion (we can only cover so much in one chat…) but the use of p-values (and even NHST) in more controlled settings may be more acceptable.  One problem that some statisticians point out is that p-values aren’t really p-values any more when the entire data generation procedure and choice of statistical model was modifiable along the way.

Noah: The question that never gets asked: p-value of what? P-values tell you little to nothing about the model itself. And as in your example, the model is the important bit in these kinds of things. So, you often get a “significant,” p-value of a meaningless parameter.

Andrew: One of the reasons we’ve all grown so cranky about stepwise regression, for example.  Do the p-values in a stepwise-built regression model retain their original meaning?

Mollie: Ugh, that. Taken out of context, that would be a nice way to promote p-hacking.

Andrew: A single p-value for a single NHST of the primary outcome in a single randomized controlled trial with a prespecified analysis plan at least does retain its original meaning (though you may still argue about the absurdity of treating p=0.045 and p=0.055 as dramatically different results; that’s a separate problem).  I’m not even sure what the p-values mean in the setting of multivariable regression model where the authors may have peaked at the results a couple times, changed which variables to keep in or remove from the model, changed which people to include in the analysis, etc.


Darren: But I haven’t goaded anyone into a fight about CIs and “precision” yet!

Noah: Darren and Alex: quick summary of the problems w/ technical stuff.

Alex: *gulp* how did I get selected to do technical!?

Noah: because you uttered the names Neyman, Pearson, Fisher, and Poole. You and Darren dug your graves.

Darren: Can I share that I almost failed Charlie Poole’s epi class at UNC? Well…I did.

Mollie: A lot of science is constructing a narrative around uncertain estimates. P-values, and particularly their use in NHST, allow people to make decision rules– it feels like a way to let science speak for itself. But P-values are routinely misunderstood by both producers and consumers of research, NHST leads to publication bias, and we’re all spiraling into a meta-science pit of despair from which there is no escape.

Alex: Technical summary: P-values are most often used for NHST, which, as used today, is a bastardization of the approaches originally laid out by Fisher and NP, which emphasized that p-values are continuous (Fisher) and that cost/benefits should be considered when designing hypothesis tests (NP). Instead, people often test against the null hypothesis of no effect with, typically, alpha=0.05, and make dichotomous yes/no decisions based on the results.

Laure: NHST is still omnipresent in introductory stats courses, to the point that researchers do not even realize it is contested.

Darren: The technical problem with p-values is that they are in fact technical. Their use requires careful consideration. 100+ years of conversations have been polluted with “gotcha” talking points among statisticians, leaving researchers to fend for themselves. In the face of poor incentives, people have taken the path of least resistance, turning an absolute genius of an idea into a silly caricature of objective reasoning from data. This problem won’t be sorted until we stop looking for things like “p-values that everyone can understand easily”. Fin.

Andrew: My goodness, I think my work is done.  Darren, Mollie, and Alexander all have covered things so beautifully.

Noah: :heart_eyes: It’s . . .  glorious

Andrew: Oh, boy, and Laure just hit on another point that we ought to talk about more. NHST is still omnipresent in introductory stats courses, both inside and outside of stats departments, and a nontrivial (read: significant majority) of researchers that I work with or encounter are not even aware that this is a controversy.

Noah: Yes, indeed, but not today!

Noah: ONE MORE TIME! Unfair, possibly meaningless question: In an ideal world, in a frictionless plane in a vacuum, what should we do about p-values? Scale of 0 (ban ALL the p-values) to 100 (I love p-values, especially in table 1).

Mollie: 48 from 50

Darren: Not fair with the table 1 quip

Noah: Literally the first word is “unfair.” I’m going from 49 to 51.

Laure: I’m going from 70 to 60

Alex: I’ll go from 50 to 51

Andrew: I’ll stick with 67. There are certain situations where they’re clearly nonsense and ought not to be used.

Laure: regression to the mean?

Darren: 100 – a time and a place

Mollie Wood is a postdoctoral researcher in the Department of Epidemiology at the Harvard School of Public Health. She specializes in reproductive and perinatal pharmacoepidemiology, with methods interests in measurement error, mediation, and family-based study designs. Tweets @Anecdatally

Andrew Althouse is an Assistant Professor at the University of Pittsburgh School of Medicine.  He works principally as a statistician on randomized controlled trials in medicine and health services research.  Tweets @ADAlthousePhD

Darren Dahly is the Principal Statistician of the HRB Clinical Research Facility Cork, and a Senior Lecturer in Research Methods at the University College Cork School of Public Health. He works as an applied statistician, contributing to both epidemiological studies and clinical trials, and is interested in better understanding how we can improve the nature of statistical collaboration across the health sciences. Tweets @statsepi.

Laure Wynants is Assistant Professor at the Department of Epidemiology at Maastricht University (The Netherlands) and post-doctoral fellow of the Research Foundation Flanders at KU Leuven (Belgium). She works on methodological and applied research focusing on risk prediction modeling. Tweets @laure_wynants

Alex Breskin is a postdoctoral researcher in epidemiology at the UNC Gillings School of Global Public Health. His methodological research focuses on study designs and methods to estimate practice- and policy-relevant effects from observational data. Substantively, he focuses on the evolving HIV epidemic in the United States. Tweets @BreskinEpi

Noah Haber is a postdoctoral researcher at the Meta-Research Innovation Center at Stanford University (METRICS), specializing in meta-science from study generation to social media, causal inference econometrics, and applied statistical work in HIV/AIDS in South Africa. He is the lead author of CLAIMS and XvY and blogs here at Tweets @NoahHaber.

Chat: Should we placebo control in late phase clinical trials?

Welcome to the first of a series of chat posts, in which we get a handful of experts together to chat about important issues in the world of health science and statistics. We’ll be hosting these on a regular basis for the foreseeable future. First up:

Should we placebo control in late phase clinical trials?

<Editor’s note: The transcript below is lightly edited from the original for clarity. Additional edits noted at bottom of page>

Noah Haber (postdoc in HIV/AIDS, causal inference econometrics, meta-science): Let’s kick off with an unfair, possibly meaningless question: In an ideal world, in a frictionless plane in a vacuum, on a scale of 0 (never placebo) to 100 (always placebo), should late phase clinical trials control the treatment of interest against a placebo?
Everyone must give a number!

Mollie Wood (postdoc in repro/perinatal pharmacoepi): Can we assume participants are spheres?

Noah: Spheres of infinitely small size

Andrew Althouse (professor at University of Pittsburgh, clinical trial statistician): I’ll toss 80 out there.

Boback Ziaeian (professor at UCLA in the Division of Cardiology): 90

Emily R Smith (research at Gates Foundation and HSPH in global health, MNCH, and nutrition): 90 (Assuming a placebo is ethical / appropriate)

Mollie: also, 80

Noah: Looks like I get to play devil’s advocate today!

Andrew: Fascinating. Round number bias. I’ll revise my answer to 77.3

I think “assuming a placebo is ethical” is a given.

Mollie: I have a quick possible-digression about this, prompted from a twitter thread the other day.
Are we too quick to say “setting ethics aside” or similar?

Noah: Probably, but for now, let’s set ethics aside 🙂

Mollie: OK, but I’m gonna harp on this later

Noah: Noted, harping will occur. We need to do some defining before we go much further. Who wants to give a definition of what “late phase clinical trial” actually means?

Emily: I equate late phase trials to Phase III / IV clinical trials. The Phase III/IV delineation is commonly used in the drug development and regulatory space. It indicates something about the size of the study and the purpose of the study. These later phase trials are larger (e.g. 300 to 3,000 people perhaps) and are meant to show efficacy and monitor potential adverse outcomes.

Boback: For prescription or device therapy it’s typically a phase III trial meant to evaluate efficacy or “non-inferiority.”

Emily: To put it simply, it is a trial to demonstrate whether a new ‘treatment’ is as good as or better than the existing standard of care.

Andrew: A layperson’s definition, maybe, is “the trial that would be strong enough evidence to make people start using the drug if positive.” I’ll go with “Phase III / pivotal trial that would grant drug approval” for pharmaceuticals in development / not yet FDA approved.

Mollie: Maybe also good to note that the trial could be for a new indication for an existing approved drug or device- right?

Emily: Good point

Andrew: Agree, Mollie.

Noah: What’s the usual logic behind placebo controls?

Boback: Control arm by default is always usual care +/- placebo or sham control.

Emily: For now, this isn’t a debate about whether to use a placebo or other control. This assumes the standard of care is currently nothing or something without evidence.

Why do we use placebos? There are so many good reasons!

Noah: Gimme a list!

Boback: Participants and study administrators are blinded to treatment arms.

Andrew: Generally, avoidance/minimization of bias (in assessors and participants). So participants report their symptoms honestly without knowledge of their treatment assignment. And likewise, assessors treat the patients the same / assess the outcome the same way. Rather than letting their knowledge of what the patient is getting influence them in some way

Boback: Allow for equal Hawthorne effects in both arms. The Hawthorne effect is the realization that by studying people in a research setting their behavior may naturally shift. Perhaps they become more adherent and avoid toxic habits like smoking etc. with study participation.

Emily: And, we want to avoid the ‘placebo effect’. We are all susceptible to feeling better when we are given something. For example, your toddler feels better when you kiss their ouchie.

Andrew: Boback, don’t use such big words. This is a chat for the people

Mollie: And disease course changes over time- we want to know if it changes more for active treatment than control. Or less! sometimes treatment halts progression

Noah: Imma start off my devil’s advocating strong Andrew Althouse on the “bias” part. Or at least get more specific. Bias against what?

Mollie:  Here’s one: if you’re a scientist and you’ve started a drug trial, you probably believe your drug works

Emily: The concern about assessor or particpant bias is a major concern when the outcomes of interest rely heavily on self perception. For example, is your pain better or use. Can you concentrate at school more easily? Has your child’s motor or language skills improved? I personally feel less concerned about assessor bias when the outcome of interest is static/objective/easy to measure. For example, mortality is a clear study endpoint. It’s harder to imagine bias creeping into this assessment.

Boback: The consent process for a study typically requires full disclosure of what the study is designed for. “We are evaluating fish oil to see if you don’t develop coronary artery disease. If you consent, we will randomize you to treatment or placebo for the next 5 years.” If you consent someone and say “sorry you didn’t get the drug, we are going to just give you nothing and check in on your every 6 months.” The participant may say forget you, I’ll buy over the counter fish oil myself. Or they may feel so depressed that their stress hormones go up and they eat more Cheetos that increases their coronary artery disease risk.

Noah: Boom, let’s use Boback’s example. Do love me some Cheetos.

So, let’s give a scenario. I am treating a patient (which should NEVER HAPPEN) for coronary artery disease. I have heard of such a thing as a “placebo effect.” Now I decide what treatment to give you. Drug A or ______. Where _____ is almost never placebo.

Andrew: You can play doctor in this chat, Noah. I am pretty pro-placebo-controlled-trial, but I see where Noah’s going with this.

Noah: My job is, in general, to pick the thing that is most likely to make my patient better. If the clinical trial was randomized, placebo controlled, and double blinded. That doesn’t look AT ALL what my clinical decision is like. Because the patient’s response in the real world INCLUDES all of those things we just got rid of. Correct?

Emily: That’s true – clinical trials look very different than clinical practice. And that’s the point! We didn’t eliminate the real world in a clinical trial. We made sure the ‘real word’ things happening aren’t the causes of your good/bad health.

Boback: The trial isn’t meant to mimic reality, it’s meant to neutralize confounding factors and estimate a direct treatment effect.

Andrew: Right. Your point is, the real world is either “We’re going to start you on this drug” or “Have a good year. We’ll see you next year”

Mollie: Hopefully, we eliminated the clinician saying “patient A is really sick, he’d better get active treatment.” Patient B is doing great, let’s give him the placebo. Wait, the drug is killing people, what happened?

Andrew: So, Boback is opening the “efficacy or effectiveness” door

Emily: An efficacy clinical trial is meant to know if a treatment works or not (in the ideal setting). In contrast, an effectiveness trial is meant to know if the treatment will work in the real world context.

Andrew: Difference between “try to figure out if this drug is biologically active” and “will adding this drug to clinical practice be a net benefit / net harm”

Boback: Physicians and hopefully patients want to know treatment effects of therapies. “What happens to this patient in front of me if I treat them with X?”

Emily: So here we’re talking about ‘late phase’ clinical trials – efficacy trials. Where we are trying to learn if “the drug is biologically active”

Andrew: In a placebo-controlled trial, we avoid the bias introduced by participant/assessor knowledge of what the patient is getting, and get a good estimate of the “true biological effect” of the drug. But Noah’s point is that people acting differentially based on knowledge of whether they’re getting a drug or not is going to be part of what happens in the real world

Noah: Exactly. And all we care about in the end is what happens to our patients

Andrew: The question, then, is for a late-phase trial where the “fate” of a new drug hangs in the balance, which estimate do we care about more

Emily: But we can’t know if it’s a good idea (e.g. efficacious, safe) to proceed to the ‘real world’ until we have some evidence!

Andrew: Right, that’s why I have trouble fully embracing Noah’s idea

Noah: Isn’t that evidence enough? If it doesn’t work because of some effect of knowing what treatment you are on, won’t that happen to all patients too?

Emily: Noah, it’s not that it doesn’t work because you know you’re on treatment. It’s that you might ‘feel better’ thinking you got a treatment

Mollie: Is this still a frictionless plane?

Noah: Friction mode restored.

Mollie: Well, treatments have costs. Trying treatment a means you didn’t try treatment B. and approving treatments that work because of positive thinking means resources are going to those treatments that would be better spent on others

Emily: Mollie brings up a new point here – we have to allocate resources.Doctors have to make choices. Health systems have to provide supplies. How do we make those choices?

Mollie: I don’t want to go into a whole cost-effectiveness Thing here, I just wanted to point out that introducing a drug into the formulary that genuinely does nothing is not a neutral act– it means that patients who might have benefited from a different drug instead go untreated, and finite resources are wasted.

Noah: Knowing you got a treatment is free. If it makes a big improvement on outcomes, aren’t we losing something if we don’t include knowing you got treatment in the cost-effectiveness analysis?

Emily: Everyone in a placebo-controlled trials thinks they might have gotten the treatment!

Boback: We want to estimate the treatment effect of the intervention without the belief in the intervention. Most trials probably bias towards the null based on the intention to treat principle. We quantify treatment effects based on what group a person was randomized to not whether they adhered to treatment. For almost all trials, patients are not fully adherent or drop out at some rate which makes estimating the treatment effect not exactly what we hoped to measure. The beneficial effects if measured are underestimated.

The whole concern with all the randomized trials of stenting for angina were that the control arms was never blinded to the procedure until the Orbita trial was published last year. Prior to Orbita, trials compared invasive angiograms to medical therapy and claimed stenting improved symptoms more. However, Orbita did angiograms on everyone and patients did not know whether they had a stent for 30-days. At which point they were unblinded, no significant difference was noted in exercise time or other primary endpoints.

Same with sham arthroscopy of the knee. Patients post-procedure naturally get better and the treatment effects were confounded by the act of receiving a procedure itself.

Andrew: I think we need Noah to go into more detail about the direction that he believes this works. Because we’re all pretty clearly grounded in the idea that using a placebo is meant to wash out the “knowledge you got a treatment makes you feel better” effect.

Noah: Ok, let me lay out the case. The central idea is that we SHOULDN’T wash out the “knowledge you got a treatment makes you feel better” effect. Because that effect is part of (maybe a HUGE part of) the total effect the clinician and patient faces when they decide to treat or not to treat. So not including that is it’s own kind of bias, in the context of the treatment decision.

Emily: This can be captured AFTER we know whether there is a meaningful biological effect of the treatment itself. For example. If you ‘feel better’ after consuming arsenic / rat poison — should we give it to everyone? NO! It’s dangerous!

Noah: I don’t think I want to be in that trial

Emily: Me either, and this is why we need to carefully prove there is a a ‘real’ biological effect of a treatment by using a placebo/control

Andrew: Using Boback’s example: if sham-controlled trials show that knee arthroscopy’s benefit is entirely explained by “knowledge that you got a procedure” then why bother doing any knee arthroscopes? Just do sham procedures and save everyone the trouble. Send them to the OR, have everyone stand around looking serious, give them a local anesthetic, stand there awhile, tell them the procedure went well and we’re good. Same thing could be applied to drug trials – if the drug can’t outperform a sugar-pill as placebo (even if part of the benefit is “belief that they’re getting a new drug makes them feel better”) why bother with the expensive drug? Just give them Placebonium.

Emily: Well said

Boback: So Noah are you saying you want to include the “placebo effect” as a benefit for a prescribed treatment for a patient?

Noah: I’ll say yes, or at least there’s an argument to be made for it.

Emily’s point about separation is important. If we can know both, separately, we obviously should. Practically, do we ever have the time and money to these giant, expensive trials both ways (with a placebo control and with a “do nothing control”?

Boback: Well, than it’s probably just worth quantifying the placebo effect. Compare placebo to no treatment and active treatment.

Noah: The “placebo effect” changes for every treatment. If we can only do it one way, shouldn’t we do it the way most relevant for the clinical decision?

Boback: There’s plenty issues with why randomized trials are expensive and time consuming to perform. I do not think the placebo issue is the main problem. There’s more to be said for using our statistical understanding better and building a pragmatic trial infrastructure.

Emily: Maybe we can talk about other reasons you wouldn’t want to use a placebo? Andrew and Mollie said 80%. What’s happening in the other 20%?

Noah: I am the 20%.

Andrew: Re: when a placebo control wouldn’t be needed: I think basically any Phase III (drug approval) study should be placebo controlled (i.e. either there is no accepted therapy for the condition so we’re truly talking therapy vs. nothing, or if there’s an accepted therapy, the patient is blinded so they don’t know if they’re getting standard-of-care or new-experimental-drug, which may or may not require a “placebo” to achieve said blinding depending on what SOC is). I’m a *little* more ambivalent for studies of drugs that are already approved or in use.

Boback: The other point is that placebos are probably not necessary for “hard endpoints.” For all-cause mortality, it’s hard to imagine that getting placebo vs. not getting it would influence your risk of dying.

Andrew: But I guess that depends if we’re including real-world, head-2-head CER trials of existing things as “late stage trials”

Emily: I tend to agree Boback. My only caveat — in my line of work — there aren’t vital registration programs and we rely on research staff to find ‘missing’ patients. So some people still worry about staff/assessor bias. For example – this child got treatment, so maybe they didn’t come to clinic because they are on vacation. Whereas, this child got no treatment, so maybe they didn’t come to clinic because they’ve passed away – I will go visit the household to find out.

Mollie: Yeah, I think my 80% comes from my bias as a postmarketing surveillance researcher- none of the treatments I deal with are pre-approval

Noah: What’s different about postmarketing surveillance?

Mollie: So I work almost entirely on drug safety in pregnancy, and most of the time, the relevant clinical question is “should this woman planning to get pregnant discontinue her methadone or switch to buprenorphine?” or similar.

This is maybe a little too detailed, but there have been SO MANY studies of antidepressant safety in pregnancy, some showing harm, some not, that I am almost ready to say you could ethically do a placebo controlled trial to get the right answer, but man, clinicians do not seem to agree.

Emily: Mollie this is a great point. It can be quite controversial as to whether or not there is equipoise (genuine uncertainty about which treatment is better) for a placebo.

Mollie: Equipoise is hard! You have to be truly unsure about the possible benefit or risk of the drug.

Mollie: I think in the postmarketing space, equipoise for placebo control is almost never really there.

Andrew: Mollie’s last comment brings up an interesting example, which makes me wonder if we’re just talking about “placebo” or more broadly about “blinding.” I’ve seen some trials that were h2h CER trials of 2 active drugs that didn’t look like one another where both arms had to take their active drug and a placebo that LOOKED like the other one to achieve full blinding. So is our issue just about using “placebo controls” where the choice is “something versus nothing” or is it a broader debate about making sure the participant/assessor isn’t sure who’s getting what (even in setting of 2 active agents?) i.e. if one active drug dosed once/daily and the other is dosed twice/daily, each arm had to take a “placebo” on the schedule of the other drug so they wouldn’t know which drug they were on

Boback: Placebo is very frequently broken in trials. If you are getting a cholesterol lowering drug, it’s hard not realize your lipid values are dropping on follow-up testing.

Emily: If this is the case, then I move my original estimate to say that I think that trials should be blinded 95%+ of time!

Noah: Wait! I go the other way!

Emily: You do?!

Noah: If blinding/placebo is going to be broken anyway, why are we doing it in the first place?

Andrew: It sounds like the world Noah describes is that these trials shouldn’t be blinded.  

Boback: I’ve always wished trials routinely reported at the end of trials what portion of participants believed they received active treatment.

Mollie: It would be nice if it were routine

Andrew: But that brings up something Boback said at the very beginning. If you’re in a high-mortality space (cancer) and the patient isn’t blinded, they’re probably walking out of the trial immediately.

Emily: Yes, agree with Andrew and Boback – sometimes including a placebo means that you can’t recruit a representative sample.

Boback: But it’s always interesting to see all the side effects in placebo arms.

Andrew: In theory, patients shouldn’t be enrolling in RCT to get access to experimental treatments, but they’re still probably not hanging around that trial once they’re assigned & told they’re in the placebo arm.

Boback: If someone is going through the trouble of running an RCT, there’s probably a need or a large market for what they are proposing.

Emily: My #1 practical reason for including blinding/placebo/control is that it’s the hallmark of high quality evidence. And in order for evidence to make it through regulatory / policy processes – it needs to be high quality! And why are generating evidence if not to change policy and practice?! Thus, I vote placebo for president.

Mollie: I’d prioritize randomizing and blinding over placebo. Unless it’s a totally new drug to treat a disease with no current treatment

Emily: Agree with Mollie. If we’re in the real world – then placebo likely not appropriate in many cases. For ethical reasons. (A placebo isn’t ethical when there is an existing treatment or practice that is either recommended (by governing/regulatory bodies) or is commonly practices by physicians)

Andrew: Right, a “placebo” is kind of inextricable from blinding. If a placebo isn’t needed for “blinding” then fine – no placebo. But even in some trials with 2 active agents the placebo is needed to preserve the blind. (the earlier example of one drug given 1x daily vs another given 2x daily) We specialize in Swiss movement Breitling watches made in swiss. Reasonably priced SuperOcean watches will catch your eye. You can also buy quality rolex replica watches and christian louboutin replica shoes at wholesale prices here!

Mollie: Yeah, no argument there

Andrew: So a placebo’s main function is a means to preserve blinding

Noah: And to my point earlier, my argument is really more generally against blinding, by way of placebo controls

Emily: To preserve blinding AND to account for placebo effect. (I think they are two separate points?)

Andrew: Right, Noah just thinks that blinding is problematic because it doesn’t = “real world treatment effect”

Noah: Yup, and in the end, the “why” doesn’t matter so much as what you get in the end

Mollie: Noah, are you happy with a pre/post measurement on just the treated group?

Andrew: (shrieks in horror)


Mollie: Nuke it from orbit. It’s the only way to be sure.

Mollie: But seriously, that’s the effect a doc sees when they treat a patient, right? In the real world.

Mollie: …historical controls?

You’re scaring me, man.

Andrew: I reluctantly admit that I’m giving more thought to historical controls as a viable option in some situations, though the statistician in me still hates it

Noah: HA

Emily: I am also thinking a lot about historical controls!

Noah: This is probably why I’m not a real doctor, which we can all be thankful for. But to clarify, my version is to control against “don’t treat at all,” which, going back to earlier point, would be tough to recruit with as an option. I mean real current trial controls, where half the people just don’t get treated. But sometimes historical controls can be useful. . .

Mollie: Nooooooo, guys

Emily: In the context of adaptive trial design … it starts to make some sense.

Andrew: Specifically, very-high-mortality where there is no known effective treatment option, with novel device/drug as the only real option available. I might be able to stomach comparison against historical controls. This is a bit off topic, perhaps can be picked up in a future chat

Noah: Let’s switch topics a bit, and talk about the research science meta-verse

Mollie: My favorite meta-verse!

Noah: Then let’s get super meta. If all of our past research now had been with placebo / blinded controls vs none of it was. What might we know more or less of now? Would we understand more about biological mechanisms?

Andrew: if we just replaced all placebo controlled trials that have been done with Noah Haber style trials?

Noah: Exactly. What would we gain/lose. Except lose, because Noah style trials are perfect <editor’s note: I regret tacitly agreeing that this should be called Noah Haber style”>

Emily: There would be even more molecules approved for treatment of depression. (This is an area of research known to be especially sensitive to the placebo effect).

Mollie: I assume we’d be using Zicam for cancer treatment

Noah: But only if believing in zicam (to do anything at all) had an actual clinical effect, right?

Andrew: I mean, we’d certainly know less about true biological effects

Noah Haber: True. So, quick recap: the main argument against placebo controls is super tied in with the idea that we also shouldn’t blind because we are sufficiently far from real world conditions (which include placebo effects) that our measured effects aren’t realistic. HOWEVER:

Andrew: with placebo control (and blinding), we can be *reasonably* certain that the effect observed in the trial is actually an effect of the drug being tested and not simply the effect of feeling like you get something better. without placebo control (i.e. trial is “drug versus nothing”) in theory you may get a better estimate of the “real world” effect since that’s what the “real world” will be (drug or…whatever else) but you run the risks of patient dropout and other behaviors influencing their outcomes.

Mollie: I see two major ethical concerns. First, approving treatments that don’t “work” (beyond placebo effect) takes away opportunities for patients to be treated with drugs that DO work. Second, resources are finite and we shouldn’t be spending limited funds on ineffective treatments. Removing placebo controls risks violations of both of these. (Third, I do not think we need to help pharma companies any more than we already do). (edited)

Boback: The only proper way to measure treatment effects is to keep patients and clinicians in the setting of an RCT blinded to the intervention to avoid introducing confounding factors. RCTs are meant to measure treatment effects. Randomization is our best tool for breaking confounding and for most endpoints, blinding is required to preserve patient and clinician behavior. It also allows for objective measures of adverse events/side effects.

Noah: Great, ok, last thing! In an ideal world, in a frictionless plane in a vacuum, on a scale of 0 (never placebo) to 100 (always placebo), should late phase clinical trials control the treatment of interest against a placebo?

Boback: 90

Noah: 45

Andrew: I’ll stick with my 80. Majority of the time. But there may be some settings where I could be convinced that a non-placebo-controlled design is appropriate

Mollie: In this frictionless plane, are there other treatments available?

Noah: Yes, but also frictionless.

Mollie: Then I’m sticking with my 80

Emily: (I”m still unsure if placebo means control here) But I’m increasing to 96% assuming we’re talking placebo/control!

Noah: Ha. I’ll call that a day! Thanks y’all!

Emily: Thanks, friends!

Mollie: Thanks everyone!

Andrew: Thanks everyone. this was fun, good to kick things around with other smart people.

Boback: Thanks for including a manatee.

Boback Ziaeian is an Assistant Professor at UCLA and the VA Greater Los Angeles in the Division of Cardiology. As an outcomes/health services researcher and cardiologist his primary interest is improving the receipt of high-value care and reducing disparities for cardiovascular patients. Tweets @boback.

Andrew Althouse is an Assistant Professor at the University of Pittsburgh School of Medicine.  He works principally as a statistician on randomized controlled trials in medicine and health services research.  Tweets @ADAlthousePhD

Mollie Wood is a postdoctoral researcher in the Department of Epidemiology at the Harvard School of Public Health. She specializes in reproductive and perinatal pharmacoepidemiology, with methods interests in measurement error, mediation, and family-based study designs.  Tweets @anecdatally

Emily R. Smith is a Program Officer at the Bill & Melinda Gates Foundation and a Research Associate at the Harvard School of Public Health. Her research focuses on the design and conduct of clinical trials to improve maternal and child health globally. Tweets @emily_ers

Noah Haber is a postdoc at UNC, specializing in meta-science from study generation to social media, causal inference econometrics, and applied statistical work in HIV/AIDS in South Africa. He is the lead author of CLAIMS and XvY, blogs here at, and tweets @NoahHaber.

Edit notes: Made an edit to change “late stage clinical trial” to “late phase clinical trial” to clarify that this was not specific to late-stage cancer trials.

Discussion: Inferring when Association Implies Causation

Noah Haber
Standard heath research dogma dictates that the “correct” way for authors to deal with weak causal inference is to just call it association. Papers that say that coffee is “associated/linked/correlated” with cancer are acceptable for publication, even if they don’t give any useful inference about the actual impact of drinking coffee, as long as they don’t use the word “caused.” While it is extremely difficult or even impossible to estimate the causal effect of coffee on cancer, it is relatively easy to publish a paper about the association between the two. As others have noted, this creates a serious issue where a huge number of misleading studies are published, get distributed to the public, distract from good studies, and do real harm.

Association is a powerful research tool to answer the right questions with the right methods, but not for the kinds of questions and methods for which you need causal inference. Stranger yet, the culture of “don’t use the word cause” is so strong that there are even papers which find really strong evidence of causality, but stay on the “conservative” side and just say association.

In CLAIMS, reviewers found that 34% of authors in our sample used stronger technical causal language than was appropriate given the methods. Most academic authors follow the technical rules, if not their spirit. What about the remaining 66%? How many of those implied causality through means other than “technical” language? Can we reasonably infer how these studies might have mislead through sloppy methods, hints, nudges, and reasonable misinterpretation?

If technical language is an unreliable method of determining whether the study implied causality, how can we infer those implications? I have a few ideas below for discussion, but would LOVE to hear your thoughts on where I get this wrong, better explanations, general disagreements, etc.

Decision implications

Go straight to the discussion section and read what the authors say people should do or change based on their results. In almost all cases that the authors recommending changing main exposure to change the outcome of interest, they implied causality. A study about the association between coffee and cancer that concludes that you should drink more or coffee to avoid cancer, or even if they simply say coffee is “safe” to drink, relies on estimating the causal effect of coffee on cancer. If their methods weren’t up for the task, the study is misleading.

In general, if the study was truly useful for association only, changing the exposure of interest will usually not be the main action implication. If the question of interest is disparities in outcomes between groups (such as race), the authors would, in general, not suggest that people switch groups. Similarly, finding associations to better target interventions don’t imply that we need to change the exposure, but rather that the exposure is a useful metric for identifying targets of interventions.

This can get tricky, particularly when the exposure of interest is a proxy for changing something else that is harder to measure, such as laws as a proxy for the causal impact the political and cultural circumstances that brings about a change in the law, plus the impact of the law itself. As usual, there is no simple rule or formula to follow.

Question of interest

In the great words of Randall Munroe of XKCD (hidden in the mouseover): “Correlation doesn’t imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing ‘look over there.'”

Some associations inherently imply causality. Virtually every study in which individual consumption of something is the exposure and some health effect is the outcome of interest implies causality. One way in which the association might inherently imply causation is simply the lack of useful alternative interpretations. For example, there is little plausible reason why merely studying the association between coffee in cancer is useful for anything except when you have identified causal effects of coffee on cancer.

I find it helpful to try think about plausible ways that the association between X and Y can be useful, firstly in my head and secondly from those the authors describe. For each item, I strike out ones that require causality to be inferred. If I have no items remaining and/or if the remaining items seem implausible, that may hint that the question of interest has inherent causal implications. Even then, there are two caveats: 1) my inability to come up with a good non-causal use does not mean one does not exist, and 2) even if one does exist, the association could still inherently imply causation.

Look for language in the grey zone

The list of words that are taboo because they mean causality is short, consisting mainly of “cause” and “impact.” The list of words in the grey zone is much longer, and not always obvious. My personal favorite is the word “effect.” For some reason, the phrase “the effect of X on Y” is more often considered technically equivalent to “the association between X and Y” than “the causal impact of changing X on Y.” While “effect” is sometimes used purely as shorthand, I find that it is more often used when authors want to imply causality but can’t say it. Curiously, “confounding”/”confounders” is not on the causal language taboo list, even though it implies causation by definition.

Statistical methods

Some statistical methods and data scenarios strongly imply causality. In many cases, this is simply because the methods eliminate all alternative interpretations, such as when authors control for dozens of “confounding” covariates. Some methods are developed specifically to estimate causal effects, and have limited application outside of causal inference.

This one is unfortunately in the statistical/causal inference experts only zone, since it requires a fairly deep understanding of what the statistics actually do and assume to tease out implications of causality.

Intent vs. implication

It is important to understand that the study authors making these implications aren’t generally bad people, and may genuinely not have intended to imply causality when inappropriate. In some cases, they may simply not mean to make causal implications. In other cases, they may have been led to certain uses of language by reviewers, editors, co-authors, or media writers. Alternatively, the most misleading articles are simply the ones that will be most likely to be published and written about, and therefore most likely to be seen.

However, as always, some of the blame and responsibility lies with us, the researchers. We should be careful generating studies where causation is implied, regardless of what the technical dogma tells us is right and wrong. We should learn to be more honest about what we are studying, embrace the limitations of science and statistics, and fight to create systems that allow us to do so.

Getting (very) meta part 4: How did the media do covering CLAIMS?

At the time of this writing, five unique media articles were written about the CLAIMS study, four unique press releases, and a few copies in different outlets, have been published. We have them all listed here and at the bottom of this page, which we will be updating on a continuous basis.

So, how (in our opinion) did the media do?

TL;DR: The handful of media outlets which covered CLAIMS did a pretty decent job.

1) Coverage was limited, and mostly from small outlets

No huge surprise there. CLAIMS is a bit of a niche study, albeit one designed to be the foundation of studies which are not-so-niche. It involves academia, media, and social media, but without providing a clear narrative of what we are supposed to do about it. The study caught on a bit in smaller outlets, but none of the giant mainstream ones, which is roughly what we expected to see, noting that very few research articles receive any coverage at all. The largest media outlet that covered our article is probably, which mostly covers and critiques news media coverage of health studies. Limited exposure is almost certainly for the best given the slightly complicated nature of our results, but it definitely limits what we can infer about how the article was covered. That being said…

2) Most outlets had a (very) slight preference for a particular narrative

As above, CLAIMS doesn’t and can’t, say that any particular party – like academia, researchers, journals, news editors, journalists, or social media sharers – more to “blame” for our result than any other. However, most of the articles had a bit of a focus on one particular party over the others. Some focused a bit more on the media side, and others a bit more on the academic side. These were typically fairly small leanings, and probably not a big deal. We did not observe anything close to extreme skew, like claiming that our study finds that academia, media, or social media are “broken” or similar.

3) Most (but not all) journalists contacted the team for quotes and pre-publication clarifications

It’s nice when we can be involved with the way that our articles are being communicated, particularly when we go through the efforts that we do to explain our study here on Our approval is not required by any means, and we respect journalistic independence, but sometimes it helps. Science is complicated and easy to mistranslate. Most authors reached out to us for quotes, and of those, most gave us a copy of their article before they published it to check if we had corrections. That probably helped with accuracy.

So, all in all, pretty good job. Some more minor notes below:

4) Some sites reported (wrongly) that only RCTs can produce strong causal inference

This was a bit of an odd one, and in one case we wrote specifically to the media authors in an effort to correct this mistake (to no avail). RCTs certainly make causal inference much easier, but it isn’t the only way you can get strong causal inference. This mistake appears in articles from outlets that are more critical of news media and health research, which is slightly ironic. Sometimes simplifications are necessary, but in my opinion, this one can only do harm.

5) Don’t read the comments on news articles

It’s the internet. As most have learned by now, comments on news articles are terrible, and these are not particularly exceptional. The comments on these articles vacillate between reasonable discussion and absolute nonsense. I looked so you don’t have to. You’re welcome.

Media coverage:

Internet media

TitleOrganizationTypePublication dateNotes/Disclosures
Findings in science, health reporting often overstated on social mediaHarvard Gazette / Harvard TH Chan School of Public HealthPress releaseJune 5, 2018Study authors worked with press office for this press release
Can’t say we didn’t warn you: Study finds popular health news stories overstate the evidenceHealthNewsReview.orgNews / blog articleJune 13, 2018Article author interviewed Noah Haber before publication
Health misinformation in the news: Where does it start?KevinMD.comNews / blog articleJune 20, 2018Nearly identical to article
Study examines the state of health research as seen in social mediaUniversity of North Carolina at Chapel Hill Gillings School of Global Public HealthPress releaseJune 19, 2018Study authors worked with press office for this press release
Overdrijven de media gezondheidsnieuws?EIS WetenschapNews / blog articleJune 20, 2018
Karra publishes study in PLOS OneBoston UniversityPress releaseJune 4, 2018
UNC Study Examines the State of Health Research As Seen in Social MediaAssociation of Schools & Programs of Public HealthPress releaseJune 28, 2018Appears to be a direct copy of UNC press release
Redes sociales han alterado la forma en que se presentan las noticias de saludFNPINews / blog articleUnknown, published at least before June 28
Echoing the networkNieman LabNews / blog articleAugust 6, 2018
Health News In Crisis?European Journalism ObservatoryNews / blog articleJuly 18, 2018
'A large grain of salt': Why journalists should avoid reporting on most food studiesCBC NewsNews / blog articleSeptember 6, 2018Article author interviewed Noah Haber before publication
Il giornalismo sanitario è in crisi?European Journalism ObservatoryNews / blog articleSeptember 21, 2018Appears to be a reposting of a previous EJO article

A practical primer on p-values

Noah Haber
There is a recent trend among the health statisticians to discourage the use of p-values, commonly used to define as a threshold at which something is “statistically significant.” Statistical significance is often viewed as a proxy for “proof” of something, which in turn is used as a proxy for success. The thought goes that the obsession with statistical significance encourages poorly designed (i.e. weak and misleading causal inference), highly sensationalized studies that have significant, but meaningless, findings. As a result, many have called for reducing the use of p-values, if not outright banishing them, as a way to improve health science.

However, as usual, the issue is complicated in important ways, touching on issues of technical statistics, popular preference, psychology, culture, research funding, causal inference, and virtually everything else. To understand the issue and why it’s important (but maybe misguided) takes a bit of explaining.

We’re doing a three-part set of posts, explaining 1) what p-values do (and do not) mean, 2) why they are controversial, and 3) an opinion on why the controversy may be misguided. Here we go!

What does a p-value tell us?

A p-value is a measure that helps researchers distinguish their estimates from random noise. P-values are probably easiest to understand with coin flips. Lets say your buddy hands you a coin that you know nothing about, and bets you $10 that it’ll flip heads more than half the time. You accept, flip it 100 times, and it’s heads 57 times and tails 43. You want to know if you should punch your buddy for cheating by giving you a weighted coin.

To start answering that question, we usually need to have a hypothesis to check. For the coin case, our hypothesis might be “the coin is weighted towards heads,” which is another way of saying that the coin will flip heads more than our baseline guess of an unweighted 50/50-flipping coin. So now we have a starting place (a theoretical unweighted coin, or our “null hypothesis”) and some real data to test with (our actual coin flips) and someThis triggers the tooltip, or “alternative hypothesis”).

One way we could compare these hypotheses is if we had a coin which we knew was perfectly balanced and we 1) flipped it 100 times, 2) recorded how many were heads, 3) repeated 1 and 2 an absurd number of times (each of which being one trial), and 4) counted how many of our trials had 57+ or more heads flipped and divided that count by the total number of trials. Boom, there is an estimated p-value. Alternatively, if you know some code, you could simulate that whole process in a program without knowing any probability theory at all.

Even better, we can estimate that number empirically without having to flip a bajillion coins, using what we know about probability theory. In our case, p=0.0968. That means that, had you done this 100-flip test infinite times with a perfectly balanced coin, 9.68% of those 100-flip tests would have had 57+ heads.

To interpret a p-value, you might say “If the coin wasn’t weighted, the expected probability that we would have gotten 57 or more heads out of 100 had we repeated the experiment is 9.68%,” or “9.68% of trials of with 100 flips would have resulted in having the same or greater number of heads than we found in our experiment.”

Should you punch your buddy? That depends on whether that number above is meaningful enough for you to make a punching decision. Ideally, you would set a decision threshold a priori, such that if you were sufficiently sure the game was sufficiently rigged (note the two “sufficiently”s), you would punch your buddy.

The default threshold for most research is p≤0.05, or 95% “confidence,” to indicate what is “statistically significant.” With our p=0.0968, it isn’t “significant” at that level, but it is significant if we make our threshold p≤0.010, or 90% confidence.

If you wanted to interpret your p-value in terms of the threshold, you might say something like “We did not find sufficient evidence to rule out the possibility that the coin was unweighted at a statistical significance threshold of 95%.”

A p-value is a nice way of indirectly helping answer the question: “how sure can I be that these two statistical measures are different?” by comparing it with an estimate of what we would have expected to happen if it didn’t. That’s it.

Things get a bit more complicated when you step outside of coin flips into multivariable regression and other models, but the same intuition applies. Within the specific statistical model in which you are working, assuming it is correctly built to answer the question of interest, what is the probability that you would have gotten an estimated value at least as big as you actually did if the process of producing that data was null?

What DOESN’T a p-value tell us?

Did you notice that I snuck something in there? Here it is again:

“Within the specific statistical model in which you are working, assuming it is correctly built to answer the question of interest…”

A p-value doesn’t tell you much about the model itself, except that you are testing some kind of difference. We can test if something is sufficiently large that we wouldn’t expect it to be due to random chance, but that does doesn’t tell you much about what that something is. For example, we find that chocolate consumption is statistically significantly associated with longevity. What that practically means is that people who ate more chocolate lived longer (usually conditional on / adjusted for a bunch of other stuff). What that does NOT mean is that eating more chocolate makes you live longer, no matter how small your p-value is, because the model itself is generally not able to inform that question.

The choice of threshold is also arbitrary. A p-value of .049 vs. 0.051 are practically exactly the same, but one is often declared to be “statistically significant” and the other not. There isn’t anything magical about the p-values above and below our choice of threshold unless that threshold has a meaningful binary decision associated with it, such as “How sure should I be before I do this surgery?” or “How much evidence do I need to rule out NOT cheating before I punch my buddy?” 95% confidence is used by default for consistency across research, but it isn’t meaningful by itself.

P-values neither mean nor even imply whether or not our two statistical measures are actually different from each other in the real world. In our coins example, the coin is either weighted (probability that it is weighted is 100%) or it isn’t (probability is 0%). Our estimation of the probability of getting heads gets more precise the more data we have. If the coin were weighted to heads, our p-values would become closer to 0 both if the weighting was larger (i.e. more likely to flip heads), but ALSO if we used the same weighted coin and just flipped it more. The weight didn’t change, but our p-value did.

Here’s another way to think about it: the coin in the example was definitely weighted, if only because it’s near impossible to make a fully unweighted coin. To choose whether or not you want to punch your buddy depends both on how much weighting you believe is acceptable AND how sure you want to be. A p-value indirectly informs the latter, but not the former. In almost all cases in health statistics, the “true” values that you are comparing are at least a little different, and you’ll see SOME difference with a large enough sample size. That doesn’t make it meaningful.

P-values can be really useful for testing a lot of things, but in a very limited way, and in a way that is prone to misuse and misinterpretation. Stay tuned for Part II, where we discuss why many stats folks think the p-value has got to go.

Opinion: When all publications are public speech, all scientists are public speakers

Noah Haber
The following is the opinion of the author, and does not necessarily reflect scientific findings or theory.

Yesterday, a large group of national science funders across Europe announced that they were making open access mandatory for their funding recipients. That effectively bans nearly a continent’s worth of researchers and their co-authors from publication in traditional paywalled journals, and rapidly hastens movement towards open access models of scientific publishing.

Open access is simply the idea that all people, regardless of who you are, should be able to easily access scientific publications without having to pay fees or jump through hoops. Giving everyone access to scientific publications has the potential to vastly increase collaborative efforts, spread scientific findings, and improve science education. Open access is also inevitable given the power of communications technology. Arguably, we’ve already had open access for years, albeit through a questionably legal science equivalent of Napster. That doesn’t in any way take away from the impact of this announcement, which in many ways forces others to hasten their moves to open access.

Before I move on, I need to be absolutely, all-caps-in-bold, clear about one thing: I AM IN FIRMLY FAVOR OF FULL OPEN ACCESS OF SCIENTIFIC PUBLICATIONS AND DATA, with some generally agreed-on ethical and logistical constraints. However, open access also comes with a few caveats. While some would point to how open access impacts  publication funding incentives, the biggest issues may be institutional and cultural. They may even be serious enough to do harm if we in the scientific, media, and popular communities don’t adapt and embrace this change. To understand why, we need to dive a bit into a (slightly fictitious) model of how publication works.

Back in the day, publication was very limited. Scientists scienced, and publishers published. Publications were on physical paper, and almost entirely read and debated within the scientific community. That information would make its way into professional organizations and scientific societies, where it was debated and rehashed, and eventually consensus was synthesized and passed to practitioners. A layperson would almost never come in direct contact with research.

While slow and tedious, this old (and, again, slightly fictitious) model had one feature often taken for granted: consensus was built slowly among a community of experts. That, by no means, made those scientific communities immune to popular whim and often deeply flawed conclusions, but it did provide some insulation, which in turn  provided some breathing room for debate and consensus-building. A study isn’t the absolute truth; it’s an argument with data, one which can be overturned, backed up, revised, or rejected. It’s made explicitly to be read by “peers,” by which we mean other scientists in the same field, who are more likely to treat studies with skeptical debate. The traditional “peer review” is really only the first step. The real peer review happens through other people doing their own studies and debating, comparing, rejecting, and sharing them.

Jump cut to today: if someone publishes an article about the “link” between chocolate and Alzheimer’s, that goes almost straight to Twitter, where all opinions are roughly equal, for everyone to see. While I, and hopefully readers of this blog, understand why chocolate studies rarely if ever have any bearing on our lives, most people aren’t privileged to be equipped with the kind of time and education it takes to understand these issues. Science involves complicated theory with conflicting data, and jargon that’s hardly understandable or means something totally different to people on the outside. Science is hard, and it’s a privilege to have the resources and time to understand it. Most do not have that privilege.

Scientific research is increasingly discussed, consumed, used, and abused outside of gated scientific communities, but our institutions and culture are made for a time when they weren’t. That comes with some danger if we fail to adapt. There have always been paths by which popular preference has impacted science, both positively and negatively, and modern communication accelerates tends to both catalyze these processes and bypass some of the checks and balances.

When all publications are public speech, all scientists are public speakers. Public discussion is, and should be a major part of scientist’s jobs, and one which we should embrace. That means adapting scientific culture as well as as institutions to meet these needs, and avoiding some of its pitfalls. We run real risks if we cannot adapt to this environment. Research which is poorly communicated, easily mistranslated, or otherwise misleading can cause real harm, both directly or indirectly. The results of the CLAIMS study are at least partially a result of this new open scientific environment.

If it sounds like I am being vague about what that means in practical terms, it’s because I don’t know. We have ideas on what new models might look like for performing and communicating science, but we won’t know what does and doesn’t work unless we try. And with announcements like this one, it looks like we need to try harder, faster.