Chat: Should we placebo control in late phase clinical trials?

Welcome to the first of a series of chat posts, in which we get a handful of experts together to chat about important issues in the world of health science and statistics. We’ll be hosting these on a regular basis for the foreseeable future. First up:

Should we placebo control in late phase clinical trials?

<Editor’s note: The transcript below is lightly edited from the original for clarity. Additional edits noted at bottom of page>

Noah Haber (postdoc in HIV/AIDS, causal inference econometrics, meta-science): Let’s kick off with an unfair, possibly meaningless question: In an ideal world, in a frictionless plane in a vacuum, on a scale of 0 (never placebo) to 100 (always placebo), should late phase clinical trials control the treatment of interest against a placebo?
Everyone must give a number!

Mollie Wood (postdoc in repro/perinatal pharmacoepi): Can we assume participants are spheres?

Noah: Spheres of infinitely small size

Andrew Althouse (professor at University of Pittsburgh, clinical trial statistician): I’ll toss 80 out there.

Boback Ziaeian (professor at UCLA in the Division of Cardiology): 90

Emily R Smith (research at Gates Foundation and HSPH in global health, MNCH, and nutrition): 90 (Assuming a placebo is ethical / appropriate)

Mollie: also, 80

Noah: Looks like I get to play devil’s advocate today!

Andrew: Fascinating. Round number bias. I’ll revise my answer to 77.3

I think “assuming a placebo is ethical” is a given.

Mollie: I have a quick possible-digression about this, prompted from a twitter thread the other day.
Are we too quick to say “setting ethics aside” or similar?

Noah: Probably, but for now, let’s set ethics aside 🙂

Mollie: OK, but I’m gonna harp on this later

Noah: Noted, harping will occur. We need to do some defining before we go much further. Who wants to give a definition of what “late phase clinical trial” actually means?

Emily: I equate late phase trials to Phase III / IV clinical trials. The Phase III/IV delineation is commonly used in the drug development and regulatory space. It indicates something about the size of the study and the purpose of the study. These later phase trials are larger (e.g. 300 to 3,000 people perhaps) and are meant to show efficacy and monitor potential adverse outcomes.

Boback: For prescription or device therapy it’s typically a phase III trial meant to evaluate efficacy or “non-inferiority.”

Emily: To put it simply, it is a trial to demonstrate whether a new ‘treatment’ is as good as or better than the existing standard of care.

Andrew: A layperson’s definition, maybe, is “the trial that would be strong enough evidence to make people start using the drug if positive.” I’ll go with “Phase III / pivotal trial that would grant drug approval” for pharmaceuticals in development / not yet FDA approved.

Mollie: Maybe also good to note that the trial could be for a new indication for an existing approved drug or device- right?

Emily: Good point

Andrew: Agree, Mollie.

Noah: What’s the usual logic behind placebo controls?

Boback: Control arm by default is always usual care +/- placebo or sham control.

Emily: For now, this isn’t a debate about whether to use a placebo or other control. This assumes the standard of care is currently nothing or something without evidence.

Why do we use placebos? There are so many good reasons!

Noah: Gimme a list!

Boback: Participants and study administrators are blinded to treatment arms.

Andrew: Generally, avoidance/minimization of bias (in assessors and participants). So participants report their symptoms honestly without knowledge of their treatment assignment. And likewise, assessors treat the patients the same / assess the outcome the same way. Rather than letting their knowledge of what the patient is getting influence them in some way

Boback: Allow for equal Hawthorne effects in both arms. The Hawthorne effect is the realization that by studying people in a research setting their behavior may naturally shift. Perhaps they become more adherent and avoid toxic habits like smoking etc. with study participation.

Emily: And, we want to avoid the ‘placebo effect’. We are all susceptible to feeling better when we are given something. For example, your toddler feels better when you kiss their ouchie.

Andrew: Boback, don’t use such big words. This is a chat for the people

Mollie: And disease course changes over time- we want to know if it changes more for active treatment than control. Or less! sometimes treatment halts progression

Noah: Imma start off my devil’s advocating strong Andrew Althouse on the “bias” part. Or at least get more specific. Bias against what?

Mollie:  Here’s one: if you’re a scientist and you’ve started a drug trial, you probably believe your drug works

Emily: The concern about assessor or particpant bias is a major concern when the outcomes of interest rely heavily on self perception. For example, is your pain better or use. Can you concentrate at school more easily? Has your child’s motor or language skills improved? I personally feel less concerned about assessor bias when the outcome of interest is static/objective/easy to measure. For example, mortality is a clear study endpoint. It’s harder to imagine bias creeping into this assessment.

Boback: The consent process for a study typically requires full disclosure of what the study is designed for. “We are evaluating fish oil to see if you don’t develop coronary artery disease. If you consent, we will randomize you to treatment or placebo for the next 5 years.” If you consent someone and say “sorry you didn’t get the drug, we are going to just give you nothing and check in on your every 6 months.” The participant may say forget you, I’ll buy over the counter fish oil myself. Or they may feel so depressed that their stress hormones go up and they eat more Cheetos that increases their coronary artery disease risk.

Noah: Boom, let’s use Boback’s example. Do love me some Cheetos.

So, let’s give a scenario. I am treating a patient (which should NEVER HAPPEN) for coronary artery disease. I have heard of such a thing as a “placebo effect.” Now I decide what treatment to give you. Drug A or ______. Where _____ is almost never placebo.

Andrew: You can play doctor in this chat, Noah. I am pretty pro-placebo-controlled-trial, but I see where Noah’s going with this.

Noah: My job is, in general, to pick the thing that is most likely to make my patient better. If the clinical trial was randomized, placebo controlled, and double blinded. That doesn’t look AT ALL what my clinical decision is like. Because the patient’s response in the real world INCLUDES all of those things we just got rid of. Correct?

Emily: That’s true – clinical trials look very different than clinical practice. And that’s the point! We didn’t eliminate the real world in a clinical trial. We made sure the ‘real word’ things happening aren’t the causes of your good/bad health.

Boback: The trial isn’t meant to mimic reality, it’s meant to neutralize confounding factors and estimate a direct treatment effect.

Andrew: Right. Your point is, the real world is either “We’re going to start you on this drug” or “Have a good year. We’ll see you next year”

Mollie: Hopefully, we eliminated the clinician saying “patient A is really sick, he’d better get active treatment.” Patient B is doing great, let’s give him the placebo. Wait, the drug is killing people, what happened?

Andrew: So, Boback is opening the “efficacy or effectiveness” door

Emily: An efficacy clinical trial is meant to know if a treatment works or not (in the ideal setting). In contrast, an effectiveness trial is meant to know if the treatment will work in the real world context.

Andrew: Difference between “try to figure out if this drug is biologically active” and “will adding this drug to clinical practice be a net benefit / net harm”

Boback: Physicians and hopefully patients want to know treatment effects of therapies. “What happens to this patient in front of me if I treat them with X?”

Emily: So here we’re talking about ‘late phase’ clinical trials – efficacy trials. Where we are trying to learn if “the drug is biologically active”

Andrew: In a placebo-controlled trial, we avoid the bias introduced by participant/assessor knowledge of what the patient is getting, and get a good estimate of the “true biological effect” of the drug. But Noah’s point is that people acting differentially based on knowledge of whether they’re getting a drug or not is going to be part of what happens in the real world

Noah: Exactly. And all we care about in the end is what happens to our patients

Andrew: The question, then, is for a late-phase trial where the “fate” of a new drug hangs in the balance, which estimate do we care about more

Emily: But we can’t know if it’s a good idea (e.g. efficacious, safe) to proceed to the ‘real world’ until we have some evidence!

Andrew: Right, that’s why I have trouble fully embracing Noah’s idea

Noah: Isn’t that evidence enough? If it doesn’t work because of some effect of knowing what treatment you are on, won’t that happen to all patients too?

Emily: Noah, it’s not that it doesn’t work because you know you’re on treatment. It’s that you might ‘feel better’ thinking you got a treatment

Mollie: Is this still a frictionless plane?

Noah: Friction mode restored.

Mollie: Well, treatments have costs. Trying treatment a means you didn’t try treatment B. and approving treatments that work because of positive thinking means resources are going to those treatments that would be better spent on others

Emily: Mollie brings up a new point here – we have to allocate resources.Doctors have to make choices. Health systems have to provide supplies. How do we make those choices?

Mollie: I don’t want to go into a whole cost-effectiveness Thing here, I just wanted to point out that introducing a drug into the formulary that genuinely does nothing is not a neutral act– it means that patients who might have benefited from a different drug instead go untreated, and finite resources are wasted.

Noah: Knowing you got a treatment is free. If it makes a big improvement on outcomes, aren’t we losing something if we don’t include knowing you got treatment in the cost-effectiveness analysis?

Emily: Everyone in a placebo-controlled trials thinks they might have gotten the treatment!

Boback: We want to estimate the treatment effect of the intervention without the belief in the intervention. Most trials probably bias towards the null based on the intention to treat principle. We quantify treatment effects based on what group a person was randomized to not whether they adhered to treatment. For almost all trials, patients are not fully adherent or drop out at some rate which makes estimating the treatment effect not exactly what we hoped to measure. The beneficial effects if measured are underestimated.

The whole concern with all the randomized trials of stenting for angina were that the control arms was never blinded to the procedure until the Orbita trial was published last year. Prior to Orbita, trials compared invasive angiograms to medical therapy and claimed stenting improved symptoms more. However, Orbita did angiograms on everyone and patients did not know whether they had a stent for 30-days. At which point they were unblinded, no significant difference was noted in exercise time or other primary endpoints.

Same with sham arthroscopy of the knee. Patients post-procedure naturally get better and the treatment effects were confounded by the act of receiving a procedure itself.

Andrew: I think we need Noah to go into more detail about the direction that he believes this works. Because we’re all pretty clearly grounded in the idea that using a placebo is meant to wash out the “knowledge you got a treatment makes you feel better” effect.

Noah: Ok, let me lay out the case. The central idea is that we SHOULDN’T wash out the “knowledge you got a treatment makes you feel better” effect. Because that effect is part of (maybe a HUGE part of) the total effect the clinician and patient faces when they decide to treat or not to treat. So not including that is it’s own kind of bias, in the context of the treatment decision.

Emily: This can be captured AFTER we know whether there is a meaningful biological effect of the treatment itself. For example. If you ‘feel better’ after consuming arsenic / rat poison — should we give it to everyone? NO! It’s dangerous!

Noah: I don’t think I want to be in that trial

Emily: Me either, and this is why we need to carefully prove there is a a ‘real’ biological effect of a treatment by using a placebo/control

Andrew: Using Boback’s example: if sham-controlled trials show that knee arthroscopy’s benefit is entirely explained by “knowledge that you got a procedure” then why bother doing any knee arthroscopes? Just do sham procedures and save everyone the trouble. Send them to the OR, have everyone stand around looking serious, give them a local anesthetic, stand there awhile, tell them the procedure went well and we’re good. Same thing could be applied to drug trials – if the drug can’t outperform a sugar-pill as placebo (even if part of the benefit is “belief that they’re getting a new drug makes them feel better”) why bother with the expensive drug? Just give them Placebonium.

Emily: Well said

Boback: So Noah are you saying you want to include the “placebo effect” as a benefit for a prescribed treatment for a patient?

Noah: I’ll say yes, or at least there’s an argument to be made for it.

Emily’s point about separation is important. If we can know both, separately, we obviously should. Practically, do we ever have the time and money to these giant, expensive trials both ways (with a placebo control and with a “do nothing control”?

Boback: Well, than it’s probably just worth quantifying the placebo effect. Compare placebo to no treatment and active treatment.

Noah: The “placebo effect” changes for every treatment. If we can only do it one way, shouldn’t we do it the way most relevant for the clinical decision?

Boback: There’s plenty issues with why randomized trials are expensive and time consuming to perform. I do not think the placebo issue is the main problem. There’s more to be said for using our statistical understanding better and building a pragmatic trial infrastructure.

Emily: Maybe we can talk about other reasons you wouldn’t want to use a placebo? Andrew and Mollie said 80%. What’s happening in the other 20%?

Noah: I am the 20%.

Andrew: Re: when a placebo control wouldn’t be needed: I think basically any Phase III (drug approval) study should be placebo controlled (i.e. either there is no accepted therapy for the condition so we’re truly talking therapy vs. nothing, or if there’s an accepted therapy, the patient is blinded so they don’t know if they’re getting standard-of-care or new-experimental-drug, which may or may not require a “placebo” to achieve said blinding depending on what SOC is). I’m a *little* more ambivalent for studies of drugs that are already approved or in use.

Boback: The other point is that placebos are probably not necessary for “hard endpoints.” For all-cause mortality, it’s hard to imagine that getting placebo vs. not getting it would influence your risk of dying.

Andrew: But I guess that depends if we’re including real-world, head-2-head CER trials of existing things as “late stage trials”

Emily: I tend to agree Boback. My only caveat — in my line of work — there aren’t vital registration programs and we rely on research staff to find ‘missing’ patients. So some people still worry about staff/assessor bias. For example – this child got treatment, so maybe they didn’t come to clinic because they are on vacation. Whereas, this child got no treatment, so maybe they didn’t come to clinic because they’ve passed away – I will go visit the household to find out.

Mollie: Yeah, I think my 80% comes from my bias as a postmarketing surveillance researcher- none of the treatments I deal with are pre-approval

Noah: What’s different about postmarketing surveillance?

Mollie: So I work almost entirely on drug safety in pregnancy, and most of the time, the relevant clinical question is “should this woman planning to get pregnant discontinue her methadone or switch to buprenorphine?” or similar.

This is maybe a little too detailed, but there have been SO MANY studies of antidepressant safety in pregnancy, some showing harm, some not, that I am almost ready to say you could ethically do a placebo controlled trial to get the right answer, but man, clinicians do not seem to agree.

Emily: Mollie this is a great point. It can be quite controversial as to whether or not there is equipoise (genuine uncertainty about which treatment is better) for a placebo.

Mollie: Equipoise is hard! You have to be truly unsure about the possible benefit or risk of the drug.

Mollie: I think in the postmarketing space, equipoise for placebo control is almost never really there.

Andrew: Mollie’s last comment brings up an interesting example, which makes me wonder if we’re just talking about “placebo” or more broadly about “blinding.” I’ve seen some trials that were h2h CER trials of 2 active drugs that didn’t look like one another where both arms had to take their active drug and a placebo that LOOKED like the other one to achieve full blinding. So is our issue just about using “placebo controls” where the choice is “something versus nothing” or is it a broader debate about making sure the participant/assessor isn’t sure who’s getting what (even in setting of 2 active agents?) i.e. if one active drug dosed once/daily and the other is dosed twice/daily, each arm had to take a “placebo” on the schedule of the other drug so they wouldn’t know which drug they were on

Boback: Placebo is very frequently broken in trials. If you are getting a cholesterol lowering drug, it’s hard not realize your lipid values are dropping on follow-up testing.

Emily: If this is the case, then I move my original estimate to say that I think that trials should be blinded 95%+ of time!

Noah: Wait! I go the other way!

Emily: You do?!

Noah: If blinding/placebo is going to be broken anyway, why are we doing it in the first place?

Andrew: It sounds like the world Noah describes is that these trials shouldn’t be blinded.  

Boback: I’ve always wished trials routinely reported at the end of trials what portion of participants believed they received active treatment.

Mollie: It would be nice if it were routine

Andrew: But that brings up something Boback said at the very beginning. If you’re in a high-mortality space (cancer) and the patient isn’t blinded, they’re probably walking out of the trial immediately.

Emily: Yes, agree with Andrew and Boback – sometimes including a placebo means that you can’t recruit a representative sample.

Boback: But it’s always interesting to see all the side effects in placebo arms.

Andrew: In theory, patients shouldn’t be enrolling in RCT to get access to experimental treatments, but they’re still probably not hanging around that trial once they’re assigned & told they’re in the placebo arm.

Boback: If someone is going through the trouble of running an RCT, there’s probably a need or a large market for what they are proposing.

Emily: My #1 practical reason for including blinding/placebo/control is that it’s the hallmark of high quality evidence. And in order for evidence to make it through regulatory / policy processes – it needs to be high quality! And why are generating evidence if not to change policy and practice?! Thus, I vote placebo for president.

Mollie: I’d prioritize randomizing and blinding over placebo. Unless it’s a totally new drug to treat a disease with no current treatment

Emily: Agree with Mollie. If we’re in the real world – then placebo likely not appropriate in many cases. For ethical reasons. (A placebo isn’t ethical when there is an existing treatment or practice that is either recommended (by governing/regulatory bodies) or is commonly practices by physicians)

Andrew: Right, a “placebo” is kind of inextricable from blinding. If a placebo isn’t needed for “blinding” then fine – no placebo. But even in some trials with 2 active agents the placebo is needed to preserve the blind. (the earlier example of one drug given 1x daily vs another given 2x daily)

Mollie: Yeah, no argument there

Andrew: So a placebo’s main function is a means to preserve blinding

Noah: And to my point earlier, my argument is really more generally against blinding, by way of placebo controls

Emily: To preserve blinding AND to account for placebo effect. (I think they are two separate points?)

Andrew: Right, Noah just thinks that blinding is problematic because it doesn’t = “real world treatment effect”

Noah: Yup, and in the end, the “why” doesn’t matter so much as what you get in the end

Mollie: Noah, are you happy with a pre/post measurement on just the treated group?

Andrew: (shrieks in horror)


Mollie: Nuke it from orbit. It’s the only way to be sure.

Mollie: But seriously, that’s the effect a doc sees when they treat a patient, right? In the real world.

Mollie: …historical controls?

You’re scaring me, man.

Andrew: I reluctantly admit that I’m giving more thought to historical controls as a viable option in some situations, though the statistician in me still hates it

Noah: HA

Emily: I am also thinking a lot about historical controls!

Noah: This is probably why I’m not a real doctor, which we can all be thankful for. But to clarify, my version is to control against “don’t treat at all,” which, going back to earlier point, would be tough to recruit with as an option. I mean real current trial controls, where half the people just don’t get treated. But sometimes historical controls can be useful. . .

Mollie: Nooooooo, guys

Emily: In the context of adaptive trial design … it starts to make some sense.

Andrew: Specifically, very-high-mortality where there is no known effective treatment option, with novel device/drug as the only real option available. I might be able to stomach comparison against historical controls. This is a bit off topic, perhaps can be picked up in a future chat

Noah: Let’s switch topics a bit, and talk about the research science meta-verse

Mollie: My favorite meta-verse!

Noah: Then let’s get super meta. If all of our past research now had been with placebo / blinded controls vs none of it was. What might we know more or less of now? Would we understand more about biological mechanisms?

Andrew: if we just replaced all placebo controlled trials that have been done with Noah Haber style trials?

Noah: Exactly. What would we gain/lose. Except lose, because Noah style trials are perfect <editor’s note: I regret tacitly agreeing that this should be called Noah Haber style”>

Emily: There would be even more molecules approved for treatment of depression. (This is an area of research known to be especially sensitive to the placebo effect).

Mollie: I assume we’d be using Zicam for cancer treatment

Noah: But only if believing in zicam (to do anything at all) had an actual clinical effect, right?

Andrew: I mean, we’d certainly know less about true biological effects

Noah Haber: True. So, quick recap: the main argument against placebo controls is super tied in with the idea that we also shouldn’t blind because we are sufficiently far from real world conditions (which include placebo effects) that our measured effects aren’t realistic. HOWEVER:

Andrew: with placebo control (and blinding), we can be *reasonably* certain that the effect observed in the trial is actually an effect of the drug being tested and not simply the effect of feeling like you get something better. without placebo control (i.e. trial is “drug versus nothing”) in theory you may get a better estimate of the “real world” effect since that’s what the “real world” will be (drug or…whatever else) but you run the risks of patient dropout and other behaviors influencing their outcomes.

Mollie: I see two major ethical concerns. First, approving treatments that don’t “work” (beyond placebo effect) takes away opportunities for patients to be treated with drugs that DO work. Second, resources are finite and we shouldn’t be spending limited funds on ineffective treatments. Removing placebo controls risks violations of both of these. (Third, I do not think we need to help pharma companies any more than we already do). (edited)

Boback: The only proper way to measure treatment effects is to keep patients and clinicians in the setting of an RCT blinded to the intervention to avoid introducing confounding factors. RCTs are meant to measure treatment effects. Randomization is our best tool for breaking confounding and for most endpoints, blinding is required to preserve patient and clinician behavior. It also allows for objective measures of adverse events/side effects.

Noah: Great, ok, last thing! In an ideal world, in a frictionless plane in a vacuum, on a scale of 0 (never placebo) to 100 (always placebo), should late phase clinical trials control the treatment of interest against a placebo?

Boback: 90

Noah: 45

Andrew: I’ll stick with my 80. Majority of the time. But there may be some settings where I could be convinced that a non-placebo-controlled design is appropriate

Mollie: In this frictionless plane, are there other treatments available?

Noah: Yes, but also frictionless.

Mollie: Then I’m sticking with my 80

Emily: (I”m still unsure if placebo means control here) But I’m increasing to 96% assuming we’re talking placebo/control!

Noah: Ha. I’ll call that a day! Thanks y’all!

Emily: Thanks, friends!

Mollie: Thanks everyone!

Andrew: Thanks everyone. this was fun, good to kick things around with other smart people.

Boback: Thanks for including a manatee.

Boback Ziaeian is an Assistant Professor at UCLA and the VA Greater Los Angeles in the Division of Cardiology. As an outcomes/health services researcher and cardiologist his primary interest is improving the receipt of high-value care and reducing disparities for cardiovascular patients. Tweets @boback.

Andrew Althouse is an Assistant Professor at the University of Pittsburgh School of Medicine.  He works principally as a statistician on randomized controlled trials in medicine and health services research.  Tweets @ADAlthousePhD

Mollie Wood is a postdoctoral researcher in the Department of Epidemiology at the Harvard School of Public Health. She specializes in reproductive and perinatal pharmacoepidemiology, with methods interests in measurement error, mediation, and family-based study designs.  Tweets @anecdatally

Emily R. Smith is a Program Officer at the Bill & Melinda Gates Foundation and a Research Associate at the Harvard School of Public Health. Her research focuses on the design and conduct of clinical trials to improve maternal and child health globally. Tweets @emily_ers

Noah Haber is a postdoc at UNC, specializing in meta-science from study generation to social media, causal inference econometrics, and applied statistical work in HIV/AIDS in South Africa. He is the lead author of CLAIMS and XvY, blogs here at, and tweets @NoahHaber.

Edit notes: Made an edit to change “late stage clinical trial” to “late phase clinical trial” to clarify that this was not specific to late-stage cancer trials.

Discussion: Inferring when Association Implies Causation

Noah Haber
Standard heath research dogma dictates that the “correct” way for authors to deal with weak causal inference is to just call it association. Papers that say that coffee is “associated/linked/correlated” with cancer are acceptable for publication, even if they don’t give any useful inference about the actual impact of drinking coffee, as long as they don’t use the word “caused.” While it is extremely difficult or even impossible to estimate the causal effect of coffee on cancer, it is relatively easy to publish a paper about the association between the two. As others have noted, this creates a serious issue where a huge number of misleading studies are published, get distributed to the public, distract from good studies, and do real harm.

Association is a powerful research tool to answer the right questions with the right methods, but not for the kinds of questions and methods for which you need causal inference. Stranger yet, the culture of “don’t use the word cause” is so strong that there are even papers which find really strong evidence of causality, but stay on the “conservative” side and just say association.

In CLAIMS, reviewers found that 34% of authors in our sample used stronger technical causal language than was appropriate given the methods. Most academic authors follow the technical rules, if not their spirit. What about the remaining 66%? How many of those implied causality through means other than “technical” language? Can we reasonably infer how these studies might have mislead through sloppy methods, hints, nudges, and reasonable misinterpretation?

If technical language is an unreliable method of determining whether the study implied causality, how can we infer those implications? I have a few ideas below for discussion, but would LOVE to hear your thoughts on where I get this wrong, better explanations, general disagreements, etc.

Decision implications

Go straight to the discussion section and read what the authors say people should do or change based on their results. In almost all cases that the authors recommending changing main exposure to change the outcome of interest, they implied causality. A study about the association between coffee and cancer that concludes that you should drink more or coffee to avoid cancer, or even if they simply say coffee is “safe” to drink, relies on estimating the causal effect of coffee on cancer. If their methods weren’t up for the task, the study is misleading.

In general, if the study was truly useful for association only, changing the exposure of interest will usually not be the main action implication. If the question of interest is disparities in outcomes between groups (such as race), the authors would, in general, not suggest that people switch groups. Similarly, finding associations to better target interventions don’t imply that we need to change the exposure, but rather that the exposure is a useful metric for identifying targets of interventions.

This can get tricky, particularly when the exposure of interest is a proxy for changing something else that is harder to measure, such as laws as a proxy for the causal impact the political and cultural circumstances that brings about a change in the law, plus the impact of the law itself. As usual, there is no simple rule or formula to follow.

Question of interest

In the great words of Randall Munroe of XKCD (hidden in the mouseover): “Correlation doesn’t imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing ‘look over there.'”

Some associations inherently imply causality. Virtually every study in which individual consumption of something is the exposure and some health effect is the outcome of interest implies causality. One way in which the association might inherently imply causation is simply the lack of useful alternative interpretations. For example, there is little plausible reason why merely studying the association between coffee in cancer is useful for anything except when you have identified causal effects of coffee on cancer.

I find it helpful to try think about plausible ways that the association between X and Y can be useful, firstly in my head and secondly from those the authors describe. For each item, I strike out ones that require causality to be inferred. If I have no items remaining and/or if the remaining items seem implausible, that may hint that the question of interest has inherent causal implications. Even then, there are two caveats: 1) my inability to come up with a good non-causal use does not mean one does not exist, and 2) even if one does exist, the association could still inherently imply causation.

Look for language in the grey zone

The list of words that are taboo because they mean causality is short, consisting mainly of “cause” and “impact.” The list of words in the grey zone is much longer, and not always obvious. My personal favorite is the word “effect.” For some reason, the phrase “the effect of X on Y” is more often considered technically equivalent to “the association between X and Y” than “the causal impact of changing X on Y.” While “effect” is sometimes used purely as shorthand, I find that it is more often used when authors want to imply causality but can’t say it. Curiously, “confounding”/”confounders” is not on the causal language taboo list, even though it implies causation by definition.

Statistical methods

Some statistical methods and data scenarios strongly imply causality. In many cases, this is simply because the methods eliminate all alternative interpretations, such as when authors control for dozens of “confounding” covariates. Some methods are developed specifically to estimate causal effects, and have limited application outside of causal inference.

This one is unfortunately in the statistical/causal inference experts only zone, since it requires a fairly deep understanding of what the statistics actually do and assume to tease out implications of causality.

Intent vs. implication

It is important to understand that the study authors making these implications aren’t generally bad people, and may genuinely not have intended to imply causality when inappropriate. In some cases, they may simply not mean to make causal implications. In other cases, they may have been led to certain uses of language by reviewers, editors, co-authors, or media writers. Alternatively, the most misleading articles are simply the ones that will be most likely to be published and written about, and therefore most likely to be seen.

However, as always, some of the blame and responsibility lies with us, the researchers. We should be careful generating studies where causation is implied, regardless of what the technical dogma tells us is right and wrong. We should learn to be more honest about what we are studying, embrace the limitations of science and statistics, and fight to create systems that allow us to do so.

Getting (very) meta part 4: How did the media do covering CLAIMS?

At the time of this writing, five unique media articles were written about the CLAIMS study, four unique press releases, and a few copies in different outlets, have been published. We have them all listed here and at the bottom of this page, which we will be updating on a continuous basis.

So, how (in our opinion) did the media do?

TL;DR: The handful of media outlets which covered CLAIMS did a pretty decent job.

1) Coverage was limited, and mostly from small outlets

No huge surprise there. CLAIMS is a bit of a niche study, albeit one designed to be the foundation of studies which are not-so-niche. It involves academia, media, and social media, but without providing a clear narrative of what we are supposed to do about it. The study caught on a bit in smaller outlets, but none of the giant mainstream ones, which is roughly what we expected to see, noting that very few research articles receive any coverage at all. The largest media outlet that covered our article is probably, which mostly covers and critiques news media coverage of health studies. Limited exposure is almost certainly for the best given the slightly complicated nature of our results, but it definitely limits what we can infer about how the article was covered. That being said…

2) Most outlets had a (very) slight preference for a particular narrative

As above, CLAIMS doesn’t and can’t, say that any particular party – like academia, researchers, journals, news editors, journalists, or social media sharers – more to “blame” for our result than any other. However, most of the articles had a bit of a focus on one particular party over the others. Some focused a bit more on the media side, and others a bit more on the academic side. These were typically fairly small leanings, and probably not a big deal. We did not observe anything close to extreme skew, like claiming that our study finds that academia, media, or social media are “broken” or similar.

3) Most (but not all) journalists contacted the team for quotes and pre-publication clarifications

It’s nice when we can be involved with the way that our articles are being communicated, particularly when we go through the efforts that we do to explain our study here on Our approval is not required by any means, and we respect journalistic independence, but sometimes it helps. Science is complicated and easy to mistranslate. Most authors reached out to us for quotes, and of those, most gave us a copy of their article before they published it to check if we had corrections. That probably helped with accuracy.

So, all in all, pretty good job. Some more minor notes below:

4) Some sites reported (wrongly) that only RCTs can produce strong causal inference

This was a bit of an odd one, and in one case we wrote specifically to the media authors in an effort to correct this mistake (to no avail). RCTs certainly make causal inference much easier, but it isn’t the only way you can get strong causal inference. This mistake appears in articles from outlets that are more critical of news media and health research, which is slightly ironic. Sometimes simplifications are necessary, but in my opinion, this one can only do harm.

5) Don’t read the comments on news articles

It’s the internet. As most have learned by now, comments on news articles are terrible, and these are not particularly exceptional. The comments on these articles vacillate between reasonable discussion and absolute nonsense. I looked so you don’t have to. You’re welcome.

Media coverage:

Internet media

TitleOrganizationTypePublication dateNotes/Disclosures
Findings in science, health reporting often overstated on social mediaHarvard Gazette / Harvard TH Chan School of Public HealthPress releaseJune 5, 2018Study authors worked with press office for this press release
Can’t say we didn’t warn you: Study finds popular health news stories overstate the evidenceHealthNewsReview.orgNews / blog articleJune 13, 2018Article author interviewed Noah Haber before publication
Health misinformation in the news: Where does it start?KevinMD.comNews / blog articleJune 20, 2018Nearly identical to article
Study examines the state of health research as seen in social mediaUniversity of North Carolina at Chapel Hill Gillings School of Global Public HealthPress releaseJune 19, 2018Study authors worked with press office for this press release
Overdrijven de media gezondheidsnieuws?EIS WetenschapNews / blog articleJune 20, 2018
Karra publishes study in PLOS OneBoston UniversityPress releaseJune 4, 2018
UNC Study Examines the State of Health Research As Seen in Social MediaAssociation of Schools & Programs of Public HealthPress releaseJune 28, 2018Appears to be a direct copy of UNC press release
Redes sociales han alterado la forma en que se presentan las noticias de saludFNPINews / blog articleUnknown, published at least before June 28
Echoing the networkNieman LabNews / blog articleAugust 6, 2018
Health News In Crisis?European Journalism ObservatoryNews / blog articleJuly 18, 2018
'A large grain of salt': Why journalists should avoid reporting on most food studiesCBC NewsNews / blog articleSeptember 6, 2018Article author interviewed Noah Haber before publication
Il giornalismo sanitario è in crisi?European Journalism ObservatoryNews / blog articleSeptember 21, 2018Appears to be a reposting of a previous EJO article

A practical primer on p-values

Noah Haber
There is a recent trend among the health statisticians to discourage the use of p-values, commonly used to define as a threshold at which something is “statistically significant.” Statistical significance is often viewed as a proxy for “proof” of something, which in turn is used as a proxy for success. The thought goes that the obsession with statistical significance encourages poorly designed (i.e. weak and misleading causal inference), highly sensationalized studies that have significant, but meaningless, findings. As a result, many have called for reducing the use of p-values, if not outright banishing them, as a way to improve health science.

However, as usual, the issue is complicated in important ways, touching on issues of technical statistics, popular preference, psychology, culture, research funding, causal inference, and virtually everything else. To understand the issue and why it’s important (but maybe misguided) takes a bit of explaining.

We’re doing a three-part set of posts, explaining 1) what p-values do (and do not) mean, 2) why they are controversial, and 3) an opinion on why the controversy may be misguided. Here we go!

What does a p-value tell us?

A p-value is a measure that helps researchers distinguish their estimates from random noise. P-values are probably easiest to understand with coin flips. Lets say your buddy hands you a coin that you know nothing about, and bets you $10 that it’ll flip heads more than half the time. You accept, flip it 100 times, and it’s heads 57 times and tails 43. You want to know if you should punch your buddy for cheating by giving you a weighted coin.

To start answering that question, we usually need to have a hypothesis to check. For the coin case, our hypothesis might be “the coin is weighted towards heads,” which is another way of saying that the coin will flip heads more than our baseline guess of an unweighted 50/50-flipping coin. So now we have a starting place (a theoretical unweighted coin, or our “null hypothesis”) and some real data to test with (our actual coin flips) and someThis triggers the tooltip, or “alternative hypothesis”).

One way we could compare these hypotheses is if we had a coin which we knew was perfectly balanced and we 1) flipped it 100 times, 2) recorded how many were heads, 3) repeated 1 and 2 an absurd number of times (each of which being one trial), and 4) counted how many of our trials had 57+ or more heads flipped and divided that count by the total number of trials. Boom, there is an estimated p-value. Alternatively, if you know some code, you could simulate that whole process in a program without knowing any probability theory at all.

Even better, we can estimate that number empirically without having to flip a bajillion coins, using what we know about probability theory. In our case, p=0.0968. That means that, had you done this 100-flip test infinite times with a perfectly balanced coin, 9.68% of those 100-flip tests would have had 57+ heads.

To interpret a p-value, you might say “If the coin wasn’t weighted, the expected probability that we would have gotten 57 or more heads out of 100 had we repeated the experiment is 9.68%,” or “9.68% of trials of with 100 flips would have resulted in having the same or greater number of heads than we found in our experiment.”

Should you punch your buddy? That depends on whether that number above is meaningful enough for you to make a punching decision. Ideally, you would set a decision threshold a priori, such that if you were sufficiently sure the game was sufficiently rigged (note the two “sufficiently”s), you would punch your buddy.

The default threshold for most research is p≤0.05, or 95% “confidence,” to indicate what is “statistically significant.” With our p=0.0968, it isn’t “significant” at that level, but it is significant if we make our threshold p≤0.010, or 90% confidence.

If you wanted to interpret your p-value in terms of the threshold, you might say something like “We did not find sufficient evidence to rule out the possibility that the coin was unweighted at a statistical significance threshold of 95%.”

A p-value is a nice way of indirectly helping answer the question: “how sure can I be that these two statistical measures are different?” by comparing it with an estimate of what we would have expected to happen if it didn’t. That’s it.

Things get a bit more complicated when you step outside of coin flips into multivariable regression and other models, but the same intuition applies. Within the specific statistical model in which you are working, assuming it is correctly built to answer the question of interest, what is the probability that you would have gotten an estimated value at least as big as you actually did if the process of producing that data was null?

What DOESN’T a p-value tell us?

Did you notice that I snuck something in there? Here it is again:

“Within the specific statistical model in which you are working, assuming it is correctly built to answer the question of interest…”

A p-value doesn’t tell you much about the model itself, except that you are testing some kind of difference. We can test if something is sufficiently large that we wouldn’t expect it to be due to random chance, but that does doesn’t tell you much about what that something is. For example, we find that chocolate consumption is statistically significantly associated with longevity. What that practically means is that people who ate more chocolate lived longer (usually conditional on / adjusted for a bunch of other stuff). What that does NOT mean is that eating more chocolate makes you live longer, no matter how small your p-value is, because the model itself is generally not able to inform that question.

The choice of threshold is also arbitrary. A p-value of .049 vs. 0.051 are practically exactly the same, but one is often declared to be “statistically significant” and the other not. There isn’t anything magical about the p-values above and below our choice of threshold unless that threshold has a meaningful binary decision associated with it, such as “How sure should I be before I do this surgery?” or “How much evidence do I need to rule out NOT cheating before I punch my buddy?” 95% confidence is used by default for consistency across research, but it isn’t meaningful by itself.

P-values neither mean nor even imply whether or not our two statistical measures are actually different from each other in the real world. In our coins example, the coin is either weighted (probability that it is weighted is 100%) or it isn’t (probability is 0%). Our estimation of the probability of getting heads gets more precise the more data we have. If the coin were weighted to heads, our p-values would become closer to 0 both if the weighting was larger (i.e. more likely to flip heads), but ALSO if we used the same weighted coin and just flipped it more. The weight didn’t change, but our p-value did.

Here’s another way to think about it: the coin in the example was definitely weighted, if only because it’s near impossible to make a fully unweighted coin. To choose whether or not you want to punch your buddy depends both on how much weighting you believe is acceptable AND how sure you want to be. A p-value indirectly informs the latter, but not the former. In almost all cases in health statistics, the “true” values that you are comparing are at least a little different, and you’ll see SOME difference with a large enough sample size. That doesn’t make it meaningful.

P-values can be really useful for testing a lot of things, but in a very limited way, and in a way that is prone to misuse and misinterpretation. Stay tuned for Part II, where we discuss why many stats folks think the p-value has got to go.

Opinion: When all publications are public speech, all scientists are public speakers

Noah Haber
The following is the opinion of the author, and does not necessarily reflect scientific findings or theory.

Yesterday, a large group of national science funders across Europe announced that they were making open access mandatory for their funding recipients. That effectively bans nearly a continent’s worth of researchers and their co-authors from publication in traditional paywalled journals, and rapidly hastens movement towards open access models of scientific publishing.

Open access is simply the idea that all people, regardless of who you are, should be able to easily access scientific publications without having to pay fees or jump through hoops. Giving everyone access to scientific publications has the potential to vastly increase collaborative efforts, spread scientific findings, and improve science education. Open access is also inevitable given the power of communications technology. Arguably, we’ve already had open access for years, albeit through a questionably legal science equivalent of Napster. That doesn’t in any way take away from the impact of this announcement, which in many ways forces others to hasten their moves to open access.

Before I move on, I need to be absolutely, all-caps-in-bold, clear about one thing: I AM IN FIRMLY FAVOR OF FULL OPEN ACCESS OF SCIENTIFIC PUBLICATIONS AND DATA, with some generally agreed-on ethical and logistical constraints. However, open access also comes with a few caveats. While some would point to how open access impacts  publication funding incentives, the biggest issues may be institutional and cultural. They may even be serious enough to do harm if we in the scientific, media, and popular communities don’t adapt and embrace this change. To understand why, we need to dive a bit into a (slightly fictitious) model of how publication works.

Back in the day, publication was very limited. Scientists scienced, and publishers published. Publications were on physical paper, and almost entirely read and debated within the scientific community. That information would make its way into professional organizations and scientific societies, where it was debated and rehashed, and eventually consensus was synthesized and passed to practitioners. A layperson would almost never come in direct contact with research.

While slow and tedious, this old (and, again, slightly fictitious) model had one feature often taken for granted: consensus was built slowly among a community of experts. That, by no means, made those scientific communities immune to popular whim and often deeply flawed conclusions, but it did provide some insulation, which in turn  provided some breathing room for debate and consensus-building. A study isn’t the absolute truth; it’s an argument with data, one which can be overturned, backed up, revised, or rejected. It’s made explicitly to be read by “peers,” by which we mean other scientists in the same field, who are more likely to treat studies with skeptical debate. The traditional “peer review” is really only the first step. The real peer review happens through other people doing their own studies and debating, comparing, rejecting, and sharing them.

Jump cut to today: if someone publishes an article about the “link” between chocolate and Alzheimer’s, that goes almost straight to Twitter, where all opinions are roughly equal, for everyone to see. While I, and hopefully readers of this blog, understand why chocolate studies rarely if ever have any bearing on our lives, most people aren’t privileged to be equipped with the kind of time and education it takes to understand these issues. Science involves complicated theory with conflicting data, and jargon that’s hardly understandable or means something totally different to people on the outside. Science is hard, and it’s a privilege to have the resources and time to understand it. Most do not have that privilege.

Scientific research is increasingly discussed, consumed, used, and abused outside of gated scientific communities, but our institutions and culture are made for a time when they weren’t. That comes with some danger if we fail to adapt. There have always been paths by which popular preference has impacted science, both positively and negatively, and modern communication accelerates tends to both catalyze these processes and bypass some of the checks and balances.

When all publications are public speech, all scientists are public speakers. Public discussion is, and should be a major part of scientist’s jobs, and one which we should embrace. That means adapting scientific culture as well as as institutions to meet these needs, and avoiding some of its pitfalls. We run real risks if we cannot adapt to this environment. Research which is poorly communicated, easily mistranslated, or otherwise misleading can cause real harm, both directly or indirectly. The results of the CLAIMS study are at least partially a result of this new open scientific environment.

If it sounds like I am being vague about what that means in practical terms, it’s because I don’t know. We have ideas on what new models might look like for performing and communicating science, but we won’t know what does and doesn’t work unless we try. And with announcements like this one, it looks like we need to try harder, faster.

The 10,000 Octopus Problem

Noah Haber
Meet Paul the Octopus. Paul is famous. When Paul lived in an aquarium in Germany in 2008, his handlers decided to play a game. They would give him two boxes, one with each team the German European Championship team was playing next, with some food in it. Paul (mostly) predicted the outcomes of the German European Championship matches in 2008 by eating from the box of winning team first, just happening to choose Germany each time. But some doubted Paul. They said he was lucky. So Paul stepped up his game, buckled down, studied up, and waited for his chance in the 2010 World Cup. He correctly predicted every single match the German team played, and then went on to predict the finals between the Netherlands and Spain. Don’t believe me? It’s all on wikipedia.

Unfortunately, Paul has sadly shuffled off this mortal tentacoil, so we can’t do a “real” test of his skills. But we CAN review a few theories on just how Paul, like his namesake, was so prescient.

Theory #1: Paul is really good at predicting football matches

This probably isn’t the blog for you.

Theory #2: Paul (or his handler) loves Germany

In the 2008 matches, Paul chose the box representing his home, Germany, each time, and was mostly right choosing correctly 4 times and wrong 2. In 2010, Paul changed things up. He chose Germany only 5/7 times, and was correct in each instance. So Paul chose Germany 11 times out of 13 (let’s ignore the 2010 finals match, which Germany didn’t play in, for a moment). Maybe Paul’s handlers (who were German, remember) put tastier food in the Germany box. Or maybe Paul prefers the black, red, and gold stripes of the German flag. Who knows? More importantly, who cares when we have a far simpler explanation.

Theory #3: Paul got lucky

Paul got it right 12/14 times (counting the 2010 finals). Let’s assume for a moment that Paul’s prediction is basically a coin flip, and that he just got lucky. How lucky does Paul have to be for this to work? We can predict the probability of getting 12/14 coin flips right using a simple binomial distribution. Assuming that these 12 trials are all independent (we’ll get to that), the probability that Paul would have gotten exactly 12/14 matches right is roughly 0.6%. That’s not great, but that’s not bad either. An easier way to think about that probability is by its inverse. If you wanted one octopus to get 12/14 boxes right by random chance, you would need 180 octopodes. So it’s plausible that Paul got lucky.

Theory #4: Paul got lucky, and we’re bad at understanding uncertainty

#3 works if we had done this before Paul predicted matches. The problem is, we (mostly) didn’t. We know about Paul because he got a little lucky, retroactively. But we don’t know about all the other octopi whose handlers did the same thing, but failed. Most importantly, you don’t know about them because they failed. If you have enough octopuses that know nothing about football, one of them is going to just happen to get it right by chance. The coin that happened to flip the right combination isn’t special just because it happened to flip the right combination. Enough octopuses with typewriters will eventually write an exact copy of 20,000 Leagues under the Sea.

Now, of course, things aren’t really quite that simple, and we’re glossing over some important details. Paul didn’t really have two independent trials, he had two sets of trials (or three, depending if you count the 2010 final separately). He started becoming famous after the 2008 trials. But he didn’t get REALLY famous until 2010.

We can give Paul the benefit of the doubt and say he was learning the rules of the game in 2008 and look only at 2010, in which case the chances of predicting correctly all 8 matches in 2010 is 1/256, slightly more remarkable than our 1/180 above. On the other hand, maybe he truly did have a German flag preference, or the handler helped a little by offering better food in higher chance boxes, which would make Paul be more likely to be correct by non-prescient influence.

To get all 13 of the Germany matches right, you would need 8,192 octopuses (2^13), or 16,384 (2^14) to have one get all 14 matches right if we include the 2010 match. If you had a few other advantages (like color / food preference) that number is lower.  Let’s call it 10,000.

Of course, there weren’t 10,000 octopodes picking matches. After all, there are only a few hundred aquariums in the world. But the comparison isn’t just other octopuses. There are thousands of other “low probability” events happening all the time, from other animals and other sports to anything else. We only know about things that happen, and not those that don’t, and tend to think those things that happen are more remarkable than the are.

Even if you repeat the experiment a bunch of times, streaks are random too. If you’ve ever heard of the Sports Illustrated cover jinx, that’s an extension of this problem. You get on the cover because you (randomly) had an anomalously good streak. You have much better chances of getting that streak if you are a better player, but it’s unlikely that you’ll repeat it a second time. You tend to need a LOT of experiments to tease out what is (noise) luck, and what is skill (signal), and even then there is a chance you randomly get misleading results.

Most of the time, this problem is relatively harmless, like our probably not-so-prescient fried Paul here. Sometimes that is deeply harmful.

It matters

You’ve seen this before: Person has a deadly metastatic cancer, and is told they have 6 months to live. They take some supplement, and boom, cancer magically cured. Or someone tells you to punch sharks in the nose to avoid being eaten during an attack. Let’s just assume for a moment that all of that is literally, actually true. The problem is simple: you never hear about all the people who took supplements and punched sharks but died anyway. Some people just get lucky.

Most importantly, this happens EVERYWHERE. It’s the main reason why most of those studies that find near miraculous sounding cures for diseases don’t pan out, and why anecdotes make bad evidence, and why you shouldn’t pick your stocks on who made the most money last year. Statisticians aren’t people who make certainty with decisions; we’re people who spend a lot of time understanding and dealing with UNcertainty.

Updates: Corrected 2008 being the European Championship, not the World Cup. Credit to Matthew Rogers for finding this error. Corrected English because I am bad at copy-editing, credit to Dan Larremore.

Opinion: Misleading coffee studies have hidden consequences

Noah Haber
The following is the opinion of the author, and does not necessarily reflect scientific findings or theory.

A few weeks ago, Alex and I broke down why a study on coffee and its related media articles were misleading. While it might seem obvious that bad studies are bad for our health, the real damage that studies like this do is much deeper, though harder to see and measure. To understand why, we need to start from the obvious.

Direct impact: weak and misleading medical science leads to bad medical decisions

In 2015, two studies came out claiming that they found that statins, drugs typically used for high blood pressure, cause deadly side effects. The papers were both severely misleading, later resulting in retractions to statements in both papers, but before that happened media ran with their claims. A 2016 study led by Anthony Matthews looked at statin prescription and refill rates in the UK, and found compelling evidence that these two studies and their media coverage caused huge disruptions in statin refills and prescriptions, resulting in over 200,000 people ceasing taking their statins for a few months. I have plenty of nits to pick with this study, but the my biggest is that they probably underestimated their estimates of the total impact.

It is remarkably difficult to find the causal effect of weak and misleading causal evidence, but occasionally we get some hints. The example of statins is a particularly dramatic story for which we have the rare privilege of having strong evidence, and you can imagine that this sort of thing happens all the time and goes unmeasured.

Which brings us back to the coffee study in question. You would be right in thinking that coffee studies probably little to no direct harm or help. It’s just coffee. However, you would be wrong to think the problem stops there.

Weak and misleading articles crowd out rigorous ones

That headline space is precious. In principle, every one of those articles could have been about better studies that could be more useful to decision makers. Even better, those media articles could have been written about topics on which there is scientific consensus. Similarly, the time and funding those researchers spent on this misleading coffee article probably could have been put to better scientific use, although worth noting that many of the proposed mechanisms to achieve more intense control of scientific studies would probably do more harm than good.

However, headline space, scientific progress, funding, and consumer exposure are not really zero sum games. Taking away one headline does not automatically mean that it will be replaced with a better one, or replaced with anything at all. Further, while this may be a particularly expensive coffee study due to the genetics aspect, most are dirt cheap. If I had to guess, putting the time and money spent on those to other studies would probably not result in a huge net gain for public health.

The most important and impactful reasons why these studies and their media coverage are damaging are far more subtle, and far more insidious.

Weak and misleading science erodes public trust and discourse in science

As usual, a comedian is the one that described it best: Lewis Black’s late 90’s rant on scientific studies flip-flopping on eggs.

As we showed in CLAIMS, the majority of what people see of health science is weak, misleading, and/or inaccurate. These headlines make up nearly 100% of almost everyone’s exposure to health science. While that represents only a fraction of health science, extremely few are privileged with getting to see the big picture, and most of those people are not writing for mainstream news. If the near entirety of what people see of studies looks like scientists flip-flopping on eggs, it shouldn’t be surprising that trust in scientific institutions is cracking. If people only see the least reliable health science, distrust is a reasonable response.

Unfortunately, many of us were indeed caught by surprise over the last few years as we watched severe backlash against scientific thought and institutions coming from news outlets and political rhetoric. When it is difficult for people to distinguish scientific strength, and people are used to weak science, it allows anyone with sufficient lack of knowledge and/or willingness to take advantage of the situation to more easily reject scientific consensus without cause.

We own this, and we need to fix it

A study like the one we saw a few weeks ago should never have entered into the public sphere, and maybe should not have been done at all. It adds little to nothing of note to our scientific knowledge, misleads health decisions, and continues the erosion of public trust in our institutions. Other studies, such as the statins example, have more immediate consequences.

We have a responsibility as scientists to educate, collaborate with, listen to, and intervene in public discussion. We also have a responsibility encourage our best science, and reject our worst. Sometimes, that means trying things that are uncomfortable and risky.

The family of projects starting with CLAIMS are explicitly intended to be used to intervene and change the interaction of scientific institutions, media, and social media. Many of them are based in critique. Critical scientific review is an unusual thing to be doing at a time when trust in scientific institutions is low, and it makes for some strange (and severely mistaken) bedfellows. It’s a risk which we hope produces net positives for scientific progress and its impact on human lives.

Count the covariates: A proposed simple test for research consumers

Noah Haber
Trying to determine if a study shows causal effects is difficult and time consuming. Most of us don’t have that kind of time or training (yes, that includes almost all medical professionals too). I have a proposed idea for a potential test that anyone can do for any article linking some X to some health Y, and I want to hear your thoughts: count the covariates.

TL;DR: You may be able to get a decent idea of whether or not the study you just saw on social media about X linking to some Y shows a causal relationship by counting the number of covariates needed for the main analysis. The fewer variables controlled for, the more likely the study is to be interperetable as having strong causal inference. The more covariates, the more likely it is to be misleading.

A few important caveats: 1) THIS IS CURRENTLY UNTESTED, but we are currently working on formally testing a pilot of the idea; 2) It will certainly be imperfect, but it might be a good guideline; 3) This is probably only works for studies shared on social media; and 4) This is an idea intended for people who don’t have graduate degrees in epidemiology, econometrics, biostats, etc., but the more you know, the better.

Why it could work:

The key intuition here is twofold. A study that is “controlling” for a lot of variables 1) is usually trying to isolate a causal effect, regardless of the language used; but 2) can’t.

Let’s see why this might work, using that coffee study from last week as an example.

Controlling for a lot of variables implies estimating causal effects

The logic comes down to what it means to “control” for something. For example, smoking. The reason the authors control for smoking is because smoking messes with their estimation of the effect of coffee on mortality. People who drink more coffee also more likely to smoke. Smoking is bad for you. One reason, then, that people who drink more coffee might have different life expediencies is because they are likely to die earlier from smoking. So it makes sense to “control” for smoking then, right?

It does make sense, if you are trying to isolate the effect of drinking coffee on mortality. If you don’t care about that cause effect, and have some other reason to want to know this association, you generally don’t need or want to control for other variable. The more variables you control for, the less plausible it is that you are doing anything other than estimating a causal effect.

Controlling for a lot of variables implies inadequate methods to estimate a causal effect

Some research strategies get you great causal effect estimation without having to control for much of anything at all, such as randomized controlled trials, “natural experiments,” and many kinds of observational data analysis methods in the right scenarios. You can’t always do this successfully. Sometimes, you have to control or adjust for alternative explanations.

The problem is when you have to control for a LOT of alternative explanations. That generally means that there was no “cleaner” way to go about the study that didn’t require controlling for so many variables. That also means that there are probably a thousand other variables that they didn’t control for, or even have the data on those variables to start with. It only takes one uncontrolled for factor to ruin the effect analysis, and there are too many to count. There are also some slightly weirder statistical issues when you imperfectly control for something, and that’s more likely to happen when you are controlling for a lot of stuff.

In that coffee study, the authors controlled for the kitchen sink. However, coffee is related to basically everything we do. People from different cultural backgrounds have different coffee drinking habits. People with different kinds of jobs drink coffee differently. Fitness. Geographic region. Genetics. Social attitudes. You name it, and it is related to coffee. That’s not a problem by itself. What IS a problem is that all of those things ALSO impact how long you are going to live. If you have to control for everything, you can’t.

Count the covariates

To review: controlling for a lot of variables implies that you are looking for a causal effect, but ALSO implies that there is more that needed to be controlled for to actually have estimated a causal effect. See the catch-22?

We can also take a look at causal language here as well. Studies are often considered acceptable in scientific circles (i.e. peer review in journals) as long as they use “technically correct” language with regard to causality. We think that is seriously misleading, but that doesn’t stop those studies from hitting our newsfeeds.

The most likely scenario for most people seeing a study that uses strong causal language and controls for very little is that it’s one of those studies that actually can estimate causality, such as most randomized controlled trials. On the other hand, a study that uses weak causal language and controls for very little probably isn’t actually trying to estimate a causal effect, and our proposed rule doesn’t really say much about whether or not these studies are misleading.

We can also look at the language used, where studies may use stronger (effect/impact/cause) or weaker (association/correlation/link) causal language. It’s also worth considering how the authors state their evidence can be used, as that can also imply that their results are causal. The kinds of studies that control for a lot of variables and state it as such are a strange bunch and unlikely to be seen in your social media news feed. This rule just doesn’t work as well for them, but most are unlikely to see them anyway, so the rule is still mostly ok.

Important considerations and discussion

Multiple specifications can make this hard to deal with. In the phrase “number of covariates required for the main analysis,” there are two tricky words: required and main. Most studies have several ways of going at the same problem, and it’s difficult to determine what the “main” one is. It is common that a study might have both a “controlled” and “uncontrolled” version, which may or may have very different numbers produced. If the numbers don’t change much between those two versions (or, even better, you have the background to know what is required and not), controlling for them probably wasn’t “required,” so they may not need to count. It is notable that the coffee study we keep talking about doesn’t do anything of the kind. All plausible main analyses are heavily controlled, and as such would fail any version and interpretation of this test.

There is probably a paradox that occurs here (credit to Alex Breskin for pointing this out). In the case of multiple studies on the same topic using roughly the same methods, observational trials controlling for more covariates probably do better with regard to causal inference. But because we are not selecting among studies in that way, and we are intending this as a guideline for ALL studies on social media, the opposite may be true.

It is also worth noting that this may end up being mostly indistinguishable from RCT vs. everything else, which is not the intent.

There are also some sets of methods which do require moderate numbers of covariates to work, and occasionally these articles appear in our news feeds. One example from Ellen Moscoe is difference-in-difference studies for causal effects of policies. These typically need controls for time and place, which is at minimum two covariates.

We also just don’t know if this idea actually works. But it might, and we can test it.


Any thoughts on why this might fail? Alternative proposed tests? Let us know in the comments or get in touch!

Do coffee studies make causal inference statisticians die earlier?

Alexander Breskin
Alexander Breskin

Noah Haber
This week, yet another article about the association between coffee and mortality plastered our social media feeds. This trope is so common that we used it as an example in our post on LSE’s Impact Blog, which happened to be released the very same day this study was published. We helped comment on the study and reporting for a post in Health News Review, which focused on how the media misinterpreted this study. Most news media made unjustifiable claims that suggested that drinking more coffee would increase life expectancy. The media side, however, is only half of the story. The other half is what went wrong on the academic side.

In order to have estimated a causal effect, the researchers would have needed to find a way to account for all possible reasons that people who drink more coffee might have higher/lower mortality that aren’t the direct result of coffee. For example, maybe people who drink a lot of coffee do so because they have to wake up early for work. Since people with jobs tend to be healthier than those who don’t, people who drink coffee may be living longer because they are healthy enough to work. However, this study can’t control for everything, so what they find is an association, but not an association that is useful for people wondering whether they should drink more or less coffee.

The study is very careful to use language which does not technically mean that they found that drinking more coffee causes longer life. That makes them technically correct, because their study is simply incapable of rigorously estimating a causal effect, and they don’t technically claim they do. Unfortunately, in the specific case of this study, hiding behind technically correct language is at least mildly disingenuous. Here is why:

1) The authors implied causation in their methodological approach

The analytic strategy provides key clues was designed to answer a causal question. Remember above where we talked about controlling for alternative explanations? If you are only interested in association (and there might be some reasons why you might want this, albeit a bit contrived), you don’t need to control for alternative explanations. As soon as you start trying to eliminate/control for alternative explanations, you are, by definition, trying to isolate the one effect of interest. This study tries to control for a lot of variables, and by doing so, trying to rule out alternative explanations for the association they found. There is no reason to eliminate “alternatives” unless you are interested in a specific effect.

2) The authors implied causality in their language, even without technically saying so

The authors propose several mechanistic theories for why the association was found, including “reduced inflammation, improved insulin sensitivity, and effects on liver enzyme levels and endothelial function.” Each of those theories implies a causal effect. When interpreting their results, they state that “coffee drinking can be a part of a healthy diet.” Again, that is a conclusion which is only relevant if they were looking at the causal effect coffee on health, which they cannot make. How can you say if coffee is ok to drink if you didn’t tell me anything about the effect of drinking coffee?

3) Alternative purposes of this study are implausible or meaningless

Effect modification by genetics

The stated purpose of the study and its contribution to the literature is about the role of genetics in regulating the impact of coffee on mortality. The problem here, again, is that in order to determine the impact of genetics on regulating the effect of coffee on mortality, you first have to have isolated the effect of coffee on mortality. You can not have “effect modification” without first having an “effect.” That’s a shame, because it is totally plausible that there was some neat genetics science in this study that we aren’t qualified to talk about.

Contribution to a greater literature

In general, we should ignore individual studies, and look at the consensus of evidence that is built up by many studies. However, there are literally hundreds of studies about coffee and mortality, almost all of which commit the exact same errors with regard to causation. One more study that is wrong for the same reason that all the other studies are wrong gives a net contribution of nearly nothing. They may be contributing to the genetics literature, but this study does not add any meaningful evidence to the question of whether or not I should have another coffee.

4) Duh.

Studying whether coffee is linked to mortality is inherently a causal question. To pretend otherwise is like a batter missing a swing, and then claiming they didn’t want to hit the ball anyway. Just by conducting this study, a causal effect is implied, but as we already noted this kind of study is not useful for causal inference. This specific issue is unfortunately common for studies in our media feeds, and was one of the reasons we did the CLAIMS study in the first place. We contend that researchers need to be upfront about the fact that they want to estimate causal effects, and to then consider whether or not it is reasonable to do so for the exposures and outcomes they are considering.

We also can not stress enough a more general point: the authors of this study and and the peer review process made a lot of mistakes, but this study does not represent all of of academic research. It is a shame that studies like these are what makes the top headlines time after time again instead of the excellent work done elsewhere.

Can’t we please just accept coffee (and wine and chocolate) for what it is: delicious?