GiveWell’s Uncertainty Problem

Noah A. Haber 
noahhaber@gmail.com
ORCiD: 0000-0002-5672-1769

Links:
Code and data repository
Modified model sheet for PSA demo
Web app: Two programs with uncertainty
Web app: Conceptual demonstration of overall selection problem
Web app: Selection rules for individual candidate programs

Mouseover text that looks blue like this for additional notes

Background

GiveWell’s key decision point for its models are whether the intervention being modeled provides at least 3x as much expected benefit per dollar spent as a direct cash transfer, using the expected modeled effectiveness of GiveDirectly as a proxy. The key value of interest is the “best guess” point estimate for both inputs and outputs of the model. While there are some subjective assessments and adjustments dealing with uncertainty in GiveWell’s models, these are relatively rare, suggesting that GiveWell’s assessments are designed to be fairly risk (or uncertainty) neutral, or at least very risk tolerant.

The driving assumption is that this neutrality would give the best possible mix of recommendations. Some recommended interventions would have underestimated cost-effectiveness, some overestimated, but as a group of recommendations would be nearly as good as it gets with the information available.

Unfortunately, the particular circumstances of GiveWell’s decision-making process yields some major unintended – but fixable – consequences. Specifically, GiveWell’s decision-process advantages programs that have more unreliable evidence, are overestimated, and have weaker underlying cost-effectiveness.

This essay proceeds as follows:

  • A set of stylized simulation-based demonstrations that show how:
    • Individual programs with unreliable evidence have a selection advantage
    • Individual programs that are worse have a (limited) selection advantage
    • The selected group of programs will be strongly biased toward more unreliable and weaker programs, at the expense of better programs
  • A real implementation of a miniature probabilistic sensitivity analysis (PSA) demonstrating that:
    • Uncertainty is a serious and largely ignored problem in GiveWell’s models
    • That probabilistic sensitivity analysis is implementable within GiveWell’s existing workflow
  • A set of simulation-based demonstrations of alternative decision rules based on a PSA infrastructure

The major impact of these suggestions are that they change how GiveWell decides which programs are cost-effective and by what measures. That is expected to change the mix of which programs are selected. In particular it is likely that if these suggestions are taken:

  • GiveWell will be able to recommend some of the programs it would otherwise reject due to improved uncertainty-forward metrics for evaluation.
  • GiveWell will no longer recommended some of the programs in its recommendation portfolio due to the evidence being too unreliable.

While it is unclear exactly how many programs would change status with an uncertainty-forward decision-making process, it is almost certain that at least one program will change status in both directions (not recommended to recommended and recommended to not recommended). It is plausible that many interventions would have their statuses switched. Even one proposed intervention switching statuses would have a major impact on GiveWell’s recommendations.

Conceptual underpinnings

In the following section, I generate a number of simulation-based conceptual demonstrations. While these closely resemble GiveWell’s decision rules, they are here to develop conceptual intuition on the mechanics of the problem(s) addressed. Regarding technical language, see disclaimer. I recommend opening the provided apps in a second window as you read; they are there for you to play with.

GiveWell’s decision-making framework

GiveWell’s decision-making framework, like many cost-effectiveness-based frameworks, is based on a threshold, where programs are considered highly cost-effective if the estimated CE value is ≥ 3. However, these estimates are made with some amount of uncertainty, where we cannot know the “true” cost-effectiveness but can get a rough idea of how much uncertainty there may be.

Warm up exercise

https://noahhaber.shinyapps.io/GW_two_dists/

To build some intuition on where problems start to arise, consider two comparable hypothetical programs being reviewed by GiveWell. Program A is fairly cost-effective and well-measured, with a true value of 2.8, but where what we measure by chance alone ranges by a normal distribution with an SD of 0.5. Program B is slightly worse overall (true value = 2.5), but is more poorly measured such that there is a wider range of values by which we might have measured it by chance (SD = 1.5).

As shown in the above app, 34% of the time, Program A would randomly be measured as being ≥ 3, while Program B would randomly be measured as being ≥3 and selected 37% of the time. In other words, despite Program B being overall worse and more poorly measured, GiveWell is more likely to include Program B in its selected portfolio. As a general rule, the more uncertainty, the closer a program will be selected by coin flip. If most programs evaluated are less cost-effective than our threshold, the more uncertainty in the program evaluation, the more likely it is to be selected.

As a result, GiveWell’s decision rule isn’t just “risk neutral” or “expected value maximizing,” it selects and biases FOR uncertainty. This is a major issue by itself; some programs will get lucky and be anomalously overestimated (i.e. the “winner’s curse”). Unfortunately, as demonstrated in subsequent sections, the problem becomes more serious in aggregate, and much more serious when we consider how uncertainty and program quality are connected.

Arguably, one of the reasons that GiveWell’s rule is 3x GiveDirectly in the first place is as a partial hedge against uncertainty. In other words, if we were very sure of GiveWell’s estimates, the rule should be 1x. However, because the threshold rule operates only on the point estimate and ignores uncertainty, the threshold rule has to hedge against both the mean (which it does well) and the amount of uncertainty (which it does poorly). As will be shown later, a rule that incorporates the uncertainty directly lets us relax that tension.

Sources of uncertainty in cost-effectiveness modeling

In this essay, I am going to be somewhat vague about what kinds of uncertainty I am dealing with. The clearest one is statistical uncertainty from sampling, which is typically what is provided by the literature in the form of confidence intervals, standard errors, p-values, etc. However, there are many many more forms of uncertainty in modeling, including but not limited to:

  • Internal identification and study design issues in studies from which models parameters are taken
  • External/generalizability in source studies relative to the target population of interest
  • Design choices and parameterizations in CE models
  • Unknown unknowns

For the sake of brevity, I am putting it all under the label “uncertainty,” and keeping vague about it. Deciding what kinds of uncertainty to consider and how is an exercise in and of itself, but must also be considered very carefully.

Importantly, uncertainty in models grows rapidly as more assumptions are required, more estimates are made, parameters interact with each other, etc. Total uncertainty in a model is much larger than the sum of its parts, as discussed later.

Selection in aggregate

https://noahhaber.shinyapps.io/GiveWells-Uncertainty-Problem/

Baseline scenario

The app above generates 2,000 programs from which GiveWell could hypothetically select. As a starting assumption, we assume that on average programs have a true relative cost-effectiveness of 1 (i.e. below our selection threshold), randomly drawn from a normal distribution with an SD of 1.5. This creates 2,000 programs, where we know their true cost-effectiveness. Then, we add measurement error to our estimate of how cost-effective these programs are, simulating uncertainty in measurement and assessment. Finally, we apply GiveWell’s threshold-based decision rule to those programs and observe which programs we select for inclusion into GiveWell’s portfolio and which ones we reject.

For the purposes of this demonstration, we are interested in the degree to which we get our decisions right, using whether the true cost-effectiveness of the program is better than or worse than cash transfers (i.e. a relative CE value of 1) as a proxy for whether we made the right decision.

Rejected (negative)Accepted (positive)
Negative (CE < 1)True rejectionFalse acceptance
Positive (CE ≥ 1)False rejectionTrue acceptance

In the first tab (“True vs false rejections”), we see the distribution of programs we accept and reject compared with whether or not they were truly better or worse than cash transfers. As expected, we generate many false rejections, due largely to the decision threshold of 3 being a “hedge” of sorts. More importantly, we observe false positives (i.e. programs that got “lucky”).

In the “Selection bias” tab, we see that there is a substantial bias in the programs that we select. Programs that we select have a large positive bias on average. If it were the case that being “risk-neutral” yields unbiased results, we would expect equal amounts of bias among the programs that we select and reject.

The fundamental reason for this is shown in the third tab (“Uncertainty vs bias”), where we see the relative amount of uncertainty on the x axis and the amount of bias in the y. Because we are selecting on a threshold and ignoring uncertainty, we are more likely to be selecting programs in the upper right quadrant (i.e. those that are truly better and/or more lucky and biased).

What happens if the amount of uncertainty and the true value of the program are related?

For the previous demonstration, we assumed that there is no relationship between the quality of the program and the amount of uncertainty. However, it is likely that programs that are truly better or more promising will also be better measured (i.e. less uncertain). There are many reasons to believe this is the case in general: the more likely a program is to be effective, the more likely it is that researchers and policy makers will invest in rolling it out, testing it, and retesting it; higher quality, stronger, more direct, and therefore more expensive tests will usually be reserved for programs more likely to yield strong results; regression to the mean-type effects will tend to temper expectations for good programs more efficiently than for poor programs; and publication biases and related adverse selection forces in academia will tend to overselect strongly biased results and more rarely measured interventions.

The “Program certainty reduction factor” in the app reduces more of the overall amount of uncertainty for programs that are truly more cost-effective. Crank it up to 1, and see what happens: we get even more false positives and false negatives and more overall bias, despite this factor reducing the overall uncertainty in our program measurement. While the specific numbers here do not represent reality, this is the most likely scenario we are in.

While there is no way to completely eliminate this issue, there are relatively low-lift ways of substantially mitigating it. One option, discussed in more detail in the section on “Alternative decision rules” below, uses a lower bound of the uncertainty distribution for each program measured, compared against a slightly more generous (i.e. lower) decision threshold. Using this alternative decision rule, we get fewer false rejections, fewer false acceptances, and reduced bias. This uncertainty-forward rule is particularly important and bias-reducing in the case where better programs have less uncertainty.

Selection bias with default decision-rule (left) and CI-based decision-rule (right) when more cost-effective programs are also better measured

In the app, I recommend flipping between the default and CI-based rules in different simulation scenarios to gain some intuition.

Summary

  • When a decision is made on whether a point estimate exceeds a threshold and ignores uncertainty, programs that are more uncertain will tend to be more likely to be selected, leading to biased selections.
  • When there is an inverse relationship between the value of the programs and the amount of uncertainty involved in predicting its true value, both programs that are more uncertain and overall worse will tend to be even more likely to be selected over better evaluated programs
  • The more uncertainty and more pronounced the relationship between certainty and value, the worse the problem is, to the point where it can completely overwhelm the value of the selection exercise.

Is this a problem for GiveWell’s models? A review of uncertainty in GiveWell models and demonstration of probabilistic sensitivity analysis

In short: yes, this is a major issue for GiveWell’s models.

Ideally, I would estimate the extent of the uncertainty by replicating all of GiveWell’s models and reassessing the certainty involved in every relevant decision and parameter. Unfortunately, doing that for even one model is more time than is available. Instead, rather than try to give a reliable estimate of the extent of the true uncertainty, I instead show what happens with a small amount of prodding, as a preview of what might happen with a full accounting of uncertainty. Code, data, and modified workbook are available.

For the Deworm the World and Malaria Consortium models assessed, I:

  1. Looked for a handful of key parameters pertaining to the overall effectiveness of the program and prevalence of the issue being addressed
  2. Traced those parameter values back to the original data, and located the statistical sampling uncertainty provided
  3. Decided on a conservative, reasonable distribution of uncertainty that reflects that statistical uncertainty identified.
  4. Using those data, ran a script that randomly varies those parameters in the model by the uncertainty distributions identified (i.e. ran a very limited probabilistic sensitivity analysis)

Important note: The results below are an *underestimate* of the amount of uncertainty, reflect an arbitrarily decided set of parameters, and are not designed for any kind of comparison between models. The more parameters added to the PSA (as would be done in a real exercise), the more decisions modeled, and the more thorough the investigation, and the more systematic the uncertainty estimation, the more uncertainty that would be found. In real life, each of these models is MUCH more uncertain than what is shown here, but by an unknown amount.

The results for the two models above give a flavor for what might be found with a full accounting of uncertainty, noting that the total amount of uncertainty is likely to be much larger than the above in the real exercise. In the Deworm the World intervention, we see that allowing a very small number of variables to vary by their statistical sampling uncertainty limits yields a very large range of potential outcomes, including those that yield worse return than cash transfers or are actively harmful. The Malaria Consortium intervention, on the other hand, is not quite as sensitive in this exercise (again noting that this is an underestimate of the amount of uncertainty, and that identifying which interventions are more uncertain or better is not feasible from this exercise). Should a full exercise yield similar certainty on the results, we might be more sure about this particular intervention, and weigh it more highly.

Uncertainty compounds rapidly and complicatedly

One of the key reasons that these models are so sensitive is because they layer and multiply uncertainty onto uncertainty. A highly uncertain parameter (such as the prevalence of a disease) multiplied by another highly uncertain parameter (such as the effectiveness of the intervention) produces much more uncertainty than the sum of its parts. The more steps needed to get to effectiveness, the larger the assumptions, the less direct the evidence, and the more uncertainty there is in the parameters, the more rapidly that uncertainty grows. There is no escaping this problem, only addressing it head on.

So, to answer the key question here: Is there enough uncertainty in these models that we should be seriously worried about making weak decisions due to ignored uncertainty? The answer is definitely “yes.”

Instead of sidestepping the problem of uncertainty, GiveWell’s modeling imposes extremely strong assumptions on its estimates that results in estimates that are biased and less reliable than they could be, and also misses that many of these interventions are measured with a strong level of reliability.

One-way sensitivity analyses should also be performed (i.e. varying only one parameter at a time) to best identify areas where it is more valuable to invest more time resolving uncertainties, and can easily be incorporated into a sensitivity analysis workflow.

A brief infrastructure and workflow interlude

GiveWell uses Google Sheets as the primary way it designs, develops, and shares its models. While PSA is much easier and more efficient using models implemented in more fit-to-task software such as R, Google Sheets is sufficient and accessible for a wider range of people. Rather than completely upend GiveWell’s existing infrastructure, the PSA here is based on a nearly exact copy of GiveWell’s existing models, with only modifications being moving some cell references for convenience and sheets defining some inputs and outputs. In this case, the code for running the PSA interfaces directly with the Google Sheets from R, and a comparable approach would require little to no general change in model design workflow for GiveWell’s researchers, and maintains the accessible openness of public Google Sheets documents. All code and files are, of course, available on my GitHub with no restrictions.

In other words, the PSA implementation here is designed to drop neatly into GiveWell’s workflow and infrastructure. In theory, GiveWell researchers could remix the code and design from this essay to implement their own PSA procedures and PSA-based decision rules (implemented and discussed in the next section).

The very lightly modified data sheet for performing this PSA is here for perusing.

Potential methods and frameworks for addressing uncertainty

https://noahhaber.shinyapps.io/GW-ind-rules/

To review briefly, in the previous sections we have shown 1) the problem with using the point estimate alone in combination with a threshold, 2) that this is a serious issue for GiveWell’s models, and 3) how to generate a distribution of uncertainty around the CE estimates using PSA. In this section, we explore alternative decision rules and frameworks to mitigate these issues based on PSA.

Option 1) Just look at the uncertainty

The most important part of this exercise is to look at and understand the uncertainty involved in these models and escape from viewing only the point estimate. While informal, simply going through the exercise and being confronted with uncertainty may be enough on its own to impact decisions. It is likely to have knock-on effects on study design as well.

Option 2) Use a lower bound of the uncertainty interval

Rather than using the mean value of the uncertainty distribution (i.e. the point estimate), researchers can use a lower bound (e.g. the lower bound on the 80% CI). The lower bound is impacted both by the expected value of the impact AND the distribution of uncertainty, where the LB will be lower the more uncertainty there is. In practice, the 80% CI lower bound is simply the 20th percentile value of the distribution of outcomes generated from the PSA.

While using the LB instead of the point estimate would inherently lead to more conservative decisions, we can change that by also changing the threshold. As discussed earlier, the existing threshold is doing too much work to try to make up for the fact that it does not include uncertainty, so it needs to be very large. When we include uncertainty directly through using a lower bound, we can relax that necessity and use a lower threshold (e.g. 2x GiveDirectly). By doing these simultaneously, GiveWell can be both less conservative (i.e. accept more programs), have fewer false positives among those it selects, and have a portfolio with much less bias. It is truly a win-win.

Option 3) Use a probability-based threshold with a discrete comparator

There is an alternative framework to think in: instead of using a single value of the distribution of cost-effectiveness values, we can change the question to “What is the probability that this intervention is better than 1x (i.e. cash transfers)?” We can set a critical value for that threshold (e.g. we accept programs that we are 90% sure are better than cash transfers). As above, that value comes straight out of the distribution from our PSA: it’s simply the proportion of outcome results from our PSA-generated distribution which are ≥1.

If that sounds a lot like p-value-based null hypothesis significance testing, that’s because it is. Here our “null” is 1 (i.e. the ratio of the program of interest’s estimate to the cost-effectiveness of cash transfers), and our alpha is our choice of threshold cutoff. While null hypothesis significance testing is often problematic and maligned in many circles of science, this is a case where it is ideally suited for the problem at hand. The tradeoff is that we stop caring about the actual number value of how cost-effective the program is, but instead how sure we are that it’s better than cash transfers.

But we can potentially do a bit better.

Option 4) Use a probability-based threshold with a distributional comparator

Up until this point, we have been considering the cost-effectiveness of cash transfers themselves as being known and fixed. In reality, our estimate of the cost-effectiveness of cash transfers ALSO has uncertainty. And we can do to the cash transfers model exactly what we did for all the other models to get a *distribution* of cost-effectiveness values to compare to. In practice, the question becomes “What is the probability that a random draw from the distribution of uncertainty for our program of interest is ≥ a random draw from the distribution of uncertainty for cash transfers?”

This version makes the idea that we are operating under uncertainty explicit. We are not 100% sure about the cost-effectiveness of these interventions, and that is ok!

Alternative options

There are a plethora of alternative frameworks, models, and decisions. Chief among them is adopting a Bayesian framework for decision-making under uncertainty. Mechanically, these require a somewhat larger departure from GiveWell’s existing framework and may impede public communications and accessibility, so they are not presented here. GiveWell should, however, strongly consider these options, and it has in the past. Notably, this document shows how the blog post’s assumption of unbiasedness isn’t particularly reliable either.

Which option is best depends on what question(s) you want to answer

While we can see that GiveWell’s existing methods yield unreliable and biased results, choosing which of the above options comes down to deciding: “What question(s) am I, a donor, most interested in?” The two probability-based thresholds imply a decision-tradeoff: faced with the choice of giving to Organization A or giving cash, how sure am I that I should give to Organization A instead of GiveDirectly? The downside to this question is that it makes the decision a binary; while you can rank programs by that probability, all very good programs will have a probability close to 1. That means that after some point, we can’t tell which programs are the best.

If we want to know “what is the *best* charity I can give to?” we have to incorporate the level of cost-effectiveness as well. While I have shown how ignoring uncertainty altogether leads to biased and unreliable answers to this question, there is a bit of tuning and tradeoffs to be made. Option 2)’s lower bound option can be considered a balance of both questions.

While tuning decision thresholds and choosing which options to use for decisions takes a bit of work, the good news is that they all come from the same PSA exercise. In some sense, the choice of which option to choose matters much less than incorporating uncertainty using any of these options (or others).

Additional benefits of uncertainty-forward models

Relationship to other common EA critiques

One common and extremely important critique of the EA community is a general level of overconfidence, exacerbated by the social and physical distance between EA researchers from the communities they impact. For example, the “moral value weights” reflect the moral values of the GiveWell researchers, which are not necessarily those of the communities they impact. an uncertainty forward approach lets researchers express that they are unsure of what those values should be.

While there is no way to fully fix this problem without directly incorporating impacted communities into the decision-making process, uncertainty-forward approaches are a halfway measure toward relieving this tension.

How an uncertainty-forward modeling framework changes model decision-making

Up until this point, we have been assuming that the models themselves would be relatively unaffected by this change in framing. However, an uncertainty-forward framing also changes how you construct and evaluate a model. For example, it may result in researchers changing focus to the variables of greatest range of plausible uncertainty that can impact results. In some cases, the reliability of a model may come down to a single parameter that might not have been noticed had researchers been looking only at the point estimate. With this information, researchers can add their own subjective assessments about model reliability beyond the rote 3x rule.

One of the most difficult tasks as a modeler is deciding which parameters to include, how to model specific issues, etc. This is a time-consuming process, and at present GiveWell’s workflow results in one parameterization per model. Rather than choosing just one, an uncertainty-forward modeling approach lets GiveWell researchers relax that choice; they can (and should!) include several options if and when they are unsure about which is the “right” answer. Since we no longer care so much about the single best guess value, we don’t have to make quite as many assumptions.

Error-checking opportunities

No model is without error. Sensitivity analyses incorporated into the modeling process can help identify these issues earlier and more efficiently. Sensitivity analysis provides an additional semi-automated sanity check of model construction. If varying a parameter yields little change in outcomes, outcome changes in an unexpected direction, too much change, etc. that can indicate a model error. Relatedly, see note about replicability.

Clearer communication with donors about the reliability of estimates

As a donor, I don’t want to only know what my best guess is, but also how sure I can be of that guess and decide accordingly. If I can do so as part of the normal process of model development, all the better. GiveWell has a responsibility to provide this information to donors, especially as in many cases it could result in donors being more sure of the organizations to which their donations go.

Beyond responsibility, uncertainty estimates are beneficial for the long-term credibility in GiveWell. Failure to communicate the uncertainty in estimates is a perennial problem. COVID-19 prediction models, for example, broadly advertised their “best guess” estimates, but rarely communicated the extent and nature of the uncertainty involved. As a result, nearly all of them failed to provide any useful information. GiveWell may be facing a similar problem; if GiveWell’s models are found to be less reliable than donors expect, they may (reasonably) trust GiverWell’s models less. That problem is avoidable by communicating and incorporating uncertainty by default.

Discussion

In this essay, I have demonstrated that:

  1. Without an uncertainty framework, GW’s models are
    1. Over confident
    2. Biased toward more uncertain interventions
    3. Biased toward weaker interventions
  2. The uncertainty problem is large, serious, and cannot be ignored
  3. A probabilistic sensitivity analysis-based decision process addresses these issues
  4. A probabilistic sensitivity analysis is very doable within GiveWell’s workflow without enormous additional burden or changes to model building infrastructure. As part of this essay, much of that infrastructure work has already been done.

I have two strong recommendations: First, GiveWell should develop a framework to systematically engage in the uncertainty of their problems, and to document them. Second, GiveWell should revise its decision-making rules to explicitly include that uncertainty, such as using an 80% lower bound in the distribution of cost-effectiveness or setting the threshold as being a certain probability of exceeding GiveDirectly’s cost-effectiveness.

After developing the initial draft of the outline and apps, a colleague pointed out another essay that is extraordinarily similar in structure and recommendations to this one. No doubt that other entries are also likely similar. The reason for this is relatively straightforward: these are standard problems in cost-effectiveness, with relatively standard approaches to these problems. In fields that rely heavily on cost-effectiveness, PSA-based cost-effectiveness analysis or similar are performed by default, and are often required by the organizations publishing and using cost-effectiveness. Fortunately, that ubiquity means that there is a wealth of experience GiveWell can draw from in modifying its own procedures and models.

Making good decisions under uncertainty requires understanding the nature of the uncertainty. Some important topic areas are inherently more expensive or difficult to generate evidence for, and understanding how much and why can help us make better decisions. Importantly, uncertainty does not inherently mean inaction. Decisions do not need to be strictly rule-based, but exploring the uncertainty and applying some rules give us a starting point. That first uncertainty accounting step is necessary in order to deal with the fallacy that avoiding dealing with uncertainty makes decisions risk neutral. Some topic areas are inherently more difficult to measure but could be enormously effective, and uncertainty-forward modeling will likely move decisions away from those programs compared to where GiveWell’s decisions are at this moment. However, at this moment GiveWell’s models are skewed toward more uncertain, less effective programs. There are balances and tradeoffs.

While there is unquestionably a cost in time and effort involved with changing to an uncertainty-forward framing, the benefits of doing so are likely to be extraordinarily high. With experience, good guidance, and tuning, the long run additional time it would take to bring uncertainty into GiveWell’s models is likely to be relatively modest, and the value gained in strength of inference, reliability, and credibility are large.

In order to improve the reliability of GiveWell’s modeling, it must embrace uncertainty.

Acknowledgements

The following people provided constructive comments, suggestions, criticism, etc: 

Josh Ehrenberg, Alyssa Bilinski, Joshua Blake, Ryan Briggs, Caitlin Tulloch, Theiss Bendixen, Jonas Kristoffer Lindeløv, Richard Nerland, Paul Whalen, Christopher Boyer, Tanae Rao, and Karthik Tadepalli. If any errors are present, they are my own.

Critique, comments, disagreements, and everything in between are very welcome; after all, even this essay about uncertainty contains uncertainty.

Thoughts and comments welcome