
Health research publications often conclude that the association between some exposure and outcome is “hypothesis generating,” particularly with regard to examining causal effects. The popularity of this conclusion strongly suggests that there is a need for more efficient hypothesis generating technologies, so that we can move from “generating” hypotheses to rigorous and critical work to actually test those hypotheses (or where applicable, an explicit and upfront acknowledgement that a reliable test was not or could not be performed). Fortunately, a solution to this problem has already been proposed by Dr. Philip Cole in 1993: The Hypothesis Generation Machine (Cole 1993). The Hypothesaurus is a realization of that dream.
Hypothesaurus is a list of virtually every combination of exposure, outcome, and (as a novel advancement upon Dr. Cole’s work) a linking relationship concept between the two in human health research. Because this list is precompiled, these hypotheses are already generated, and there is no additional need to generate them further.
Methods
The hypothesis list simply takes three input lists: (a list of outcomes, exposures, and linking concepts) and generates every combination of these three items, where each combination is a hypothesis. However, in order to ensure a reasonable set of human health-related exposures, outcomes, and linking concepts, it is necessary to select reasonably well curated input lists.
At this time, the primary source of input lists comes from Causal and Associational Language in Observational Health Research: A Systematic Evaluation (Haber et al., 2022). This is a large-scale project that during the course of review, review teams extracted the exposures, outcomes, and linking phrases of 1,275 randomly selected articles concerning human health from among the highest profile journals in medicine, public health, and epidemiology. Details of the review methods and processes are available in the publication, at least as much of it as the journal would allow us to fit in the word limits.
Hypothesaurus takes the human-curated lists of exposures, outcomes, and linking phrases copied from these articles, treats them as independent lists, and generates every combination of them (ignoring duplicates).
Code and datasets are available at https://github.com/noahhaber/Hypothesaurus/
Results and data
The current version of the Hypothesaurus contains 1,181 exposures, 1,012 outcomes, and 402 linking phrases, yielding 482,849,488 hypotheses. A random sample of 20 appears below:

The full Hypothesaurus is available in compressed format as a .rar file (the official compression format of Hypothesaurus), containing an .RData file with a data frame consisting of the full Hypothesaurus.
The dataset is available here.
Discussion
This work generated and documented over 480 million hypotheses related to human health. Now that this work is done, there is no need to do more hypothesis generation, excepting adding more exposures, outcomes, and linking concepts to this list. We can now do the real work and examine useful evaluations and tests of these relationships, particularly where causality and causal inference are a necessary requisite for being useful.
Works cited
Philip Cole. The Hypothesis Generating Machine. Epidemiology: May 1993 – Volume 4 – Issue 3 – p 271-273
Noah A Haber, Sarah E Wieten, Julia M Rohrer, Onyebuchi A Arah, Peter W G Tennant, Elizabeth A Stuart, Eleanor J Murray, Sophie Pilleron, Sze Tung Lam, Emily Riederer, Sarah Jane Howcutt, Alison E Simmons, Clémence Leyrat, Philipp Schoenegger, Anna Booman, Mi-Suk Kang Dufour, Ashley L O’Donoghue, Rebekah Baglini, Stefanie Do, Mari De La Rosa Takashima, Thomas Rhys Evans, Daloha Rodriguez-Molina, Taym M Alsalti, Daniel J Dunleavy, Gideon Meyerowitz-Katz, Alberto Antonietti, Jose A Calvache, Mark J Kelson, Meg G Salvia, Camila Olarte Parra, Saman Khalatbari-Soltani, Taylor McLinden, Arthur Chatton, Jessie Seiler, Andreea Steriu, Talal S Alshihayb, Sarah E Twardowski, Julia Dabravolskaj, Eric Au, Rachel A Hoopsick, Shashank Suresh, Nicholas Judd, Sebastián Peña, Cathrine Axfors, Palwasha Khan, Ariadne E Rivera Aguirre, Nnaemeka U Odo, Ian Schmid, Matthew P Fox, Causal and Associational Language in Observational Health Research: A Systematic Evaluation, American Journal of Epidemiology, 2022;, kwac137, https://doi.org/10.1093/aje/kwac137
Acknowledgements
This work uses the data from a very real and serious effort by nearly 50 coauthors involved in the causal language study linked above. In addition to the coauthors (Sarah E Wieten, Julia M Rohrer, Onyebuchi A Arah, Peter W G Tennant, Elizabeth A Stuart, Eleanor J Murray, Sophie Pilleron, Sze Tung Lam, Emily Riederer, Sarah Jane Howcutt, Alison E Simmons, Clémence Leyrat, Philipp Schoenegger, Anna Booman, Mi-Suk Kang Dufour, Ashley L O’Donoghue, Rebekah Baglini, Stefanie Do, Mari De La Rosa Takashima, Thomas Rhys Evans, Daloha Rodriguez-Molina, Taym M Alsalti, Daniel J Dunleavy, Gideon Meyerowitz-Katz, Alberto Antonietti, Jose A Calvache, Mark J Kelson, Meg G Salvia, Camila Olarte Parra, Saman Khalatbari-Soltani, Taylor McLinden, Arthur Chatton, Jessie Seiler, Andreea Steriu, Talal S Alshihayb, Sarah E Twardowski, Julia Dabravolskaj, Eric Au, Rachel A Hoopsick, Shashank Suresh, Nicholas Judd, Sebastián Peña, Cathrine Axfors, Palwasha Khan, Ariadne E Rivera Aguirre, Nnaemeka U Odo, Ian Schmid, Matthew P Fox) this effort involved many many contributions from others across the globe.
Blame
This is mostly Dr. Darren Dahly’s fault.