U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • J Korean Med Sci
  • v.34(45); 2019 Nov 25

Logo of jkms

Scientific Hypotheses: Writing, Promoting, and Predicting Implications

Armen yuri gasparyan.

1 Departments of Rheumatology and Research and Development, Dudley Group NHS Foundation Trust (Teaching Trust of the University of Birmingham, UK), Russells Hall Hospital, Dudley, West Midlands, UK.

Lilit Ayvazyan

2 Department of Medical Chemistry, Yerevan State Medical University, Yerevan, Armenia.

Ulzhan Mukanova

3 Department of Surgical Disciplines, South Kazakhstan Medical Academy, Shymkent, Kazakhstan.

Marlen Yessirkepov

4 Department of Biology and Biochemistry, South Kazakhstan Medical Academy, Shymkent, Kazakhstan.

George D. Kitas

5 Arthritis Research UK Epidemiology Unit, University of Manchester, Manchester, UK.

Scientific hypotheses are essential for progress in rapidly developing academic disciplines. Proposing new ideas and hypotheses require thorough analyses of evidence-based data and predictions of the implications. One of the main concerns relates to the ethical implications of the generated hypotheses. The authors may need to outline potential benefits and limitations of their suggestions and target widely visible publication outlets to ignite discussion by experts and start testing the hypotheses. Not many publication outlets are currently welcoming hypotheses and unconventional ideas that may open gates to criticism and conservative remarks. A few scholarly journals guide the authors on how to structure hypotheses. Reflecting on general and specific issues around the subject matter is often recommended for drafting a well-structured hypothesis article. An analysis of influential hypotheses, presented in this article, particularly Strachan's hygiene hypothesis with global implications in the field of immunology and allergy, points to the need for properly interpreting and testing new suggestions. Envisaging the ethical implications of the hypotheses should be considered both by authors and journal editors during the writing and publishing process.

INTRODUCTION

We live in times of digitization that radically changes scientific research, reporting, and publishing strategies. Researchers all over the world are overwhelmed with processing large volumes of information and searching through numerous online platforms, all of which make the whole process of scholarly analysis and synthesis complex and sophisticated.

Current research activities are diversifying to combine scientific observations with analysis of facts recorded by scholars from various professional backgrounds. 1 Citation analyses and networking on social media are also becoming essential for shaping research and publishing strategies globally. 2 Learning specifics of increasingly interdisciplinary research studies and acquiring information facilitation skills aid researchers in formulating innovative ideas and predicting developments in interrelated scientific fields.

Arguably, researchers are currently offered more opportunities than in the past for generating new ideas by performing their routine laboratory activities, observing individual cases and unusual developments, and critically analyzing published scientific facts. What they need at the start of their research is to formulate a scientific hypothesis that revisits conventional theories, real-world processes, and related evidence to propose new studies and test ideas in an ethical way. 3 Such a hypothesis can be of most benefit if published in an ethical journal with wide visibility and exposure to relevant online databases and promotion platforms.

Although hypotheses are crucially important for the scientific progress, only few highly skilled researchers formulate and eventually publish their innovative ideas per se . Understandably, in an increasingly competitive research environment, most authors would prefer to prioritize their ideas by discussing and conducting tests in their own laboratories or clinical departments, and publishing research reports afterwards. However, there are instances when simple observations and research studies in a single center are not capable of explaining and testing new groundbreaking ideas. Formulating hypothesis articles first and calling for multicenter and interdisciplinary research can be a solution in such instances, potentially launching influential scientific directions, if not academic disciplines.

The aim of this article is to overview the importance and implications of infrequently published scientific hypotheses that may open new avenues of thinking and research.

Despite the seemingly established views on innovative ideas and hypotheses as essential research tools, no structured definition exists to tag the term and systematically track related articles. In 1973, the Medical Subject Heading (MeSH) of the U.S. National Library of Medicine introduced “Research Design” as a structured keyword that referred to the importance of collecting data and properly testing hypotheses, and indirectly linked the term to ethics, methods and standards, among many other subheadings.

One of the experts in the field defines “hypothesis” as a well-argued analysis of available evidence to provide a realistic (scientific) explanation of existing facts, fill gaps in public understanding of sophisticated processes, and propose a new theory or a test. 4 A hypothesis can be proven wrong partially or entirely. However, even such an erroneous hypothesis may influence progress in science by initiating professional debates that help generate more realistic ideas. The main ethical requirement for hypothesis authors is to be honest about the limitations of their suggestions. 5

EXAMPLES OF INFLUENTIAL SCIENTIFIC HYPOTHESES

Daily routine in a research laboratory may lead to groundbreaking discoveries provided the daily accounts are comprehensively analyzed and reproduced by peers. The discovery of penicillin by Sir Alexander Fleming (1928) can be viewed as a prime example of such discoveries that introduced therapies to treat staphylococcal and streptococcal infections and modulate blood coagulation. 6 , 7 Penicillin got worldwide recognition due to the inventor's seminal works published by highly prestigious and widely visible British journals, effective ‘real-world’ antibiotic therapy of pneumonia and wounds during World War II, and euphoric media coverage. 8 In 1945, Fleming, Florey and Chain got a much deserved Nobel Prize in Physiology or Medicine for the discovery that led to the mass production of the wonder drug in the U.S. and ‘real-world practice’ that tested the use of penicillin. What remained globally unnoticed is that Zinaida Yermolyeva, the outstanding Soviet microbiologist, created the Soviet penicillin, which turned out to be more effective than the Anglo-American penicillin and entered mass production in 1943; that year marked the turning of the tide of the Great Patriotic War. 9 One of the reasons of the widely unnoticed discovery of Zinaida Yermolyeva is that her works were published exclusively by local Russian (Soviet) journals.

The past decades have been marked by an unprecedented growth of multicenter and global research studies involving hundreds and thousands of human subjects. This trend is shaped by an increasing number of reports on clinical trials and large cohort studies that create a strong evidence base for practice recommendations. Mega-studies may help generate and test large-scale hypotheses aiming to solve health issues globally. Properly designed epidemiological studies, for example, may introduce clarity to the hygiene hypothesis that was originally proposed by David Strachan in 1989. 10 David Strachan studied the epidemiology of hay fever in a cohort of 17,414 British children and concluded that declining family size and improved personal hygiene had reduced the chances of cross infections in families, resulting in epidemics of atopic disease in post-industrial Britain. Over the past four decades, several related hypotheses have been proposed to expand the potential role of symbiotic microorganisms and parasites in the development of human physiological immune responses early in life and protection from allergic and autoimmune diseases later on. 11 , 12 Given the popularity and the scientific importance of the hygiene hypothesis, it was introduced as a MeSH term in 2012. 13

Hypotheses can be proposed based on an analysis of recorded historic events that resulted in mass migrations and spreading of certain genetic diseases. As a prime example, familial Mediterranean fever (FMF), the prototype periodic fever syndrome, is believed to spread from Mesopotamia to the Mediterranean region and all over Europe due to migrations and religious prosecutions millennia ago. 14 Genetic mutations spearing mild clinical forms of FMF are hypothesized to emerge and persist in the Mediterranean region as protective factors against more serious infectious diseases, particularly tuberculosis, historically common in that part of the world. 15 The speculations over the advantages of carrying the MEditerranean FeVer (MEFV) gene are further strengthened by recorded low mortality rates from tuberculosis among FMF patients of different nationalities living in Tunisia in the first half of the 20th century. 16

Diagnostic hypotheses shedding light on peculiarities of diseases throughout the history of mankind can be formulated using artefacts, particularly historic paintings. 17 Such paintings may reveal joint deformities and disfigurements due to rheumatic diseases in individual subjects. A series of paintings with similar signs of pathological conditions interpreted in a historic context may uncover mysteries of epidemics of certain diseases, which is the case with Ruben's paintings depicting signs of rheumatic hands and making some doctors to believe that rheumatoid arthritis was common in Europe in the 16th and 17th century. 18

WRITING SCIENTIFIC HYPOTHESES

There are author instructions of a few journals that specifically guide how to structure, format, and make submissions categorized as hypotheses attractive. One of the examples is presented by Med Hypotheses , the flagship journal in its field with more than four decades of publishing and influencing hypothesis authors globally. However, such guidance is not based on widely discussed, implemented, and approved reporting standards, which are becoming mandatory for all scholarly journals.

Generating new ideas and scientific hypotheses is a sophisticated task since not all researchers and authors are skilled to plan, conduct, and interpret various research studies. Some experience with formulating focused research questions and strong working hypotheses of original research studies is definitely helpful for advancing critical appraisal skills. However, aspiring authors of scientific hypotheses may need something different, which is more related to discerning scientific facts, pooling homogenous data from primary research works, and synthesizing new information in a systematic way by analyzing similar sets of articles. To some extent, this activity is reminiscent of writing narrative and systematic reviews. As in the case of reviews, scientific hypotheses need to be formulated on the basis of comprehensive search strategies to retrieve all available studies on the topics of interest and then synthesize new information selectively referring to the most relevant items. One of the main differences between scientific hypothesis and review articles relates to the volume of supportive literature sources ( Table 1 ). In fact, hypothesis is usually formulated by referring to a few scientific facts or compelling evidence derived from a handful of literature sources. 19 By contrast, reviews require analyses of a large number of published documents retrieved from several well-organized and evidence-based databases in accordance with predefined search strategies. 20 , 21 , 22

The format of hypotheses, especially the implications part, may vary widely across disciplines. Clinicians may limit their suggestions to the clinical manifestations of diseases, outcomes, and management strategies. Basic and laboratory scientists analysing genetic, molecular, and biochemical mechanisms may need to view beyond the frames of their narrow fields and predict social and population-based implications of the proposed ideas. 23

Advanced writing skills are essential for presenting an interesting theoretical article which appeals to the global readership. Merely listing opposing facts and ideas, without proper interpretation and analysis, may distract the experienced readers. The essence of a great hypothesis is a story behind the scientific facts and evidence-based data.

ETHICAL IMPLICATIONS

The authors of hypotheses substantiate their arguments by referring to and discerning rational points from published articles that might be overlooked by others. Their arguments may contradict the established theories and practices, and pose global ethical issues, particularly when more or less efficient medical technologies and public health interventions are devalued. The ethical issues may arise primarily because of the careless references to articles with low priorities, inadequate and apparently unethical methodologies, and concealed reporting of negative results. 24 , 25

Misinterpretation and misunderstanding of the published ideas and scientific hypotheses may complicate the issue further. For example, Alexander Fleming, whose innovative ideas of penicillin use to kill susceptible bacteria saved millions of lives, warned of the consequences of uncontrolled prescription of the drug. The issue of antibiotic resistance had emerged within the first ten years of penicillin use on a global scale due to the overprescription that affected the efficacy of antibiotic therapies, with undesirable consequences for millions. 26

The misunderstanding of the hygiene hypothesis that primarily aimed to shed light on the role of the microbiome in allergic and autoimmune diseases resulted in decline of public confidence in hygiene with dire societal implications, forcing some experts to abandon the original idea. 27 , 28 Although that hypothesis is unrelated to the issue of vaccinations, the public misunderstanding has resulted in decline of vaccinations at a time of upsurge of old and new infections.

A number of ethical issues are posed by the denial of the viral (human immunodeficiency viruses; HIV) hypothesis of acquired Immune deficiency Syndrome (AIDS) by Peter Duesberg, who overviewed the links between illicit recreational drugs and antiretroviral therapies with AIDS and refuted the etiological role of HIV. 29 That controversial hypothesis was rejected by several journals, but was eventually published without external peer review at Med Hypotheses in 2010. The publication itself raised concerns of the unconventional editorial policy of the journal, causing major perturbations and more scrutinized publishing policies by journals processing hypotheses.

WHERE TO PUBLISH HYPOTHESES

Although scientific authors are currently well informed and equipped with search tools to draft evidence-based hypotheses, there are still limited quality publication outlets calling for related articles. The journal editors may be hesitant to publish articles that do not adhere to any research reporting guidelines and open gates for harsh criticism of unconventional and untested ideas. Occasionally, the editors opting for open-access publishing and upgrading their ethics regulations launch a section to selectively publish scientific hypotheses attractive to the experienced readers. 30 However, the absence of approved standards for this article type, particularly no mandate for outlining potential ethical implications, may lead to publication of potentially harmful ideas in an attractive format.

A suggestion of simultaneously publishing multiple or alternative hypotheses to balance the reader views and feedback is a potential solution for the mainstream scholarly journals. 31 However, that option alone is hardly applicable to emerging journals with unconventional quality checks and peer review, accumulating papers with multiple rejections by established journals.

A large group of experts view hypotheses with improbable and controversial ideas publishable after formal editorial (in-house) checks to preserve the authors' genuine ideas and avoid conservative amendments imposed by external peer reviewers. 32 That approach may be acceptable for established publishers with large teams of experienced editors. However, the same approach can lead to dire consequences if employed by nonselective start-up, open-access journals processing all types of articles and primarily accepting those with charged publication fees. 33 In fact, pseudoscientific ideas arguing Newton's and Einstein's seminal works or those denying climate change that are hardly testable have already found their niche in substandard electronic journals with soft or nonexistent peer review. 34

CITATIONS AND SOCIAL MEDIA ATTENTION

The available preliminary evidence points to the attractiveness of hypothesis articles for readers, particularly those from research-intensive countries who actively download related documents. 35 However, citations of such articles are disproportionately low. Only a small proportion of top-downloaded hypotheses (13%) in the highly prestigious Med Hypotheses receive on average 5 citations per article within a two-year window. 36

With the exception of a few historic papers, the vast majority of hypotheses attract relatively small number of citations in a long term. 36 Plausible explanations are that these articles often contain a single or only a few citable points and that suggested research studies to test hypotheses are rarely conducted and reported, limiting chances of citing and crediting authors of genuine research ideas.

A snapshot analysis of citation activity of hypothesis articles may reveal interest of the global scientific community towards their implications across various disciplines and countries. As a prime example, Strachan's hygiene hypothesis, published in 1989, 10 is still attracting numerous citations on Scopus, the largest bibliographic database. As of August 28, 2019, the number of the linked citations in the database is 3,201. Of the citing articles, 160 are cited at least 160 times ( h -index of this research topic = 160). The first three citations are recorded in 1992 and followed by a rapid annual increase in citation activity and a peak of 212 in 2015 ( Fig. 1 ). The top 5 sources of the citations are Clin Exp Allergy (n = 136), J Allergy Clin Immunol (n = 119), Allergy (n = 81), Pediatr Allergy Immunol (n = 69), and PLOS One (n = 44). The top 5 citing authors are leading experts in pediatrics and allergology Erika von Mutius (Munich, Germany, number of publications with the index citation = 30), Erika Isolauri (Turku, Finland, n = 27), Patrick G Holt (Subiaco, Australia, n = 25), David P. Strachan (London, UK, n = 23), and Bengt Björksten (Stockholm, Sweden, n = 22). The U.S. is the leading country in terms of citation activity with 809 related documents, followed by the UK (n = 494), Germany (n = 314), Australia (n = 211), and the Netherlands (n = 177). The largest proportion of citing documents are articles (n = 1,726, 54%), followed by reviews (n = 950, 29.7%), and book chapters (n = 213, 6.7%). The main subject areas of the citing items are medicine (n = 2,581, 51.7%), immunology and microbiology (n = 1,179, 23.6%), and biochemistry, genetics and molecular biology (n = 415, 8.3%).

An external file that holds a picture, illustration, etc.
Object name is jkms-34-e300-g001.jpg

Interestingly, a recent analysis of 111 publications related to Strachan's hygiene hypothesis, stating that the lack of exposure to infections in early life increases the risk of rhinitis, revealed a selection bias of 5,551 citations on Web of Science. 37 The articles supportive of the hypothesis were cited more than nonsupportive ones (odds ratio adjusted for study design, 2.2; 95% confidence interval, 1.6–3.1). A similar conclusion pointing to a citation bias distorting bibliometrics of hypotheses was reached by an earlier analysis of a citation network linked to the idea that β-amyloid, which is involved in the pathogenesis of Alzheimer disease, is produced by skeletal muscle of patients with inclusion body myositis. 38 The results of both studies are in line with the notion that ‘positive’ citations are more frequent in the field of biomedicine than ‘negative’ ones, and that citations to articles with proven hypotheses are too common. 39

Social media channels are playing an increasingly active role in the generation and evaluation of scientific hypotheses. In fact, publicly discussing research questions on platforms of news outlets, such as Reddit, may shape hypotheses on health-related issues of global importance, such as obesity. 40 Analyzing Twitter comments, researchers may reveal both potentially valuable ideas and unfounded claims that surround groundbreaking research ideas. 41 Social media activities, however, are unevenly distributed across different research topics, journals and countries, and these are not always objective professional reflections of the breakthroughs in science. 2 , 42

Scientific hypotheses are essential for progress in science and advances in healthcare. Innovative ideas should be based on a critical overview of related scientific facts and evidence-based data, often overlooked by others. To generate realistic hypothetical theories, the authors should comprehensively analyze the literature and suggest relevant and ethically sound design for future studies. They should also consider their hypotheses in the context of research and publication ethics norms acceptable for their target journals. The journal editors aiming to diversify their portfolio by maintaining and introducing hypotheses section are in a position to upgrade guidelines for related articles by pointing to general and specific analyses of the subject, preferred study designs to test hypotheses, and ethical implications. The latter is closely related to specifics of hypotheses. For example, editorial recommendations to outline benefits and risks of a new laboratory test or therapy may result in a more balanced article and minimize associated risks afterwards.

Not all scientific hypotheses have immediate positive effects. Some, if not most, are never tested in properly designed research studies and never cited in credible and indexed publication outlets. Hypotheses in specialized scientific fields, particularly those hardly understandable for nonexperts, lose their attractiveness for increasingly interdisciplinary audience. The authors' honest analysis of the benefits and limitations of their hypotheses and concerted efforts of all stakeholders in science communication to initiate public discussion on widely visible platforms and social media may reveal rational points and caveats of the new ideas.

Disclosure: The authors have no potential conflicts of interest to disclose.

Author Contributions:

  • Conceptualization: Gasparyan AY, Yessirkepov M, Kitas GD.
  • Methodology: Gasparyan AY, Mukanova U, Ayvazyan L.
  • Writing - original draft: Gasparyan AY, Ayvazyan L, Yessirkepov M.
  • Writing - review & editing: Gasparyan AY, Yessirkepov M, Mukanova U, Kitas GD.

Scientific Hypotheses: Writing, Promoting, and Predicting Implications

Affiliations.

  • 1 Departments of Rheumatology and Research and Development, Dudley Group NHS Foundation Trust (Teaching Trust of the University of Birmingham, UK), Russells Hall Hospital, Dudley, West Midlands, UK. [email protected].
  • 2 Department of Medical Chemistry, Yerevan State Medical University, Yerevan, Armenia.
  • 3 Department of Surgical Disciplines, South Kazakhstan Medical Academy, Shymkent, Kazakhstan.
  • 4 Department of Biology and Biochemistry, South Kazakhstan Medical Academy, Shymkent, Kazakhstan.
  • 5 Departments of Rheumatology and Research and Development, Dudley Group NHS Foundation Trust (Teaching Trust of the University of Birmingham, UK), Russells Hall Hospital, Dudley, West Midlands, UK.
  • 6 Arthritis Research UK Epidemiology Unit, University of Manchester, Manchester, UK.
  • PMID: 31760713
  • PMCID: PMC6875436
  • DOI: 10.3346/jkms.2019.34.e300

Scientific hypotheses are essential for progress in rapidly developing academic disciplines. Proposing new ideas and hypotheses require thorough analyses of evidence-based data and predictions of the implications. One of the main concerns relates to the ethical implications of the generated hypotheses. The authors may need to outline potential benefits and limitations of their suggestions and target widely visible publication outlets to ignite discussion by experts and start testing the hypotheses. Not many publication outlets are currently welcoming hypotheses and unconventional ideas that may open gates to criticism and conservative remarks. A few scholarly journals guide the authors on how to structure hypotheses. Reflecting on general and specific issues around the subject matter is often recommended for drafting a well-structured hypothesis article. An analysis of influential hypotheses, presented in this article, particularly Strachan's hygiene hypothesis with global implications in the field of immunology and allergy, points to the need for properly interpreting and testing new suggestions. Envisaging the ethical implications of the hypotheses should be considered both by authors and journal editors during the writing and publishing process.

Keywords: Bibliographic Databases; Hypothesis; Impact; Peer Review; Research Ethics; Writing.

© 2019 The Korean Academy of Medical Sciences.

Publication types

  • Peer Review, Research* / ethics
  • Publishing* / ethics
  • Social Media

The potential of working hypotheses for deductive exploratory research

  • Open access
  • Published: 08 December 2020
  • Volume 55 , pages 1703–1725, ( 2021 )

Cite this article

You have full access to this open access article

  • Mattia Casula   ORCID: orcid.org/0000-0002-7081-8153 1 ,
  • Nandhini Rangarajan 2 &
  • Patricia Shields   ORCID: orcid.org/0000-0002-0960-4869 2  

60k Accesses

78 Citations

4 Altmetric

Explore all metrics

While hypotheses frame explanatory studies and provide guidance for measurement and statistical tests, deductive, exploratory research does not have a framing device like the hypothesis. To this purpose, this article examines the landscape of deductive, exploratory research and offers the working hypothesis as a flexible, useful framework that can guide and bring coherence across the steps in the research process. The working hypothesis conceptual framework is introduced, placed in a philosophical context, defined, and applied to public administration and comparative public policy. Doing so, this article explains: the philosophical underpinning of exploratory, deductive research; how the working hypothesis informs the methodologies and evidence collection of deductive, explorative research; the nature of micro-conceptual frameworks for deductive exploratory research; and, how the working hypothesis informs data analysis when exploratory research is deductive.

Similar content being viewed by others

journal articles on hypothesis

Reflections on Methodological Issues

journal articles on hypothesis

Research: Meaning and Purpose

journal articles on hypothesis

Research Design and Methodology

Avoid common mistakes on your manuscript.

1 Introduction

Exploratory research is generally considered to be inductive and qualitative (Stebbins 2001 ). Exploratory qualitative studies adopting an inductive approach do not lend themselves to a priori theorizing and building upon prior bodies of knowledge (Reiter 2013 ; Bryman 2004 as cited in Pearse 2019 ). Juxtaposed against quantitative studies that employ deductive confirmatory approaches, exploratory qualitative research is often criticized for lack of methodological rigor and tentativeness in results (Thomas and Magilvy 2011 ). This paper focuses on the neglected topic of deductive, exploratory research and proposes working hypotheses as a useful framework for these studies.

To emphasize that certain types of applied research lend themselves more easily to deductive approaches, to address the downsides of exploratory qualitative research, and to ensure qualitative rigor in exploratory research, a significant body of work on deductive qualitative approaches has emerged (see for example, Gilgun 2005 , 2015 ; Hyde 2000 ; Pearse 2019 ). According to Gilgun ( 2015 , p. 3) the use of conceptual frameworks derived from comprehensive reviews of literature and a priori theorizing were common practices in qualitative research prior to the publication of Glaser and Strauss’s ( 1967 ) The Discovery of Grounded Theory . Gilgun ( 2015 ) coined the terms Deductive Qualitative Analysis (DQA) to arrive at some sort of “middle-ground” such that the benefits of a priori theorizing (structure) and allowing room for new theory to emerge (flexibility) are reaped simultaneously. According to Gilgun ( 2015 , p. 14) “in DQA, the initial conceptual framework and hypotheses are preliminary. The purpose of DQA is to come up with a better theory than researchers had constructed at the outset (Gilgun 2005 , 2009 ). Indeed, the production of new, more useful hypotheses is the goal of DQA”.

DQA provides greater level of structure for both the experienced and novice qualitative researcher (see for example Pearse 2019 ; Gilgun 2005 ). According to Gilgun ( 2015 , p. 4) “conceptual frameworks are the sources of hypotheses and sensitizing concepts”. Sensitizing concepts frame the exploratory research process and guide the researcher’s data collection and reporting efforts. Pearse ( 2019 ) discusses the usefulness for deductive thematic analysis and pattern matching to help guide DQA in business research. Gilgun ( 2005 ) discusses the usefulness of DQA for family research.

Given these rationales for DQA in exploratory research, the overarching purpose of this paper is to contribute to that growing corpus of work on deductive qualitative research. This paper is specifically aimed at guiding novice researchers and student scholars to the working hypothesis as a useful a priori framing tool. The applicability of the working hypothesis as a tool that provides more structure during the design and implementation phases of exploratory research is discussed in detail. Examples of research projects in public administration that use the working hypothesis as a framing tool for deductive exploratory research are provided.

In the next section, we introduce the three types of research purposes. Second, we examine the nature of the exploratory research purpose. Third, we provide a definition of working hypothesis. Fourth, we explore the philosophical roots of methodology to see where exploratory research fits. Fifth, we connect the discussion to the dominant research approaches (quantitative, qualitative and mixed methods) to see where deductive exploratory research fits. Sixth, we examine the nature of theory and the role of the hypothesis in theory. We contrast formal hypotheses and working hypotheses. Seven, we provide examples of student and scholarly work that illustrates how working hypotheses are developed and operationalized. Lastly, this paper synthesizes previous discussion with concluding remarks.

2 Three types of research purposes

The literature identifies three basic types of research purposes—explanation, description and exploration (Babbie 2007 ; Adler and Clark 2008 ; Strydom 2013 ; Shields and Whetsell 2017 ). Research purposes are similar to research questions; however, they focus on project goals or aims instead of questions.

Explanatory research answers the “why” question (Babbie 2007 , pp. 89–90), by explaining “why things are the way they are”, and by looking “for causes and reasons” (Adler and Clark 2008 , p. 14). Explanatory research is closely tied to hypothesis testing. Theory is tested using deductive reasoning, which goes from the general to the specific (Hyde 2000 , p. 83). Hypotheses provide a frame for explanatory research connecting the research purpose to other parts of the research process (variable construction, choice of data, statistical tests). They help provide alignment or coherence across stages in the research process and provide ways to critique the strengths and weakness of the study. For example, were the hypotheses grounded in the appropriate arguments and evidence in the literature? Are the concepts imbedded in the hypotheses appropriately measured? Was the best statistical test used? When the analysis is complete (hypothesis is tested), the results generally answer the research question (the evidence supported or failed to support the hypothesis) (Shields and Rangarajan 2013 ).

Descriptive research addresses the “What” question and is not primarily concerned with causes (Strydom 2013 ; Shields and Tajalli 2006 ). It lies at the “midpoint of the knowledge continuum” (Grinnell 2001 , p. 248) between exploration and explanation. Descriptive research is used in both quantitative and qualitative research. A field researcher might want to “have a more highly developed idea of social phenomena” (Strydom 2013 , p. 154) and develop thick descriptions using inductive logic. In science, categorization and classification systems such as the periodic table of chemistry or the taxonomies of biology inform descriptive research. These baseline classification systems are a type of theorizing and allow researchers to answer questions like “what kind” of plants and animals inhabit a forest. The answer to this question would usually be displayed in graphs and frequency distributions. This is also the data presentation system used in the social sciences (Ritchie and Lewis 2003 ; Strydom 2013 ). For example, if a scholar asked, what are the needs of homeless people? A quantitative approach would include a survey that incorporated a “needs” classification system (preferably based on a literature review). The data would be displayed as frequency distributions or as charts. Description can also be guided by inductive reasoning, which draws “inferences from specific observable phenomena to general rules or knowledge expansion” (Worster 2013 , p. 448). Theory and hypotheses are generated using inductive reasoning, which begins with data and the intention of making sense of it by theorizing. Inductive descriptive approaches would use a qualitative, naturalistic design (open ended interview questions with the homeless population). The data could provide a thick description of the homeless context. For deductive descriptive research, categories, serve a purpose similar to hypotheses for explanatory research. If developed with thought and a connection to the literature, categories can serve as a framework that inform measurement, link to data collection mechanisms and to data analysis. Like hypotheses they can provide horizontal coherence across the steps in the research process.

Table  1 demonstrated these connections for deductive, descriptive and explanatory research. The arrow at the top emphasizes the horizontal or across the research process view we emphasize. This article makes the case that the working hypothesis can serve the same purpose as the hypothesis for deductive, explanatory research and categories for deductive descriptive research. The cells for exploratory research are filled in with question marks.

The remainder of this paper focuses on exploratory research and the answers to questions found in the table:

What is the philosophical underpinning of exploratory, deductive research?

What is the Micro-conceptual framework for deductive exploratory research? [ As is clear from the article title we introduce the working hypothesis as the answer .]

How does the working hypothesis inform the methodologies and evidence collection of deductive exploratory research?

How does the working hypothesis inform data analysis of deductive exploratory research?

3 The nature of exploratory research purpose

Explorers enter the unknown to discover something new. The process can be fraught with struggle and surprises. Effective explorers creatively resolve unexpected problems. While we typically think of explorers as pioneers or mountain climbers, exploration is very much linked to the experience and intention of the explorer. Babies explore as they take their first steps. The exploratory purpose resonates with these insights. Exploratory research, like reconnaissance, is a type of inquiry that is in the preliminary or early stages (Babbie 2007 ). It is associated with discovery, creativity and serendipity (Stebbins 2001 ). But the person doing the discovery, also defines the activity or claims the act of exploration. It “typically occurs when a researcher examines a new interest or when the subject of study itself is relatively new” (Babbie 2007 , p. 88). Hence, exploration has an open character that emphasizes “flexibility, pragmatism, and the particular, biographically specific interests of an investigator” (Maanen et al. 2001 , p. v). These three purposes form a type of hierarchy. An area of inquiry is initially explored . This early work lays the ground for, description which in turn becomes the basis for explanation . Quantitative, explanatory studies dominate contemporary high impact journals (Twining et al. 2017 ).

Stebbins ( 2001 ) makes the point that exploration is often seen as something like a poor stepsister to confirmatory or hypothesis testing research. He has a problem with this because we live in a changing world and what is settled today will very likely be unsettled in the near future and in need of exploration. Further, exploratory research “generates initial insights into the nature of an issue and develops questions to be investigated by more extensive studies” (Marlow 2005 , p. 334). Exploration is widely applicable because all research topics were once “new.” Further, all research topics have the possibility of “innovation” or ongoing “newness”. Exploratory research may be appropriate to establish whether a phenomenon exists (Strydom 2013 ). The point here, of course, is that the exploratory purpose is far from trivial.

Stebbins’ Exploratory Research in the Social Sciences ( 2001 ), is the only book devoted to the nature of exploratory research as a form of social science inquiry. He views it as a “broad-ranging, purposive, systematic prearranged undertaking designed to maximize the discovery of generalizations leading to description and understanding of an area of social or psychological life” (p. 3). It is science conducted in a way distinct from confirmation. According to Stebbins ( 2001 , p. 6) the goal is discovery of potential generalizations, which can become future hypotheses and eventually theories that emerge from the data. He focuses on inductive logic (which stimulates creativity) and qualitative methods. He does not want exploratory research limited to the restrictive formulas and models he finds in confirmatory research. He links exploratory research to Glaser and Strauss’s ( 1967 ) flexible, immersive, Grounded Theory. Strydom’s ( 2013 ) analysis of contemporary social work research methods books echoes Stebbins’ ( 2001 ) position. Stebbins’s book is an important contribution, but it limits the potential scope of this flexible and versatile research purpose. If we accepted his conclusion, we would delete the “Exploratory” row from Table  1 .

Note that explanatory research can yield new questions, which lead to exploration. Inquiry is a process where inductive and deductive activities can occur simultaneously or in a back and forth manner, particularly as the literature is reviewed and the research design emerges. Footnote 1 Strict typologies such as explanation, description and exploration or inductive/deductive can obscures these larger connections and processes. We draw insight from Dewey’s ( 1896 ) vision of inquiry as depicted in his seminal “Reflex Arc” article. He notes that “stimulus” and “response” like other dualities (inductive/deductive) exist within a larger unifying system. Yet the terms have value. “We need not abandon terms like stimulus and response, so long as we remember that they are attached to events based upon their function in a wider dynamic context, one that includes interests and aims” (Hildebrand 2008 , p. 16). So too, in methodology typologies such as deductive/inductive capture useful distinctions with practical value and are widely used in the methodology literature.

We argue that there is a role for exploratory, deductive, and confirmatory research. We maintain all types of research logics and methods should be in the toolbox of exploratory research. First, as stated above, it makes no sense on its face to identify an extremely flexible purpose that is idiosyncratic to the researcher and then basically restrict its use to qualitative, inductive, non-confirmatory methods. Second, Stebbins’s ( 2001 ) work focused on social science ignoring the policy sciences. Exploratory research can be ideal for immediate practical problems faced by policy makers, who could find a framework of some kind useful. Third, deductive, exploratory research is more intentionally connected to previous research. Some kind of initial framing device is located or designed using the literature. This may be very important for new scholars who are developing research skills and exploring their field and profession. Stebbins’s insights are most pertinent for experienced scholars. Fourth, frameworks and deductive logic are useful for comparative work because some degree of consistency across cases is built into the design.

As we have seen, the hypotheses of explanatory and categories of descriptive research are the dominate frames of social science and policy science. We certainly concur that neither of these frames makes a lot of sense for exploratory research. They would tend to tie it down. We see the problem as a missing framework or missing way to frame deductive, exploratory research in the methodology literature. Inductive exploratory research would not work for many case studies that are trying to use evidence to make an argument. What exploratory deductive case studies need is a framework that incorporates flexibility. This is even more true for comparative case studies. A framework of this sort could be usefully applied to policy research (Casula 2020a ), particularly evaluative policy research, and applied research generally. We propose the Working Hypothesis as a flexible conceptual framework and as a useful tool for doing exploratory studies. It can be used as an evaluative criterion particularly for process evaluation and is useful for student research because students can develop theorizing skills using the literature.

Table  1 included a column specifying the philosophical basis for each research purpose. Shifting gears to the philosophical underpinning of methodology provides useful additional context for examination of deductive, exploratory research.

4 What is a working hypothesis

The working hypothesis is first and foremost a hypothesis or a statement of expectation that is tested in action. The term “working” suggest that these hypotheses are subject to change, are provisional and the possibility of finding contradictory evidence is real. In addition, a “working” hypothesis is active, it is a tool in an ongoing process of inquiry. If one begins with a research question, the working hypothesis could be viewed as a statement or group of statements that answer the question. It “works” to move purposeful inquiry forward. “Working” also implies some sort of community, mostly we work together in relationship to achieve some goal.

Working Hypothesis is a term found in earlier literature. Indeed, both pioneering pragmatists, John Dewey and George Herbert Mead use the term working hypothesis in important nineteenth century works. For both Dewey and Mead, the notion of a working hypothesis has a self-evident quality and it is applied in a big picture context. Footnote 2

Most notably, Dewey ( 1896 ), in one of his most pivotal early works (“Reflex Arc”), used “working hypothesis” to describe a key concept in psychology. “The idea of the reflex arc has upon the whole come nearer to meeting this demand for a general working hypothesis than any other single concept (Italics added)” (p. 357). The notion of a working hypothesis was developed more fully 42 years later, in Logic the Theory of Inquiry , where Dewey developed the notion of a working hypothesis that operated on a smaller scale. He defines working hypotheses as a “provisional, working means of advancing investigation” (Dewey 1938 , pp. 142). Dewey’s definition suggests that working hypotheses would be useful toward the beginning of a research project (e.g., exploratory research).

Mead ( 1899 ) used working hypothesis in a title of an American Journal of Sociology article “The Working Hypothesis and Social Reform” (italics added). He notes that a scientist’s foresight goes beyond testing a hypothesis.

Given its success, he may restate his world from this standpoint and get the basis for further investigation that again always takes the form of a problem. The solution of this problem is found over again in the possibility of fitting his hypothetical proposition into the whole within which it arises. And he must recognize that this statement is only a working hypothesis at the best, i.e., he knows that further investigation will show that the former statement of his world is only provisionally true, and must be false from the standpoint of a larger knowledge, as every partial truth is necessarily false over against the fuller knowledge which he will gain later (Mead 1899 , p. 370).

Cronbach ( 1975 ) developed a notion of working hypothesis consistent with inductive reasoning, but for him, the working hypothesis is a product or result of naturalistic inquiry. He makes the case that naturalistic inquiry is highly context dependent and therefore results or seeming generalizations that may come from a study and should be viewed as “working hypotheses”, which “are tentative both for the situation in which they first uncovered and for other situations” (as cited in Gobo 2008 , p. 196).

A quick Google scholar search using the term “working hypothesis” show that it is widely used in twentieth and twenty-first century science, particularly in titles. In these articles, the working hypothesis is treated as a conceptual tool that furthers investigation in its early or transitioning phases. We could find no explicit links to exploratory research. The exploratory nature of the problem is expressed implicitly. Terms such as “speculative” (Habib 2000 , p. 2391) or “rapidly evolving field” (Prater et al. 2007 , p. 1141) capture the exploratory nature of the study. The authors might describe how a topic is “new” or reference “change”. “As a working hypothesis, the picture is only new, however, in its interpretation” (Milnes 1974 , p. 1731). In a study of soil genesis, Arnold ( 1965 , p. 718) notes “Sequential models, formulated as working hypotheses, are subject to further investigation and change”. Any 2020 article dealing with COVID-19 and respiratory distress would be preliminary almost by definition (Ciceri et al. 2020 ).

5 Philosophical roots of methodology

According to Kaplan ( 1964 , p. 23) “the aim of methodology is to help us understand, in the broadest sense not the products of scientific inquiry but the process itself”. Methods contain philosophical principles that distinguish them from other “human enterprises and interests” (Kaplan 1964 , p. 23). Contemporary research methodology is generally classified as quantitative, qualitative and mixed methods. Leading scholars of methodology have associated each with a philosophical underpinning—positivism (or post-positivism), interpretivism or constructivist and pragmatism, respectively (Guba 1987 ; Guba and Lincoln 1981 ; Schrag 1992 ; Stebbins 2001 ; Mackenzi and Knipe 2006 ; Atieno 2009 ; Levers 2013 ; Morgan 2007 ; O’Connor et al. 2008 ; Johnson and Onwuegbuzie 2004 ; Twining et al. 2017 ). This section summarizes how the literature often describes these philosophies and informs contemporary methodology and its literature.

Positivism and its more contemporary version, post-positivism, maintains an objectivist ontology or assumes an objective reality, which can be uncovered (Levers 2013 ; Twining et al. 2017 ). Footnote 3 Time and context free generalizations are possible and “real causes of social scientific outcomes can be determined reliably and validly (Johnson and Onwuegbunzie 2004 , p. 14). Further, “explanation of the social world is possible through a logical reduction of social phenomena to physical terms”. It uses an empiricist epistemology which “implies testability against observation, experimentation, or comparison” (Whetsell and Shields 2015 , pp. 420–421). Correspondence theory, a tenet of positivism, asserts that “to each concept there corresponds a set of operations involved in its scientific use” (Kaplan 1964 , p. 40).

The interpretivist, constructivists or post-modernist approach is a reaction to positivism. It uses a relativist ontology and a subjectivist epistemology (Levers 2013 ). In this world of multiple realities, context free generalities are impossible as is the separation of facts and values. Causality, explanation, prediction, experimentation depend on assumptions about the correspondence between concepts and reality, which in the absence of an objective reality is impossible. Empirical research can yield “contextualized emergent understanding rather than the creation of testable theoretical structures” (O’Connor et al. 2008 , p. 30). The distinctively different world views of positivist/post positivist and interpretivist philosophy is at the core of many controversies in methodology, social and policy science literature (Casula 2020b ).

With its focus on dissolving dualisms, pragmatism steps outside the objective/subjective debate. Instead, it asks, “what difference would it make to us if the statement were true” (Kaplan 1964 , p. 42). Its epistemology is connected to purposeful inquiry. Pragmatism has a “transformative, experimental notion of inquiry” anchored in pluralism and a focus on constructing conceptual and practical tools to resolve “problematic situations” (Shields 1998 ; Shields and Rangarajan 2013 ). Exploration and working hypotheses are most comfortably situated within the pragmatic philosophical perspective.

6 Research approaches

Empirical investigation relies on three types of methodology—quantitative, qualitative and mixed methods.

6.1 Quantitative methods

Quantitative methods uses deductive logic and formal hypotheses or models to explain, predict, and eventually establish causation (Hyde 2000 ; Kaplan 1964 ; Johnson and Onwuegbunzie 2004 ; Morgan 2007 ). Footnote 4 The correspondence between the conceptual and empirical world make measures possible. Measurement assigns numbers to objects, events or situations and allows for standardization and subtle discrimination. It also allows researchers to draw on the power of mathematics and statistics (Kaplan 1964 , pp. 172–174). Using the power of inferential statistics, quantitative research employs research designs, which eliminate competing hypotheses. It is high in external validity or the ability to generalize to the whole. The research results are relatively independent of the researcher (Johnson & Onwuegbunzie 2004 ).

Quantitative methods depend on the quality of measurement and a priori conceptualization, and adherence to the underlying assumptions of inferential statistics. Critics charge that hypotheses and frameworks needlessly constrain inquiry (Johnson and Onwuegbunzie 2004 , p. 19). Hypothesis testing quantitative methods support the explanatory purpose.

6.2 Qualitative methods

Qualitative researchers who embrace the post-modern, interpretivist view, Footnote 5 question everything about the nature of quantitative methods (Willis et al. 2007 ). Rejecting the possibility of objectivity, correspondence between ideas and measures, and the constraints of a priori theorizing they focus on “unique impressions and understandings of events rather than to generalize the findings” (Kolb 2012 , p. 85). Characteristics of traditional qualitative research include “induction, discovery, exploration, theory/hypothesis generation and the researcher as the primary ‘instrument’ of data collection” (Johnson and Onwuegbunzie 2004 , p. 18). It also concerns itself with forming “unique impressions and understandings of events rather than to generalize findings” (Kolb 2012 , p. 85). The data of qualitative methods are generated via interviews, direct observation, focus groups and analysis of written records or artifacts.

Qualitative methods provide for understanding and “description of people’s personal experiences of phenomena”. They enable descriptions of detailed “phenomena as they are situated and embedded in local contexts.” Researchers use naturalistic settings to “study dynamic processes” and explore how participants interpret experiences. Qualitative methods have an inherent flexibility, allowing researchers to respond to changes in the research setting. They are particularly good at narrowing to the particular and on the flipside have limited external validity (Johnson and Onwuegbunzie 2004 , p. 20). Instead of specifying a suitable sample size to draw conclusions, qualitative research uses the notion of saturation (Morse 1995 ).

Saturation is used in grounded theory—a widely used and respected form of qualitative research, and a well-known interpretivist qualitative research method. Introduced by Glaser and Strauss ( 1967 ), this “grounded on observation” (Patten and Newhart 2000 , p. 27) methodology, focuses on “the creation of emergent understanding” (O’Connor et al. 2008 , p. 30). It uses the Constant Comparative method, whereby researchers develop theory from data as they code and analyze at the same time. Data collection, coding and analysis along with theoretical sampling are systematically combined to generate theory (Kolb 2012 , p. 83). The qualitative methods discussed here support exploratory research.

A close look at the two philosophies and assumptions of quantitative and qualitative research suggests two contradictory world views. The literature has labeled these contradictory views the Incompatibility Theory, which sets up a quantitative versus qualitative tension similar to the seeming separation of art and science or fact and values (Smith 1983a , b ; Guba 1987 ; Smith and Heshusius 1986 ; Howe 1988 ). The incompatibility theory does not make sense in practice. Yin ( 1981 , 1992 , 2011 , 2017 ), a prominent case study scholar, showcases a deductive research methodology that crosses boundaries using both quantaitive and qualitative evidence when appropriate.

6.3 Mixed methods

Turning the “Incompatibility Theory” on its head, Mixed Methods research “combines elements of qualitative and quantitative research approaches … for the broad purposes of breadth and depth of understanding and corroboration” (Johnson et al. 2007 , p. 123). It does this by partnering with philosophical pragmatism. Footnote 6 Pragmatism is productive because “it offers an immediate and useful middle position philosophically and methodologically; it offers a practical and outcome-oriented method of inquiry that is based on action and leads, iteratively, to further action and the elimination of doubt; it offers a method for selecting methodological mixes that can help researchers better answer many of their research questions” (Johnson and Onwuegbunzie 2004 , p. 17). What is theory for the pragmatist “any theoretical model is for the pragmatist, nothing more than a framework through which problems are perceived and subsequently organized ” (Hothersall 2019 , p. 5).

Brendel ( 2009 ) constructed a simple framework to capture the core elements of pragmatism. Brendel’s four “p”’s—practical, pluralism, participatory and provisional help to show the relevance of pragmatism to mixed methods. Pragmatism is purposeful and concerned with the practical consequences. The pluralism of pragmatism overcomes quantitative/qualitative dualism. Instead, it allows for multiple perspectives (including positivism and interpretivism) and, thus, gets around the incompatibility problem. Inquiry should be participatory or inclusive of the many views of participants, hence, it is consistent with multiple realities and is also tied to the common concern of a problematic situation. Finally, all inquiry is provisional . This is compatible with experimental methods, hypothesis testing and consistent with the back and forth of inductive and deductive reasoning. Mixed methods support exploratory research.

Advocates of mixed methods research note that it overcomes the weaknesses and employs the strengths of quantitative and qualitative methods. Quantitative methods provide precision. The pictures and narrative of qualitative techniques add meaning to the numbers. Quantitative analysis can provide a big picture, establish relationships and its results have great generalizability. On the other hand, the “why” behind the explanation is often missing and can be filled in through in-depth interviews. A deeper and more satisfying explanation is possible. Mixed-methods brings the benefits of triangulation or multiple sources of evidence that converge to support a conclusion. It can entertain a “broader and more complete range of research questions” (Johnson and Onwuegbunzie 2004 , p. 21) and can move between inductive and deductive methods. Case studies use multiple forms of evidence and are a natural context for mixed methods.

One thing that seems to be missing from mixed method literature and explicit design is a place for conceptual frameworks. For example, Heyvaert et al. ( 2013 ) examined nine mixed methods studies and found an explicit framework in only two studies (transformative and pragmatic) (p. 663).

7 Theory and hypotheses: where is and what is theory?

Theory is key to deductive research. In essence, empirical deductive methods test theory. Hence, we shift our attention to theory and the role and functions of the hypotheses in theory. Oppenheim and Putnam ( 1958 ) note that “by a ‘theory’ (in the widest sense) we mean any hypothesis, generalization or law (whether deterministic or statistical) or any conjunction of these” (p. 25). Van Evera ( 1997 ) uses a similar and more complex definition “theories are general statements that describe and explain the causes of effects of classes of phenomena. They are composed of causal laws or hypotheses, explanations, and antecedent conditions” (p. 8). Sutton and Staw ( 1995 , p. 376) in a highly cited article “What Theory is Not” assert the that hypotheses should contain logical arguments for “why” the hypothesis is expected. Hypotheses need an underlying causal argument before they can be considered theory. The point of this discussion is not to define theory but to establish the importance of hypotheses in theory.

Explanatory research is implicitly relational (A explains B). The hypotheses of explanatory research lay bare these relationships. Popular definitions of hypotheses capture this relational component. For example, the Cambridge Dictionary defines a hypothesis a “an idea or explanation for something that is based on known facts but has not yet been proven”. Vocabulary.Com’s definition emphasizes explanation, a hypothesis is “an idea or explanation that you then test through study and experimentation”. According to Wikipedia a hypothesis is “a proposed explanation for a phenomenon”. Other definitions remove the relational or explanatory reference. The Oxford English Dictionary defines a hypothesis as a “supposition or conjecture put forth to account for known facts.” Science Buddies defines a hypothesis as a “tentative, testable answer to a scientific question”. According to the Longman Dictionary the hypothesis is “an idea that can be tested to see if it is true or not”. The Urban Dictionary states a hypothesis is “a prediction or educated-guess based on current evidence that is yet be tested”. We argue that the hypotheses of exploratory research— working hypothesis — are not bound by relational expectations. It is this flexibility that distinguishes the working hypothesis.

Sutton and Staw (1995) maintain that hypotheses “serve as crucial bridges between theory and data, making explicit how the variables and relationships that follow from a logical argument will be operationalized” (p. 376, italics added). The highly rated journal, Computers and Education , Twining et al. ( 2017 ) created guidelines for qualitative research as a way to improve soundness and rigor. They identified the lack of alignment between theoretical stance and methodology as a common problem in qualitative research. In addition, they identified a lack of alignment between methodology, design, instruments of data collection and analysis. The authors created a guidance summary, which emphasized the need to enhance coherence throughout elements of research design (Twining et al. 2017 p. 12). Perhaps the bridging function of the hypothesis mentioned by Sutton and Staw (1995) is obscured and often missing in qualitative methods. Working hypotheses can be a tool to overcome this problem.

For reasons, similar to those used by mixed methods scholars, we look to classical pragmatism and the ideas of John Dewey to inform our discussion of theory and working hypotheses. Dewey ( 1938 ) treats theory as a tool of empirical inquiry and uses a map metaphor (p. 136). Theory is like a map that helps a traveler navigate the terrain—and should be judged by its usefulness. “There is no expectation that a map is a true representation of reality. Rather, it is a representation that allows a traveler to reach a destination (achieve a purpose). Hence, theories should be judged by how well they help resolve the problem or achieve a purpose ” (Shields and Rangarajan 2013 , p. 23). Note that we explicitly link theory to the research purpose. Theory is never treated as an unimpeachable Truth, rather it is a helpful tool that organizes inquiry connecting data and problem. Dewey’s approach also expands the definition of theory to include abstractions (categories) outside of causation and explanation. The micro-conceptual frameworks Footnote 7 introduced in Table  1 are a type of theory. We define conceptual frameworks as the “way the ideas are organized to achieve the project’s purpose” (Shields and Rangarajan 2013 p. 24). Micro-conceptual frameworks do this at the very close to the data level of analysis. Micro-conceptual frameworks can direct operationalization and ways to assess measurement or evidence at the individual research study level. Again, the research purpose plays a pivotal role in the functioning of theory (Shields and Tajalli 2006 ).

8 Working hypothesis: methods and data analysis

We move on to answer the remaining questions in the Table  1 . We have established that exploratory research is extremely flexible and idiosyncratic. Given this, we will proceed with a few examples and draw out lessons for developing an exploratory purpose, building a framework and from there identifying data collection techniques and the logics of hypotheses testing and analysis. Early on we noted the value of the Working Hypothesis framework for student empirical research and applied research. The next section uses a masters level student’s work to illustrate the usefulness of working hypotheses as a way to incorporate the literature and structure inquiry. This graduate student was also a mature professional with a research question that emerged from his job and is thus an example of applied research.

Master of Public Administration student, Swift ( 2010 ) worked for a public agency and was responsible for that agency’s sexual harassment training. The agency needed to evaluate its training but had never done so before. He also had never attempted a significant empirical research project. Both of these conditions suggest exploration as a possible approach. He was interested in evaluating the training program and hence the project had a normative sense. Given his job, he already knew a lot about the problem of sexual harassment and sexual harassment training. What he did not know much about was doing empirical research, reviewing the literature or building a framework to evaluate the training (working hypotheses). He wanted a framework that was flexible and comprehensive. In his research, he discovered Lundvall’s ( 2006 ) knowledge taxonomy summarized with four simple ways of knowing ( Know - what, Know - how, Know - why, Know - who ). He asked whether his agency’s training provided the participants with these kinds of knowledge? Lundvall’s categories of knowing became the basis of his working hypotheses. Lundvall’s knowledge taxonomy is well suited for working hypotheses because it is so simple and is easy to understand intuitively. It can also be tailored to the unique problematic situation of the researcher. Swift ( 2010 , pp. 38–39) developed four basic working hypotheses:

WH1: Capital Metro provides adequate know - what knowledge in its sexual harassment training

WH2: Capital Metro provides adequate know - how knowledge in its sexual harassment training

WH3: Capital Metro provides adequate know - why knowledge in its sexual harassment training

WH4: Capital Metro provides adequate know - who knowledge in its sexual harassment training

From here he needed to determine what would determine the different kinds of knowledge. For example, what constitutes “know what” knowledge for sexual harassment training. This is where his knowledge and experience working in the field as well as the literature come into play. According to Lundvall et al. ( 1988 , p. 12) “know what” knowledge is about facts and raw information. Swift ( 2010 ) learned through the literature that laws and rules were the basis for the mandated sexual harassment training. He read about specific anti-discrimination laws and the subsequent rules and regulations derived from the laws. These laws and rules used specific definitions and were enacted within a historical context. Laws, rules, definitions and history became the “facts” of Know-What knowledge for his working hypothesis. To make this clear, he created sub-hypotheses that explicitly took these into account. See how Swift ( 2010 , p. 38) constructed the sub-hypotheses below. Each sub-hypothesis was defended using material from the literature (Swift 2010 , pp. 22–26). The sub-hypotheses can also be easily tied to evidence. For example, he could document that the training covered anti-discrimination laws.

WH1: Capital Metro provides adequate know - what knowledge in its sexual Harassment training

WH1a: The sexual harassment training includes information on anti-discrimination laws (Title VII).

WH1b: The sexual harassment training includes information on key definitions.

WH1c: The sexual harassment training includes information on Capital Metro’s Equal Employment Opportunity and Harassment policy.

WH1d: Capital Metro provides training on sexual harassment history.

Know-How knowledge refers to the ability to do something and involves skills (Lundvall and Johnson 1994 , p. 12). It is a kind of expertise in action. The literature and his experience allowed James Smith to identify skills such as how to file a claim or how to document incidents of sexual harassment as important “know-how” knowledge that should be included in sexual harassment training. Again, these were depicted as sub-hypotheses.

WH2: Capital Metro provides adequate know - how knowledge in its sexual Harassment training

WH2a: Training is provided on how to file and report a claim of harassment

WH2b: Training is provided on how to document sexual harassment situations.

WH2c: Training is provided on how to investigate sexual harassment complaints.

WH2d: Training is provided on how to follow additional harassment policy procedures protocol

Note that the working hypotheses do not specify a relationship but rather are simple declarative sentences. If “know-how” knowledge was found in the sexual harassment training, he would be able to find evidence that participants learned about how to file a claim (WH2a). The working hypothesis provides the bridge between theory and data that Sutton and Staw (1995) found missing in exploratory work. The sub-hypotheses are designed to be refined enough that the researchers would know what to look for and tailor their hunt for evidence. Figure  1 captures the generic sub-hypothesis design.

figure 1

A Common structure used in the development of working hypotheses

When expected evidence is linked to the sub-hypotheses, data, framework and research purpose are aligned. This can be laid out in a planning document that operationalizes the data collection in something akin to an architect’s blueprint. This is where the scholar explicitly develops the alignment between purpose, framework and method (Shields and Rangarajan 2013 ; Shields et al. 2019b ).

Table  2 operationalizes Swift’s working hypotheses (and sub-hypotheses). The table provide clues as to what kind of evidence is needed to determine whether the hypotheses are supported. In this case, Smith used interviews with participants and trainers as well as a review of program documents. Column one repeats the sub-hypothesis, column two specifies the data collection method (here interviews with participants/managers and review of program documents) and column three specifies the unique questions that focus the investigation. For example, the interview questions are provided. In the less precise world of qualitative data, evidence supporting a hypothesis could have varying degrees of strength. This too can be specified.

For Swift’s example, neither the statistics of explanatory research nor the open-ended questions of interpretivist, inductive exploratory research is used. The deductive logic of inquiry here is somewhat intuitive and similar to a detective (Ulriksen and Dadalauri 2016 ). It is also a logic used in international law (Worster 2013 ). It should be noted that the working hypothesis and the corresponding data collection protocol does not stop inquiry and fieldwork outside the framework. The interviews could reveal an unexpected problem with Smith’s training program. The framework provides a very loose and perhaps useful ways to identify and make sense of the data that does not fit the expectations. Researchers using working hypotheses should be sensitive to interesting findings that fall outside their framework. These could be used in future studies, to refine theory or even in this case provide suggestions to improve sexual harassment training. The sensitizing concepts mentioned by Gilgun ( 2015 ) are free to emerge and should be encouraged.

Something akin to working hypotheses are hidden in plain sight in the professional literature. Take for example Kerry Crawford’s ( 2017 ) book Wartime Sexual Violence. Here she explores how basic changes in the way “advocates and decision makers think about and discuss conflict-related sexual violence” (p. 2). She focused on a subsequent shift from silence to action. The shift occurred as wartime sexual violence was reframed as a “weapon of war”. The new frame captured the attention of powerful members of the security community who demanded, initiated, and paid for institutional and policy change. Crawford ( 2017 ) examines the legacy of this key reframing. She develops a six-stage model of potential international responses to incidents of wartime violence. This model is fairly easily converted to working hypotheses and sub-hypotheses. Table  3 shows her model as a set of (non-relational) working hypotheses. She applied this model as a way to gather evidence among cases (e.g., the US response to sexual violence in the Democratic Republic of the Congo) to show the official level of response to sexual violence. Each case study chapter examined evidence to establish whether the case fit the pattern formalized in the working hypotheses. The framework was very useful in her comparative context. The framework allowed for consistent comparative analysis across cases. Her analysis of the three cases went well beyond the material covered in the framework. She freely incorporated useful inductively informed data in her analysis and discussion. The framework, however, allowed for alignment within and across cases.

9 Conclusion

In this article we argued that the exploratory research is also well suited for deductive approaches. By examining the landscape of deductive, exploratory research, we proposed the working hypothesis as a flexible conceptual framework and a useful tool for doing exploratory studies. It has the potential to guide and bring coherence across the steps in the research process. After presenting the nature of exploratory research purpose and how it differs from two types of research purposes identified in the literature—explanation, and description. We focused on answering four different questions in order to show the link between micro-conceptual frameworks and research purposes in a deductive setting. The answers to the four questions are summarized in Table  4 .

Firstly, we argued that working hypothesis and exploration are situated within the pragmatic philosophical perspective. Pragmatism allows for pluralism in theory and data collection techniques, which is compatible with the flexible exploratory purpose. Secondly, after introducing and discussing the four core elements of pragmatism (practical, pluralism, participatory, and provisional), we explained how the working hypothesis informs the methodologies and evidence collection of deductive exploratory research through a presentation of the benefits of triangulation provided by mixed methods research. Thirdly, as is clear from the article title, we introduced the working hypothesis as the micro-conceptual framework for deductive explorative research. We argued that the hypotheses of explorative research, which we call working hypotheses are distinguished from those of the explanatory research, since they do not require a relational component and are not bound by relational expectations. A working hypothesis is extremely flexible and idiosyncratic, and it could be viewed as a statement or group of statements of expectations tested in action depending on the research question. Using examples, we concluded by explaining how working hypotheses inform data collection and analysis for deductive exploratory research.

Crawford’s ( 2017 ) example showed how the structure of working hypotheses provide a framework for comparative case studies. Her criteria for analysis were specified ahead of time and used to frame each case. Thus, her comparisons were systemized across cases. Further, the framework ensured a connection between the data analysis and the literature review. Yet the flexible, working nature of the hypotheses allowed for unexpected findings to be discovered.

The evidence required to test working hypotheses is directed by the research purpose and potentially includes both quantitative and qualitative sources. Thus, all types of evidence, including quantitative methods should be part of the toolbox of deductive, explorative research. We show how the working hypotheses, as a flexible exploratory framework, resolves many seeming dualisms pervasive in the research methods literature.

To conclude, this article has provided an in-depth examination of working hypotheses taking into account philosophical questions and the larger formal research methods literature. By discussing working hypotheses as applied, theoretical tools, we demonstrated that working hypotheses fill a unique niche in the methods literature, since they provide a way to enhance alignment in deductive, explorative studies.

In practice, quantitative scholars often run multivariate analysis on data bases to find out if there are correlations. Hypotheses are tested because the statistical software does the math, not because the scholar has an a priori, relational expectation (hypothesis) well-grounded in the literature and supported by cogent arguments. Hunches are just fine. This is clearly an inductive approach to research and part of the large process of inquiry.

In 1958 , Philosophers of Science, Oppenheim and Putnam use the notion of Working Hypothesis in their title “Unity of Science as Working Hypothesis.” They too, use it as a big picture concept, “unity of science in this sense, can be fully realized constitutes an over-arching meta-scientific hypothesis, which enables one to see a unity in scientific activities that might otherwise appear disconnected or unrelated” (p. 4).

It should be noted that the positivism described in the research methods literature does not resemble philosophical positivism as developed by philosophers like Comte (Whetsell and Shields 2015 ). In the research methods literature “positivism means different things to different people….The term has long been emptied of any precise denotation …and is sometimes affixed to positions actually opposed to those espoused by the philosophers from whom the name derives” (Schrag 1992 , p. 5). For purposes of this paper, we are capturing a few essential ways positivism is presented in the research methods literature. This helps us to position the “working hypothesis” and “exploratory” research within the larger context in contemporary research methods. We are not arguing that the positivism presented here is anything more. The incompatibility theory discussed later, is an outgrowth of this research methods literature…

It should be noted that quantitative researchers often use inductive reasoning. They do this with existing data sets when they run correlations or regression analysis as a way to find relationships. They ask, what does the data tell us?

Qualitative researchers are also associated with phenomenology, hermeneutics, naturalistic inquiry and constructivism.

See Feilzer ( 2010 ), Howe ( 1988 ), Johnson and Onwuegbunzie ( 2004 ), Morgan ( 2007 ), Onwuegbuzie and Leech ( 2005 ), Biddle and Schafft ( 2015 ).

The term conceptual framework is applicable in a broad context (see Ravitch and Riggan 2012 ). The micro-conceptual framework narrows to the specific study and informs data collection (Shields and Rangarajan 2013 ; Shields et al. 2019a ) .

Adler, E., Clark, R.: How It’s Done: An Invitation to Social Research, 3rd edn. Thompson-Wadsworth, Belmont (2008)

Google Scholar  

Arnold, R.W.: Multiple working hypothesis in soil genesis. Soil Sci. Soc. Am. J. 29 (6), 717–724 (1965)

Article   Google Scholar  

Atieno, O.: An analysis of the strengths and limitation of qualitative and quantitative research paradigms. Probl. Educ. 21st Century 13 , 13–18 (2009)

Babbie, E.: The Practice of Social Research, 11th edn. Thompson-Wadsworth, Belmont (2007)

Biddle, C., Schafft, K.A.: Axiology and anomaly in the practice of mixed methods work: pragmatism, valuation, and the transformative paradigm. J. Mixed Methods Res. 9 (4), 320–334 (2015)

Brendel, D.H.: Healing Psychiatry: Bridging the Science/Humanism Divide. MIT Press, Cambridge (2009)

Bryman, A.: Qualitative research on leadership: a critical but appreciative review. Leadersh. Q. 15 (6), 729–769 (2004)

Casula, M.: Under which conditions is cohesion policy effective: proposing an Hirschmanian approach to EU structural funds, Regional & Federal Studies, https://doi.org/10.1080/13597566.2020.1713110 (2020a)

Casula, M.: Economic gowth and cohesion policy implementation in Italy and Spain, Palgrave Macmillan, Cham (2020b)

Ciceri, F., et al.: Microvascular COVID-19 lung vessels obstructive thromboinflammatory syndrome (MicroCLOTS): an atypical acute respiratory distress syndrome working hypothesis. Crit. Care Resusc. 15 , 1–3 (2020)

Crawford, K.F.: Wartime sexual violence: From silence to condemnation of a weapon of war. Georgetown University Press (2017)

Cronbach, L.: Beyond the two disciplines of scientific psychology American Psychologist. 30 116–127 (1975)

Dewey, J.: The reflex arc concept in psychology. Psychol. Rev. 3 (4), 357 (1896)

Dewey, J.: Logic: The Theory of Inquiry. Henry Holt & Co, New York (1938)

Feilzer, Y.: Doing mixed methods research pragmatically: implications for the rediscovery of pragmatism as a research paradigm. J. Mixed Methods Res. 4 (1), 6–16 (2010)

Gilgun, J.F.: Qualitative research and family psychology. J. Fam. Psychol. 19 (1), 40–50 (2005)

Gilgun, J.F.: Methods for enhancing theory and knowledge about problems, policies, and practice. In: Katherine Briar, Joan Orme., Roy Ruckdeschel., Ian Shaw. (eds.) The Sage handbook of social work research pp. 281–297. Thousand Oaks, CA: Sage (2009)

Gilgun, J.F.: Deductive Qualitative Analysis as Middle Ground: Theory-Guided Qualitative Research. Amazon Digital Services LLC, Seattle (2015)

Glaser, B.G., Strauss, A.L.: The Discovery of Grounded Theory: Strategies for Qualitative Research. Aldine, Chicago (1967)

Gobo, G.: Re-Conceptualizing Generalization: Old Issues in a New Frame. In: Alasuutari, P., Bickman, L., Brannen, J. (eds.) The Sage Handbook of Social Research Methods, pp. 193–213. Sage, Los Angeles (2008)

Chapter   Google Scholar  

Grinnell, R.M.: Social work research and evaluation: quantitative and qualitative approaches. New York: F.E. Peacock Publishers (2001)

Guba, E.G.: What have we learned about naturalistic evaluation? Eval. Pract. 8 (1), 23–43 (1987)

Guba, E., Lincoln, Y.: Effective Evaluation: Improving the Usefulness of Evaluation Results Through Responsive and Naturalistic Approaches. Jossey-Bass Publishers, San Francisco (1981)

Habib, M.: The neurological basis of developmental dyslexia: an overview and working hypothesis. Brain 123 (12), 2373–2399 (2000)

Heyvaert, M., Maes, B., Onghena, P.: Mixed methods research synthesis: definition, framework, and potential. Qual. Quant. 47 (2), 659–676 (2013)

Hildebrand, D.: Dewey: A Beginners Guide. Oneworld Oxford, Oxford (2008)

Howe, K.R.: Against the quantitative-qualitative incompatibility thesis or dogmas die hard. Edu. Res. 17 (8), 10–16 (1988)

Hothersall, S.J.: Epistemology and social work: enhancing the integration of theory, practice and research through philosophical pragmatism. Eur. J. Social Work 22 (5), 860–870 (2019)

Hyde, K.F.: Recognising deductive processes in qualitative research. Qual. Market Res. Int. J. 3 (2), 82–90 (2000)

Johnson, R.B., Onwuegbuzie, A.J.: Mixed methods research: a research paradigm whose time has come. Educ. Res. 33 (7), 14–26 (2004)

Johnson, R.B., Onwuegbuzie, A.J., Turner, L.A.: Toward a definition of mixed methods research. J. Mixed Methods Res. 1 (2), 112–133 (2007)

Kaplan, A.: The Conduct of Inquiry. Chandler, Scranton (1964)

Kolb, S.M.: Grounded theory and the constant comparative method: valid research strategies for educators. J. Emerg. Trends Educ. Res. Policy Stud. 3 (1), 83–86 (2012)

Levers, M.J.D.: Philosophical paradigms, grounded theory, and perspectives on emergence. Sage Open 3 (4), 2158244013517243 (2013)

Lundvall, B.A.: Knowledge management in the learning economy. In: Danish Research Unit for Industrial Dynamics Working Paper Working Paper, vol. 6, pp. 3–5 (2006)

Lundvall, B.-Å., Johnson, B.: Knowledge management in the learning economy. J. Ind. Stud. 1 (2), 23–42 (1994)

Lundvall, B.-Å., Jenson, M.B., Johnson, B., Lorenz, E.: Forms of Knowledge and Modes of Innovation—From User-Producer Interaction to the National System of Innovation. In: Dosi, G., et al. (eds.) Technical Change and Economic Theory. Pinter Publishers, London (1988)

Maanen, J., Manning, P., Miller, M.: Series editors’ introduction. In: Stebbins, R. (ed.) Exploratory research in the social sciences. pp. v–vi. Thousands Oak, CA: SAGE (2001)

Mackenzie, N., Knipe, S.: Research dilemmas: paradigms, methods and methodology. Issues Educ. Res. 16 (2), 193–205 (2006)

Marlow, C.R.: Research Methods for Generalist Social Work. Thomson Brooks/Cole, New York (2005)

Mead, G.H.: The working hypothesis in social reform. Am. J. Sociol. 5 (3), 367–371 (1899)

Milnes, A.G.: Structure of the Pennine Zone (Central Alps): a new working hypothesis. Geol. Soc. Am. Bull. 85 (11), 1727–1732 (1974)

Morgan, D.L.: Paradigms lost and pragmatism regained: methodological implications of combining qualitative and quantitative methods. J. Mixed Methods Res. 1 (1), 48–76 (2007)

Morse, J.: The significance of saturation. Qual. Health Res. 5 (2), 147–149 (1995)

O’Connor, M.K., Netting, F.E., Thomas, M.L.: Grounded theory: managing the challenge for those facing institutional review board oversight. Qual. Inq. 14 (1), 28–45 (2008)

Onwuegbuzie, A.J., Leech, N.L.: On becoming a pragmatic researcher: The importance of combining quantitative and qualitative research methodologies. Int. J. Soc. Res. Methodol. 8 (5), 375–387 (2005)

Oppenheim, P., Putnam, H.: Unity of science as a working hypothesis. In: Minnesota Studies in the Philosophy of Science, vol. II, pp. 3–36 (1958)

Patten, M.L., Newhart, M.: Understanding Research Methods: An Overview of the Essentials, 2nd edn. Routledge, New York (2000)

Pearse, N.: An illustration of deductive analysis in qualitative research. In: European Conference on Research Methodology for Business and Management Studies, pp. 264–VII. Academic Conferences International Limited (2019)

Prater, D.N., Case, J., Ingram, D.A., Yoder, M.C.: Working hypothesis to redefine endothelial progenitor cells. Leukemia 21 (6), 1141–1149 (2007)

Ravitch, B., Riggan, M.: Reason and Rigor: How Conceptual Frameworks Guide Research. Sage, Beverley Hills (2012)

Reiter, B.: The epistemology and methodology of exploratory social science research: Crossing Popper with Marcuse. In: Government and International Affairs Faculty Publications. Paper 99. http://scholarcommons.usf.edu/gia_facpub/99 (2013)

Ritchie, J., Lewis, J.: Qualitative Research Practice: A Guide for Social Science Students and Researchers. Sage, London (2003)

Schrag, F.: In defense of positivist research paradigms. Educ. Res. 21 (5), 5–8 (1992)

Shields, P.M.: Pragmatism as a philosophy of science: A tool for public administration. Res. Pub. Admin. 41995-225 (1998)

Shields, P.M., Rangarajan, N.: A Playbook for Research Methods: Integrating Conceptual Frameworks and Project Management. New Forums Press (2013)

Shields, P.M., Tajalli, H.: Intermediate theory: the missing link in successful student scholarship. J. Public Aff. Educ. 12 (3), 313–334 (2006)

Shields, P., & Whetsell, T.: Public administration methodology: A pragmatic perspective. In: Raadshelders, J., Stillman, R., (eds). Foundations of Public Administration, pp. 75–92. New York: Melvin and Leigh (2017)

Shields, P., Rangarajan, N., Casula, M.: It is a Working Hypothesis: Searching for Truth in a Post-Truth World (part I). Sotsiologicheskie issledovaniya 10 , 39–47 (2019a)

Shields, P., Rangarajan, N., Casula, M.: It is a Working Hypothesis: Searching for Truth in a Post-Truth World (part 2). Sotsiologicheskie issledovaniya 11 , 40–51 (2019b)

Smith, J.K.: Quantitative versus qualitative research: an attempt to clarify the issue. Educ. Res. 12 (3), 6–13 (1983a)

Smith, J.K.: Quantitative versus interpretive: the problem of conducting social inquiry. In: House, E. (ed.) Philosophy of Evaluation, pp. 27–52. Jossey-Bass, San Francisco (1983b)

Smith, J.K., Heshusius, L.: Closing down the conversation: the end of the quantitative-qualitative debate among educational inquirers. Educ. Res. 15 (1), 4–12 (1986)

Stebbins, R.A.: Exploratory Research in the Social Sciences. Sage, Thousand Oaks (2001)

Book   Google Scholar  

Strydom, H.: An evaluation of the purposes of research in social work. Soc. Work/Maatskaplike Werk 49 (2), 149–164 (2013)

Sutton, R. I., Staw, B.M.: What theory is not. Administrative science quarterly. 371–384 (1995)

Swift, III, J.: Exploring Capital Metro’s Sexual Harassment Training using Dr. Bengt-Ake Lundvall’s taxonomy of knowledge principles. Applied Research Project, Texas State University https://digital.library.txstate.edu/handle/10877/3671 (2010)

Thomas, E., Magilvy, J.K.: Qualitative rigor or research validity in qualitative research. J. Spec. Pediatric Nurs. 16 (2), 151–155 (2011)

Twining, P., Heller, R.S., Nussbaum, M., Tsai, C.C.: Some guidance on conducting and reporting qualitative studies. Comput. Educ. 107 , A1–A9 (2017)

Ulriksen, M., Dadalauri, N.: Single case studies and theory-testing: the knots and dots of the process-tracing method. Int. J. Soc. Res. Methodol. 19 (2), 223–239 (2016)

Van Evera, S.: Guide to Methods for Students of Political Science. Cornell University Press, Ithaca (1997)

Whetsell, T.A., Shields, P.M.: The dynamics of positivism in the study of public administration: a brief intellectual history and reappraisal. Adm. Soc. 47 (4), 416–446 (2015)

Willis, J.W., Jost, M., Nilakanta, R.: Foundations of Qualitative Research: Interpretive and Critical Approaches. Sage, Beverley Hills (2007)

Worster, W.T.: The inductive and deductive methods in customary international law analysis: traditional and modern approaches. Georget. J. Int. Law 45 , 445 (2013)

Yin, R.K.: The case study as a serious research strategy. Knowledge 3 (1), 97–114 (1981)

Yin, R.K.: The case study method as a tool for doing evaluation. Curr. Sociol. 40 (1), 121–137 (1992)

Yin, R.K.: Applications of Case Study Research. Sage, Beverley Hills (2011)

Yin, R.K.: Case Study Research and Applications: Design and Methods. Sage Publications, Beverley Hills (2017)

Download references

Acknowledgements

The authors contributed equally to this work. The authors would like to thank Quality & Quantity’ s editors and the anonymous reviewers for their valuable advice and comments on previous versions of this paper.

Open access funding provided by Alma Mater Studiorum - Università di Bologna within the CRUI-CARE Agreement. There are no funders to report for this submission.

Author information

Authors and affiliations.

Department of Political and Social Sciences, University of Bologna, Strada Maggiore 45, 40125, Bologna, Italy

Mattia Casula

Texas State University, San Marcos, TX, USA

Nandhini Rangarajan & Patricia Shields

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Mattia Casula .

Ethics declarations

Conflict of interest.

No potential conflict of interest was reported by the author.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Casula, M., Rangarajan, N. & Shields, P. The potential of working hypotheses for deductive exploratory research. Qual Quant 55 , 1703–1725 (2021). https://doi.org/10.1007/s11135-020-01072-9

Download citation

Accepted : 05 November 2020

Published : 08 December 2020

Issue Date : October 2021

DOI : https://doi.org/10.1007/s11135-020-01072-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Exploratory research
  • Working hypothesis
  • Deductive qualitative research
  • Find a journal
  • Publish with us
  • Track your research
  • Search Menu
  • Browse content in A - General Economics and Teaching
  • Browse content in A1 - General Economics
  • A11 - Role of Economics; Role of Economists; Market for Economists
  • Browse content in B - History of Economic Thought, Methodology, and Heterodox Approaches
  • Browse content in B4 - Economic Methodology
  • B49 - Other
  • Browse content in C - Mathematical and Quantitative Methods
  • Browse content in C0 - General
  • C00 - General
  • C01 - Econometrics
  • Browse content in C1 - Econometric and Statistical Methods and Methodology: General
  • C10 - General
  • C11 - Bayesian Analysis: General
  • C12 - Hypothesis Testing: General
  • C13 - Estimation: General
  • C14 - Semiparametric and Nonparametric Methods: General
  • C18 - Methodological Issues: General
  • Browse content in C2 - Single Equation Models; Single Variables
  • C21 - Cross-Sectional Models; Spatial Models; Treatment Effect Models; Quantile Regressions
  • C23 - Panel Data Models; Spatio-temporal Models
  • C26 - Instrumental Variables (IV) Estimation
  • Browse content in C3 - Multiple or Simultaneous Equation Models; Multiple Variables
  • C30 - General
  • C31 - Cross-Sectional Models; Spatial Models; Treatment Effect Models; Quantile Regressions; Social Interaction Models
  • C32 - Time-Series Models; Dynamic Quantile Regressions; Dynamic Treatment Effect Models; Diffusion Processes; State Space Models
  • C35 - Discrete Regression and Qualitative Choice Models; Discrete Regressors; Proportions
  • Browse content in C4 - Econometric and Statistical Methods: Special Topics
  • C40 - General
  • Browse content in C5 - Econometric Modeling
  • C52 - Model Evaluation, Validation, and Selection
  • C53 - Forecasting and Prediction Methods; Simulation Methods
  • C55 - Large Data Sets: Modeling and Analysis
  • Browse content in C6 - Mathematical Methods; Programming Models; Mathematical and Simulation Modeling
  • C63 - Computational Techniques; Simulation Modeling
  • C67 - Input-Output Models
  • Browse content in C7 - Game Theory and Bargaining Theory
  • C71 - Cooperative Games
  • C72 - Noncooperative Games
  • C73 - Stochastic and Dynamic Games; Evolutionary Games; Repeated Games
  • C78 - Bargaining Theory; Matching Theory
  • C79 - Other
  • Browse content in C8 - Data Collection and Data Estimation Methodology; Computer Programs
  • C83 - Survey Methods; Sampling Methods
  • Browse content in C9 - Design of Experiments
  • C90 - General
  • C91 - Laboratory, Individual Behavior
  • C92 - Laboratory, Group Behavior
  • C93 - Field Experiments
  • C99 - Other
  • Browse content in D - Microeconomics
  • Browse content in D0 - General
  • D00 - General
  • D01 - Microeconomic Behavior: Underlying Principles
  • D02 - Institutions: Design, Formation, Operations, and Impact
  • D03 - Behavioral Microeconomics: Underlying Principles
  • D04 - Microeconomic Policy: Formulation; Implementation, and Evaluation
  • Browse content in D1 - Household Behavior and Family Economics
  • D10 - General
  • D11 - Consumer Economics: Theory
  • D12 - Consumer Economics: Empirical Analysis
  • D13 - Household Production and Intrahousehold Allocation
  • D14 - Household Saving; Personal Finance
  • D15 - Intertemporal Household Choice: Life Cycle Models and Saving
  • D18 - Consumer Protection
  • Browse content in D2 - Production and Organizations
  • D20 - General
  • D21 - Firm Behavior: Theory
  • D22 - Firm Behavior: Empirical Analysis
  • D23 - Organizational Behavior; Transaction Costs; Property Rights
  • D24 - Production; Cost; Capital; Capital, Total Factor, and Multifactor Productivity; Capacity
  • Browse content in D3 - Distribution
  • D30 - General
  • D31 - Personal Income, Wealth, and Their Distributions
  • D33 - Factor Income Distribution
  • Browse content in D4 - Market Structure, Pricing, and Design
  • D40 - General
  • D41 - Perfect Competition
  • D42 - Monopoly
  • D43 - Oligopoly and Other Forms of Market Imperfection
  • D44 - Auctions
  • D47 - Market Design
  • D49 - Other
  • Browse content in D5 - General Equilibrium and Disequilibrium
  • D50 - General
  • D51 - Exchange and Production Economies
  • D52 - Incomplete Markets
  • D53 - Financial Markets
  • D57 - Input-Output Tables and Analysis
  • Browse content in D6 - Welfare Economics
  • D60 - General
  • D61 - Allocative Efficiency; Cost-Benefit Analysis
  • D62 - Externalities
  • D63 - Equity, Justice, Inequality, and Other Normative Criteria and Measurement
  • D64 - Altruism; Philanthropy
  • D69 - Other
  • Browse content in D7 - Analysis of Collective Decision-Making
  • D70 - General
  • D71 - Social Choice; Clubs; Committees; Associations
  • D72 - Political Processes: Rent-seeking, Lobbying, Elections, Legislatures, and Voting Behavior
  • D73 - Bureaucracy; Administrative Processes in Public Organizations; Corruption
  • D74 - Conflict; Conflict Resolution; Alliances; Revolutions
  • D78 - Positive Analysis of Policy Formulation and Implementation
  • Browse content in D8 - Information, Knowledge, and Uncertainty
  • D80 - General
  • D81 - Criteria for Decision-Making under Risk and Uncertainty
  • D82 - Asymmetric and Private Information; Mechanism Design
  • D83 - Search; Learning; Information and Knowledge; Communication; Belief; Unawareness
  • D84 - Expectations; Speculations
  • D85 - Network Formation and Analysis: Theory
  • D86 - Economics of Contract: Theory
  • D89 - Other
  • Browse content in D9 - Micro-Based Behavioral Economics
  • D90 - General
  • D91 - Role and Effects of Psychological, Emotional, Social, and Cognitive Factors on Decision Making
  • D92 - Intertemporal Firm Choice, Investment, Capacity, and Financing
  • Browse content in E - Macroeconomics and Monetary Economics
  • Browse content in E0 - General
  • E00 - General
  • E01 - Measurement and Data on National Income and Product Accounts and Wealth; Environmental Accounts
  • E02 - Institutions and the Macroeconomy
  • E03 - Behavioral Macroeconomics
  • Browse content in E1 - General Aggregative Models
  • E10 - General
  • E12 - Keynes; Keynesian; Post-Keynesian
  • E13 - Neoclassical
  • Browse content in E2 - Consumption, Saving, Production, Investment, Labor Markets, and Informal Economy
  • E20 - General
  • E21 - Consumption; Saving; Wealth
  • E22 - Investment; Capital; Intangible Capital; Capacity
  • E23 - Production
  • E24 - Employment; Unemployment; Wages; Intergenerational Income Distribution; Aggregate Human Capital; Aggregate Labor Productivity
  • E25 - Aggregate Factor Income Distribution
  • Browse content in E3 - Prices, Business Fluctuations, and Cycles
  • E30 - General
  • E31 - Price Level; Inflation; Deflation
  • E32 - Business Fluctuations; Cycles
  • E37 - Forecasting and Simulation: Models and Applications
  • Browse content in E4 - Money and Interest Rates
  • E40 - General
  • E41 - Demand for Money
  • E42 - Monetary Systems; Standards; Regimes; Government and the Monetary System; Payment Systems
  • E43 - Interest Rates: Determination, Term Structure, and Effects
  • E44 - Financial Markets and the Macroeconomy
  • Browse content in E5 - Monetary Policy, Central Banking, and the Supply of Money and Credit
  • E50 - General
  • E51 - Money Supply; Credit; Money Multipliers
  • E52 - Monetary Policy
  • E58 - Central Banks and Their Policies
  • Browse content in E6 - Macroeconomic Policy, Macroeconomic Aspects of Public Finance, and General Outlook
  • E60 - General
  • E62 - Fiscal Policy
  • E66 - General Outlook and Conditions
  • Browse content in E7 - Macro-Based Behavioral Economics
  • E71 - Role and Effects of Psychological, Emotional, Social, and Cognitive Factors on the Macro Economy
  • Browse content in F - International Economics
  • Browse content in F0 - General
  • F00 - General
  • Browse content in F1 - Trade
  • F10 - General
  • F11 - Neoclassical Models of Trade
  • F12 - Models of Trade with Imperfect Competition and Scale Economies; Fragmentation
  • F13 - Trade Policy; International Trade Organizations
  • F14 - Empirical Studies of Trade
  • F15 - Economic Integration
  • F16 - Trade and Labor Market Interactions
  • F18 - Trade and Environment
  • Browse content in F2 - International Factor Movements and International Business
  • F20 - General
  • F21 - International Investment; Long-Term Capital Movements
  • F22 - International Migration
  • F23 - Multinational Firms; International Business
  • Browse content in F3 - International Finance
  • F30 - General
  • F31 - Foreign Exchange
  • F32 - Current Account Adjustment; Short-Term Capital Movements
  • F34 - International Lending and Debt Problems
  • F35 - Foreign Aid
  • F36 - Financial Aspects of Economic Integration
  • Browse content in F4 - Macroeconomic Aspects of International Trade and Finance
  • F40 - General
  • F41 - Open Economy Macroeconomics
  • F42 - International Policy Coordination and Transmission
  • F43 - Economic Growth of Open Economies
  • F44 - International Business Cycles
  • Browse content in F5 - International Relations, National Security, and International Political Economy
  • F50 - General
  • F51 - International Conflicts; Negotiations; Sanctions
  • F52 - National Security; Economic Nationalism
  • F55 - International Institutional Arrangements
  • Browse content in F6 - Economic Impacts of Globalization
  • F60 - General
  • F61 - Microeconomic Impacts
  • F63 - Economic Development
  • Browse content in G - Financial Economics
  • Browse content in G0 - General
  • G00 - General
  • G01 - Financial Crises
  • G02 - Behavioral Finance: Underlying Principles
  • Browse content in G1 - General Financial Markets
  • G10 - General
  • G11 - Portfolio Choice; Investment Decisions
  • G12 - Asset Pricing; Trading volume; Bond Interest Rates
  • G14 - Information and Market Efficiency; Event Studies; Insider Trading
  • G15 - International Financial Markets
  • G18 - Government Policy and Regulation
  • Browse content in G2 - Financial Institutions and Services
  • G20 - General
  • G21 - Banks; Depository Institutions; Micro Finance Institutions; Mortgages
  • G22 - Insurance; Insurance Companies; Actuarial Studies
  • G23 - Non-bank Financial Institutions; Financial Instruments; Institutional Investors
  • G24 - Investment Banking; Venture Capital; Brokerage; Ratings and Ratings Agencies
  • G28 - Government Policy and Regulation
  • Browse content in G3 - Corporate Finance and Governance
  • G30 - General
  • G31 - Capital Budgeting; Fixed Investment and Inventory Studies; Capacity
  • G32 - Financing Policy; Financial Risk and Risk Management; Capital and Ownership Structure; Value of Firms; Goodwill
  • G33 - Bankruptcy; Liquidation
  • G34 - Mergers; Acquisitions; Restructuring; Corporate Governance
  • G38 - Government Policy and Regulation
  • Browse content in G4 - Behavioral Finance
  • G40 - General
  • G41 - Role and Effects of Psychological, Emotional, Social, and Cognitive Factors on Decision Making in Financial Markets
  • Browse content in G5 - Household Finance
  • G50 - General
  • G51 - Household Saving, Borrowing, Debt, and Wealth
  • Browse content in H - Public Economics
  • Browse content in H0 - General
  • H00 - General
  • Browse content in H1 - Structure and Scope of Government
  • H10 - General
  • H11 - Structure, Scope, and Performance of Government
  • Browse content in H2 - Taxation, Subsidies, and Revenue
  • H20 - General
  • H21 - Efficiency; Optimal Taxation
  • H22 - Incidence
  • H23 - Externalities; Redistributive Effects; Environmental Taxes and Subsidies
  • H24 - Personal Income and Other Nonbusiness Taxes and Subsidies; includes inheritance and gift taxes
  • H25 - Business Taxes and Subsidies
  • H26 - Tax Evasion and Avoidance
  • Browse content in H3 - Fiscal Policies and Behavior of Economic Agents
  • H31 - Household
  • Browse content in H4 - Publicly Provided Goods
  • H40 - General
  • H41 - Public Goods
  • H42 - Publicly Provided Private Goods
  • H44 - Publicly Provided Goods: Mixed Markets
  • Browse content in H5 - National Government Expenditures and Related Policies
  • H50 - General
  • H51 - Government Expenditures and Health
  • H52 - Government Expenditures and Education
  • H53 - Government Expenditures and Welfare Programs
  • H54 - Infrastructures; Other Public Investment and Capital Stock
  • H55 - Social Security and Public Pensions
  • H56 - National Security and War
  • H57 - Procurement
  • Browse content in H6 - National Budget, Deficit, and Debt
  • H63 - Debt; Debt Management; Sovereign Debt
  • Browse content in H7 - State and Local Government; Intergovernmental Relations
  • H70 - General
  • H71 - State and Local Taxation, Subsidies, and Revenue
  • H73 - Interjurisdictional Differentials and Their Effects
  • H75 - State and Local Government: Health; Education; Welfare; Public Pensions
  • H76 - State and Local Government: Other Expenditure Categories
  • H77 - Intergovernmental Relations; Federalism; Secession
  • Browse content in H8 - Miscellaneous Issues
  • H81 - Governmental Loans; Loan Guarantees; Credits; Grants; Bailouts
  • H83 - Public Administration; Public Sector Accounting and Audits
  • H87 - International Fiscal Issues; International Public Goods
  • Browse content in I - Health, Education, and Welfare
  • Browse content in I0 - General
  • I00 - General
  • Browse content in I1 - Health
  • I10 - General
  • I11 - Analysis of Health Care Markets
  • I12 - Health Behavior
  • I13 - Health Insurance, Public and Private
  • I14 - Health and Inequality
  • I15 - Health and Economic Development
  • I18 - Government Policy; Regulation; Public Health
  • Browse content in I2 - Education and Research Institutions
  • I20 - General
  • I21 - Analysis of Education
  • I22 - Educational Finance; Financial Aid
  • I23 - Higher Education; Research Institutions
  • I24 - Education and Inequality
  • I25 - Education and Economic Development
  • I26 - Returns to Education
  • I28 - Government Policy
  • Browse content in I3 - Welfare, Well-Being, and Poverty
  • I30 - General
  • I31 - General Welfare
  • I32 - Measurement and Analysis of Poverty
  • I38 - Government Policy; Provision and Effects of Welfare Programs
  • Browse content in J - Labor and Demographic Economics
  • Browse content in J0 - General
  • J00 - General
  • J01 - Labor Economics: General
  • J08 - Labor Economics Policies
  • Browse content in J1 - Demographic Economics
  • J10 - General
  • J12 - Marriage; Marital Dissolution; Family Structure; Domestic Abuse
  • J13 - Fertility; Family Planning; Child Care; Children; Youth
  • J14 - Economics of the Elderly; Economics of the Handicapped; Non-Labor Market Discrimination
  • J15 - Economics of Minorities, Races, Indigenous Peoples, and Immigrants; Non-labor Discrimination
  • J16 - Economics of Gender; Non-labor Discrimination
  • J18 - Public Policy
  • Browse content in J2 - Demand and Supply of Labor
  • J20 - General
  • J21 - Labor Force and Employment, Size, and Structure
  • J22 - Time Allocation and Labor Supply
  • J23 - Labor Demand
  • J24 - Human Capital; Skills; Occupational Choice; Labor Productivity
  • Browse content in J3 - Wages, Compensation, and Labor Costs
  • J30 - General
  • J31 - Wage Level and Structure; Wage Differentials
  • J33 - Compensation Packages; Payment Methods
  • J38 - Public Policy
  • Browse content in J4 - Particular Labor Markets
  • J40 - General
  • J42 - Monopsony; Segmented Labor Markets
  • J44 - Professional Labor Markets; Occupational Licensing
  • J45 - Public Sector Labor Markets
  • J48 - Public Policy
  • J49 - Other
  • Browse content in J5 - Labor-Management Relations, Trade Unions, and Collective Bargaining
  • J50 - General
  • J51 - Trade Unions: Objectives, Structure, and Effects
  • J53 - Labor-Management Relations; Industrial Jurisprudence
  • Browse content in J6 - Mobility, Unemployment, Vacancies, and Immigrant Workers
  • J60 - General
  • J61 - Geographic Labor Mobility; Immigrant Workers
  • J62 - Job, Occupational, and Intergenerational Mobility
  • J63 - Turnover; Vacancies; Layoffs
  • J64 - Unemployment: Models, Duration, Incidence, and Job Search
  • J65 - Unemployment Insurance; Severance Pay; Plant Closings
  • J68 - Public Policy
  • Browse content in J7 - Labor Discrimination
  • J71 - Discrimination
  • J78 - Public Policy
  • Browse content in J8 - Labor Standards: National and International
  • J81 - Working Conditions
  • J88 - Public Policy
  • Browse content in K - Law and Economics
  • Browse content in K0 - General
  • K00 - General
  • Browse content in K1 - Basic Areas of Law
  • K14 - Criminal Law
  • K2 - Regulation and Business Law
  • Browse content in K3 - Other Substantive Areas of Law
  • K31 - Labor Law
  • Browse content in K4 - Legal Procedure, the Legal System, and Illegal Behavior
  • K40 - General
  • K41 - Litigation Process
  • K42 - Illegal Behavior and the Enforcement of Law
  • Browse content in L - Industrial Organization
  • Browse content in L0 - General
  • L00 - General
  • Browse content in L1 - Market Structure, Firm Strategy, and Market Performance
  • L10 - General
  • L11 - Production, Pricing, and Market Structure; Size Distribution of Firms
  • L13 - Oligopoly and Other Imperfect Markets
  • L14 - Transactional Relationships; Contracts and Reputation; Networks
  • L15 - Information and Product Quality; Standardization and Compatibility
  • L16 - Industrial Organization and Macroeconomics: Industrial Structure and Structural Change; Industrial Price Indices
  • L19 - Other
  • Browse content in L2 - Firm Objectives, Organization, and Behavior
  • L21 - Business Objectives of the Firm
  • L22 - Firm Organization and Market Structure
  • L23 - Organization of Production
  • L24 - Contracting Out; Joint Ventures; Technology Licensing
  • L25 - Firm Performance: Size, Diversification, and Scope
  • L26 - Entrepreneurship
  • Browse content in L3 - Nonprofit Organizations and Public Enterprise
  • L33 - Comparison of Public and Private Enterprises and Nonprofit Institutions; Privatization; Contracting Out
  • Browse content in L4 - Antitrust Issues and Policies
  • L40 - General
  • L41 - Monopolization; Horizontal Anticompetitive Practices
  • L42 - Vertical Restraints; Resale Price Maintenance; Quantity Discounts
  • Browse content in L5 - Regulation and Industrial Policy
  • L50 - General
  • L51 - Economics of Regulation
  • Browse content in L6 - Industry Studies: Manufacturing
  • L60 - General
  • L62 - Automobiles; Other Transportation Equipment; Related Parts and Equipment
  • L63 - Microelectronics; Computers; Communications Equipment
  • L66 - Food; Beverages; Cosmetics; Tobacco; Wine and Spirits
  • Browse content in L7 - Industry Studies: Primary Products and Construction
  • L71 - Mining, Extraction, and Refining: Hydrocarbon Fuels
  • L73 - Forest Products
  • Browse content in L8 - Industry Studies: Services
  • L81 - Retail and Wholesale Trade; e-Commerce
  • L83 - Sports; Gambling; Recreation; Tourism
  • L84 - Personal, Professional, and Business Services
  • L86 - Information and Internet Services; Computer Software
  • Browse content in L9 - Industry Studies: Transportation and Utilities
  • L91 - Transportation: General
  • L93 - Air Transportation
  • L94 - Electric Utilities
  • Browse content in M - Business Administration and Business Economics; Marketing; Accounting; Personnel Economics
  • Browse content in M1 - Business Administration
  • M11 - Production Management
  • M12 - Personnel Management; Executives; Executive Compensation
  • M14 - Corporate Culture; Social Responsibility
  • Browse content in M2 - Business Economics
  • M21 - Business Economics
  • Browse content in M3 - Marketing and Advertising
  • M31 - Marketing
  • M37 - Advertising
  • Browse content in M4 - Accounting and Auditing
  • M42 - Auditing
  • M48 - Government Policy and Regulation
  • Browse content in M5 - Personnel Economics
  • M50 - General
  • M51 - Firm Employment Decisions; Promotions
  • M52 - Compensation and Compensation Methods and Their Effects
  • M53 - Training
  • M54 - Labor Management
  • Browse content in N - Economic History
  • Browse content in N0 - General
  • N00 - General
  • N01 - Development of the Discipline: Historiographical; Sources and Methods
  • Browse content in N1 - Macroeconomics and Monetary Economics; Industrial Structure; Growth; Fluctuations
  • N10 - General, International, or Comparative
  • N11 - U.S.; Canada: Pre-1913
  • N12 - U.S.; Canada: 1913-
  • N13 - Europe: Pre-1913
  • N17 - Africa; Oceania
  • Browse content in N2 - Financial Markets and Institutions
  • N20 - General, International, or Comparative
  • N22 - U.S.; Canada: 1913-
  • N23 - Europe: Pre-1913
  • Browse content in N3 - Labor and Consumers, Demography, Education, Health, Welfare, Income, Wealth, Religion, and Philanthropy
  • N30 - General, International, or Comparative
  • N31 - U.S.; Canada: Pre-1913
  • N32 - U.S.; Canada: 1913-
  • N33 - Europe: Pre-1913
  • N34 - Europe: 1913-
  • N36 - Latin America; Caribbean
  • N37 - Africa; Oceania
  • Browse content in N4 - Government, War, Law, International Relations, and Regulation
  • N40 - General, International, or Comparative
  • N41 - U.S.; Canada: Pre-1913
  • N42 - U.S.; Canada: 1913-
  • N43 - Europe: Pre-1913
  • N44 - Europe: 1913-
  • N45 - Asia including Middle East
  • N47 - Africa; Oceania
  • Browse content in N5 - Agriculture, Natural Resources, Environment, and Extractive Industries
  • N50 - General, International, or Comparative
  • N51 - U.S.; Canada: Pre-1913
  • Browse content in N6 - Manufacturing and Construction
  • N63 - Europe: Pre-1913
  • Browse content in N7 - Transport, Trade, Energy, Technology, and Other Services
  • N71 - U.S.; Canada: Pre-1913
  • Browse content in N8 - Micro-Business History
  • N82 - U.S.; Canada: 1913-
  • Browse content in N9 - Regional and Urban History
  • N91 - U.S.; Canada: Pre-1913
  • N92 - U.S.; Canada: 1913-
  • N93 - Europe: Pre-1913
  • N94 - Europe: 1913-
  • Browse content in O - Economic Development, Innovation, Technological Change, and Growth
  • Browse content in O1 - Economic Development
  • O10 - General
  • O11 - Macroeconomic Analyses of Economic Development
  • O12 - Microeconomic Analyses of Economic Development
  • O13 - Agriculture; Natural Resources; Energy; Environment; Other Primary Products
  • O14 - Industrialization; Manufacturing and Service Industries; Choice of Technology
  • O15 - Human Resources; Human Development; Income Distribution; Migration
  • O16 - Financial Markets; Saving and Capital Investment; Corporate Finance and Governance
  • O17 - Formal and Informal Sectors; Shadow Economy; Institutional Arrangements
  • O18 - Urban, Rural, Regional, and Transportation Analysis; Housing; Infrastructure
  • O19 - International Linkages to Development; Role of International Organizations
  • Browse content in O2 - Development Planning and Policy
  • O23 - Fiscal and Monetary Policy in Development
  • O25 - Industrial Policy
  • Browse content in O3 - Innovation; Research and Development; Technological Change; Intellectual Property Rights
  • O30 - General
  • O31 - Innovation and Invention: Processes and Incentives
  • O32 - Management of Technological Innovation and R&D
  • O33 - Technological Change: Choices and Consequences; Diffusion Processes
  • O34 - Intellectual Property and Intellectual Capital
  • O38 - Government Policy
  • Browse content in O4 - Economic Growth and Aggregate Productivity
  • O40 - General
  • O41 - One, Two, and Multisector Growth Models
  • O43 - Institutions and Growth
  • O44 - Environment and Growth
  • O47 - Empirical Studies of Economic Growth; Aggregate Productivity; Cross-Country Output Convergence
  • Browse content in O5 - Economywide Country Studies
  • O52 - Europe
  • O53 - Asia including Middle East
  • O55 - Africa
  • Browse content in P - Economic Systems
  • Browse content in P0 - General
  • P00 - General
  • Browse content in P1 - Capitalist Systems
  • P10 - General
  • P16 - Political Economy
  • P17 - Performance and Prospects
  • P18 - Energy: Environment
  • Browse content in P2 - Socialist Systems and Transitional Economies
  • P26 - Political Economy; Property Rights
  • Browse content in P3 - Socialist Institutions and Their Transitions
  • P37 - Legal Institutions; Illegal Behavior
  • Browse content in P4 - Other Economic Systems
  • P48 - Political Economy; Legal Institutions; Property Rights; Natural Resources; Energy; Environment; Regional Studies
  • Browse content in P5 - Comparative Economic Systems
  • P51 - Comparative Analysis of Economic Systems
  • Browse content in Q - Agricultural and Natural Resource Economics; Environmental and Ecological Economics
  • Browse content in Q1 - Agriculture
  • Q10 - General
  • Q12 - Micro Analysis of Farm Firms, Farm Households, and Farm Input Markets
  • Q13 - Agricultural Markets and Marketing; Cooperatives; Agribusiness
  • Q14 - Agricultural Finance
  • Q15 - Land Ownership and Tenure; Land Reform; Land Use; Irrigation; Agriculture and Environment
  • Q16 - R&D; Agricultural Technology; Biofuels; Agricultural Extension Services
  • Browse content in Q2 - Renewable Resources and Conservation
  • Q25 - Water
  • Browse content in Q3 - Nonrenewable Resources and Conservation
  • Q32 - Exhaustible Resources and Economic Development
  • Q34 - Natural Resources and Domestic and International Conflicts
  • Browse content in Q4 - Energy
  • Q41 - Demand and Supply; Prices
  • Q48 - Government Policy
  • Browse content in Q5 - Environmental Economics
  • Q50 - General
  • Q51 - Valuation of Environmental Effects
  • Q53 - Air Pollution; Water Pollution; Noise; Hazardous Waste; Solid Waste; Recycling
  • Q54 - Climate; Natural Disasters; Global Warming
  • Q56 - Environment and Development; Environment and Trade; Sustainability; Environmental Accounts and Accounting; Environmental Equity; Population Growth
  • Q58 - Government Policy
  • Browse content in R - Urban, Rural, Regional, Real Estate, and Transportation Economics
  • Browse content in R0 - General
  • R00 - General
  • Browse content in R1 - General Regional Economics
  • R11 - Regional Economic Activity: Growth, Development, Environmental Issues, and Changes
  • R12 - Size and Spatial Distributions of Regional Economic Activity
  • R13 - General Equilibrium and Welfare Economic Analysis of Regional Economies
  • Browse content in R2 - Household Analysis
  • R20 - General
  • R23 - Regional Migration; Regional Labor Markets; Population; Neighborhood Characteristics
  • R28 - Government Policy
  • Browse content in R3 - Real Estate Markets, Spatial Production Analysis, and Firm Location
  • R30 - General
  • R31 - Housing Supply and Markets
  • R38 - Government Policy
  • Browse content in R4 - Transportation Economics
  • R40 - General
  • R41 - Transportation: Demand, Supply, and Congestion; Travel Time; Safety and Accidents; Transportation Noise
  • R48 - Government Pricing and Policy
  • Browse content in Z - Other Special Topics
  • Browse content in Z1 - Cultural Economics; Economic Sociology; Economic Anthropology
  • Z10 - General
  • Z12 - Religion
  • Z13 - Economic Sociology; Economic Anthropology; Social and Economic Stratification
  • Advance Articles
  • Editor's Choice
  • Author Guidelines
  • Submission Site
  • Open Access Options
  • Self-Archiving Policy
  • Why Submit?
  • About The Quarterly Journal of Economics
  • Editorial Board
  • Advertising and Corporate Services
  • Journals Career Network
  • Dispatch Dates
  • Journals on Oxford Academic
  • Books on Oxford Academic

Issue Cover

Article Contents

I. introduction, ii. a simple framework for discovery, iii. application and data, iv. the surprising importance of the face, v. algorithm-human communication, vi. evaluating these new hypotheses, vii. conclusion, data availability.

  • < Previous

Machine Learning as a Tool for Hypothesis Generation *

  • Article contents
  • Figures & tables
  • Supplementary Data

Jens Ludwig, Sendhil Mullainathan, Machine Learning as a Tool for Hypothesis Generation, The Quarterly Journal of Economics , Volume 139, Issue 2, May 2024, Pages 751–827, https://doi.org/10.1093/qje/qjad055

  • Permissions Icon Permissions

While hypothesis testing is a highly formalized activity, hypothesis generation remains largely informal. We propose a systematic procedure to generate novel hypotheses about human behavior, which uses the capacity of machine learning algorithms to notice patterns people might not. We illustrate the procedure with a concrete application: judge decisions about whom to jail. We begin with a striking fact: the defendant’s face alone matters greatly for the judge’s jailing decision. In fact, an algorithm given only the pixels in the defendant’s mug shot accounts for up to half of the predictable variation. We develop a procedure that allows human subjects to interact with this black-box algorithm to produce hypotheses about what in the face influences judge decisions. The procedure generates hypotheses that are both interpretable and novel: they are not explained by demographics (e.g., race) or existing psychology research, nor are they already known (even if tacitly) to people or experts. Though these results are specific, our procedure is general. It provides a way to produce novel, interpretable hypotheses from any high-dimensional data set (e.g., cell phones, satellites, online behavior, news headlines, corporate filings, and high-frequency time series). A central tenet of our article is that hypothesis generation is a valuable activity, and we hope this encourages future work in this largely “prescientific” stage of science.

Science is curiously asymmetric. New ideas are meticulously tested using data, statistics, and formal models. Yet those ideas originate in a notably less meticulous process involving intuition, inspiration, and creativity. The asymmetry between how ideas are generated versus tested is noteworthy because idea generation is also, at its core, an empirical activity. Creativity begins with “data” (albeit data stored in the mind), which are then “analyzed” (through a purely psychological process of pattern recognition). What feels like inspiration is actually the output of a data analysis run by the human brain. Despite this, idea generation largely happens off stage, something that typically happens before “actual science” begins. 1 Things are likely this way because there is no obvious alternative. The creative process is so human and idiosyncratic that it would seem to resist formalism.

That may be about to change because of two developments. First, human cognition is no longer the only way to notice patterns in the world. Machine learning algorithms can also find patterns, including patterns people might not notice themselves. These algorithms can work not just with structured, tabular data but also with the kinds of inputs that traditionally could only be processed by the mind, like images or text. Second, data on human behavior is exploding: second-by-second price and volume data in asset markets, high-frequency cellphone data on location and usage, CCTV camera and police bodycam footage, news stories, children’s books, the entire text of corporate filings, and so on. The kind of information researchers once relied on for inspiration is now machine readable: what was once solely mental data is increasingly becoming actual data. 2

We suggest that these changes can be leveraged to expand how hypotheses are generated. Currently, researchers do of course look at data to generate hypotheses, as in exploratory data analysis, but this depends on the idiosyncratic creativity of investigators who must decide what statistics to calculate. In contrast, we suggest capitalizing on the capacity of machine learning algorithms to automatically detect patterns, especially ones people might never have considered. A key challenge is that we require hypotheses that are interpretable to people. One important goal of science is to generalize knowledge to new contexts. Predictive patterns in a single data set alone are rarely useful; they become insightful when they can be generalized. Currently, that generalization is done by people, and people can only generalize things they understand. The predictors produced by machine learning algorithms are, however, notoriously opaque—hard-to-decipher “black boxes.” We propose a procedure that integrates these algorithms into a pipeline that results in human-interpretable hypotheses that are both novel and testable.

While our procedure is broadly applicable, we illustrate it in a concrete application: judicial decision making. Specifically we study pretrial decisions about which defendants are jailed versus set free awaiting trial, a decision that by law is supposed to hinge on a prediction of the defendant’s risk ( Dobbie and Yang 2021 ). 3 This is also a substantively interesting application in its own right because of the high stakes involved and mounting evidence that judges make these decisions less than perfectly ( Kleinberg et al. 2018 ; Rambachan et al. 2021 ; Angelova, Dobbie, and Yang 2023 ).

We begin with a striking fact. When we build a deep learning model of the judge—one that predicts whether the judge will detain a given defendant—a single factor emerges as having large explanatory power: the defendant’s face. A predictor that uses only the pixels in the defendant’s mug shot explains from one-quarter to nearly one-half of the predictable variation in detention. 4 Defendants whose mug shots fall in the bottom quartile of predicted detention are 20.4 percentage points more likely to be jailed than those in the top quartile. By comparison, the difference in detention rates between those arrested for violent versus nonviolent crimes is 4.8 percentage points. Notice what this finding is and is not. We are not claiming the mug shot predicts defendant behavior; that would be the long-discredited field of phrenology ( Schlag 1997 ). We instead claim the mug shot predicts judge behavior: how the defendant looks correlates strongly with whether the judge chooses to jail them. 5

Has the algorithm found something new in the pixels of the mug shot or simply rediscovered something long known or intuitively understood? After all, psychologists have been studying people’s reactions to faces for at least 100 years ( Todorov et al. 2015 ; Todorov and Oh 2021 ), while economists have shown that judges are influenced by factors (like race) that can be seen from someone’s face ( Arnold, Dobbie, and Yang 2018 ; Arnold, Dobbie, and Hull 2020 ). When we control for age, gender, race, skin color, and even the facial features suggested by previous psychology research (dominance, trustworthiness, attractiveness, and competence), none of these factors (individually or jointly) meaningfully diminishes the algorithm’s predictive power (see Figure I , Panel A). It is perhaps worth noting that the algorithm on its own does rediscover some of the signal from these features: in fact, collectively these known features explain |$22.3\%$| of the variation in predicted detention (see Figure I , Panel B). The key point is that the algorithm has discovered a great deal more as well.

Correlates of Judge Detention Decision and Algorithmic Prediction of Judge Decision

Correlates of Judge Detention Decision and Algorithmic Prediction of Judge Decision

Panel A summarizes the explanatory power of a regression model in explaining judge detention decisions, controlling for the different explanatory variables indicated at left (shaded tiles), either on their own (dark circles) or together with the algorithmic prediction of the judge decisions (triangles). Each row represents a different regression specification. By “other facial features,” we mean variables that previous psychology research suggests matter for how faces influence people’s reactions to others (dominance, trustworthiness, competence, and attractiveness). Ninety-five percent confidence intervals around our R 2 estimates come from drawing 10,000 bootstrap samples from the validation data set. Panel B shows the relationship between the different explanatory variables as indicated at left by the shaded tiles with the algorithmic prediction itself as the outcome variable in the regressions. Panel C examines the correlation with judge decisions of the two novel hypotheses generated by our procedure about what facial features affect judge detention decisions: well-groomed and heavy-faced.

Perhaps we should control for something else? Figuring out that “something else” is itself a form of hypothesis generation. To avoid a possibly endless—and misleading—process of generating other controls, we take a different approach. We show mug shots to subjects and ask them to guess whom the judge will detain and incentivize them for accuracy. These guesses summarize the facial features people readily (if implicitly) believe influence jailing. Although subjects are modestly good at this task, the algorithm is much better. It remains highly predictive even after controlling for these guesses. The algorithm seems to have found something novel beyond what scientists have previously hypothesized and beyond whatever patterns people can even recognize in data (whether or not they can articulate them).

What, then, are the novel facial features the algorithm has discovered? If we are unable to answer that question, we will have simply replaced one black box (the judge’s mind) with another (an algorithmic model of the judge’s mind). We propose a solution whereby the algorithm can communicate what it “sees.” Specifically, our procedure begins with a mug shot and “morphs” it to create a mug shot that maximally increases (or decreases) the algorithm’s predicted detention probability. The result is pairs of synthetic mug shots that can be examined to understand and articulate what differs within the pairs. The algorithm discovers, and people name that discovery. In principle we could have just shown subjects actual mug shots with higher versus lower predicted detention odds. But faces are so rich that between any pair of actual mug shots, many things will happen to be different and most will be unrelated to detention (akin to the curse of dimensionality). Simply looking at pairs of actual faces can, as a result, lead to many spurious observations. Morphing creates counterfactual synthetic images that are as similar as possible except with respect to detention odds, to minimize extraneous differences and help focus on what truly matters for judge detention decisions.

Importantly, we do not generate hypotheses by looking at the morphs ourselves; instead, they are shown to independent study subjects (MTurk or Prolific workers) in an experimental design. Specifically, we showed pairs of morphed images and asked participants to guess which image the algorithm predicts to have higher detention risk. Subjects were given both incentives and feedback, so they had motivation and opportunity to learn the underlying patterns. While subjects initially guess the judge’s decision correctly from these morphed mug shots at about the same rate as they do when looking at “raw data,” that is, actual mug shots (modestly above the |$50\%$| random guessing mark), they quickly learn from these morphed images what the algorithm is seeing and reach an accuracy of nearly |$70\%$|⁠ . At the end, participants are asked to put words to the differences they see across images in each pair, that is, to name what they think are the key facial features the algorithm is relying on to predict judge decisions. Comfortingly, there is substantial agreement on what subjects see: a sizable share of subjects all name the same feature. To verify whether the feature they identify is used by the algorithm, a separate sample of subjects independently coded mug shots for this new feature. We show that the new feature is indeed correlated with the algorithm’s predictions. What subjects think they’re seeing is indeed what the algorithm is also “seeing.”

Having discovered a single feature, we can iterate the procedure—the first feature explains only a fraction of what the algorithm has captured, suggesting there are many other factors to be discovered. We again produce morphs, but this time hold the first feature constant: that is, we orthogonalize so that the pairs of morphs do not differ on the first feature. When these new morphs are shown to subjects, they consistently name a second feature, which again correlates with the algorithm’s prediction. Both features are quite important. They explain a far larger share of what the algorithm sees than all the other variables (including race and skin color) besides gender. These results establish our main goals: show that the procedure produces meaningful communication, and that it can be iterated.

What are the two discovered features? The first can be called “well-groomed” (e.g., tidy, clean, groomed, versus unkept, disheveled, sloppy look), and the second can be called “heavy-faced” (e.g., wide facial shape, puffier face, wider face, rounder face, heavier). These features are not just predictive of what the algorithm sees, but also of what judges actually do ( Figure I , Panel C). We find that both well-groomed and heavy-faced defendants are more likely to be released, even controlling for demographic features and known facial features from psychology. Detention rates of defendants in the top and bottom quartile of well-groomedness differ by 5.5 percentage points ( ⁠|$24\%$| of the base rate), while the top versus bottom quartile difference in heavy-facedness is 7 percentage points (about |$30\%$| of the base rate). Both differences are larger than the 4.8 percentage points detention rate difference between those arrested for violent versus nonviolent crimes. Not only are these magnitudes substantial, these hypotheses are novel even to practitioners who work in the criminal justice system (in a public defender’s office and a legal aid society).

Establishing whether these hypotheses are truly causally related to judge decisions is obviously beyond the scope of the present article. But we nonetheless present a few additional findings that are at least suggestive. These novel features do not appear to be simply proxies for factors like substance abuse, mental health, or socioeconomic status. Moreover, we carried out a lab experiment in which subjects are asked to make hypothetical pretrial release decisions as if they were a judge. They are shown information about criminal records (current charge, prior arrests) along with mug shots that are randomly morphed in the direction of higher or lower values of well-groomed (or heavy-faced). Subjects tend to detain those with higher-risk structured variables (criminal records), all else equal, suggesting they are taking the task seriously. These same subjects, though, are also more likely to detain defendants who are less heavy-faced or well-groomed, even though these were randomly assigned.

Ultimately, though, this is not a study about well-groomed or heavy-faced defendants, nor are its implications limited to faces or judges. It develops a general procedure that can be applied wherever behavior can be predicted using rich (especially high-dimensional) data. Development of such a procedure has required overcoming two key challenges.

First, to generate interpretable hypotheses, we must overcome the notorious black box nature of most machine learning algorithms. Unlike with a regression, one cannot simply inspect the coefficients. A modern deep-learning algorithm, for example, can have tens of millions of parameters. Noninspectability is especially problematic when the data are rich and high dimensional since the parameters are associated with primitives such as pixels. This problem of interpretation is fundamental and remains an active area of research. 6 Part of our procedure here draws on the recent literature in computer science that uses generative models to create counterfactual explanations. Most of those methods are designed for AI applications that seek to automate tasks humans do nearly perfectly, like image classification, where predictability of the outcome (is this image of a dog or a cat?) is typically quite high. 7 Interpretability techniques are used to ensure the algorithm is not picking up on spurious signal. 8 We developed our method, which has similar conceptual underpinnings to this existing literature, for social science applications where the outcome (human behavior) is typically more challenging to predict. 9 To what degree existing methods (as they currently stand or with some modification) could perform as well or better in social science applications like ours is a question we leave to future work.

Second, we must overcome what we might call the Rorschach test problem. Suppose we, the authors, were to look at these morphs and generate a hypothesis. We would not know if the procedure played any meaningful role. Perhaps the morphs, like ink blots, are merely canvases onto which we project our creativity. 10 Put differently, a single research team’s idiosyncratic judgments lack the kind of replicability we desire of a scientific procedure. To overcome this problem, it is key that we use independent (nonresearcher) subjects to inspect the morphs. The fact that a sizable share of subjects all name the same discovery suggests that human-algorithm communication has occurred and the procedure is replicable, rather than reflecting some unique spark of creativity.

At the same time, the fact that our procedure is not fully automatic implies that it will be shaped and constrained by people. Human participants are needed to name the discoveries. So whole new concepts that humans do not yet understand cannot be produced. Such breakthroughs clearly happen (e.g., gravity or probability) but are beyond the scope of procedures like ours. People also play a crucial role in curating the data the algorithm sees. Here, for example, we chose to include mug shots. The creative acquisition of rich data is an important human input into this hypothesis generation procedure. 11

Our procedure can be applied to a broad range of settings and will be particularly useful for data that are not already intrinsically interpretable. Many data sets contain a few variables that already have clear, fixed meanings and are unlikely to lead to novel discoveries. In contrast, images, text, and time series are rich high-dimensional data with many possible interpretations. Just as there is an ocean of plausible facial features, these sorts of data contain a large set of potential hypotheses that an algorithm can search through. Such data are increasingly available and used by economists, including news headlines, legislative deliberations, annual corporate reports, Federal Open Market Committee statements, Google searches, student essays, résumés, court transcripts, doctors’ notes, satellite images, housing photos, and medical images. Our procedure could, for example, raise hypotheses about what kinds of news lead to over- or underreaction of stock prices, which features of a job interview increase racial disparities, or what features of an X-ray drive misdiagnosis.

Central to this work is the belief that hypothesis generation is a valuable activity in and of itself. Beyond whatever the value might be of our specific procedure and empirical application, we hope these results also inspire greater attention to this traditionally “prescientific” stage of science.

We develop a simple framework to clarify the goals of hypothesis generation and how it differs from testing, how algorithms might help, and how our specific approach to algorithmic hypothesis generation differs from existing methods. 12

II.A. The Goals of Hypothesis Generation

What criteria should we use for assessing hypothesis generation procedures? Two common goals for hypothesis generation are ones that we ensure ex post. First is novelty. In our application, we aim to orthogonalize against known factors, recognizing that it may be hard to orthogonalize against all known hypotheses. Second, we require that hypotheses be testable ( Popper 2002 ). But what can be tested is hard to define ex ante, in part because it depends on the specific hypothesis and the potential experimental setups. Creative empiricists over time often find ways to test hypotheses that previously seemed untestable. 13 To these, we add two more: interpretability and empirical plausibility.

What do we mean by empirically plausible? Let y be some outcome of interest, which for simplicity we assume is binary, and let h ( x ) be some hypothesis that maps the features of each instance, x , to [0,1]. By empirical plausibility we mean some correlation between y and h ( x ). Our ultimate aim is to uncover causal relationships. But causality can only be known after causal testing. That raises the question of how to come up with ideas worth causally testing, and how we would recognize them when we see them. Many true hypotheses need not be visible in raw correlations. Those can only be identified with background knowledge (e.g., theory). Other procedures would be required to surface those. Our focus here is on searching for true hypotheses that are visible in raw correlations. Of course not every correlation will turn out to be a true hypothesis, but even in those cases, generating such hypotheses and then invalidating them can be a valuable activity. Debunking spurious correlations has long been one of the most useful roles of empirical work. Understanding what confounders produce those correlations can also be useful.

We care about our final goal for hypothesis generation, interpretability, because science is largely about helping people make forecasts into new contexts, and people can only do that with hypotheses they meaningfully understand. Consider an uninterpretable hypothesis like “this set of defendants is more likely to be jailed than that set,” but we cannot articulate a reason why. From that hypothesis, nothing could be said about a new set of courtroom defendants. In contrast an interpretable hypothesis like “skin color affects detention” has implications for other samples of defendants and for entirely different settings. We could ask whether skin color also affects, say, police enforcement choices or whether these effects differ by time of day. By virtue of being interpretable, these hypotheses let us use a wider set of knowledge (police may share racial biases; skin color is not as easily detected at night). 14 Interpretable descriptions let us generalize to novel situations, in addition to being easier to communicate to key stakeholders and lending themselves to interpretable solutions.

II.B. Human versus Algorithmic Hypothesis Generation

Human hypothesis generation has the advantage of generating hypotheses that are interpretable. By construction, the ideas that humans come up with are understandable by humans. But as a procedure for generating new ideas, human creativity has the drawback of often being idiosyncratic and not necessarily replicable. A novel hypothesis is novel exactly because one person noticed it when many others did not. A large body of evidence shows that human judgments have a great deal of “noise.” It is not just that different people draw different conclusions from the same observations, but the same person may notice different things at different times ( Kahneman, Sibony, and Sunstein 2022 ). A large body of psychology research shows that people typically are not able to introspect and understand why we notice specific things those times we do notice them. 15

There is also no guarantee that human-generated hypotheses need be empirically plausible. The intuition is related to “overfitting.” Suppose that people look at a subset of all data and look for something that differentiates positive ( y  = 1) from negative ( y  = 0) cases. Even with no noise in y , there is randomness in which observations are in the data. That can lead to idiosyncratic differences between y  = 0 and y  = 1 cases. As the number of comprehensible hypotheses gets large, there is a “curse of dimensionality”: many plausible hypotheses for these idiosyncratic differences. That is, many different hypotheses can look good in sample but need not work out of sample. 16

In contrast, supervised learning tools in machine learning are designed to generate predictions in new (out-of-sample) data. 17 That is, algorithms generate hypotheses that are empirically plausible by construction. 18 Moreover, machine learning can detect patterns in data that humans cannot. Algorithms can notice, for example, that livestock all tend to be oriented north ( Begall et al. 2008 ), whether someone is about to have a heart attack based on subtle indications in an electrocardiogram ( Mullainathan and Obermeyer 2022 ), or that a piece of machinery is about to break ( Mobley 2002 ). We call these machine learning prediction functions m ( x ), which for a binary outcome y map to [0, 1].

The challenge is that most m ( x ) are not interpretable. For this type of statistical model to yield an interpretable hypothesis, its parameters must be interpretable. That can happen in some simple cases. For example, if we had a data set where each dimension of x was interpretable (such as individual structured variables in a tabular data set) and we used a predictor such as OLS (or LASSO), we could just read the hypotheses from the nonzero coefficients: which variables are significant? Even in that case, interpretation is challenging because machine learning tools, built to generate accurate predictions rather than apportion explanatory power across explanatory variables, yield coefficients that can be unstable across realizations of the data ( Mullainathan and Spiess 2017 ). 19 Often interpretation is much less straightforward than that. If x is an image, text, or time series, the estimated models (such as convolutional neural networks) can have literally millions of parameters. The models are defined on granular inputs with no particular meaning: if we knew m ( x ) weighted a particular pixel, what have we learned? In these cases, the estimated model m ( x ) is not interpretable. Our focus is on these contexts where algorithms, as black-box models, are not readily interpreted.

Ideally one might marry people’s unique knowledge of what is comprehensible with an algorithm’s superior capacity to find meaningful correlations in data: to have the algorithm discover new signal and then have humans name that discovery. How to do so is not straightforward. We might imagine formalizing the set of interpretable prediction functions, and then focus on creating machine learning techniques that search over functions in that set. But mathematically characterizing those functions is typically not possible. Or we might consider seeking insight from a low-dimensional representation of face space, or “eigenfaces,” which are a common teaching tool for principal components analysis ( Sirovich and Kirby 1987 ). But those turn out not to provide much useful insight for our purposes. 20 In some sense it is obvious why: the subset of actual faces is unlikely to be a linear subspace of the space of pixels. If we took two faces and linearly interpolated them the resulting image would not look like a face. Some other approach is needed. We build on methods in computer science that use generative models to generate counterfactual explanations.

II.C. Related Methods

Our hypothesis generation procedure is part of a growing literature that aims to integrate machine learning into the way science is conducted. A common use (outside of economics) is in what could be called “closed world problems”: situations where the fundamental laws are known, but drawing out predictions is computationally hard. For example, the biochemical rules of how proteins fold are known, but it is hard to predict the final shape of a protein. Machine learning has provided fundamental breakthroughs, in effect by making very hard-to-compute outcomes computable in a feasible timeframe. 21

Progress has been far more limited with applications where the relationship between x and y is unknown (“open world” problems), like human behavior. First, machine learning here has been useful at generating unexpected findings, although these are not hypotheses themselves. Pierson et al. (2021) show that a deep-learning algorithm is better able to predict patient pain from an X-ray than clinicians can: there are physical knee defects that medicine currently does not understand. But that study is not able to isolate what those defects are. 22 Second, machine learning has also been used to explore investigator-generated hypotheses, such as Mullainathan and Obermeyer (2022) , who examine whether physicians suffer from limited attention when diagnosing patients. 23

Finally, a few papers take on the same problem that we do. Fudenberg and Liang (2019) and Peterson et al. (2021) have used algorithms to predict play in games and choices between lotteries. They inspected those algorithms to produce their insights. Similarly, Kleinberg et al. (2018) and Sunstein (2021) use algorithmic models of judges and inspect those models to generate hypotheses. 24 Our proposal builds on these papers. Rather than focusing on generating an insight for a specific application, we suggest a procedure that can be broadly used for many applications. Importantly, our procedure does not rely on researcher inspection of algorithmic output. When an expert researcher with a track record of generating scientific ideas uses some procedure to generate an idea, how do we know whether the result is due to the procedure or the researcher? By relying on a fixed algorithmic procedure that human subjects can interface with, hypothesis generation goes from being an idiosyncratic act of individuals to a replicable process.

III.A. Judicial Decision Making

Although our procedure is broadly applicable, we illustrate it through a specific application to the U.S. criminal justice system. We choose this application partly because of its social relevance. It is also an exemplar of the type of application where our hypothesis generation procedure can be helpful. Its key ingredients—a clear decision maker, a large number of choices (over 10 million people are arrested each year in the United States) that are recorded in data, and, increasingly, high-dimensional data that can also be used to model those choices, such as mug shot images, police body cameras, and text from arrest reports or court transcripts—are shared with a variety of other applications.

Our specific focus is on pretrial hearings. Within 24–48 hours after arrest, a judge must decide where the defendant will await trial, in jail or at home. This is a consequential decision. Cases typically take 2–4 months to resolve, sometimes up to 9–12 months. Jail affects people’s families, their livelihoods, and the chances of a guilty plea ( Dobbie, Goldin, and Yang 2018 ). On the other hand, someone who is released could potentially reoffend. 25

While pretrial decisions are by law supposed to hinge on the defendant’s risk of flight or rearrest if released ( Dobbie and Yang 2021 ), studies show that judges’ decisions deviate from those guidelines in a number of ways. For starters, judges seem to systematically mispredict defendant risk ( Jung et al. 2017 ; Kleinberg et al. 2018 ; Rambachan 2021 ; Angelova, Dobbie, and Yang 2023 ), partly because judges overweight the charge for which people are arrested ( Sunstein 2021 ). Judge decisions can also depend on extralegal factors like race ( Arnold, Dobbie, and Yang 2018 ; Arnold, Dobbie, and Hull 2020 ), whether the judge’s favorite football team lost ( Eren and Mocan 2018 ), weather ( Heyes and Saberian 2019 ), the cases the judge just heard ( Chen, Moskowitz, and Shue 2016 ), and if the hearing is on the defendant’s birthday ( Chen and Philippe 2023 ). These studies test hypotheses that some human being was clever enough to think up. But there remains a great deal of unexplained variation in judges’ decisions. The challenge of expanding the set of hypotheses for understanding this variation without losing the benefit of interpretability is the motivation for our own analysis here.

III.B. Administrative Data

We obtained data from Mecklenburg County, North Carolina, the second most populated county in the state (over 1 million residents) that includes North Carolina’s largest city (Charlotte). The county is similar to the rest of the United States in terms of economic conditions (2021 poverty rates were |$11.0\%$| versus |$11.4\%$|⁠ , respectively), although the share of Mecklenburg County’s population that is non-Hispanic white is lower than the United States as a whole ( ⁠|$56.6\%$| versus |$75.8\%$|⁠ ). 26 We rely on three sources of administrative data: 27

The Mecklenburg County Sheriff’s Office (MCSO) publicly posts arrest data for the past three years, which provides information on defendant demographics like age, gender, and race, as well as the charge for which someone was arrested.

The North Carolina Administrative Office of the Courts (NCAOC) maintains records on the judge’s pretrial decisions (detain, release, etc.).

Data from the North Carolina Department of Public Safety includes information about the defendant’s prior convictions and incarceration spells, if any.

We also downloaded photos of the defendants from the MCSO public website (so-called mug shots), 28 which capture a frontal view of each person from the shoulders up in front of a gray background. These images are 400 pixels wide by 480 pixels high, but we pad them with a black boundary to be square 512 × 512 images to conform with the requirements of some of the machine learning tools. In Figure II , we give readers a sense of what these mug shots look like, with two important caveats. First, given concerns about how the overrepresentation of disadvantaged groups in discussions of crime can contribute to stereotyping ( Bjornstrom et al. 2010 ), we illustrate the key ideas of the paper using images for non-Hispanic white males. Second, out of sensitivity to actual arrestees, we do not wish to display actual mug shots (which are available at the MCSO website). 29 Instead, the article only shows mug shots that are synthetic, generated using generative adversarial networks as described in Section V.B .

Illustrative Facial Images

Illustrative Facial Images

This figure shows facial images that illustrate the format of the mug shots posted publicly on the Mecklenberg County, North Carolina, sheriff’s office website. These are not real mug shots of actual people who have been arrested, but are synthetic. Moreover, given concerns about how the overrepresentation of disadvantaged groups in discussions of crime can exacerbate stereotyping, we illustrate the our key ideas using images for non-Hispanic white men. However, in our human intelligence tasks that ask participants to provide labels (ratings for different image features), we show images that are representative of the Mecklenberg County defendant population as a whole.

These data capture much of the information the judge has available at the time of the pretrial hearing, but not all of it. Both the judge and the algorithm see structured variables about each defendant like defendant demographics, current charge, and prior record. Because the mug shot (which the algorithm uses) is taken not long before the pretrial hearing, it should be a reasonable proxy for what the judge sees in court. The additional information the judge has but the algorithm does not includes the narrative arrest report from the police and what happens in court. While pretrial hearings can be quite brief in many jurisdictions (often not more than just a few minutes), the judge may nonetheless hear statements from police, prosecutors, defense lawyers, and sometimes family members. Defendants usually have their lawyers speak for them and do not say much at these hearings.

We downloaded 81,166 arrests made between January 18, 2017, and January 17, 2020, involving 42,353 unique defendants. We apply several data filters, like dropping cases without mugshots ( Online Appendix Table A.I ), leaving 51,751 observations. Because our goal is inference about new out-of-sample (OOS) observations, we partition our data as follows:

A train set of N = 22,696 cases, constructed by taking arrests through July 17, 2019, grouping arrests by arrestee, 30 randomly selecting |$70\%$| to the training-plus-validation data set, then randomly selecting |$70\%$| of those arrestees for the training data specifically.

A validation set of N = 9,604 cases used to report OOS performance in the article’s main exhibits, consisting of the remaining |$30\%$| in the combined training-plus-validation data frame.

A lock-box hold-out set of N = 19,009 cases that we did not touch until the article was accepted for final publication, to avoid what one might call researcher overfitting: we run lots of models over the course of writing the article, and the results on the validation data set may overstate our findings. This data set consists of the N = 4,759 valid cases for the last six months of our data period (July 17, 2019, to January 17, 2020) plus a random sample of |$30\%$| of those arrested before July 17, 2019, so that we can present results that are OOS with respect to individuals and time. Once this article was officially accepted, we replicated the findings presented in our main exhibits (see Online Appendix D and Online Appendix Tables A.XVIII–A.XXXII ). We see that our core findings are qualitatively similar. 31

Descriptive statistics are shown in Table I . Relative to the county as a whole, the arrested population substantially overrepresents men ( ⁠|$78.7\%$|⁠ ) and Black residents ( ⁠|$69.4\%$|⁠ ). The average age of arrestees is 31.8 years. Judges detain |$23.3\%$| of cases, and in |$25.1\%$| of arrests the person is rearrested before their case is resolved (about one-third of those released). Randomization of arrestees to the training versus validation data sets seems to have been successful, as shown in Table I . None of the pairwise comparisons has a p -value below .05 (see Online Appendix Table A.II ). A permutation multivariate analysis of variance test of the joint null hypothesis that the training-validation differences for all variables are all zero yields p  = .963. 32 A test for the same joint null hypothesis for the differences between the training sample and the lock-box hold-out data set (out of sample by individual) yields a test statistic of p  = .537.

Summary Statistics for Mecklenburg County NC Data, 2017–2020

Notes. This table reports descriptive statistics for our full data set and analysis subsets, which cover the period January 18, 2017, through January 17, 2020, from Mecklenburg County, NC. The lock-box hold-out data set consists of data from the last six months of our study period (July 17, 2019–January 17, 2020) plus a subset of cases through July 16, 2019, selected by randomly selecting arrestees. The remainder of the data set is then randomly assigned by arrestee to our training data set (used to build our algorithms) or to our validation set (which we use to report results in the article’s main exhibits). For additional details of our data filters and partitioning procedures, see Online Appendix Table A.I . We define pretrial release as being released on the defendant’s own recognizance or having been assigned and then posting cash bail requirements within three days of arrest. We define rearrest as experiencing a new arrest before adjudication of the focal arrest, with detained defendants being assigned zero values for the purposes of this table. Arrest charge categories reflect the most serious criminal charge for which a person was arrested, using the FBI Uniform Crime Reporting hierarchy rule in cases where someone is arrested and charged with multiple offenses. For analyses of variance for the test of the joint null hypothesis that the difference in means across each variable is zero, see Online Appendix Table A.II .

III.C. Human Labels

The administrative data capture many key features of each case but omit some other important ones. We solve these data insufficiency problems through a series of human intelligence tasks (HITs), which involve having study subjects on one of two possible platforms (Amazon’s Mechanical Turk or Prolific) assign labels to each case from looking at the mug shots. More details are in Online Appendix Table A.III . We use data from these HITs mostly to understand how the algorithm’s predictions relate to already-known determinants of human decision making, and hence the degree to which the algorithm is discovering something novel.

One set of HITs filled in demographic-related data: ethnicity; skin tone (since people are often stereotyped on skin color, or “colorism”; Hunter 2007 ), reported on an 18-point scale; the degree to which defendants appear more stereotypically Black on a 9-point scale ( Eberhardt et al. 2006 show this affects criminal justice decisions); and age, to compare to administrative data for label quality checks. 33 Because demographics tend to be easy for people to see in images, we collect just one label per image for each of these variables. To confirm one label is enough, we repeated the labeling task for 100 images but collected 10 labels for each image; we see that additional labels add little information. 34 Another data quality check comes from the fact that the distributions of skin color ratings do systematically differ by defendant race ( Online Appendix Figure A.III ).

A second type of HIT measured facial features that previous psychology research has shown affect human judgments. The specific set of facial features we focus on come from the influential study by Oosterhof and Todorov (2008) of people’s perceptions of the facial features of others. When subjects are asked to provide descriptions of different faces, principal components analysis suggests just two dimensions account for about |$80\%$| of the variation: (i) trustworthiness and (ii) dominance. We also collected data on two other facial features shown to be associated with real-world decisions like hiring or whom to vote for: (iii) attractiveness and (iv) competence ( Frieze, Olson, and Russell 1991 ; Little, Jones, and DeBruine 2011 ; Todorov and Oh 2021 ). 35

We asked subjects to rate images for each of these psychological features on a nine-point scale. Because psychological features may be less obvious than demographic features, we collected three labels per training–data set image and five per validation–data set image. 36 There is substantial variation in the ratings that subjects assign to different images for each feature (see Online Appendix Figure A.VI ). The ratings from different subjects for the same feature and image are highly correlated: interrater reliability measures (Cronbach’s α) range from 0.87 to 0.98 ( Online Appendix Figure A.VII ), similar to those reported in studies like Oosterhof and Todorov (2008) . 37 The information gain from collecting more than a few labels per image is modest. 38 For summary statistics, see Online Appendix Table A.IV .

Finally, we also tried to capture people’s implicit or tacit understanding of the determinants of judges’ decisions by asking subjects to predict which mug shot out of a pair would be detained, with images in each pair matched on gender, race, and five-year age brackets. 39 We incentivized study subjects for correct predictions and gave them feedback over the course of the 50 image pairs to facilitate learning. We treat the first 10 responses per subject as a “learning set” that we exclude from our analysis.

The first step of our hypothesis generation procedure is to build an algorithmic model of some behavior, which in our case is the judge’s detention decision. A sizable share of the predictable variation in judge decisions comes from a surprising source: the defendant’s face. Facial features implicated by past research explain just a modest share of this predictable variation. The algorithm seems to have found a novel discovery.

IV.A. What Drives Judge Decisions?

We begin by predicting judge pretrial detention decisions ( y  = 1 if detain, y  = 0 if release) using all the inputs available ( x ). We use the training data set to construct two separate models for the two types of data available. We apply gradient-boosted decision trees to predict judge decisions using the structured administrative data (current charge, prior record, age, gender), m s ( x ); for the unstructured data (raw pixel values from the mug shots), we train a convolutional neural network, m u ( x ). Each model returns an estimate of y (a predicted detention probability) for a given x . Because these initial steps of our procedure use standard machine learning methods, we relegate their discussion to the Online Appendix .

We pool the signal from both models to form a single weighted-average model |$m_p(x)=[\hat{\beta _s} m_s(x) + \hat{\beta _u} m_u(x)]$| using a so-called stacking procedure where the data are used to estimate the relevant weights. 40 Combining structured and unstructured data is an active area of deep-learning research, often called fusion modeling ( Yuhas, Goldstein, and Sejnowski 1989 ; Lahat, Adali, and Jutten 2015 ; Ramachandram and Taylor 2017 ; Baltrušaitis, Ahuja, and Morency 2019 ). We have tried several of the latest fusion architectures; none improve on our ensemble approach.

Judge decisions do have some predictable structure. We report predictive performance as the area under the receiver operating characteristic curve, or AUC, which is a measure of how well the algorithm rank-orders cases with values from 0.5 (random guessing) to 1.0 (perfect prediction). Intuitively, AUC can be thought of as the chance that a uniformly randomly selected detained defendant has a higher predicted detention likelihood than a uniformly randomly selected released defendant. The algorithm built using all candidate features, m p ( x ), has an AUC of 0.780 (see Online Appendix Figure A.X ).

What is the algorithm using to make its predictions? A single type of input captures a sizable share of the total signal: the defendant’s face. The algorithm built using only the mug shot image, m u ( x ), has an AUC of 0.625 (see Online Appendix Figure A.X ). Since an AUC of 0.5 represents random prediction, in AUC terms the mug shot accounts for |$\frac{0.625-0.5}{0.780-0.5}=44.6\%$| of the predictive signal about judicial decisions.

Another common way to think about predictive accuracy is in R 2 terms. While our data are high dimensional (because the facial image is a high-dimensional object), the algorithm’s prediction of the judge’s decision based on the facial image, m u ( x ), is a scalar and can be easily included in a familiar regression framework. Like AUC, measures like R 2 and mean squared error capture how well a model rank-orders observations by predicted probabilities, but R 2 , unlike AUC, also captures how close predictions are to observed outcomes (calibration). 41 The R 2 from regressing y against m s ( x ) and m u ( x ) in the validation data is 0.11. Regressing y against m u ( x ) alone yields an R 2 of 0.03. So depending on how we measure predictive accuracy, around a quarter ( ⁠|$\frac{0.03}{0.11}=27.3\%)$| to a half ( ⁠|$44.6\%$|⁠ ) of the predicted signal about judges’ decisions is captured by the face.

Average differences are another way to see what drives judges’ decisions. For any given feature x k , we can calculate the average detention rate for different values of the feature. For example, for the variable measuring whether the defendant is male ( x k  = 1) versus female ( x k  = 0), we can calculate and plot E [ y | x k  = 1] versus E [ y | x k  = 0]. As shown in Online Appendix Figure A.XI , the difference in detention rates equals 4.8 percentage points for those arrested for violent versus nonviolent crimes, 10.2 percentage points for men versus women, and 4.3 percentage points for bottom versus top quartile of skin tone, which are all sizable relative to the baseline detention rate of |$23.3\%$| in our validation data set. By way of comparison, average detention rates for the bottom versus top quartile of the mug shot algorithm’s predictions, m u ( x ), differ by 20.4 percentage points.

In what follows, we seek to understand more about the mug shot–based prediction of the judge’s decision, which we refer to simply as m ( x ) in the remainder of the article.

IV.B. Judicial Error?

So far we have shown that the face predicts judges’ behavior. Are judges right to use face information? To be precise, by “right” we do not mean a broader ethical judgment; for many reasons, one could argue it is never ethical to use the face. But suppose we take a rather narrow (exceedingly narrow) formulation of “right.” Recall the judge is meant to make jailing decisions based on the defendant’s risk. Is the use of these facial characteristics consistent with that objective? Put differently, if we account for defendant risk differences, do these facial characteristics still predict judge decisions? The fact that judges rely on the face in making detention decisions is in itself a striking insight regardless of whether the judges use appearance as a proxy for risk or are committing a cognitive error.

At first glance, the most straightforward way to answer this question would be to regress rearrest against the algorithm’s mug shot–based detention prediction. That yields a statistically significant relationship: The coefficient (and standard error) for the mug shot equals 0.6127 (0.0460) with no other explanatory variables in the regression versus 0.5735 (0.0521) with all the explanatory variables (as in the final column, Table III ). But the interpretation here is not so straightforward.

The challenge of interpretation comes from the fact that we have only measured crime rates for the released defendants. The problem with having measured crime, not actual crime, is that whether someone is charged with a crime is itself a human choice, made by police. If the choices police make about when to make an arrest are affected by the same biases that might afflict judges, then measured rearrest rates may correlate with facial characteristics simply due to measurement bias. The problem created by having measures of rearrest only for released defendants is that if judges have access to private information (defendant characteristics not captured by our data set), and judges use that information to inform detention decisions, then the released and detained defendants may be different in unobservable ways that are relevant for rearrest risk ( Kleinberg et al. 2018 ).

With these caveats in mind, at least we can perform a bounding exercise. We created a predictor of rearrest risk (see Online Appendix B ) and then regress judges’ decisions on predicted rearrest risk. We find that a one-unit change in predicted rearrest risk changes judge detention rates by 0.6103 (standard error 0.0213). By comparison, we found that a one-unit change in the mug shot (by which we mean the algorithm’s mug shot–based prediction of the judge detention decision) changes judge detention rates by 0.6963 (standard error 0.0383; see Table III , column (1)). That means if the judges were reacting to the defendant’s face only because the face is a proxy for rearrest risk, the difference in rearrest risk for those with a one-unit difference in the mug shot would need to be |$\frac{0.6963}{0.6103} = 1.141$|⁠ . But when we directly regress rearrest against the algorithm’s mug shot–based detention prediction, we get a coefficient of 0.6127 (standard error 0.0460). Clearly 0.6127 < 1.141; that is, the mug shot does not seem to be strongly related enough to rearrest risk to explain the judge’s use of it in making detention decisions. 42

Of course this leaves us with the second problem with our data: we only have crime data on the released. It is possible the relationship between the mug shot and risk could be very different among the |$23.3\%$| of defendants who are detained (which we cannot observe). Put differently, the mug shot–risk relationship among the |$76.7\%$| of the defendants who are released is 0.6127; and let A be the (unknown) mug shot–risk relationship among the jailed. What we really want to know is the mug shot–risk relationship among all defendants, which equals (0.767 · 0.6127) + (0.233 · A ). For this mug shot–risk relationship among all defendants to equal 1.141, A would need to be 2.880, nearly five times as great among the detained defendants as among the released. This would imply an implausibly large effect of the mug shot on rearrest risk relative to the size of the effects on rearrest risk of other defendant characteristics. 43

In addition, the results from Section VI.B call into question that these characteristics are well-understood proxies for risk. As we show there, experts who understand pretrial (public defenders and legal aid society staff) do not recognize the signal about judge decision making that the algorithm has discovered in the mug shot. These considerations as a whole—that measured rearrest is itself biased, the bounding exercise, and the failure of experts to recreate this signal—together lead us to tentatively conclude that it is unlikely that what the algorithm is finding in the face is merely a well-understood proxy for risk, but reflects errors in the judicial decision-making process. Of course, that presumption is not essential for the rest of the article, which asks: what exactly has the algorithm discovered in the face?

IV.C. Is the Algorithm Discovering Something New?

Previous studies already tell us a number of things about what shapes the decisions of judges and other people. For example, we know people stereotype by gender ( Avitzour et al. 2020 ), age ( Neumark, Burn, and Button 2016 ; Dahl and Knepper 2020 ), and race or ethnicity ( Bertrand and Mullainathan 2004 ; Arnold, Dobbie, and Yang 2018 ; Arnold, Dobbie, and Hull 2020 ; Fryer 2020 ; Hoekstra and Sloan 2022 ; Goncalves and Mello 2021 ). Is the algorithm just rediscovering known determinants of people’s decisions, or discovering something new? We address this in two ways. We first ask how much of the algorithm’s predictions can be explained by already-known features ( Table II ). We then ask how much of the algorithm’s predictive power in explaining actual judges’ decisions is diminished when we control for known factors ( Table III ). We carry out both analyses for three sets of known facial features: (i) demographic characteristics, (ii) psychological features, and (iii) incentivized human guesses. 44

Is the Algorithm Rediscovering Known Facial Features?

Notes. The table presents the results of regressing an algorithmic prediction of judge detention decisions against each of the different explanatory variables as listed in the rows, where each column represents a different regression specification (the specific explanatory variables in each regression are indicated by the filled-in coefficients and standard errors in the table). The algorithm was trained using mug shots from the training data set; the regressions reported here are carried out using data from the validation data set. Data on skin tone, attractiveness, competence, dominance, and trustworthiness comes from asking subjects to assign feature ratings to mug shot images from the Mecklenburg County, NC, Sheriff’s Office public website (see the text). The human guess about the judges’ decision comes from showing workers on the Prolific platform pairs of mug shot images and asking them to report which defendant they believe the judge would be more likely to detain. Regressions follow a linear probability model and also include indicators for unknown race and unknown gender. * p < .1; ** p < .05; *** p < .01.

Does the Algorithm Predict Judge Behavior after Controlling for Known Factors?

Notes. This table reports the results of estimating a linear probability specification of judges’ detain decisions against different explanatory variables in the validation set described in Table I . Each row represents a different explanatory variable for the regression, while each column reports the results of a separate regression with different combinations of explanatory variables (as indicated by the filled-in coefficients and standard errors in the table). The algorithmic predictions of the judges’ detain decision come from our convolutional neural network algorithm built using the defendants’ face image as the only feature, using data from the training data set. Measures of defendant demographics and current arrest charge come from government administrative data obtained from a combination of Mecklenburg County, NC, and state agencies. Measures of skin tone, attractiveness, competence, dominance, and trustworthiness come from subject ratings of mug shot images (see the text). Human guess variable comes from showing subjects pairs of mug shot images and asking subjects to identify the defendant they think the judge would be more likely to detain. Regression specifications also include indicators for unknown race and unknown gender. * p < .1; ** p < .05; *** p < .01.

Table II , columns (1)–(3) show the relationship of the algorithm’s predictions to demographics. The predictions vary enormously by gender (men have predicted detention likelihoods 11.9 percentage points higher than women), less so by age, 45 and by different indicators of race or ethnicity. With skin tone scored on a 0−1 continuum, defendants whom independent raters judge to be at the lightest end of the continuum are 4.4 percentage points less likely to be detained than those rated to have the darkest skin tone (column (3)). Conditional on skin tone, Black defendants have a 1.9 percentage point lower predicted likelihood of detention compared with whites. 46

Table II , column (4) shows how the algorithm’s predictions relate to facial features implicated by past psychological studies as shaping people’s judgments of one another. These features also help explain the algorithm’s predictions of judges’ detention decisions: people judged by independent raters to be one standard deviation more attractive, competent, or trustworthy have lower predicted likelihood of detention equal to 0.55, 0.91, and 0.48 percentage points, respectively, or |$2.2\%$|⁠ , |$3.6\%$|⁠ , and |$1.8\%$| of the base rate. 47 Those whom subjects judge are one standard deviation more dominant-looking have a higher predicted likelihood of detention of 0.37 percentage points (or |$1.5\%)$|⁠ .

How do we know we have controlled for everything relevant from past research? The literature on what shapes human judgments in general is vast; perhaps there are things that are relevant for judges’ decisions specifically that we have inadvertently excluded? One way to solve this problem would be to do a comprehensive scan of past studies of human judgment and decision making, and then decide which results from different non–criminal justice contexts might be relevant for criminal justice. But that itself is a form of human-driven hypothesis generation, bringing us right back to where we started.

To get out of this box, we take a different approach. Instead of enumerating individual characteristics, we ask people to embody their beliefs in a guess, which ought to be the compound of all these characteristics. Then we can ask whether the algorithm has rediscovered this human guess (and later whether it has discovered more). We ask independent subjects to look at pairs of mug shots matched by gender, race, and five-year age bins and forecast which defendant is more likely to be detained by a judge. We provide a financial incentive for accurate guesses to increase the chances that subjects take the exercise seriously. 48 We also provide subjects with an opportunity to learn by showing subjects 50 image pairs with feedback after each pair about which defendant the judge detained. We treat the first 10 image pairs from each subject as learning trials and only use data from the last 40 image pairs. This approach is intended to capture anything that influences judges’ decisions that subjects could recognize, from subtle signs of things like socioeconomic status or drug use or mood, to things people can recognize but not articulate.

It turns out subjects are modestly good at this task ( Table II ). Participants guess which mug shot is more likely to be detained at a rate of |$51.4\%$|⁠ , which is different to a statistically significant degree from the |$50\%$| random-guessing threshold. When we regress the algorithm’s predicted detention rate against these subject guesses, the coefficient is 3.99 percentage points, equal to |$17.1\%$| of the base rate.

The findings in Table II are somewhat remarkable. The only input the algorithm had access to was the raw pixel values of each mug shot, yet it has rediscovered findings from decades of previous research and human intuition.

Interestingly, these features collectively explain only a fraction of the variation in the algorithm’s predictions: the R 2 is only 0.2228. That by itself does not necessarily mean the algorithm has discovered additional useful signal. It is possible that the remaining variation is prediction error—components of the prediction that do not explain actual judges’ decisions.

In Table III , we test whether the algorithm uncovers any additional signal for actual judge decisions, above and beyond the influence of these known factors. The algorithm by itself produces an R 2 of 0.0331 (column (1)), substantially higher than all previously known features taken together, which produce an R 2 of 0.0162 (column (5)), or the human guesses alone which produce an R 2 of 0.0025 (so we can see the algorithm is much better at predicting detention from faces than people are). Another way to see that the algorithm has detected signal above and beyond these known features is that the coefficient on the algorithm prediction when included alone in the regression, 0.6963 (column (1)), changes only modestly when we condition on everything else, now equal to 0.6171 (column (7)). The algorithm seems to have discovered some novel source of signal that better predicts judge detention decisions. 49

The algorithm has made a discovery: something about the defendant’s face explains judge decisions, above and beyond the facial features implicated by existing research. But what is it about the face that matters? Without an answer, we are left with a discovery of an unsatisfying sort. We have simply replaced one black box hypothesis generation procedure (human creativity) with another (the algorithm). In what follows we demonstrate how existing methods like saliency maps cannot solve this challenge in our application and then discuss our solution to that problem.

V.A. The Challenge of Explanation

The problem of algorithm-human communication stems from the fact that we cannot simply look inside the algorithm’s “black box” and see what it is doing because m ( x ), the algorithmic predictor, is so complicated. A common solution in computer science is to forget about looking inside the algorithmic black box and focus instead on drawing inferences from curated outputs of that box. Many of these methods involve gradients: given a prediction function m ( x ), we can calculate the gradient |$\nabla m(x) = \frac{\mathrm{d}{m}}{\mathrm{d}x}(x)$|⁠ . This lets us determine, at any input value, what change in the input vector maximally changes the prediction. 50 The idea of gradients is useful for image classification tasks because it allows us to tell which pixel image values are most important for changing the predicted outcome.

For example, a widely used method known as saliency maps uses gradient information to highlight the specific pixels that are most important for predicting the outcome of interest ( Baehrens et al. 2010 ; Simonyan, Vedaldi, and Zisserman 2014 ). This approach works well for many applications like determining whether a given picture contains a given type of animal, a common task in ecology ( Norouzzadeh et al. 2018 ). What distinguishes a cat from a dog? A saliency map for a cat detector might highlight pixels around, say, the cat’s head: what is most cat-like is not the tail, paws, or torso, but the eyes, ears, and whiskers. But more complicated outcomes of the sort social scientists study may depend on complicated functions of the entire image.

Even if saliency maps were more selective in highlighting pixels in applications like ours, for hypothesis generation they also suffer from a second limitation: they do not convey enough information to enable people to articulate interpretable hypotheses. In the cat detector example, a saliency map can tell us that something about the cat’s (say) whiskers are key for distinguishing cats from dogs. But what about that feature matters? Would a cat look more like a dog if its whiskers were longer? Or shorter? More (or less?) even in length? People need to know not just what features matter but how they must change to change the prediction. For hypothesis generation, the saliency map undercommunicates with humans.

To test the ability of saliency maps to help with our application, we focused on a facial feature that people already understand and can easily recognize from a photo: age. We first build an algorithm that predicts each defendant’s age from their mug shot. For a representative image, as in the top left of Figure III , we can highlight which pixels are most important for predicting age, shown in the top right. 51 A key limitation of saliency maps is easy to see: because age (like many human facial features) is a function of almost every part of a person’s face, the saliency map highlights almost everything.

Candidate Algorithm-Human Communication Vehicles for a Known Facial Feature: Age

Candidate Algorithm-Human Communication Vehicles for a Known Facial Feature: Age

Panel A shows a randomly selected point in the GAN latent space for a non-Hispanic white male defendant. Panel B shows a saliency map that highlights the pixels that are most important for an algorithmic model that predicts the defendant’s age from the mug shot image. Panel C shows an image changed or “morphed” in the direction of older age, based on the gradient of the image-based age prediction, using the “naive” morphing procedure that does not constrain the new image to lie on the face manifold (see the text). Panel D shows the image morphed to the maximum age using our actual preferred morphing procedure.

An alternative to simply highlighting high-leverage pixels is to change them in the direction of the gradient of the predicted outcome, to—ideally—create a new face that now has a different predicted outcome, what we call “morphing.” This new image answers the counterfactual question: “How would this person’s face change to increase their predicted outcome?” Our approach builds on the ability of people to comprehend ideas through comparisons, so we can show morphed image pairs to subjects to have them name the differences that they see. Figure IV summarizes our semiautomated hypothesis generation pipeline. (For more details see Online Appendix B .) The benefit of morphed images over actual mug shot images is to isolate the differences across faces that matter for the outcome of interest. By reducing noise, morphing also reduces the risk of spurious discoveries.

Hypothesis Generation Pipeline

Hypothesis Generation Pipeline

The diagram illustrates all the algorithmic components in our procedure by presenting a full pipeline for algorithmic interpretation.

Figure V illustrates how this morphing procedure works in practice and highlights some of the technical challenges that arise. Let the box in the top panel represent the space of all possible images—all possible combinations of pixel values for, say, a 512 × 512 image. Within this space, we can apply our mug shot–based predictor of the known facial feature, age, to identify all images with the same predicted age, as shown by the contour map of the prediction function. Imagine picking some random initial mug shot image. We could follow the gradient to find an image with a higher predicted value of the outcome y .

Morphing Images for Detention Risk On and Off the Face Manifold

Morphing Images for Detention Risk On and Off the Face Manifold

The figure shows the difference between an unconstrained (naive) morphing procedure and our preferred new morphing approach. In both panels, the background represents the image space (set of all possible pixel values) and the blue line (color version available online) represents the set of all pixel values that correspond to any face image (the face manifold). The orange lines show all images that have the same predicted outcome (isoquants in predicted outcome). The initial face (point on the outermost contour line) is a randomly selected face in GAN face space. From there we can naively follow the gradients of an algorithm that predicts some outcome of interest from face images. As shown in Panel A, this takes us off the face manifold and yields a nonface image. Alternatively, with a model of the face manifold, we can follow the gradient for the predicted outcome while ensuring that the new image is again a realistic instance as shown in Panel B.

The challenge is that most points in this image space are not actually face images. Simply following the gradient will usually take us off the data distribution of face images, as illustrated abstractly in the top panel of Figure V . What this means in practice is shown in the bottom left panel of Figure III : the result is an image that has a different predicted outcome (in the figure, illustrated for age) but no longer looks like a real instance—that is, no longer looks like a realistic face image. This “naive” morphing procedure will not work without some way to ensure the new point we wind up on in image space corresponds to a realistic face image.

V.B. Building a Model of the Data Distribution

To ensure morphing leads to realistic face images, we need a model of the data distribution p ( x )—in our specific application, the set of images that are faces. We rely on an unsupervised learning approach to this problem. 52 Specifically, we use generative adversarial networks (GANs), originally introduced to generate realistic new images for a variety of tasks (see Goodfellow et al. 2014 ). 53

A GAN is built by training two algorithms that “compete” with each another, the generator G and the classifier C : the generator creates synthetic images and the classifier (or “discriminator”), presented with synthetic or real images, tries to distinguish which is which. A good discriminator pressures the generator to produce images that are harder to distinguish from real; in turn, a good generator pressures the classifier to get better at discriminating real from synthetic images. Data on actual faces are used to train the discriminator, which results in the generator being trained as it seeks to fool the discriminator. With machine learning, the performance of C and G improve with successive iterations of training. A perfect G would output images where the classifier C does no better than random guessing. Such a generator would by definition limit itself to the same input space that defines real images, that is, the data distribution of faces. (Additional discussion of GANs in general and how we construct our GAN specifically are in Online Appendix B .)

To build our GAN and evaluate its expressiveness we use standard training metrics, which turn out to compare favorably to what we see with other widely used GAN models on other data sets (see Online Appendix B.C for details). A more qualitative way to judge our GAN comes from visual inspection; some examples of synthetic face images are in Figure II . Most importantly, the GAN we build (as is true of GANs in general) is not generic. GANs are specific. They do not generate “faces” but instead seek to match the distribution of pixel combinations in the training data. For example, our GAN trained using mug shots would never generate generic Facebook profile photos or celebrity headshots.

Figure V illustrates how having a model such as the GAN lets morphing stay on the data distribution of faces and produce realistic images. We pick a random point in the space of faces (mug shots) and then use the algorithmic predictor of the outcome of interest m ( x ) to identify nearby faces that are similar in all respects except those relevant for the outcome. Notice this procedure requires that faces closer to one another in GAN latent space should look relatively more similar to one another to a human in pixel space. Otherwise we might make a small movement along the gradient and wind up with a face that looks different in all sorts of other ways that are irrelevant to the outcome. That is, we need the GAN not just to model the support of the data but also to provide a meaningful distance metric.

When we produce these morphs, what can possibly change as we morph? In principle there is no limit. The changes need not be local: features such as skin color, which involves many pixels, could change. So could features such as attractiveness, where the pixels that need to change to make a face more attractive vary from face to face: the “same” change may make one face more attractive and another less so. Anything represented in the face could change, as could anything else in the image beyond the face that matters for the outcome (if, for example, localities varied in both detention rates and the type of background they have someone stand in front of for mug shots).

In practice, though, there is a limit. What can change depends on how rich and expressive the estimated GAN is. If the GAN fails to capture a certain kind of face or a dimension of the face, then we are unlikely to be able to morph on that dimension. The morphing procedure is only as complete as the GAN is expressive. Assuming the GAN expresses a feature, then if m ( x ) truly depends on that feature, morphing will likely display it. Nor is there any guarantee that in any given application the classifier m ( x ) will find novel signal for the outcome y , or that the GAN successfully learns the data distribution ( Nalisnick et al. 2018 ), or that subjects can detect and articulate whatever signal the classifier algorithm has discovered. Determining the general conditions under which our procedure will work is something we leave to future research. Whether our procedure can work for the specific application of judge decisions is the question to which we turn next. 54

V.C. Validating the Morphing Procedure

We return to our algorithmic prediction of a known facial feature—age—and see what morphing by age produces as a way to validate or test our procedure. When we follow the gradient of the predicted outcome (age), by constraining ourselves to stay on the GAN’s latent space of faces we wind up with a new age-morphed face that does indeed look like a realistic face image, as shown in the bottom right of Figure III . We seem to have successfully developed a model of the data distribution and a way to move around on that surface to create realistic new instances.

To figure out if algorithm-human communication occurs, we run these age-morphed image pairs through our experimental pipeline ( Figure IV ). Our procedure is only useful if it is replicable—that is, if it does not depend on the idiosyncratic insights of any particular person. For that reason, the people looking at these images and articulating what they see should not be us (the investigators) but a sample of external, independent study subjects. In our application, we use Prolific workers (see Online Appendix Table A.III ). Reliability or replicability is indicated by the agreement in the subject responses: lots of subjects see and articulate the same thing in the morphed images.

We asked subjects to look at 50 age-morphed image pairs selected at random from a population of 100 pairs, and told them the images in each pair differ on some hidden dimension but did not tell them what that was. 55 We asked subjects to guess which image expresses that hidden feature more, gave them feedback about the right answer, treated the first 10 image pairs as learning examples, and calculated accuracy on the remaining 40 images. Subjects correctly selected the older image |$97.8\%$| of the time.

The final step was to ask subjects to name what differs in image pairs. Making sense of these responses requires some way to group them into semantic categories. Each subject comment could include several concepts (e.g., “wrinkles, gray hair, tired”). We standardized these verbal descriptions by removing punctuation, using only lowercase characters, and removing stop words. We gave three research assistants not otherwise involved in the project these responses and asked them to create their own categories that would capture all the responses (see Online Appendix Figure A.XIII ). We also gave them an illustrative subject comment and highlighted the different “types” of categories (descriptive physical features, i.e., “thick eyebrows,” descriptive impression category, i.e., “energetic,” but also an illustration of a category of comment that is too vague to lend itself to useful measurement, i.e., “ears”). In our validation exercise |$81.5\%$| of subject reports fall into the semantic categories of either age or the closely related feature of hair color. 56

V.D. Understanding the Judge Detention Predictor

Having validated our algorithm-human communication procedure for the known facial feature of age, we are ready to apply it to generate a new hypothesis about what drives judge detention decisions. To do this we combine the mug shot algorithm predictor of judges’ detention decisions, m ( x ), with our GAN of the data distribution of mug shot images, then create new synthetic image pairs morphed with respect to the likelihood the judge would detain the defendant (see Figure IV ).

The top panel of Figure VI shows a pair of such images. Underneath we show an “image strip” of intermediate steps, along with each image’s predicted detention rate. With an overall detention rate of |$23.3\%$| in our validation data set, morphing takes us from about one-half the base rate ( ⁠|$13\%$|⁠ ) up to nearly twice the base rate ( ⁠|$41\%$|⁠ ). Additional examples of morphed image pairs are shown in Figure VII .

Illustration of Morphed Faces along the Detention Gradient

Illustration of Morphed Faces along the Detention Gradient

Panel A shows the result of selecting a random point on the GAN latent face space for a white non-Hispanic male defendant, then using our new morphing procedure to increase the predicted detention risk of the image to 0.41 (left) or reduce the predicted detention risk down to 0.13 (right). The overall average detention rate in the validation data set of actual mug shot images is 0.23 by comparison. Panel B shows the different intermediate images between these two end points, while Panel C shows the predicted detention risk for each of the images in the middle panel.

Examples of Morphing along the Gradients of the Face-Based Detention Predictor

Examples of Morphing along the Gradients of the Face-Based Detention Predictor

We showed 54 subjects 50 detention-risk-morphed image pairs each, asked them to predict which defendant would be detained, offered them financial incentives for correct answers, 57 and gave them feedback on the right answer. Online Appendix Figure A.XV shows how accurate subjects are as they get more practice across successive morphed image pairs. With the initial image-pair trials, subjects are not much better than random guessing, in the range of what we see when subjects look at pairs of actual mugshots (where accuracy is |$51.4\%$| across the final 40 mug shot pairs people see). But unlike what happens when subjects look at actual images, when looking at morphed image pairs subjects seem to quickly learn what the algorithm is trying to communicate to them. Accuracy increased by over 10 percentage points after 20 morphed image pairs and reached |$67\%$| after 30 image pairs. Compared to looking at actual mugshots, the morphing procedure accomplished its goal of making it easier for subjects to see what in the face matters most for detention risk.

We asked subjects to articulate the key differences they saw across morphed image pairs. The result seems to be a reliable hypothesis—a facial feature that a sizable share of subjects name. In the top panel of Figure VIII , we present a histogram of individual tokens (cleaned words from worker comments) in “word cloud” form, where word size is approximately proportional to frequency. 58 Some of the most common words are “shaved,” “cleaner,” “length,” “shorter,” “moustache,” and “scruffy.” To form semantic categories, we use a procedure similar to what we describe for our validation exercise for the known feature of age. 59 Grouping tokens into semantic categories, we see that nearly |$40\%$| of the subjects see and name a similar feature that they think helps explain judge detention decisions: how well-groomed the defendant is (see the bottom panel of Figure VIII ). 60

Subject Reports of What They See between Detention-Risk-Morphed Image Pairs

Subject Reports of What They See between Detention-Risk-Morphed Image Pairs

Panel A shows a word cloud of subject reports about what they see as the key difference between image pairs where one is a randomly selected point in the GAN latent space and the other is morphed in the direction of a higher predicted detention risk. Words are approximately proportionately sized to the frequency of subject mentions. Panel B shows the frequency of semantic groupings of those open-ended subject reports (see the text for additional details).

Can we confirm that what the subjects think the algorithm is seeing is what the algorithm actually sees? We asked a separate set of 343 independent subjects (MTurk workers) to label the 32,881 mug shots in our combined training and validation data sets for how well-groomed each image was perceived to be on a nine-point scale. 61 For data sets of our size, these labeling costs are fairly modest, but in principle those costs could be much more substantial (or even prohibitive) in some applications.

Table IV suggests algorithm-human communication has successfully occurred: our new hypothesis, call it h 1 ( x ), is correlated with the algorithm’s prediction of the judge, m ( x ). If subjects were mistaken in thinking they saw well-groomed differences across images, there would be no relationship between well-groomed and the detention predictions. Yet what we actually see is the R 2 from regressing the algorithm’s predictions against well-groomed equals 0.0247, or |$11\%$| of the R 2 we get from a model with all the explanatory variables (0.2361). In a bivariate regression the coefficient (−0.0172) implies that a one standard deviation increase in well-groomed (1.0118 points on our 9-point scale) is associated with a decline in predicted detention risk of 1.74 percentage points, or |$7.5\%$| of the base rate. Another way to see the explanatory power of this hypothesis is to note that this coefficient hardly changes when we add all the other explanatory variables to the regression (equal to −0.0153 in the final column) despite the substantial increase in the model’s R 2 .

Correlation between Well-Groomed and the Algorithm’s Prediction

Notes. This table shows the results of estimating a linear probability specification regressing algorithmic predictions of judges’ detain decision against different explanatory variables, using data from the validation set of cases from Mecklenburg County, NC. Each row of the table represents a different explanatory variable for the regression, while each column reports the results of a separate regression with different combinations of explanatory variables (as indicated by the filled-in coefficients and standard errors in the table). Algorithmic predictions of judges’ decisions come from applying an algorithm built with face images in the training data set to validation set observations. Data on well-groomed, skin tone, attractiveness, competence, dominance, and trustworthiness come from subject ratings of mug shot images (see the text). Human guess variable comes from showing subjects pairs of mug shot images and asking subjects to identify the defendant they think the judge would be more likely to detain. Regression specifications also include indicators for unknown race and unknown gender. * p < .1; ** p < .05; *** p < .01.

V.E. Iteration

Our procedure is iterable. The first novel feature we discovered, well-groomed, explains some—but only some—of the variation in the algorithm’s predictions of the judge. We can iterate our procedure to generate hypotheses about the remaining residual variation as well. Note that the order in which features are discovered will depend on how important each feature is in explaining the judge’s detention decision and on how salient each feature is to the subjects who are viewing the morphed image pairs. So explanatory power for the judge’s decisions need not monotonically decline as we iterate and discover new features.

To isolate the algorithm’s signal above and beyond what is explained by well-groomed, we wish to generate a new set of morphed image pairs that differ in predicted detention but hold well-groomed constant. That would help subjects see other novel features that might differ across the detention-risk-morphed images, without subjects getting distracted by differences in well-groomed. 62 But iterating the procedure raises several technical challenges. To see these challenges, consider what would in principle seem to be the most straightforward way to orthogonalize, in the GAN’s latent face space:

use training data to build predictors of detention risk, m ( x ), and the facial features to orthogonalize against, h 1 ( x );

pick a point on the GAN latent space of faces;

collect the gradients with respect to m ( x ) and h 1 ( x );

use the Gram-Schmidt process to move within the latent space toward higher predicted detention risk m ( x ), but orthogonal to h 1 ( x ); and

show new morphed image pairs to subjects, have them name a new feature.

The challenge with implementing this playbook in practice is that we do not have labels for well-groomed for the GAN-generated synthetic faces. Moreover, it would be infeasible to collect this feature for use in this type of orthogonalization procedure. 63 That means we cannot orthogonalize against well-groomed, only against predictions of well-groomed. And orthogonalizing with respect to a prediction is an error-prone process whenever the predictor is imperfect (as it is here). 64 The errors in the process accumulate as we take many morphing steps. Worse, that accumulated error is not expected to be zero on average. Because we are morphing in the direction of predicted detention and we know predicted detention is correlated with well-groomed, the prediction error will itself be correlated with well-groomed.

Instead we use a different approach. We build a new detention-risk predictor with a curated training data set, limited to pairs of images matched on the features to be orthogonalized against. For each detained observation i (such that y i  = 1), we find a released observation j (such that y j  = 0) where h 1 ( x i ) =  h 1 ( x j ). In that training data set y is now orthogonal to h 1 ( x ), so we can use the gradient of the orthogonalized detention risk predictor to move in GAN latent space to create new morphed images with different detention odds but are similar with respect to well-groomed. 65 We call these “orthogonalized morphs,” which we then feed into the experimental pipeline shown in Figure IV . 66 An open question for future work is how many iterations are possible before the dimensionality of the matching problem required for this procedure would create problems.

Examples from this orthogonalized image-morphing procedure are in Figure IX . Changes in facial features across morphed images are notably different from those in the first iteration of morphs as in Figure VI . From these examples, it appears possible that orthogonalization may be slightly imperfect; sometimes they show subtle differences in “well-groomed” and perhaps age. As with the first iteration of the morphing procedure, the second (orthogonalized) iteration of the procedure again generates images that vary substantially in their predicted risk, from 0.07 up to 0.27 (see Online Appendix Figure A.XVIII ).

Examples of Morphing along the Orthogonal Gradients of the Face-Based Detention Predictor

Examples of Morphing along the Orthogonal Gradients of the Face-Based Detention Predictor

Still, there is a salient new signal: when presented to subjects they name a second facial feature, as shown in Figure X . We showed 52 subjects (Prolific workers) 50 orthogonalized morphed image pairs and asked them to name the differences they see. The word cloud shown in the top panel of Figure X shows that some of the most common terms reported by subjects include “big,” “wider,” “presence,” “rounded,” “body,” “jaw,” and “head.” When we ask independent research assistants to group the subject tokens into semantic groups, we can see as in the bottom of the figure that a sizable share of subject comments (around |$22\%$|⁠ ) refer to a similar facial feature, h 2 ( x ): how “heavy-faced” or “full-faced” the defendant is.

Subject Reports of What They See between Detention-Risk-Morphed Image Pairs, Orthogonalized to the First Novel Feature Discovered (Well-Groomed)

Subject Reports of What They See between Detention-Risk-Morphed Image Pairs, Orthogonalized to the First Novel Feature Discovered (Well-Groomed)

Panel A shows a word cloud of subject reports about what they see as the key difference between image pairs, where one is a randomly selected point in the GAN latent space and the other is morphed in the direction of a higher predicted detention risk, where we are moving along the detention gradient orthogonal to well-groomed and skin tone (see the text). Panel B shows the frequency of semantic groupings of these open-ended subject reports (see the text for additional details).

This second facial feature (like the first) is again related to the algorithm’s prediction of the judge. When we ask a separate sample of subjects (343 MTurk workers, see Online Appendix Table A.III ) to independently label our validation images for heavy-facedness, we can see the R 2 from regressing the algorithm’s predictions against heavy-faced yields an R 2 of 0.0384 ( Table V , column (1)). With a coefficient of −0.0182 (0.0009), the results imply that a one standard deviation change in heavy-facedness (1.1946 points on our 9-point scale) is associated with a reduced predicted detention risk of 2.17 percentage points, or |$9.3\%$| of the base rate. Adding in other facial features implicated by past research substantially boosts the adjusted R 2 of the regression but barely changes the coefficient on heavy-facedness.

Correlation between Heavy-Faced and the Algorithm’s Prediction

Notes. This table shows the results of estimating a linear probability specification regressing algorithmic predictions of judges’ detain decision against different explanatory variables, using data from the validation set of cases from Mecklenburg County, NC. Each row of the table represents a different explanatory variable for the regression, while each column reports the results of a separate regression with different combinations of explanatory variables (as indicated by the filled-in coefficients and standard errors in the table). Algorithmic predictions of judges’ decisions come from applying the algorithm built with face images in the training data set to validation set observations. Data on heavy-faced, well-groomed, skin tone, attractiveness, competence, dominance, and trustworthiness come from subject ratings of mug shot images (see the text). Human guess variable comes from showing subjects pairs of mug shot images and asking subjects to identify the defendant they think the judge would be more likely to detain. Regression specifications also include indicators for unknown race and unknown gender. * p < .1; ** p < .05; *** p < .01.

In principle, the procedure could be iterated further. After all, well-groomed, heavy-faced plus previously known facial features all together still only explain |$27\%$| of the variation in the algorithm’s predictions of the judges’ decisions. As long as there is residual variation, the hypothesis generation crank could be turned again and again. Because our goal is not to fully explain judges’ decisions but to illustrate that the procedure works and is iterable, we leave this for future work (ideally done on data from other jurisdictions as well).

Here we consider whether the new hypotheses our procedure has generated meet our final criterion: empirical plausibility. We show that these facial features are new not just to the scientific literature but also apparently to criminal justice practitioners, before turning to whether these correlations might reflect some underlying causal relationship.

VI.A. Do These Hypotheses Predict What Judges Actually Do?

Empirical plausibility need not be implied by the fact that our new facial features are correlated with the algorithm’s predictions of judges’ decisions. The algorithm, after all, is not a perfect predictor. In principle, well-groomed and heavy-faced might be correlated with the part of the algorithm’s prediction that is unrelated to judge behavior, or m ( x ) − y .

In Table VI , we show that our two new hypotheses are indeed empirically plausible. The adjusted R 2 from regressing judges’ decisions against heavy-faced equals 0.0042 (column (1)), while for well-groomed the figure is 0.0021 (column (2)) and for both together the figure equals 0.0061 (column (3)). As a benchmark, the adjusted R 2 from all variables (other than the algorithm’s overall mug shot–based prediction) in explaining judges’ decisions equals 0.0218 (column (6)). So the explanatory power of our two novel hypotheses alone equals about |$28\%$| of what we get from all the variables together.

Do Well-Groomed and Heavy-Faced Correlate with Judge Decisions?

Notes. This table reports the results of estimating a linear probability specification of judges’ detain decisions against different explanatory variables in the validation set described in Table I . The algorithmic predictions of the judges’ detain decision come from our convolutional neural network algorithm built using the defendants’ face image as the only feature, using data from the training data set. Measures of defendant demographics and current arrest charge come from Mecklenburg County, NC, administrative data. Data on heavy-faced, well-groomed, skin tone, attractiveness, competence, dominance, and trustworthiness come from subject ratings of mug shot images (see the text). Human guess variable comes from showing subjects pairs of mug shot images and asking subjects to identify the defendant they think the judge would be more likely to detain. Regression specifications also include indicators for unknown race and unknown gender. * p < .1; ** p < .05; *** p < .01.

For a sense of the magnitude of these correlations, the coefficient on heavy-faced of −0.0234 (0.0036) in column (1) and on well-groomed of −0.0198 (0.0043) in column (2) imply that one standard deviation changes in each variable are associated with reduced detention rates equal to 2.8 and 2.0 percentage points, respectively, or |$12.0\%$| and |$8.9\%$| of the base rate. Interestingly, column (7) shows that heavy-faced remains statistically significant even when we control for the algorithm’s prediction. The discovery procedure led us to a facial feature that, when measured independently, captures signal above and beyond what the algorithm found. 67

VI.B. Do Practitioners Already Know This?

Our procedure has identified two hypotheses that are new to the existing research literature and to our study subjects. Yet the study subjects we have collected data from so far likely have relatively little experience with the criminal justice system. A reader might wonder: do experienced criminal justice practitioners already know that these “new” hypotheses affect judge decisions? The practitioners might have learned the influence of these facial features from day-to-day experience.

To answer this question, we carried out two smaller-scale data collections with a sample of N  = 15 staff at a public defender’s office and a legal aid society. We first asked an open-ended question: on what basis do judges decide to detain versus release defendants pretrial? Practitioners talked about judge misunderstandings of the law, people’s prior criminal records, and judge underappreciation for the social contexts in which criminal records arise. Aside from the defendant’s race, nothing about the appearance of defendants was mentioned.

We showed practitioners pairs of actual mug shots and asked them to guess which person is more likely to be detained by a judge (as we had done with MTurk and Prolific workers). This yields a sample of 360 detention forecasts. After seeing these mug shots practitioners were asked an open-ended question about what they think matters about the defendant’s appearance for judge detention decisions. There were a few mentions of well-groomed and one mention of something related to heavy-faced, but these were far from the most frequently mentioned features, as seen in Online Appendix Figure A.XX .

The practitioner forecasts do indeed seem to be more accurate than those of “regular” study subjects. Table VII , column (5) shows that defendants whom the practitioners predict will be detained are 29.2 percentage points more likely to actually be detained, even after controlling for the other known determinants of detention from past research. This is nearly four times the effect of forecasts made by Prolific workers, as shown in the last column of Table VI . The practitioner guesses (unlike the regular study subjects) are even about as accurate as the algorithm; the R 2 from the practitioner guess (0.0165 in column (1)) is similar to the R 2 from the algorithm’s predictions (0.0166 in column (6)).

Results from the Criminal Justice Practitioner Sample

Notes. This table shows the results of estimating judges’ detain decisions using a linear probability specification of different explanatory variables on a subset of the validation set. The criminal justice practitioner’s guess about the judge’s decision comes from showing 15 different public defenders and legal aid society members actual mug shot images of defendants and asking them to report which defendant they believe the judge would be more likely to detain. The pairs are selected to be congruent in gender and race but discordant in detention outcome. The algorithmic predictions of judges’ detain decisions come from applying the algorithm, which is built with face images in the training data set, to validation set observations. Measures of defendant demographics and current arrest charge come from Mecklenburg County, NC, administrative data. Data on heavy-faced, well-groomed, skin tone, attractiveness, competence, dominance, and trustworthiness come from subject ratings of mug shot images (see the text). Regression specifications also include indicators for unknown race and unknown gender. * p < .1; ** p < .05; *** p < .01.

Yet practitioners do not seem to already know what the algorithm has discovered. We can see this in several ways in Table VII . First, the sum of the adjusted R 2 values from the bivariate regressions of judge decisions against practitioner guesses and judge decisions against the algorithm mug shot–based prediction is not so different from the adjusted R 2 from including both variables in the same regression (0.0165 + 0.0166 = 0.0331 from columns (1) plus (6), versus 0.0338 in column (7)). We see something similar for the novel features of well-groomed and heavy-faced specifically as well. 68 The practitioners and the algorithm seem to be tapping into largely unrelated signal.

VI.C. Exploring Causality

Are these novel features actually causally related to judge decisions? Fully answering that question is clearly beyond the scope of the present article. But we can present some additional evidence that is at least suggestive.

For starters we can rule out some obvious potential confounders. With the specific hypotheses in hand, identifying the most important concerns with confounding becomes much easier. In our application, well-groomed and heavy-faced could in principle be related to things like (say) the degree to which the defendant has a substance-abuse problem, is struggling with mental health, or their socioeconomic status. But as shown in a series of Online Appendix  tables, we find that when we have study subjects independently label the mug shots in our validation data set for these features and then control for them, our novel hypotheses remain correlated with the algorithmic predictions of the judge and actual judge decisions. 69 We might wonder whether heavy-faced is simply a proxy for something that previous mock-trial-type studies suggest might matter for criminal justice decisions, “baby-faced” ( Berry and Zebrowitz-McArthur 1988 ). 70 But when we have subjects rate mug shots for baby-facedness, our full-faced measure remains strongly predictive of the algorithm’s predictions and actual judge decisions; see Online Appendix Tables A.XII and A.XVI .

In addition, we carried out a laboratory-style experiment with Prolific workers. We randomly morphed synthetic mug shot images in the direction of either higher or lower well-groomed (or full-faced), randomly assigned structured variables (current charge and prior record) to each image, explained to subjects the detention decision judges are asked to make, and then asked them which from each pair of subjects they would be more likely to detain if they were the judge. The framework from Mobius and Rosenblat (2006) helps clarify what this lab experiment gets us: appearance might affect how others treat us because others are reacting to something about our own appearance directly, because our appearance affects our own confidence, or because our appearance affects our effectiveness in oral communication. The experiment’s results shut down these latter two mechanisms and isolate the effects of something about appearance per se, recognizing it remains possible well-groomed and heavy-faced are correlated with some other aspect of appearance. 71

The study subjects recommend for detention those subjects with higher-risk structured variables (like current charge and prior record), which at the very least suggests they are taking the task seriously. Holding these other case characteristics constant, we find that the subjects are more likely to recommend for detention those defendants who are less well-groomed or less heavy-faced (see Online Appendix Table A.XVII ). Qualitatively, these results support the idea that well-groomed and heavy-faced could have a causal effect. It is not clear that the magnitudes in these experiments necessarily have much meaning: the subjects are not actual judges, and the context and structure of choice is very different from real detention decisions. Still, it is worth noting that the magnitudes implied by our results are nontrivial. Changing well-groomed or heavy-faced has the same effect on subject decisions as a movement within the predicted rearrest risk distribution of 4 and 6 percentile points, respectively (see Online Appendix C for details). Of course only an actual field experiment could conclusively determine causality here, but carrying out that type of field experiment might seem more worthwhile to an investigator in light of the lab experiment’s results.

Is this enough empirical support for these hypotheses to justify incurring the costs of causal testing? The empirical basis for these hypotheses would seem to be at least as strong as (or perhaps stronger than) the informal standard currently used to decide whether an idea is promising enough to test, which in our experience comes from some combination of observing the world, brainstorming, and perhaps some exploratory investigator-driven correlational analysis.

What might such causal testing look like? One possibility would follow in the spirit of Goldin and Rouse (2000) and compare detention decisions in settings where the defendant is more versus less visible to the judge to alter the salience of appearance. For example, many jurisdictions have continued to use some version of virtual hearings even after the pandemic. 72 In Chicago the court system has the defendant appear virtually but everyone else is in person, and the court system of its own volition has changed the size of the monitors used to display the defendant to court participants. One could imagine adding some planned variation to screen size or distance or angle to the judge. These video feeds could in principle be randomly selected for AI adjustment to the defendant’s level of well-groomedness or heavy-facedness (this would probably fall into a legal gray area). In the case of well-groomed, one could imagine a field experiment that changed this aspect of the defendant’s actual appearance prior to the court hearing. We are not claiming these are the right designs but intend only to illustrate that with new hypotheses in hand, economists are positioned to deploy the sort of creativity and rigorous testing that have become the hallmark of the field’s efforts at causal inference.

We have presented a new semi-automated procedure for hypothesis generation. We applied this new procedure to a concrete, socially important application: why judges jail some defendants and not others. Our procedure suggests two novel hypotheses: some defendants appear more well-groomed or more heavy-faced than others.

Beyond the specific findings from our illustrative application, our empirical analysis also illustrates a playbook for other applications. Start with a high-dimensional predictor m ( x ) of some behavior of interest. Build an unsupervised model of the data distribution, p ( x ). Then combine the models for m ( x ) and p ( x ) in a morphing procedure to generate new instances that answer the counterfactual question: what would a given instance look like with higher or lower likelihood of the outcome? Show morphed pairs of instances to participants and get them to name what they see as the differences between morphed instances. Get others to independently rate instances for whatever the new hypothesis is; do these labels correlate with both m ( x ) and the behavior of interest, y ? If so, we have a new hypothesis worth causal testing. This playbook is broadly applicable whenever three conditions are met.

The first condition is that we have a behavior we can statistically predict. The application we examine here fits because the behavior is clearly defined and measured for many cases. A study of, say, human creativity would be more challenging because it is not clear that it can be measured ( Said-Metwaly, Van den Noortgate, and Kyndt 2017 ). A study of why U.S. presidents use nuclear weapons during wartime would be challenging because there have been so few cases.

The second condition relates to what input data are available to predict behavior. Our procedure is likely to add only modest value in applications where we only have traditional structured variables, because those structured variables already make sense to people. Moreover the structured variables are usually already hypothesized to affect different behaviors, which is why economists ask about them on surveys. Our procedure will be more helpful with unstructured, high-dimensional data like images, language, and time series. The deeper point is that the collection of such high-dimensional data is often incidental to the scientific enterprise. We have images because the justice system photographs defendants during booking. Schools collect text from students as part of required assignments. Cellphones create location data as part of cell tower “pings.” These high-dimensional data implicitly contain an endless number of “features.”

Such high-dimensional data have already been found to predict outcomes in many economically relevant applications. Student essays predict graduation. Newspaper text predicts political slant of writers and editors. Federal Open Market Committee notes predict asset returns or volatility. X-ray images or EKG results predict doctor diagnoses (or misdiagnoses). Satellite images predict the income or health of a place. Many more relationships like these remain to be explored. From such prediction models, one could readily imagine human inspection of morphs leading to novel features. For example, suppose high-frequency data on volume and stock prices are used to predict future excess returns, for example, to understand when the market over- or undervalues a stock. Morphs of these time series might lead us to discover the kinds of price paths that produce overreaction. After all, some investors have even named such patterns (e.g., “head and shoulders,” “double bottom”) and trade on them.

The final condition is to be able to morph the input data to create new cases that differ in the predicted outcome. This requires some unsupervised learning technique to model the data distribution. The good news is that a number of such techniques are now available that work well with different types of high-dimensional data. We happen to use GANs here because they work well with images. But our procedure can accomodate a variety of unsupervised models. For example for text we can use other methods like Bidirectional Encoder Representations from Transformers ( Devlin et al. 2018 ), or for time series we could use variational auto-encoders ( Kingma and Welling 2013 ).

An open question is the degree to which our experimental pipeline could be changed by new technologies, and in particular by recent innovations in generative modeling. For example, several recent models allow people to create new synthetic images from text descriptions, and so could perhaps (eventually) provide alternative approaches to the creation of counterfactual instances. 73 Similarly, recent generative language models appear to be able to process images (e.g., GPT-4), although they are only recently publicly available. Because there is inevitably some uncertainty in forecasting what those tools will be able to do in the future, they seem unlikely to be able to help with the first stage of our procedure’s pipeline—build a predictive model of some behavior of interest. To see why, notice that methods like GPT-4 are unlikely to have access to data on judge decisions linked to mug shots. But the stage of our pipeline that GPT-4 could potentially be helpful for is to substitute for humans in “naming” the contrasts between the morphed pairs of counterfactual instances. Though speculative, such innovations potentially allow for more of the hypothesis generation procedure to be automated. We leave the exploration of these possibilities to future work.

Finally, it is worth emphasizing that hypothesis generation is not hypothesis testing. Each follows its own logic, and one procedure should not be expected to do both. Each requires different methods and approaches. What is needed to creatively produce new hypotheses is different from what is needed to carefully test a given hypothesis. Testing is about the curation of data, an effort to compare comparable subsets from the universe of all observations. But the carefully controlled experiment’s focus on isolating the role of a single prespecified factor limits the ability to generate new hypotheses. Generation is instead about bringing as much data to bear as possible, since the algorithm can only consider signal within the data available to it. The more diverse the data sources, the more scope for discovery. An algorithm could have discovered judge decisions are influenced by football losses, as in Eren and Mocan (2018) , but only if we thought to merge court records with massive archives of news stories as for example assembled by Leskovec, Backstrom, and Kleinberg (2009) . For generating ideas, creativity in experimental design useful for testing is replaced with creativity in data assembly and merging.

More generally, we hope to raise interest in the curious asymmetry we began with. Idea generation need not remain such an idiosyncratic or nebulous process. Our framework hopefully illustrates that this process can also be modeled. Our results illustrate that such activity could bear actual empirical fruit. At a minimum, these results will hopefully spur more theoretical and empirical work on hypothesis generation rather than leave this as a largely “prescientific” activity.

This is a revised version of Chicago Booth working paper 22-15 “Algorithmic Behavioral Science: Machine Learning as a Tool for Scientific Discovery.” We gratefully acknowledge support from the Alfred P. Sloan Foundation, Emmanuel Roman, and the Center for Applied Artificial Intelligence at the University of Chicago, and we thank Stephen Billings for generously sharing data. For valuable comments we thank Andrei Shleifer, Larry Katz, and five anonymous referees, as well as Marianne Bertrand, Jesse Bruhn, Steven Durlauf, Joel Ferguson, Emma Harrington, Supreet Kaur, Matteo Magnaricotte, Dev Patel, Betsy Levy Paluck, Roberto Rocha, Evan Rose, Suproteem Sarkar, Josh Schwartzstein, Nick Swanson, Nadav Tadelis, Richard Thaler, Alex Todorov, Jenny Wang, and Heather Yang, plus seminar participants at Bocconi, Brown, Columbia, ETH Zurich, Harvard, the London School of Economics, MIT, Stanford, the University of California Berkeley, the University of Chicago, the University of Pennsylvania, the University of Toronto, the 2022 Behavioral Economics Annual Meetings, and the 2022 NBER Summer Institute. For invaluable assistance with the data and analysis we thank Celia Cook, Logan Crowl, Arshia Elyaderani, and especially Jonas Knecht and James Ross. This research was reviewed by the University of Chicago Social and Behavioral Sciences Institutional Review Board (IRB20-0917) and deemed exempt because the project relies on secondary analysis of public data sources. All opinions and any errors are our own.

The question of hypothesis generation has been a vexing one in philosophy, as it appears to follow a process distinct from deduction and has been sometimes called “abduction” (see Schickore 2018 for an overview). A fascinating economic exploration of this topic can be found in Heckman and Singer (2017) , which outlines a strategy for how economists should proceed in the face of surprising empirical results. Finally, there is a small but growing literature that uses machine learning in science. In the next section we discuss how our approach is similar in some ways and different in others.

See Einav and Levin (2014) , Varian (2014) , Athey (2017) , Mullainathan and Spiess (2017) , Gentzkow, Kelly, and Taddy (2019) , and Adukia et al. (2023) on how these changes can affect economics.

In practice, there are a number of additional nuances, as discussed in Section III.A and Online Appendix A.A .

This is calculated for some of the most commonly used measures of predictive accuracy, area under the curve (AUC) and R 2 , recognizing that different measures could yield somewhat different shares of variation explained. We emphasize the word predictable here: past work has shown that judges are “noisy” and decisions are hard to predict ( Kahneman, Sibony, and Sunstein 2022 ). As a consequence, a predictive model of the judge can do better than the judge themselves ( Kleinberg et al. 2018 ).

In Section IV.B , we examine whether the mug shot’s predictive power can be explained by underlying risk differences. There, we tentatively conclude that the predictive power of the face likely reflects judicial error, but that working assumption is not essential to either our results or the ultimate goal of the article: uncovering hypotheses for later careful testing.

For reviews of the interpretability literature, see Doshi-Velez and Kim (2017) and Marcinkevičs and Vogt (2020) .

See Liu et al. (2019) , Narayanaswamy et al. (2020) , Lang et al. (2021) , and Ghandeharioun et al. (2022) .

For example, if every dog photo in a given training data set had been taken outdoors and every cat photo was taken indoors, the algorithm might learn what animal is in the image based in part on features of the background, which would lead the algorithm to perform poorly in a new data set of more representative images.

For example, for canonical computer science applications like image classification (does this photo contain an image of a dog or of a cat?), predictive accuracy (AUC) can be on the order of 0.99. In contrast, our model of judge decisions using the face only achieves an AUC of 0.625.

Of course even if the hypotheses that are generated are the result of idiosyncratic creativity, this can still be useful. For example, Swanson (1986 , 1988) generated two novel medical hypotheses: the possibility that magnesium affects migraines and that fish oil may alleviate Raynaud’s syndrome.

Conversely, given a data set, our procedure has a built-in advantage: one could imagine a huge number of hypotheses that, while possible, are not especially useful because they are not measurable. Our procedure is by construction guaranteed to generate hypotheses that are measurable in a data set.

For additional discussion, see Ludwig and Mullainathan (2023a) .

For example, isolating the causal effects of gender on labor market outcomes is a daunting task, but the clever test in Goldin and Rouse (2000) overcomes the identification challenges by using variation in screening of orchestra applicants.

See the clever paper by Grogger and Ridgeway (2006) that uses this source of variation to examine this question.

This is related to what Autor (2014) called “Polanyi’s paradox,” the idea that people’s understanding of how the world works is beyond our capacity to explicitly describe it. For discussions in psychology about the difficulty for people to access their own cognition, see Wilson (2004) and Pronin (2009) .

Consider a simple example. Suppose x  = ( x 1 , …, x k ) is a k -dimensional binary vector, all possible values of x are equally likely, and the true function in nature relating x to y only depends on the first dimension of x so the function h 1 is the only true hypothesis and the only empirically plausible hypothesis. Even with such a simple true hypothesis, people can generate nonplausible hypotheses. Imagine a pair of data points ( x 0 , 0) and ( x 1 , 1). Since the data distribution is uniform, x 0 and x 1 will differ on |$\frac{k}{2}$| dimensions in expectation. A person looking at only one pair of observations would have a high chance of generating an empirically implausible hypothesis. Looking at more data, the probability of discovering an implausible hypothesis declines. But the problem remains.

Some canonical references include Breiman et al. (1984) , Breiman (2001) , Hastie et al. (2009) , and Jordan and Mitchell (2015) . For discussions about how machine learning connects to economics, see Belloni, Chernozhukov, and Hansen (2014) , Varian (2014) , Mullainathan and Spiess (2017) , Athey (2018) , and Athey and Imbens (2019) .

Of course there is not always a predictive signal in any given data application. But that is equally an issue for human hypothesis generation. At least with machine learning, we have formal procedures for determining whether there is any signal that holds out of sample.

The intuition here is quite straightforward. If two predictor variables are highly correlated, the weight that the algorithm puts on one versus the other can change from one draw of the data to the next depending on the idiosyncratic noise in the training data set, but since the variables are highly correlated, the predicted outcome values themselves (hence predictive accuracy) can be quite stable.

See Online Appendix Figure A.I , which shows the top nine eigenfaces for the data set we describe below, which together explain |$62\%$| of the variation.

Examples of applications of this type include Carleo et al. (2019) , He et al. (2019) , Davies et al. (2021) , Jumper et al. (2021) , and Pion-Tonachini et al. (2021) .

As other examples, researchers have found that retinal images alone can unexpectedly predict gender of patient or macular edema ( Narayanaswamy et al. 2020 ; Korot et al. 2021 ).

Sheetal, Feng, and Savani (2020) use machine learning to determine which of the long list of other survey variables collected as part of the World Values Survey best predict people’s support for unethical behavior. This application sits somewhat in between an investigator-generated hypothesis and the development of an entirely new hypothesis, in the sense that the procedure can only choose candidate hypotheses for unethical behavior from the set of variables the World Values Survey investigators thought to include on their questionnaire.

Closest is Miller et al. (2019) , which morphs EKG output but stops at the point of generating realistic morphs and does not carry this through to generating interpretable hypotheses.

Additional details about how the system works are found in Online Appendix A .

For Black non-Hispanics, the figures for Mecklenburg County versus the United States were |$33.3\%$| versus |$13.6\%$|⁠ . See https://www.census.gov/programs-surveys/sis/resources/data-tools/quickfacts.html .

Details on how we operationalize these variables are found in Online Appendix A .

The mug shot seems to have originated in Paris in the 1800s ( https://law.marquette.edu/facultyblog/2013/10/a-history-of-the-mug-shot/ ). The etymology of the term is unclear, possibly based on “mug” as slang for either the face or an “incompetent person” or “sucker” since only those who get caught are photographed by police ( https://www.etymonline.com/word/mug-shot ).

See https://mecksheriffweb.mecklenburgcountync.gov/ .

We partition the data by arrestee, not arrest, to ensure people show up in only one of the partitions to avoid inadvertent information “leakage” across data partitions.

As the Online Appendix  tables show, while there are some changes to a few of the coefficients that relate the algorithm’s predictions to factors known from past research to shape human decisions, the core findings and conclusions about the importance of the defendant’s appearance and the two specific novel facial features we identify are similar.

Using the data on arrests up to July 17, 2019, we randomly reassign arrestees to three groups of similar size to our training, validation, and lock-box hold-out data sets, convert the data to long format (with one row for each arrest-and-variable) and calculate an F -test statistic for the joint null hypothesis that the difference in baseline characteristics are all zero, clustering standard errors by arrestee. We store that F -test statistic, rerun this procedure 1,000 times, and then report the share of splits with an F -statistic larger than the one observed for the original data partition.

For an example HIT task, see Online Appendix Figure A.II .

For age and skin tone, we calculated the average pairwise correlation between two labels sampled (without replacement) from the 10 possibilities, repeated across different random pairs. The Pearson correlation was 0.765 for skin tone, 0.741 for age, and between age assigned labels versus administrative data, 0.789. The maximum correlation between the average of the first k labels collected and the k + 1 label is not all that much higher for k  = 1 than k  = 9 (0.733 versus 0.837).

For an example of the consent form and instructions given to labelers, see Online Appendix Figures A.IV and A.V .

We actually collected at least three and at least five, but the averages turned out to be very close to the minimums, equal to 3.17 and 5.07, respectively.

For example, in Oosterhof and Todorov (2008) , Supplemental Materials Table S2, they report Cronbach’s α values of 0.95 for attractiveness, and 0.93 for both trustworthy and dominant.

See Online Appendix Figure A.VIII , which shows that the change in the correlation between the ( k + 1)th label with the mean of the first k labels declines after three labels.

For an example, see Online Appendix Figure A.IX .

We use the validation data set to estimate |$\hat{\beta }$| and then evaluate the accuracy of m p ( x ). Although this could lead to overfitting in principle, since we are only estimating a single parameter, this does not matter much in practice; we get very similar results if we randomly partition the validation data set by arrestee, use a random |$30\%$| of the validation data set to estimate the weights, then measure predictive performance in the other random |$70\%$| of the validation data set.

The mean squared area for a linear probability model’s predictions is related to the Brier score ( Brier 1950 ). For a discussion of how this relates to AUC and calibration, see Murphy (1973) .

Note how this comparison helps mitigate the problem that police arrest decisions could depend on a person’s face. When we regress rearrest against the mug shot, that estimated coefficient may be heavily influenced by how police arrest decisions respond to the defendant’s appearance. In contrast when we regress judge detention decisions against predicted rearrest risk, some of the variation across defendants in rearrest risk might come from the effect of the defendant’s appearance on the probability a police officer makes an arrest, but a great deal of the variation in predicted risk presumably comes from people’s behavior.

The average mug shot–predicted detention risk for the bottom and top quartiles equal 0.127 and 0.332; that difference times 2.880 implies a rearrest risk difference of 59.0 percentage points. By way of comparison, the difference in rearrest risk between those who are arrested for a felony crime rather than a less serious misdemeanor crime is equal to just 7.8 percentage points.

In our main exhibits, we impose a simple linear relationship between the algorithm’s predicted detention risk and known facial features like age or psychological variables, for ease of presentation. We show our results are qualitatively similar with less parametric specifications in Online Appendix Tables A.VI, A.VII, and A.VIII .

With a coefficient value of 0.0006 on age (measured in years), the algorithm tells us that even a full decade’s difference in age has |$5\%$| the impact on detention likelihood compared to the effects of gender (10 × 0.0006 = 0.6 percentage point higher likelihood of detention, versus 11.9 percentage points).

Online Appendix Table A.V shows that Hispanic ethnicity, which we measure from subject ratings from looking at mug shots, is not statistically significantly related to the algorithm’s predictions. Table II , column (2) showed that conditional on gender, Black defendants have slightly higher predicted detention odds than white defendants (0.3 percentage points), but this is not quite significant ( t  = 1.3). Online Appendix Table A.V , column (1) shows that conditioning on Hispanic ethnicity and having stereotypically Black facial features—as measured in Eberhardt et al. (2006) —increases the size of the Black-white difference in predicted detention odds (now equal to 0.8 percentage points) as well as the difference’s statistical significance ( t  = 2.2).

This comes from multiplying the effect of each 1 unit change in our 9-point scale associated, equal to 0.55, 0.91, and 0.48 percentage points, respectively, with the standard deviation of the average label for each psychological feature for each image, which equal 0.923, 0.911, and 0.844, respectively.

As discussed in Online Appendix Table A.III , we offer subjects a |${\$}$| 3.00 base rate for participation plus an incentive of 5 cents per correct guess. With 50 image pairs shown to each participant, they could increase their earnings by another |${\$}$| 2.50, or up to |$83\%$| above the base compensation.

Table III gives us another way to see how much of previously known features are rediscovered by the algorithm. That the algorithm’s prediction plus all previously known features yields an R 2 of just 0.0380 (column (7)), not much larger than with the algorithm alone, suggests the algorithm has discovered most of the signal in these known features. But not necessarily all: these other known features often do remain statistically significant predictors of judges’ decisions even after controlling for the algorithm’s predictions (last column). One possible reason is that, given finite samples, the algorithm has only imperfectly reconstructed factors such as “age” or “human guess.” Controlling for these factors directly adds additional signal.

Imagine a linear prediction function like |$m(x_1,x_2) = \widehat{\beta }_1 x_1 + \widehat{\beta }_2 x_2$|⁠ . If our best estimates suggested |$\widehat{\beta }_2=0$|⁠ , the maximum change to the prediction comes from incrementally changing x 1 .

As noted already, to avoid contributing to the stereotyping of minorities in discussions of crime, in our exhibits we show images for non-Hispanic white men, although in our HITs we use images representative of the larger defendant population.

Modeling p ( x ) through a supervised learning task would involve assembling a large set of images, having subjects label each image for whether they contain a realistic face, and then predicting those labels using the image pixels as inputs. But this supervised learning approach is costly because it requires extensive annotation of a large training data set.

Kaji, Manresa, and Pouliot (2020) and Athey et al. (2021 , 2022) are recent uses of GANs in economics.

Some ethical issues are worth considering. One is bias. With human hypothesis generation there is the risk people “see” an association that impugns some group yet has no basis in fact. In contrast our procedure by construction only produces empirically plausible hypotheses. A different concern is the vulnerability of deep learning to adversarial examples: tiny, almost imperceptible changes in an image changing its classification for the outcome y , so that mug shots that look almost identical (that is, are very “similar” in some visual image metric) have dramatically different m ( x ). This is a problem because tiny changes to an image don’t change the nature of the object; see Szegedy et al. (2013) and Goodfellow, Shlens, and Szegedy (2014) . In practice such instances are quite rare in nature, indeed, so rare they usually occur only if intentionally (maliciously) generated.

Online Appendix Figure A.XII gives an example of this task and the instructions given to participating subjects to complete it. Each subject was tested on 50 image pairs selected at random from a population of 100 images. Subjects were told that for every pair, one image was higher in some unknown feature, but not given details as to what the feature might be. As in the exercise for predicting detention, feedback was given immediately after selecting an image, and a 5 cent bonus was paid for every correct answer.

In principle this semantic grouping could be carried out in other ways, for example, with automated procedures involving natural-language processing.

See Online Appendix Table A.III for a high-level description of this human intelligence task, and Online Appendix Figure A.XIV for a sample of the task and the subject instructions.

We drop every token of just one or two characters in length, as well as connector words without real meaning for this purpose, like “had,” “the,” and “and,” as well as words that are relevant to our exercise but generic, like “jailed,” “judge,” and “image.”

We enlisted three research assistants blinded to the findings of this study and asked them to come up with semantic categories that captured all subject comments. Since each assistant mapped each subject comment to |$5\%$| of semantic categories on average, if the assistant mappings were totally uncorrelated, we would expect to see agreement of at least two assistant categorizations about |$5\%$| of the time. What we actually see is if one research assistant made an association, |$60\%$| of the time another assistant would make the same association. We assign a comment to a semantic category when at least two of the assistants agree on the categorization.

Moreover what subjects see does not seem to be particularly sensitive to which images they see. (As a reminder, each subject sees 50 morphed image pairs randomly selected from a larger bank of 100 morphed image pairs). If we start with a subject who says they saw “well-groomed” in the morphed image pairs they saw, for other subjects who saw 21 or fewer images in common (so saw mostly different images) they also report seeing well-groomed |$31\%$| of the time, versus |$35\%$| among the population. We select the threshold of 21 images because this is the smallest threshold in which at least 50 pairs of raters are considered.

See Online Appendix Table A.III and Online Appendix Figure A.XVI . This comes to a total of 192,280 individual labels, an average of 3.2 labels per image in the training set and an average of 10.8 labels per image in the validation set. Sampling labels from different workers on the same image, these ratings have a correlation of 0.14.

It turns out that skin tone is another feature that is correlated with well-groomed, so we orthogonalize on that as well as well-groomed. To simplify the discussion, we use “well-groomed” as a stand-in for both features we orthogonalize against, well-groomed plus skin tone.

To see why, consider the mechanics of the procedure. Since we orthogonalize as we create morphs, we would need labels at each morphing step. This would entail us producing candidate steps (new morphs), collecting data on each of the candidates, picking one that has the same well-groomed value, and then repeating. Moreover, until the labels are collected at a given step, the next step could not be taken. Since producing a final morph requires hundreds of such intermediate morphing steps, the whole process would be so time- and resource-consuming as to be infeasible.

While we can predict demographic features like race and age (above/below median age) nearly perfectly, with AUC values close to 1, for predicting well-groomed, the mean absolute error of our OOS prediction is 0.63, which is plus or minus over half a slider value for this 9-point-scaled variable. One reason it is harder to predict well-groomed is because the labels, which come from human subjects looking at and labeling mug shots, are themselves noisy, which introduces irreducible error.

For additional details see Online Appendix Figure A.XVII and Online Appendix B .

There are a few additional technical steps required, discussed in Online Appendix B . For details on the HIT we use to get subjects to name the new hypothesis from looking at orthogonalized morphs, and the follow-up HIT to generate independent labels for that new hypothesis or facial feature, see Online Appendix Table A.III .

See Online Appendix Figure A.XIX .

The adjusted R 2 of including the practitioner forecasts plus well-groomed and heavy-facedness together (column (3), equal to 0.0246) is not that different from the sum of the R 2 values from including just the practitioner forecasts (0.0165 in column (1)) plus that from including just well-groomed and heavy-faced (equal to 0.0131 in Table VII , column (2)).

In Online Appendix Table A.IX we show that controlling for one obvious indicator of a substance abuse issue—arrest for drugs—does not seem to substantially change the relationship between full-faced or well-groomed and the predicted detention decision. Online Appendix Tables A.X and A.XI show a qualitatively similar pattern of results for the defendant’s mental health and socioeconomic status, which we measure by getting a separate sample of subjects to independently rate validation–data set mug shots. We see qualitatively similar results when the dependent variable is the actual rather than predicted judge decision; see Online Appendix Tables A.XIII, A.XIV, and A.XV .

Characteristics of having a baby face included large eyes, narrow chin, small nose, and high, raised eyebrows. For a discussion of some of the larger literature on how that feature shapes the reactions of other people generally, see Zebrowitz et al. (2009) .

For additional details, see Online Appendix C .

See https://www.nolo.com/covid-19/virtual-criminal-court-appearances-in-the-time-of-the-covid-19.html .

See https://stablediffusionweb.com/ and https://openai.com/product/dall-e-2 .

The data underlying this article are available in the Harvard Dataverse, https://doi.org/10.7910/DVN/ILO46V ( Ludwig and Mullainathan 2023b ).

Adukia   Anjali , Eble   Alex , Harrison   Emileigh , Birali Runesha   Hakizumwami , Szasz   Teodora , “ What We Teach about Race and Gender: Representation in Images and Text of Children’s Books ,” Quarterly Journal of Economics , 138 ( 2023 ), 2225 – 2285 . https://doi.org/10.1093/qje/qjad028

Google Scholar

Angelova   Victoria , Dobbie   Will S. , Yang   Crystal S. , “ Algorithmic Recommendations and Human Discretion ,” NBER Working Paper no. 31747, 2023 . https://doi.org/10.3386/w31747

Arnold   David , Dobbie   Will S. , Hull   Peter , “ Measuring Racial Discrimination in Bail Decisions ,” NBER Working Paper no. 26999, 2020.   https://doi.org/10.3386/w26999

Arnold   David , Dobbie   Will , Yang   Crystal S. , “ Racial Bias in Bail Decisions ,” Quarterly Journal of Economics , 133 ( 2018 ), 1885 – 1932 . https://doi.org/10.1093/qje/qjy012

Athey   Susan , “ Beyond Prediction: Using Big Data for Policy Problems ,” Science , 355 ( 2017 ), 483 – 485 . https://doi.org/10.1126/science.aal4321

Athey   Susan , “ The Impact of Machine Learning on Economics ,” in The Economics of Artificial Intelligence: An Agenda , Ajay Agrawal, Joshua Gans, and Avi Goldfarb, eds. (Chicago: University of Chicago Press , 2018 ), 507 – 547 .

Athey   Susan , Imbens   Guido W. , “ Machine Learning Methods That Economists Should Know About ,” Annual Review of Economics , 11 ( 2019 ), 685 – 725 . https://doi.org/10.1146/annurev-economics-080217-053433

Athey   Susan , Imbens   Guido W. , Metzger   Jonas , Munro   Evan , “ Using Wasserstein Generative Adversarial Networks for the Design of Monte Carlo Simulations ,” Journal of Econometrics , ( 2021 ), 105076. https://doi.org/10.1016/j.jeconom.2020.09.013

Athey   Susan , Karlan   Dean , Palikot   Emil , Yuan   Yuan , “ Smiles in Profiles: Improving Fairness and Efficiency Using Estimates of User Preferences in Online Marketplaces ,” NBER Working Paper no. 30633 , 2022 . https://doi.org/10.3386/w30633

Autor   David , “ Polanyi’s Paradox and the Shape of Employment Growth ,” NBER Working Paper no. 20485 , 2014 . https://doi.org/10.3386/w20485

Avitzour   Eliana , Choen   Adi , Joel   Daphna , Lavy   Victor , “ On the Origins of Gender-Biased Behavior: The Role of Explicit and Implicit Stereotypes ,” NBER Working Paper no. 27818 , 2020 . https://doi.org/10.3386/w27818

Baehrens   David , Schroeter   Timon , Harmeling   Stefan , Kawanabe   Motoaki , Hansen   Katja , Müller   Klaus-Robert , “ How to Explain Individual Classification Decisions ,” Journal of Machine Learning Research , 11 ( 2010 ), 1803 – 1831 .

Baltrušaitis   Tadas , Ahuja   Chaitanya , Morency   Louis-Philippe , “ Multimodal Machine Learning: A Survey and Taxonomy ,” IEEE Transactions on Pattern Analysis and Machine Intelligence , 41 ( 2019 ), 423 – 443 . https://doi.org/10.1109/TPAMI.2018.2798607

Begall   Sabine , Červený   Jaroslav , Neef   Julia , Vojtěch   Oldřich , Burda   Hynek , “ Magnetic Alignment in Grazing and Resting Cattle and Deer ,” Proceedings of the National Academy of Sciences , 105 ( 2008 ), 13451 – 13455 . https://doi.org/10.1073/pnas.0803650105

Belloni   Alexandre , Chernozhukov   Victor , Hansen   Christian , “ High-Dimensional Methods and Inference on Structural and Treatment Effects ,” Journal of Economic Perspectives , 28 ( 2014 ), 29 – 50 . https://doi.org/10.1257/jep.28.2.29

Berry   Diane S. , Zebrowitz-McArthur   Leslie , “ What’s in a Face? Facial Maturity and the Attribution of Legal Responsibility ,” Personality and Social Psychology Bulletin , 14 ( 1988 ), 23 – 33 . https://doi.org/10.1177/0146167288141003

Bertrand   Marianne , Mullainathan   Sendhil , “ Are Emily and Greg More Employable than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination ,” American Economic Review , 94 ( 2004 ), 991 – 1013 . https://doi.org/10.1257/0002828042002561

Bjornstrom   Eileen E. S. , Kaufman   Robert L. , Peterson   Ruth D. , Slater   Michael D. , “ Race and Ethnic Representations of Lawbreakers and Victims in Crime News: A National Study of Television Coverage ,” Social Problems , 57 ( 2010 ), 269 – 293 . https://doi.org/10.1525/sp.2010.57.2.269

Breiman   Leo , “ Random Forests ,” Machine Learning , 45 ( 2001 ), 5 – 32 . https://doi.org/10.1023/A:1010933404324

Breiman   Leo , Friedman   Jerome H. , Olshen   Richard A. , Stone   Charles J. , Classification and Regression Trees (London: Routledge , 1984 ). https://doi.org/10.1201/9781315139470

Google Preview

Brier   Glenn W. , “ Verification of Forecasts Expressed in Terms of Probability ,” Monthly Weather Review , 78 ( 1950 ), 1 – 3 . https://doi.org/10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2

Carleo   Giuseppe , Cirac   Ignacio , Cranmer   Kyle , Daudet   Laurent , Schuld   Maria , Tishby   Naftali , Vogt-Maranto   Leslie , Zdeborová   Lenka , “ Machine Learning and the Physical Sciences ,” Reviews of Modern Physics , 91 ( 2019 ), 045002 . https://doi.org/10.1103/RevModPhys.91.045002

Chen   Daniel L. , Moskowitz   Tobias J. , Shue   Kelly , “ Decision Making under the Gambler’s Fallacy: Evidence from Asylum Judges, Loan Officers, and Baseball Umpires ,” Quarterly Journal of Economics , 131 ( 2016 ), 1181 – 1242 . https://doi.org/10.1093/qje/qjw017

Chen   Daniel L. , Philippe   Arnaud , “ Clash of Norms: Judicial Leniency on Defendant Birthdays ,” Journal of Economic Behavior & Organization , 211 ( 2023 ), 324 – 344 . https://doi.org/10.1016/j.jebo.2023.05.002

Dahl   Gordon B. , Knepper   Matthew M. , “ Age Discrimination across the Business Cycle ,” NBER Working Paper no. 27581 , 2020 . https://doi.org/10.3386/w27581

Davies   Alex , Veličković   Petar , Buesing   Lars , Blackwell   Sam , Zheng   Daniel , Tomašev   Nenad , Tanburn   Richard , Battaglia   Peter , Blundell   Charles , Juhász   András , Lackenby   Marc , Williamson   Geordie , Hassabis   Demis , Kohli   Pushmeet , “ Advancing Mathematics by Guiding Human Intuition with AI ,” Nature , 600 ( 2021 ), 70 – 74 . https://doi.org/10.1038/s41586-021-04086-x

Devlin   Jacob , Chang   Ming-Wei , Lee   Kenton , Toutanova   Kristina , “ BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding ,” arXiv preprint arXiv:1810.04805 , 2018 . https://doi.org/10.48550/arXiv.1810.04805

Dobbie   Will , Goldin   Jacob , Yang   Crystal S. , “ The Effects of Pretrial Detention on Conviction, Future Crime, and Employment: Evidence from Randomly Assigned Judges ,” American Economic Review , 108 ( 2018 ), 201 – 240 . https://doi.org/10.1257/aer.20161503

Dobbie   Will , Yang   Crystal S. , “ The US Pretrial System: Balancing Individual Rights and Public Interests ,” Journal of Economic Perspectives , 35 ( 2021 ), 49 – 70 . https://doi.org/10.1257/jep.35.4.49

Doshi-Velez   Finale , Kim   Been , “ Towards a Rigorous Science of Interpretable Machine Learning ,” arXiv preprint arXiv:1702.08608 , 2017 . https://doi.org/10.48550/arXiv.1702.08608

Eberhardt   Jennifer L. , Davies   Paul G. , Purdie-Vaughns   Valerie J. , Lynn Johnson   Sheri , “ Looking Deathworthy: Perceived Stereotypicality of Black Defendants Predicts Capital-Sentencing Outcomes ,” Psychological Science , 17 ( 2006 ), 383 – 386 . https://doi.org/10.1111/j.1467-9280.2006.01716.x

Einav   Liran , Levin   Jonathan , “ The Data Revolution and Economic Analysis ,” Innovation Policy and the Economy , 14 ( 2014 ), 1 – 24 . https://doi.org/10.1086/674019

Eren   Ozkan , Mocan   Naci , “ Emotional Judges and Unlucky Juveniles ,” American Economic Journal: Applied Economics , 10 ( 2018 ), 171 – 205 . https://doi.org/10.1257/app.20160390

Frieze   Irene Hanson , Olson   Josephine E. , Russell   June , “ Attractiveness and Income for Men and Women in Management ,” Journal of Applied Social Psychology , 21 ( 1991 ), 1039 – 1057 . https://doi.org/10.1111/j.1559-1816.1991.tb00458.x

Fryer   Roland G., Jr , “ An Empirical Analysis of Racial Differences in Police Use of Force: A Response ,” Journal of Political Economy , 128 ( 2020 ), 4003 – 4008 . https://doi.org/10.1086/710977

Fudenberg   Drew , Liang   Annie , “ Predicting and Understanding Initial Play ,” American Economic Review , 109 ( 2019 ), 4112 – 4141 . https://doi.org/10.1257/aer.20180654

Gentzkow   Matthew , Kelly   Bryan , Taddy   Matt , “ Text as Data ,” Journal of Economic Literature , 57 ( 2019 ), 535 – 574 . https://doi.org/10.1257/jel.20181020

Ghandeharioun   Asma , Kim   Been , Li   Chun-Liang , Jou   Brendan , Eoff   Brian , Picard   Rosalind W. , “ DISSECT: Disentangled Simultaneous Explanations via Concept Traversals ,” arXiv preprint arXiv:2105.15164   2022 . https://doi.org/10.48550/arXiv.2105.15164

Goldin   Claudia , Rouse   Cecilia , “ Orchestrating Impartiality: The Impact of ‘Blind’ Auditions on Female Musicians ,” American Economic Review , 90 ( 2000 ), 715 – 741 . https://doi.org/10.1257/aer.90.4.715

Goncalves   Felipe , Mello   Steven , “ A Few Bad Apples? Racial Bias in Policing ,” American Economic Review , 111 ( 2021 ), 1406 – 1441 . https://doi.org/10.1257/aer.20181607

Goodfellow   Ian , Pouget-Abadie   Jean , Mirza   Mehdi , Xu   Bing , Warde-Farley   David , Ozair   Sherjil , Courville   Aaron , Bengio   Yoshua , “ Generative Adversarial Nets ,” Advances in Neural Information Processing Systems , 27 ( 2014 ), 2672 – 2680 .

Goodfellow   Ian J. , Shlens   Jonathon , Szegedy   Christian , “ Explaining and Harnessing Adversarial Examples ,” arXiv preprint arXiv:1412.6572 , 2014 . https://doi.org/10.48550/arXiv.1412.6572

Grogger   Jeffrey , Ridgeway   Greg , “ Testing for Racial Profiling in Traffic Stops from Behind a Veil of Darkness ,” Journal of the American Statistical Association , 101 ( 2006 ), 878 – 887 . https://doi.org/10.1198/016214506000000168

Hastie   Trevor , Tibshirani   Robert , Friedman   Jerome H. , Friedman   Jerome H. , The Elements of Statistical Learning: Data Mining, Inference, and Prediction , vol. 2 (Berlin: Springer , 2009 ).

He   Siyu , Li   Yin , Feng   Yu , Ho   Shirley , Ravanbakhsh   Siamak , Chen   Wei , Póczos   Barnabás , “ Learning to Predict the Cosmological Structure Formation ,” Proceedings of the National Academy of Sciences , 116 ( 2019 ), 13825 – 13832 . https://doi.org/10.1073/pnas.1821458116

Heckman   James J. , Singer   Burton , “ Abducting Economics ,” American Economic Review , 107 ( 2017 ), 298 – 302 . https://doi.org/10.1257/aer.p20171118

Heyes   Anthony , Saberian   Soodeh , “ Temperature and Decisions: Evidence from 207,000 Court Cases ,” American Economic Journal: Applied Economics , 11 ( 2019 ), 238 – 265 . https://doi.org/10.1257/app.20170223

Hoekstra   Mark , Sloan   CarlyWill , “ Does Race Matter for Police Use of Force? Evidence from 911 Calls ,” American Economic Review , 112 ( 2022 ), 827 – 860 . https://doi.org/10.1257/aer.20201292

Hunter   Margaret , “ The Persistent Problem of Colorism: Skin Tone, Status, and Inequality ,” Sociology Compass , 1 ( 2007 ), 237 – 254 . https://doi.org/10.1111/j.1751-9020.2007.00006.x

Jordan   Michael I. , Mitchell   Tom M. , “ Machine Learning: Trends, Perspectives, and Prospects ,” Science , 349 ( 2015 ), 255 – 260 . https://doi.org/10.1126/science.aaa8415

Jumper   John , Evans   Richard , Pritzel   Alexander , Green   Tim , Figurnov   Michael , Ronneberger   Olaf , Tunyasuvunakool   Kathryn , Bates   Russ , Žídek   Augustin , Potapenko   Anna  et al.  , “ Highly Accurate Protein Structure Prediction with AlphaFold ,” Nature , 596 ( 2021 ), 583 – 589 . https://doi.org/10.1038/s41586-021-03819-2

Jung   Jongbin , Concannon   Connor , Shroff   Ravi , Goel   Sharad , Goldstein   Daniel G. , “ Simple Rules for Complex Decisions ,” SSRN working paper , 2017 . https://doi.org/10.2139/ssrn.2919024

Kahneman   Daniel , Sibony   Olivier , Sunstein   C. R , Noise (London: HarperCollins , 2022 ).

Kaji   Tetsuya , Manresa   Elena , Pouliot   Guillaume , “ An Adversarial Approach to Structural Estimation ,” University of Chicago, Becker Friedman Institute for Economics Working Paper No. 2020-144 , 2020 . https://doi.org/10.2139/ssrn.3706365

Kingma   Diederik P. , Welling   Max , “ Auto-Encoding Variational Bayes ,” arXiv preprint arXiv:1312.6114 , 2013 . https://doi.org/10.48550/arXiv.1312.6114

Kleinberg   Jon , Lakkaraju   Himabindu , Leskovec   Jure , Ludwig   Jens , Mullainathan   Sendhil , “ Human Decisions and Machine Predictions ,” Quarterly Journal of Economics , 133 ( 2018 ), 237 – 293 . https://doi.org/10.1093/qje/qjx032

Korot   Edward , Pontikos   Nikolas , Liu   Xiaoxuan , Wagner   Siegfried K. , Faes   Livia , Huemer   Josef , Balaskas   Konstantinos , Denniston   Alastair K. , Khawaja   Anthony , Keane   Pearse A. , “ Predicting Sex from Retinal Fundus Photographs Using Automated Deep Learning ,” Scientific Reports , 11 ( 2021 ), 10286 . https://doi.org/10.1038/s41598-021-89743-x

Lahat   Dana , Adali   Tülay , Jutten   Christian , “ Multimodal Data Fusion: An Overview of Methods, Challenges, and Prospects ,” Proceedings of the IEEE , 103 ( 2015 ), 1449 – 1477 . https://doi.org/10.1109/JPROC.2015.2460697

Lang   Oran , Gandelsman   Yossi , Yarom   Michal , Wald   Yoav , Elidan   Gal , Hassidim   Avinatan , Freeman   William T , Isola   Phillip , Globerson   Amir , Irani   Michal , et al.  , “ Explaining in Style: Training a GAN to Explain a Classifier in StyleSpace ,” paper presented at the IEEE/CVF International Conference on Computer Vision , 2021. https://doi.org/10.1109/ICCV48922.2021.00073

Leskovec   Jure , Backstrom   Lars , Kleinberg   Jon , “ Meme-Tracking and the Dynamics of the News Cycle ,” paper presented at the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 2009. https://doi.org/10.1145/1557019.1557077

Little   Anthony C. , Jones   Benedict C. , DeBruine   Lisa M. , “ Facial Attractiveness: Evolutionary Based Research ,” Philosophical Transactions of the Royal Society B: Biological Sciences , 366 ( 2011 ), 1638 – 1659 . https://doi.org/10.1098/rstb.2010.0404

Liu   Shusen , Kailkhura   Bhavya , Loveland   Donald , Han   Yong , “ Generative Counterfactual Introspection for Explainable Deep Learning ,” paper presented at the IEEE Global Conference on Signal and Information Processing (GlobalSIP) , 2019. https://doi.org/10.1109/GlobalSIP45357.2019.8969491

Ludwig   Jens , Mullainathan   Sendhil , “ Machine Learning as a Tool for Hypothesis Generation ,” NBER Working Paper no. 31017 , 2023a . https://doi.org/10.3386/w31017

Ludwig   Jens , Mullainathan   Sendhil , “ Replication Data for: ‘Machine Learning as a Tool for Hypothesis Generation’ ,” ( 2023b ), Harvard Dataverse. https://doi.org/10.7910/DVN/ILO46V .

Marcinkevičs   Ričards , Vogt   Julia E. , “ Interpretability and Explainability: A Machine Learning Zoo Mini-Tour ,” arXiv preprint arXiv:2012.01805 , 2020 . https://doi.org/10.48550/arXiv.2012.01805

Miller   Andrew , Obermeyer   Ziad , Cunningham   John , Mullainathan   Sendhil , “ Discriminative Regularization for Latent Variable Models with Applications to Electrocardiography ,” paper presented at the International Conference on Machine Learning , 2019.

Mobius   Markus M. , Rosenblat   Tanya S. , “ Why Beauty Matters ,” American Economic Review , 96 ( 2006 ), 222 – 235 . https://doi.org/10.1257/000282806776157515

Mobley   R. Keith , An Introduction to Predictive Maintenance (Amsterdam: Elsevier , 2002 ).

Mullainathan   Sendhil , Obermeyer   Ziad , “ Diagnosing Physician Error: A Machine Learning Approach to Low-Value Health Care ,” Quarterly Journal of Economics , 137 ( 2022 ), 679 – 727 . https://doi.org/10.1093/qje/qjab046

Mullainathan   Sendhil , Spiess   Jann , “ Machine Learning: an Applied Econometric Approach ,” Journal of Economic Perspectives , 31 ( 2017 ), 87 – 106 . https://doi.org/10.1257/jep.31.2.87

Murphy   Allan H. , “ A New Vector Partition of the Probability Score ,” Journal of Applied Meteorology and Climatology , 12 ( 1973 ), 595 – 600 . https://doi.org/10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2

Nalisnick   Eric , Matsukawa   Akihiro , Whye Teh   Yee , Gorur   Dilan , Lakshminarayanan   Balaji , “ Do Deep Generative Models Know What They Don’t Know? ,” arXiv preprint arXiv:1810.09136 , 2018 . https://doi.org/10.48550/arXiv.1810.09136

Narayanaswamy   Arunachalam , Venugopalan   Subhashini , Webster   Dale R. , Peng   Lily , Corrado   Greg S. , Ruamviboonsuk   Paisan , Bavishi   Pinal , Brenner   Michael , Nelson   Philip C. , Varadarajan   Avinash V. , “ Scientific Discovery by Generating Counterfactuals Using Image Translation ,” in International Conference on Medical Image Computing and Computer-Assisted Intervention , (Berlin: Springer , 2020), 273 – 283 . https://doi.org/10.1007/978-3-030-59710-8_27

Neumark   David , Burn   Ian , Button   Patrick , “ Experimental Age Discrimination Evidence and the Heckman Critique ,” American Economic Review , 106 ( 2016 ), 303 – 308 . https://doi.org/10.1257/aer.p20161008

Norouzzadeh   Mohammad Sadegh , Nguyen   Anh , Kosmala   Margaret , Swanson   Alexandra , S. Palmer   Meredith , Packer   Craig , Clune   Jeff , “ Automatically Identifying, Counting, and Describing Wild Animals in Camera-Trap Images with Deep Learning ,” Proceedings of the National Academy of Sciences , 115 ( 2018 ), E5716 – E5725 . https://doi.org/10.1073/pnas.1719367115

Oosterhof   Nikolaas N. , Todorov   Alexander , “ The Functional Basis of Face Evaluation ,” Proceedings of the National Academy of Sciences , 105 ( 2008 ), 11087 – 11092 . https://doi.org/10.1073/pnas.0805664105

Peterson   Joshua C. , Bourgin   David D. , Agrawal   Mayank , Reichman   Daniel , Griffiths   Thomas L. , “ Using Large-Scale Experiments and Machine Learning to Discover Theories of Human Decision-Making ,” Science , 372 ( 2021 ), 1209 – 1214 . https://doi.org/10.1126/science.abe2629

Pierson   Emma , Cutler   David M. , Leskovec   Jure , Mullainathan   Sendhil , Obermeyer   Ziad , “ An Algorithmic Approach to Reducing Unexplained Pain Disparities in Underserved Populations ,” Nature Medicine , 27 ( 2021 ), 136 – 140 . https://doi.org/10.1038/s41591-020-01192-7

Pion-Tonachini   Luca , Bouchard   Kristofer , Garcia Martin   Hector , Peisert   Sean , Bradley Holtz   W. , Aswani   Anil , Dwivedi   Dipankar , Wainwright   Haruko , Pilania   Ghanshyam , Nachman   Benjamin  et al.  “ Learning from Learning Machines: A New Generation of AI Technology to Meet the Needs of Science ,” arXiv preprint arXiv:2111.13786 , 2021 . https://doi.org/10.48550/arXiv.2111.13786

Popper   Karl , The Logic of Scientific Discovery (London: Routledge , 2nd ed. 2002 ). https://doi.org/10.4324/9780203994627

Pronin   Emily , “ The Introspection Illusion ,” Advances in Experimental Social Psychology , 41 ( 2009 ), 1 – 67 . https://doi.org/10.1016/S0065-2601(08)00401-2

Ramachandram   Dhanesh , Taylor   Graham W. , “ Deep Multimodal Learning: A Survey on Recent Advances and Trends ,” IEEE Signal Processing Magazine , 34 ( 2017 ), 96 – 108 . https://doi.org/10.1109/MSP.2017.2738401

Rambachan   Ashesh , “ Identifying Prediction Mistakes in Observational Data ,” Harvard University Working Paper, 2021 . www.nber.org/system/files/chapters/c14777/c14777.pdf

Said-Metwaly   Sameh , Van den Noortgate   Wim , Kyndt   Eva , “ Approaches to Measuring Creativity: A Systematic Literature Review ,” Creativity: Theories–Research-Applications , 4 ( 2017 ), 238 – 275 . https://doi.org/10.1515/ctra-2017-0013

Schickore   Jutta , “ Scientific Discovery ,” in The Stanford Encyclopedia of Philosophy , Edward N. Zalta, ed. (Stanford, CA: Stanford University , 2018).

Schlag   Pierre , “ Law and Phrenology ,” Harvard Law Review , 110 ( 1997 ), 877 – 921 . https://doi.org/10.2307/1342231

Sheetal   Abhishek , Feng   Zhiyu , Savani   Krishna , “ Using Machine Learning to Generate Novel Hypotheses: Increasing Optimism about COVID-19 Makes People Less Willing to Justify Unethical Behaviors ,” Psychological Science , 31 ( 2020 ), 1222 – 1235 . https://doi.org/10.1177/0956797620959594

Simonyan   Karen , Vedaldi   Andrea , Zisserman   Andrew , “ Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps ,” paper presented at the Workshop at International Conference on Learning Representations , 2014.

Sirovich   Lawrence , Kirby   Michael , “ Low-Dimensional Procedure for the Characterization of Human Faces ,” Journal of the Optical Society of America A , 4 ( 1987 ), 519 – 524 . https://doi.org/10.1364/JOSAA.4.000519

Sunstein   Cass R. , “ Governing by Algorithm? No Noise and (Potentially) Less Bias ,” Duke Law Journal , 71 ( 2021 ), 1175 – 1205 . https://doi.org/10.2139/ssrn.3925240

Swanson   Don R. , “ Fish Oil, Raynaud’s Syndrome, and Undiscovered Public Knowledge ,” Perspectives in Biology and Medicine , 30 ( 1986 ), 7 – 18 . https://doi.org/10.1353/pbm.1986.0087

Swanson   Don R. , “ Migraine and Magnesium: Eleven Neglected Connections ,” Perspectives in Biology and Medicine , 31 ( 1988 ), 526 – 557 . https://doi.org/10.1353/pbm.1988.0009

Szegedy   Christian , Zaremba   Wojciech , Sutskever   Ilya , Bruna   Joan , Erhan   Dumitru , Goodfellow   Ian , Fergus   Rob , “ Intriguing Properties of Neural Networks ,” arXiv preprint arXiv:1312.6199 , 2013 . https://doi.org/10.48550/arXiv.1312.6199

Todorov   Alexander , Oh   DongWon , “ The Structure and Perceptual Basis of Social Judgments from Faces. in Advances in Experimental Social Psychology , B. Gawronski, ed. (Amsterdam: Elsevier , 2021 ), 189–245.

Todorov   Alexander , Olivola   Christopher Y. , Dotsch   Ron , Mende-Siedlecki   Peter , “ Social Attributions from Faces: Determinants, Consequences, Accuracy, and Functional Significance ,” Annual Review of Psychology , 66 ( 2015 ), 519 – 545 . https://doi.org/10.1146/annurev-psych-113011-143831

Varian   Hal R. , “ Big Data: New Tricks for Econometrics ,” Journal of Economic Perspectives , 28 ( 2014 ), 3 – 28 . https://doi.org/10.1257/jep.28.2.3

Wilson   Timothy D. , Strangers to Ourselves (Cambridge, MA: Harvard University Press , 2004 ).

Yuhas   Ben P. , Goldstein   Moise H. , Sejnowski   Terrence J. , “ Integration of Acoustic and Visual Speech Signals Using Neural Networks ,” IEEE Communications Magazine , 27 ( 1989 ), 65 – 71 . https://doi.org/10.1109/35.41402

Zebrowitz   Leslie A. , Luevano   Victor X. , Bronstad   Philip M. , Aharon   Itzhak , “ Neural Activation to Babyfaced Men Matches Activation to Babies ,” Social Neuroscience , 4 ( 2009 ), 1 – 10 . https://doi.org/10.1080/17470910701676236

Supplementary data

Email alerts, citing articles via.

  • Recommend to Your Librarian

Affiliations

  • Online ISSN 1531-4650
  • Print ISSN 0033-5533
  • Copyright © 2024 President and Fellows of Harvard College
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Green River Logo

Holman Library

Ask a Librarian

Research Guide: Scholarly Journals

  • Introduction: Hypothesis/Thesis
  • Why Use Scholarly Journals?
  • What does "Peer-Reviewed" mean?
  • What is *NOT* a Scholarly Journal Article?
  • Interlibrary Loan for Journal Articles
  • Reading the Citation
  • Authors' Credentials
  • Literature Review
  • Methodology
  • Results/Data
  • Discussion/Conclusions
  • APA Citations for Scholarly Journal Articles
  • MLA Citations for Scholarly Journal Articles

Hypothesis or Thesis

Looking for the author's thesis or hypothesis.

The image below shows the part of the scholarly article that shows where the authors are making their argument. 

(click on image to enlarge)

This is an image of a journal article with a section in the first paragraphs highlighted to show that they are the author's thesis or hypothesis, or the main point they will discuss.

  • The first few paragraphs of a journal article serve to introduce the topic, to provide the author's hypothesis or thesis, and to indicate why the research was done.  
  • A thesis or hypothesis is not always clearly labeled; you may need to read through the introductory paragraphs to determine what the authors are proposing.
  • << Previous: How to Read a Scholarly Article
  • Next: Reading the Citation >>
  • Last Updated: Mar 15, 2024 1:18 PM
  • URL: https://libguides.greenriver.edu/scholarlyjournals

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • 16 April 2024

US COVID-origins hearing puts scientific journals in the hot seat

You can also search for this author in PubMed   Google Scholar

rad Wenstrup speaks with Raul Ruiz during a hearing of the House Select Subcommittee on the Coronavirus Crisis

Brad Wenstrup (right), a Republican from Ohio who chairs the Select Subcommittee on the Coronavirus Pandemic, speaks with Raul Ruiz (left), a Democrat from California who is ranking member of the subcommittee. Credit: Al Drago/Bloomberg/Getty

During a public hearing in Washington DC today, Republicans in the US House of Representatives alleged that government scientists unduly influenced the editors of scientific journals and that, in turn, those publications stifled discourse about the origins of the COVID-19 pandemic. Democrats clapped back, lambasting their Republican colleagues for making such accusations without adequate evidence and for sowing distrust of science.

journal articles on hypothesis

US congressional hearing produces heat but no light on COVID-origins debate

The session is the latest in a series of hearings held by the Select Subcommittee on the Coronavirus Pandemic to explore where the SARS-CoV-2 coronavirus came from, despite a lack of any new scientific evidence. Scientists have for some time been arguing over whether the virus spread naturally, from animals to people, or whether it leaked from a laboratory in Wuhan, China. Some have alleged that in the early days of the pandemic, government scientists Anthony Fauci, former director of the US National Institute of Allergy and Infectious Diseases, and Francis Collins, former director of the US National Institutes of Health (NIH), steered the scientific community, including journals, to dismiss the lab-leak hypothesis.

During the pandemic, “rather than journals being a wealth of information”, they instead “put a chilling effect on scientific research regarding the origins of COVID-19”, Brad Wenstrup, a Republican representative from Ohio who is chair of the subcommittee, said at the hearing. Raul Ruiz, a Democratic representative from California who is the ranking member of the subcommittee, shot back: “Congress should not be meddling in the peer-review process, and it should not be holding hearings to throw around baseless accusations.”

Holden Thorp, editor-in-chief of the Science family of journals in Washington DC, appeared before the committee to deny the suggestion that he had been coerced or censored by government scientists.

The subcommittee also invited Magdalena Skipper, Nature ’s editor-in-chief, and Richard Horton, editor-in-chief of the medical journal The Lancet , to appear, but neither was present. Skipper was absent owing to scheduling conflicts, but a spokesperson for Springer Nature says the company is “committed to remaining engaged with the Subcommittee and to assisting in its inquiry”. ( Nature ’s news team is editorially independent of its journals team and of its publisher, Springer Nature.) The Lancet did not respond to requests for comment.

Academic influence?

This is not the first time that Republicans have accused members of the scientific community of colluding with Fauci and Collins. Evolutionary biologist Kristian Andersen and virologist Robert Garry appeared before the same subcommittee on 11 July last year to deny allegations that the officials prompted them to publish a commentary in Nature Medicine 1 in March 2020 concluding that SARS-CoV-2 showed no signs of genetic engineering. They wrote in the journal that they did not “believe that any type of laboratory-based scenario is plausible” for the virus’s origins.

Portrait of Holden Thorp

Holden Thorp became editor-in-chief of the Science family of journals in 2019. Credit: Steve Exum

Some lab-leak proponents have suggested, without evidence, that the pandemic began because the NIH funded risky coronavirus research at a lab in Wuhan, offering a motive for Collins and Fauci to promote a natural origin for COVID-19.

During the latest hearing, Republicans went a step further to suggest that not only did Collins and Fauci influence prominent biologists, but that they also encouraged journals to publish research supporting the natural-origin hypothesis. This accusation is based on e-mails that Wenstrup says the subcommittee obtained showing communication between top journal editors and government scientists. Thorp forcefully denied this line of questioning. “No government officials prompted or participated in the review or editing” of two key papers 2 , 3 on COVID-19’s origins published in Science , he testified. “Any papers supporting the lab-origin theory would go through the very same processes” of peer review as any other paper, he said.

Thorp otherwise spent much of the 80-minute hearing answering questions about how a scientific manuscript is prepared for publication, what a preprint is and how peer review works. In a tense moment, Wenstrup questioned a social-media post on Thorp’s personal X (formerly Twitter) page, in which he downplayed the lab-leak hypothesis. Thorp called the post “flippant” and apologised.

Communication queries

Correspondence between journal editors and government scientists is to be expected, Deborah Ross, a Democratic representative from North Carolina, said at the hearing. “Government actors querying academia on issues that are academic in nature isn’t malpractice or unlawful — it’s just doing their jobs.”

Anita Desikan, a senior analyst at the Union of Concerned Scientists who is based in Washington DC and focuses on scientific integrity, tells Nature’ s news team that it is customary for government agencies to reach out to stakeholders to inform policy decisions. Even if a government scientist suggests an idea for a journal paper, “that doesn’t mean it will be published or receive praise from the scientific community”.

Roger Pielke Jr, a science-policy researcher at the University of Colorado Boulder, who was originally slated to testify before the subcommittee until his invitation was rescinded owing to logistical reasons, disagrees. He thinks that Fauci and Collins still shaped the Nature Medicine COVID-19 origins paper by recommending that specific scientists investigate and by offering advice along the way. Nevertheless, the hearing was a “dud”, Pielke Jr says, because Thorp was the wrong witness. Instead, a more relevant witness would have been a government scientific-integrity officer who is more knowledgeable about what constitutes an ethical breach, he adds.

doi: https://doi.org/10.1038/d41586-024-01129-x

Andersen, K. G., Rambaut, A., Lipkin, W. I., Holmes E. C. & Garry, R. F. et al. Nature Med. 26 , 450–452 (2020).

Article   PubMed   Google Scholar  

Worobey, M. et al. Science 377 , 951–959 (2022).

Pekar, J. E. et al. Science 377 , 960–966 (2022).

Download references

Reprints and permissions

Related Articles

journal articles on hypothesis

What toilets can reveal about COVID, cancer and other health threats

News Feature 17 APR 24

Long COVID still has no cure — so these patients are turning to research

Long COVID still has no cure — so these patients are turning to research

News Feature 02 APR 24

Google AI could soon use a person’s cough to diagnose disease

Google AI could soon use a person’s cough to diagnose disease

News 21 MAR 24

Canadian science gets biggest boost to PhD and postdoc pay in 20 years

Canadian science gets biggest boost to PhD and postdoc pay in 20 years

News 17 APR 24

How India can become a science powerhouse

How India can become a science powerhouse

Editorial 16 APR 24

What the India election means for science

What the India election means for science

News 10 APR 24

Researchers want a ‘nutrition label’ for academic-paper facts

Researchers want a ‘nutrition label’ for academic-paper facts

Nature Index 17 APR 24

Structure peer review to make it more robust

Structure peer review to make it more robust

World View 16 APR 24

Is ChatGPT corrupting peer review? Telltale words hint at AI use

Is ChatGPT corrupting peer review? Telltale words hint at AI use

FACULTY POSITION IN PATHOLOGY RESEARCH

Dallas, Texas (US)

The University of Texas Southwestern Medical Center (UT Southwestern Medical Center)

journal articles on hypothesis

Postdoc Fellow / Senior Scientist

The Yakoub and Sulzer labs at Harvard Medical School-Brigham and Women’s Hospital and Columbia University

Boston, Massachusetts (US)

Harvard Medical School and Brigham and Women's Hospital

journal articles on hypothesis

Postdoc in Computational Genomics – Machine Learning for Multi-Omics Profiling of Cancer Evolution

Computational Postdoc - Artificial Intelligence in Oncology and Regulatory Genomics and Cancer Evolution at the DKFZ - limited to 2 years

Heidelberg, Baden-Württemberg (DE)

German Cancer Research Center in the Helmholtz Association (DKFZ)

journal articles on hypothesis

Computational Postdoc

The German Cancer Research Center is the largest biomedical research institution in Germany.

PhD / PostDoc Medical bioinformatics (m/f/d)

The Institute of Medical Bioinformatics and Systems Medicine / University of Freiburg is looking for a PhD/PostDoc Medical bioinformatics (m/w/d)

Freiburg im Breisgau, Baden-Württemberg (DE)

University of Freiburg

journal articles on hypothesis

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

IMAGES

  1. How to form a hypothesis for a research paper. Sample Research Papers

    journal articles on hypothesis

  2. What is a Hypothesis

    journal articles on hypothesis

  3. Hypothesis

    journal articles on hypothesis

  4. How to Write Hypothesis in Research

    journal articles on hypothesis

  5. Research Hypothesis: Definition, Types, Examples and Quick Tips

    journal articles on hypothesis

  6. 15 Hypothesis Examples (2024)

    journal articles on hypothesis

VIDEO

  1. HYPOTHESIS in 3 minutes for UPSC ,UGC NET and others

  2. How To Formulate The Hypothesis/What is Hypothesis?

  3. What Is A Hypothesis?

  4. Concept of Hypothesis

  5. Research Theory and Hypothesis

  6. Dean Radin, Ph D on The "Stupidity Hypothesis"

COMMENTS

  1. Scientific Hypotheses: Writing, Promoting, and Predicting Implications

    A snapshot analysis of citation activity of hypothesis articles may reveal interest of the global scientific community towards their implications across various disciplines and countries. As a prime example, Strachan's hygiene hypothesis, published in 1989,10 is still attracting numerous citations on Scopus, the largest bibliographic database ...

  2. Full article: Research Problems and Hypotheses in Empirical Research

    Research problems and hypotheses are important means for attaining valuable knowledge. They are pointers or guides to such knowledge, or as formulated by Kerlinger ( 1986, p. 19): " … they direct investigation.". There are many kinds of problems and hypotheses, and they may play various roles in knowledge construction.

  3. Understanding Statistical Testing

    Abstract. Statistical hypothesis testing is common in research, but a conventional understanding sometimes leads to mistaken application and misinterpretation. The logic of hypothesis testing presented in this article provides for a clearer understanding, application, and interpretation. Key conclusions are that (a) the magnitude of an estimate ...

  4. On the scope of scientific hypotheses

    2. The scientific hypothesis. In this section, we will describe a functional and descriptive role regarding how scientists use hypotheses. Jeong & Kwon [] investigated and summarized the different uses the concept of 'hypothesis' had in philosophical and scientific texts.They identified five meanings: assumption, tentative explanation, tentative cause, tentative law, and prediction.

  5. Full article: Editorial: Roles of Hypothesis Testing, p-Values and

    Abstract. The role of hypothesis testing, and especially of p-values, in evaluating the results of scientific experiments has been under debate for a long time.At least since the influential article by Ioannidis (Citation 2005) awareness is growing in the scientific community that the results of many research experiments are difficult or impossible to replicate.

  6. Hypotheses

    A tension-based theory of morphogenesis and compact wiring in the central nervous system. Many structural features of the mammalian central nervous system can be explained by a morphogenetic ...

  7. Scientific Hypotheses: Writing, Promoting, and Predicting Implications

    A few scholarly journals guide the authors on how to structure hypotheses. Reflecting on general and specific issues around the subject matter is often recommended for drafting a well-structured hypothesis article. An analysis of influential hypotheses, presented in this article, particularly Strachan's hygiene hypothesis with global ...

  8. The potential of working hypotheses for deductive ...

    While hypotheses frame explanatory studies and provide guidance for measurement and statistical tests, deductive, exploratory research does not have a framing device like the hypothesis. To this purpose, this article examines the landscape of deductive, exploratory research and offers the working hypothesis as a flexible, useful framework that can guide and bring coherence across the steps in ...

  9. Why Hypothesis Testers Should Spend Less Time Testing Hypotheses

    The process we taught our hypothetical student above is commonly known as the hypothetico-deductive (HD) method. Hypothetico-deductivism is "the philosophy of science that focuses on designing tests aimed at falsifying the deductive implications of a hypothesis" (Fidler et al., 2018, p. 238).An important modification to the HD method was Popper's critical rationalism (Popper, 1959 ...

  10. MAKE

    Statistical hypothesis testing is among the most misunderstood quantitative analysis methods from data science. Despite its seeming simplicity, it has complex interdependencies between its procedural components. In this paper, we discuss the underlying logic behind statistical hypothesis testing, the formal meaning of its components and their connections. Our presentation is applicable to all ...

  11. Hypothesis Testing

    Hypothesis testing is the process used to evaluate the strength of evidence from the sample and provides a framework for making determinations related to the population, ie, it provides a method for understanding how reliably one can extrapolate observed findings in a sample under study to the larger population from which the sample was drawn ...

  12. Hypotheses in the Life Sciences

    Section Information. This section of Life publishes papers that present new ideas about fundamental aspects of biology. Manuscripts must propose a solution to a particularly important problem or suggest an idea that is broadly relevant to the life sciences. Examples include evolution of cellular or metabolic functions, extreme conditions and ...

  13. Medical Hypotheses

    About the journal. Medical Hypotheses is a forum for ideas in medicine and related biomedical sciences. It will publish interesting and important theoretical papers that foster the diversity and debate upon which the scientific process thrives. The Aims and Scope of Medical Hypotheses are no different now from what was proposed by the founder ...

  14. History and progress of hypotheses and clinical trials for ...

    There are various descriptive hypotheses regarding the causes of SAD, including the cholinergic hypothesis, 31 amyloid hypothesis, 32,33 tau propagation hypothesis, 34 mitochondrial cascade ...

  15. Machine Learning as a Tool for Hypothesis Generation*

    Journal Article. Machine Learning as a Tool for Hypothesis Generation * Jens Ludwig, Jens Ludwig ... news headlines, corporate filings, and high-frequency time series). A central tenet of our article is that hypothesis generation is a valuable activity, and we hope this encourages future work in this largely "prescientific" stage of science ...

  16. Journal of Medical Hypotheses and Ideas

    Psyllium seed may be effective in the treatment of gastroesophageal reflux disease (GERD) in patients with functional constipation. Mousalreza Hosseini, ... Roshanak Salari. December 2015 View PDF. Read the latest articles of Journal of Medical Hypotheses and Ideas at ScienceDirect.com, Elsevier's leading platform of peer-reviewed scholarly ...

  17. Research Guide: Scholarly Journals

    Looking for the author's thesis or hypothesis. The image below shows the part of the scholarly article that shows where the authors are making their argument. (click on image to enlarge) The first few paragraphs of a journal article serve to introduce the topic, to provide the author's hypothesis or thesis, and to indicate why the research was ...

  18. Hypothesis and Theory

    0 Article (s) This page contains Frontiers open-access articles about Hypothesis and Theory.

  19. Investigation of the Destination Resilience of Turkey ...

    The great crash, the oil price shock, and the unit root hypothesis. Econometrica: Journal of the Econometric Society, 57(6), 1361-1401. Crossref. Google Scholar. Perron P. (1990). Testing for a unit root in a time series with a changing mean. Journal of Business & Economic Statistics, 8(2), 153-162. Crossref.

  20. US COVID-origins hearing puts scientific journals in the hot seat

    US COVID-origins hearing puts scientific journals in the hot seat. Politicians spar over whether academic publishers colluded with government scientists to suppress the lab-leak hypothesis. Brad ...