Society Homepage About Public Health Policy Contact

Data-driven hypothesis generation in clinical research: what we learned from a human subject study, article sidebar.

hypothesis generation in science practice

Submit your own article

Register as an author to reserve your spot in the next issue of the Medical Research Archives.

Join the Society

The European Society of Medicine is more than a professional association. We are a community. Our members work in countries across the globe, yet are united by a common goal: to promote health and health equity, around the world.

Join Europe’s leading medical society and discover the many advantages of membership, including free article publication.

Main Article Content

Hypothesis generation is an early and critical step in any hypothesis-driven clinical research project. Because it is not yet a well-understood cognitive process, the need to improve the process goes unrecognized. Without an impactful hypothesis, the significance of any research project can be questionable, regardless of the rigor or diligence applied in other steps of the study, e.g., study design, data collection, and result analysis. In this perspective article, the authors provide a literature review on the following topics first: scientific thinking, reasoning, medical reasoning, literature-based discovery, and a field study to explore scientific thinking and discovery. Over the years, scientific thinking has shown excellent progress in cognitive science and its applied areas: education, medicine, and biomedical research. However, a review of the literature reveals the lack of original studies on hypothesis generation in clinical research. The authors then summarize their first human participant study exploring data-driven hypothesis generation by clinical researchers in a simulated setting. The results indicate that a secondary data analytical tool, VIADS—a visual interactive analytic tool for filtering, summarizing, and visualizing large health data sets coded with hierarchical terminologies, can shorten the time participants need, on average, to generate a hypothesis and also requires fewer cognitive events to generate each hypothesis. As a counterpoint, this exploration also indicates that the quality ratings of the hypotheses thus generated carry significantly lower ratings for feasibility when applying VIADS. Despite its small scale, the study confirmed the feasibility of conducting a human participant study directly to explore the hypothesis generation process in clinical research. This study provides supporting evidence to conduct a larger-scale study with a specifically designed tool to facilitate the hypothesis-generation process among inexperienced clinical researchers. A larger study could provide generalizable evidence, which in turn can potentially improve clinical research productivity and overall clinical research enterprise.

Article Details

The  Medical Research Archives  grants authors the right to publish and reproduce the unrevised contribution in whole or in part at any time and in any form for any scholarly non-commercial purpose with the condition that all publications of the contribution include a full citation to the journal as published by the  Medical Research Archives .

We can help you reset your password using the email address linked to your BioOne Complete account.

hypothesis generation in science practice

  • BioOne Complete
  • BioOne eBook Titles
  • By Publisher
  • About BioOne Digital Library
  • How to Subscribe & Access
  • Library Resources
  • Publisher Resources
  • Instructor Resources
  • FIGURES & TABLES
  • DOWNLOAD PAPER SAVE TO MY LIBRARY

Helping students understand and generate appropriate hypotheses and test their subsequent predictions — in science in general and biology in particular — should be at the core of teaching the nature of science. However, there is much confusion among students and teachers about the difference between hypotheses and predictions. Here, I present evidence of the problem and describe steps that scientists actually follow when employing scientific reasoning strategies. This is followed by a proposed solution for helping students effectively explore this important aspect of the nature of science.

hypothesis generation in science practice

KEYWORDS/PHRASES

Publication title:, collection:, publication years.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Perspective
  • Published: 10 January 2012

Machine learning and data mining: strategies for hypothesis generation

  • M A Oquendo 1 ,
  • E Baca-Garcia 1 , 2 ,
  • A Artés-Rodríguez 3 ,
  • F Perez-Cruz 3 , 4 ,
  • H C Galfalvy 1 ,
  • H Blasco-Fontecilla 2 ,
  • D Madigan 5 &
  • N Duan 1 , 6  

Molecular Psychiatry volume  17 ,  pages 956–959 ( 2012 ) Cite this article

5417 Accesses

60 Citations

6 Altmetric

Metrics details

  • Data mining
  • Machine learning
  • Neurological models

Strategies for generating knowledge in medicine have included observation of associations in clinical or research settings and more recently, development of pathophysiological models based on molecular biology. Although critically important, they limit hypothesis generation to an incremental pace. Machine learning and data mining are alternative approaches to identifying new vistas to pursue, as is already evident in the literature. In concert with these analytic strategies, novel approaches to data collection can enhance the hypothesis pipeline as well. In data farming, data are obtained in an ‘organic’ way, in the sense that it is entered by patients themselves and available for harvesting. In contrast, in evidence farming (EF), it is the provider who enters medical data about individual patients. EF differs from regular electronic medical record systems because frontline providers can use it to learn from their own past experience. In addition to the possibility of generating large databases with farming approaches, it is likely that we can further harness the power of large data sets collected using either farming or more standard techniques through implementation of data-mining and machine-learning strategies. Exploiting large databases to develop new hypotheses regarding neurobiological and genetic underpinnings of psychiatric illness is useful in itself, but also affords the opportunity to identify novel mechanisms to be targeted in drug discovery and development.

This is a preview of subscription content, access via your institution

Access options

Subscribe to this journal

Receive 12 print issues and online access

251,40 € per year

only 20,95 € per issue

Buy this article

  • Purchase on Springer Link
  • Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

hypothesis generation in science practice

Similar content being viewed by others

hypothesis generation in science practice

A primer on the use of machine learning to distil knowledge from data in biological psychiatry

hypothesis generation in science practice

Axes of a revolution: challenges and promises of big data in healthcare

hypothesis generation in science practice

Deep learning for small and big data in psychiatry

Carlsson A . A paradigm shift in brain research. Science 2001; 294 : 1021–1024.

Article   CAS   Google Scholar  

Mitchell TM . The Discipline of Machine Learning . School of Computer Science: Pittsburgh, PA, 2006. Available from: http://aaai.org/AITopics/MachineLearning .

Google Scholar  

Nilsson NJ . Introduction to Machine Learning. An early draft of a proposed textbook . Robotics Laboratory, Department of Computer Science, Stanford University: Stanford, 1996. Available from: http://robotics.stanford.edu/people/nilsson/mlbook.html .

Hand DJ . Mining medical data. Stat Methods Med Res 2000; 9 : 305–307.

PubMed   CAS   Google Scholar  

Smyth P . Data mining: data analysis on a grand scale? Stat Methods Med Res 2000; 9 : 309–327.

Burgun A, Bodenreider O . Accessing and integrating data and knowledge for biomedical research. Yearb Med Inform 2008; 47 (Suppl 1): 91–101.

Hochberg AM, Hauben M, Pearson RK, O’Hara DJ, Reisinger SJ, Goldsmith DI et al . An evaluation of three signal-detection algorithms using a highly inclusive reference event database. Drug Saf 2009; 32 : 509–525.

Article   Google Scholar  

Sanz EJ, De-las-Cuevas C, Kiuru A, Bate A, Edwards R . Selective serotonin reuptake inhibitors in pregnant women and neonatal withdrawal syndrome: a database analysis. Lancet 2005; 365 : 482–487.

Baca-Garcia E, Perez-Rodriguez MM, Basurte-Villamor I, Saiz-Ruiz J, Leiva-Murillo JM, de Prado-Cumplido M et al . Using data mining to explore complex clinical decisions: A study of hospitalization after a suicide attempt. J Clin Psychiatry 2006; 67 : 1124–1132.

Ray S, Britschgi M, Herbert C, Takeda-Uchimura Y, Boxer A, Blennow K et al . Classification and prediction of clinical Alzheimer's diagnosis based on plasma signaling proteins. Nat Med 2007; 13 : 1359–1362.

Baca-Garcia E, Perez-Rodriguez MM, Basurte-Villamor I, Lopez-Castroman J, Fernandez del Moral AL, Jimenez-Arriero MA et al . Diagnostic stability and evolution of bipolar disorder in clinical practice: a prospective cohort study. Acta Psychiatr Scand 2007; 115 : 473–480.

Baca-Garcia E, Vaquero-Lorenzo C, Perez-Rodriguez MM, Gratacos M, Bayes M, Santiago-Mozos R et al . Nucleotide variation in central nervous system genes among male suicide attempters. Am J Med Genet B Neuropsychiatr Genet 2010; 153B : 208–213.

Sun D, van Erp TG, Thompson PM, Bearden CE, Daley M, Kushan L et al . Elucidating a magnetic resonance imaging-based neuroanatomic biomarker for psychosis: classification analysis using probabilistic brain atlas and machine learning algorithms. Biol Psychiatry 2009; 66 : 1055–1060.

Shen H, Wang L, Liu Y, Hu D . Discriminative analysis of resting-state functional connectivity patterns of schizophrenia using low dimensional embedding of fMRI. Neuroimage 2010; 49 : 3110–3121.

Hay MC, Weisner TS, Subramanian S, Duan N, Niedzinski EJ, Kravitz RL . Harnessing experience: exploring the gap between evidence-based medicine and clinical practice. J Eval Clin Pract 2008; 14 : 707–713.

Unutzer J, Choi Y, Cook IA, Oishi S . A web-based data management system to improve care for depression in a multicenter clinical trial. Psychiatr Serv 2002; 53 : 671–673.

Download references

Acknowledgements

Dr Blasco-Fontecilla acknowledges the Spanish Ministry of Health (Rio Hortega CM08/00170), Alicia Koplowitz Foundation, and Conchita Rabago Foundation for funding his post-doctoral rotation at CHRU, Montpellier, France. SAF2010-21849.

Author information

Authors and affiliations.

Department of Psychiatry, New York State Psychiatric Institute and Columbia University, New York, NY, USA

M A Oquendo, E Baca-Garcia, H C Galfalvy & N Duan

Fundacion Jimenez Diaz and Universidad Autonoma, CIBERSAM, Madrid, Spain

E Baca-Garcia & H Blasco-Fontecilla

Department of Signal Theory and Communications, Universidad Carlos III de Madrid, Madrid, Spain

A Artés-Rodríguez & F Perez-Cruz

Princeton University, Princeton, NJ, USA

F Perez-Cruz

Department of Statistics, Columbia University, New York, NY, USA

Department of Biostatistics, Columbia University, New York, NY, USA

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to M A Oquendo .

Ethics declarations

Competing interests.

Dr Oquendo has received unrestricted educational grants and/or lecture fees form Astra-Zeneca, Bristol Myers Squibb, Eli Lilly, Janssen, Otsuko, Pfizer, Sanofi-Aventis and Shire. Her family owns stock in Bistol Myers Squibb. The remaining authors declare no conflict of interest.

PowerPoint slides

Powerpoint slide for fig. 1, rights and permissions.

Reprints and permissions

About this article

Cite this article.

Oquendo, M., Baca-Garcia, E., Artés-Rodríguez, A. et al. Machine learning and data mining: strategies for hypothesis generation. Mol Psychiatry 17 , 956–959 (2012). https://doi.org/10.1038/mp.2011.173

Download citation

Received : 15 July 2011

Revised : 20 October 2011

Accepted : 21 November 2011

Published : 10 January 2012

Issue Date : October 2012

DOI : https://doi.org/10.1038/mp.2011.173

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • data farming
  • inductive reasoning

This article is cited by

Applications of artificial intelligence−machine learning for detection of stress: a critical overview.

  • Alexios-Fotios A. Mentis
  • Donghoon Lee
  • Panos Roussos

Molecular Psychiatry (2023)

Optimizing prediction of response to antidepressant medications using machine learning and integrated genetic, clinical, and demographic data

  • Dekel Taliaz
  • Amit Spinrad
  • Bernard Lerer

Translational Psychiatry (2021)

Computational psychiatry: a report from the 2017 NIMH workshop on opportunities and challenges

  • Michele Ferrante
  • A. David Redish
  • Joshua A. Gordon

Molecular Psychiatry (2019)

The role of machine learning in neuroimaging for drug discovery and development

  • Orla M. Doyle
  • Mitul A. Mehta
  • Michael J. Brammer

Psychopharmacology (2015)

Stabilized sparse ordinal regression for medical risk stratification

  • Truyen Tran
  • Svetha Venkatesh

Knowledge and Information Systems (2015)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

hypothesis generation in science practice

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • PMC10274969

Logo of medrxiv

  • PMC10274969.1 ; 2023 Jun 5
  • ➤ PMC10274969.2; 2023 Oct 31

This is a preprint.

Data-driven hypothesis generation among inexperienced clinical researchers: a comparison of secondary data analyses with visualization (viads) and other tools.

1 Department of Public Health Sciences, Clemson University, Clemson, SC

James J. Cimino

2 Informatics Institute, School of Medicine, University of Alabama, Birmingham, Birmingham, AL

Vimla L. Patel

3 Cognitive Studies in Medicine and Public Health, The New York Academy of Medicine, New York City, NY

Yuchun Zhou

4 Patton College of Education, Ohio University, Athens, OH

Jay H. Shubrook

5 College of Osteopathic Medicine, Touro University, Vallejo, CA

Sonsoles De Lacalle

6 Department of Health Science, California State University Channel Islands, Camarillo, CA

Brooke N. Draghi

Mytchell a. ernst, aneesa weaver, shriram sekar.

7 Schoole of Computing, Clemson University, Clemson, SC

8 Russ College of Engineering and Technology, Ohio University, Athens, OH

Associated Data

Objectives:.

To compare how clinical researchers generate data-driven hypotheses with a visual interactive analytic tool (VIADS, a v isual i nteractive a nalysis tool for filtering and summarizing large d ata s ets coded with hierarchical terminologies) or other tools.

We recruited clinical researchers and separated them into “experienced” and “inexperienced” groups. Participants were randomly assigned to a VIADS or control group within the groups. Each participant conducted a remote 2-hour study session for hypothesis generation with the same study facilitator on the same datasets by following a think-aloud protocol. Screen activities and audio were recorded, transcribed, coded, and analyzed. Hypotheses were evaluated by seven experts on their validity, significance, and feasibility. We conducted multilevel random effect modeling for statistical tests.

Eighteen participants generated 227 hypotheses, of which 147 (65%) were valid. The VIADS and control groups generated a similar number of hypotheses. The VIADS group took a significantly shorter time to generate one hypothesis (e.g., among inexperienced clinical researchers, 258 seconds versus 379 seconds, p = 0.046, power = 0.437, ICC = 0.15). The VIADS group received significantly lower ratings than the control group on feasibility and the combination rating of validity, significance, and feasibility.

Conclusion:

The role of VIADS in hypothesis generation seems inconclusive. The VIADS group took a significantly shorter time to generate each hypothesis. However, the combined validity, significance, and feasibility ratings of their hypotheses were significantly lower. Further characterization of hypotheses, including specifics on how they might be improved, could guide future tool development.

Introduction

A scientific hypothesis is an educated guess regarding the relationships among several variables [ 1 , 2 ]. A hypothesis is a fundamental component of a research question [ 3 ], which typically can be answered by testing one or several hypotheses [ 4 ]. A hypothesis is critical for any research project; it determines its direction and impact. Many studies focusing on scientific research have made significant progress in scientific [ 5 , 6 ] and medical reasoning [ 7 – 11 ], problem-solving, analogy, working memory, and learning and thinking in educational contexts [ 12 ]. However, most of these studies begin with a question and focus on scientific reasoning [ 14 ], medical diagnosis, or differential diagnosis [ 10 , 15 , 16 ]. Henry and colleagues named them as open or closed discoveries in the literature-mining context [ 18 ]. The reasoning mechanisms and processes used in solving an existing puzzle are critical; however, the current literature provides limited information about the scientific hypothesis generation process [ 4 – 6 ], which is to identify the focused area to start with, not the hypotheses generated to solve existing problems.

There have been attempts to generate hypotheses automatically using text mining, literature mining, knowledge discovery, natural language processing techniques, Semantic Web technology, or machine learning methods to reveal new relationships among diseases, genes, proteins, and conditions [ 19 – 23 ]. Many of these efforts were based on Swanson’s ABC Model [ 24 – 26 ]. Several research teams explored automatic literature systems for generating [ 28 , 29 ] and validating [ 30 ] or enriching hypotheses [ 31 ]. However, the studies recognized the complexity of the hypothesis generation process and concluded that it does not seem feasible to generate hypotheses completely automatically [ 19 – 21 , 24 , 32 ]. In addition, hypothesis generation is not just identifying new relationships although a new connection is a critical component of hypothesis generation. Other literature-related efforts include adding temporal dimensions to machine learning models to predict connections between terms [ 33 , 34 ] or evaluating hypotheses using knowledge bases and Semantic Web technology [ 32 , 35 ]. To understand how humans use such systems to generate hypotheses in practice may provide unique insights into our understanding of scientific hypothesis generation, which can help system developers to better automate systems to facilitate the process.

Many researchers believe that their secondary data analytical tools (such as a v isual i nteractive a nalytic tool for filtering and summarizing large health d ata s ets coded with hierarchical terminologies —VIADS [ 36 – 39 ]) can facilitate hypothesis generation [ 40 , 41 ]. Whether these tools work as expected, and how, has not been systematically investigated. Data-driven hypothesis generation is critical and the first step in clinical and translational research projects [ 42 ]. Therefore, we conducted a study to investigate if and how VIADS can facilitate generating data-driven hypotheses among clinical researchers. We recorded their hypothesis generation process and compared the results of those who used and did not use VIADS. Hypothesis quality evaluation is usually conducted as part of a larger work, e.g., the evaluation of a scientific paper or a research grant proposal. Therefore, there are no existing metrics for hypothesis quality evaluation. We developed our quality metrics to evaluate hypotheses [ 1 , 3 , 4 , 43 – 49 ] through iterative internal and external validation [ 43 , 44 ]. This paper is a study of scientific hypothesis generation by clinical researchers with or without VIADS, including the quality evaluation, quantitative measurement of the hypotheses, and an analysis of the responses to follow-up questions. This is a randomized human participant study; however, per the definition of the National Institutes of Health, it is not a clinical trial. Therefore, we did not register it on ClinicalTrials.gov .

Research question and hypothesis

  • Can secondary data analytic tools, e.g., VIADS, facilitate the hypothesis generation process?

We hypothesize there will be group differences between clinical researchers who use VIADS in generating hypotheses and those who do not.

Rationale of the research question

Many researchers believe that new analytical tools offer opportunities to reveal further insights and new patterns in existing data to facilitate hypothesis generation [ 1 , 28 , 41 , 42 , 50 ]. We developed the underlying algorithms (determine what VIADS can do) [ 38 , 39 ] and the publicly accessible online tool—VIADS [ 37 , 51 , 52 ] to provide new ways of summarizing, comparing, and visualizing datasets. In this study, we explored the utility of VIADS.

Study design

We conducted a 2 × 2 study. We divided participants into four groups: inexperienced clinical researchers without VIADS (group 1, participants were free to use any other analytical tools), and with VIADS (group 2). Experienced clinical researchers without VIADS (group 3), and with VIADS (group 4). The main differences between experienced and inexperienced clinical researchers were years of experience in conducting clinical research and the number of publications as significant contributors [ 53 ].

A pilot study, involving two participants and four study sessions, was conducted before we finalized the study datasets ( Supplemental Material 1 ), training material ( Supplemental Material 2 ), study scripts ( Supplemental Material 3 ), follow-up surveys ( Supplemental Material 4 & 5 ), and study session flow. Afterward, we recruited clinical researchers for the study sessions.

Recruitment

We recruited study participants through local, national, and international platforms, including American Medical Informatics Association (AMIA) mailing lists for working groups (e.g., clinical research informatics, clinical information system, implementation, clinical decision support, and Women in AMIA), N3C [ 54 ] study network Slack channels, South Carolina Clinical and Translational Research Institute newsletter, guest lectures and invited presentations in peer-reviewed conferences (e.g., MIE 2022), and other internal research related newsletters. All collaborators of the investigation team shared the recruitment invitations with their colleagues. Based on the experience level and our block randomization list, the participants were assigned to the VIADS or non-VIADS groups. After scheduling, the study script and IRB-approved consent forms were shared with participants. The datasets were shared on the study date. All participants received compensation based on the time they spent.

Every study participant used the same datasets and followed the same study scripts. The same study facilitator conducted all study sessions. For the VIADS groups, we scheduled a training session (one-hour). All groups had a study session lasting a maximum of 2 hours. During the training session, the study facilitator demonstrated how to use VIADS and then the participants demonstrated the use of VIADS. During the study session, participants analyzed datasets with VIADS or other tools to develop hypotheses by following the think-aloud protocol. During the study sessions, the study facilitator asked questions, provided reminders, and acted as a colleague to the participants. All training and study sessions were conducted remotely via WebEx meetings. Figure 1 shows the study flow.

An external file that holds a picture, illustration, etc.
Object name is nihpp-2023.05.30.23290719v2-f0001.jpg

Study flow for the data-driven hypothesis generation

During the study session, all the screen activities and conversations were recorded via FlashBB and converted to audio files for professional transcription. At the end of each study session, the study facilitator asked follow-up questions about the participants’ experiences creating and capturing new research ideas. The participants in the VIADS groups also completed two follow-up Qualtrics surveys; one was about the participant and questions on how to facilitate the hypothesis generation process better ( Supplemental Material 4 ), and the other evaluated VIADS usability with a modified version of the System Usability Scale ( Supplemental Material 5 ). The participants in the non-VIADS groups received one follow-up Qualtrics survey ( Supplemental Material 4 ).

Hypothesis evaluation

We developed a complete version and a brief version of hypothesis quality evaluation instrument ( Supplemental Material 6 ) based on hypothesis quality evaluation metrics. We recruited a clinical research expert panel with four external members and three senior project advisors from our investigation team with clinical research backgrounds to validate the instruments. Their detailed eligibility criteria were published [ 53 ]. The expert panel evaluated the quality of all hypotheses generated by participants. In Phase 1, the full version of the instrument was used to evaluate randomly selected 30 hypotheses, and the evaluation results enabled us to develop a brief version of the instrument [ 43 ] ( Supplemental Material 7 , including three dimensions: validity, significance, and feasibility); Phase 2 used the brief instrument to evaluate the remaining hypotheses. Each dimension used a 5-point scale, from 1 (the lowest) to 5 (the highest). Therefore, for each hypothesis, the total raw score could range between 3 and 15. In our descriptive results and statistical tests, we used the averages of these scores (i.e., total raw score/3). Like the scores for each dimension, the results ranged from 1 to 5.

We generated a random list for all hypotheses. Then, based on the random list, we put ten randomly selected hypotheses into one Qualtrics survey for quality evaluation. We initiated the quality evaluation process after the completion of all the study sessions, allowing all hypotheses to be included in the generation of the random list.

Data analysis plan on hypothesis quality evaluation

Our data analysis focuses on the quality and quantity of hypotheses generated by participants. We conducted multilevel random intercept effect modeling in MPlus 7 to compare the VIADS group and the control group on the following items: the quality of the hypotheses, the number of hypotheses, and the time needed to generate each hypothesis by each participant. We also examined the correlations between the hypothesis quality ratings and the participant’s self-perceived creativity.

We first analyzed all hypotheses to explore the aggregated results. A second analysis was conducted by using only valid hypotheses after removing any hypothesis that was scored at “1” (the lowest rating) for validity by three or more experts. However, we include both sets of results in this paper. The usability results of VIADS were published separately [ 36 ].

All hypotheses were coded by two research assistants who worked separately and independently. They coded the time needed for each hypothesis and cognitive events during hypothesis generation. The coding principles ( Supplemental Material 8 ) were developed as the two research assistants worked. Whenever there was a discrepancy, a third member of the investigation team joined the discussion to reach a consensus by refining the coding principles.

Ethical statement

Our study was approved by the Institutional Review Boards (IRB) of Clemson University, South Carolina (IRB2020-056) and Ohio University (18-X-192).

Participant demographics

We screened 39 researchers, among whom 20 participated, of which two were in the pilot study. Participants were from different locations and institutions in the United States. Among the 18 study participants, 15 were inexperienced clinical researchers and three were experienced. The experienced clinical researchers were underrepresented, and their results were mainly for informational purposes. Table 1 presents the background information of the participants.

Eighteen participants’ — clinical researchers’ profile

Note: -, no value.

Expert panel composition and intraclass correlation coefficient (ICC)

Seven experts validated the metrics and instruments [ 43 , 44 ] and evaluated the hypotheses using the instruments. Each expert was from a different institution in the United States. Five had medical backgrounds, three of them were in clinical practice; and two had research methodology expertise. They all had 10 years or longer clinical research experience. For the hypothesis quality evaluation, the ICC of the seven experts was moderate, at 0.49.

Hypothesis quality and quantity evaluation results

The 18 participants generated 227 hypotheses during the study sessions. There were 80 invalid hypotheses and 147 (65%) valid hypotheses. They were all used separately for further analysis and comparison. Of these 147, 121 were generated by inexperienced clinical researchers (n = 15) in the VIADS (n = 8) and control (n = 7) groups.

Table 2 shows the descriptive results of the hypothesis quality evaluation results between the VIADS and the control groups. We used four analytic strategies: valid hypotheses by inexperienced clinical researchers (n = 121), valid hypotheses by inexperienced and experienced clinical researchers (n = 147), all hypotheses by inexperienced clinical researchers (n = 192), and all hypotheses by inexperienced and experienced clinical researchers (n = 227).

Expert panel quality rating results for hypotheses generated by VIADS and control groups

Note : valid. → valid hypotheses; all. → all hypotheses; SD, standard deviation. Each hypothesis was rated on a 5-point scale, from 1 (the lowest) to 5 (the highest) in each dimension.

Table 3 shows the hypotheses’ quality evaluation results for random intercept effect modeling. The four analytic strategies generated similar results. The VIADS group received slightly lower validity, significance, and feasibility scores, but the differences in validity and significance were statistically insignificant regardless of the analytic strategy. However, the feasibility scores of the VIADS group were statistically significantly lower ( p < 0.01), regardless of the analytic strategy. Using the random intercept effect model, the combined validity, significance, and feasibility ratings of the VIADS group were statistically significantly lower than those of the control group for three of four analytic strategies ( p < 0.05). This was most likely due to the differences in feasibility ratings between the VIADS and control groups. Across all four analytic strategies, there was a statistically significant random intercept effect on validity between participants. When all hypotheses and all participants were considered, there was also a statistically significant random intercept effect on significance between participants.

Multilevel random intercept modeling results on hypotheses quality ratings for different strategies

Note : valid. → valid hypotheses; all. → all hypotheses; Parti. → participants; ICC: intraclass correlation coefficient. β , group mean difference between VIADS and control groups in their quality ratings of hypotheses; ICC (for random effect modeling) = between-participants variance/(between-participants variance + within-participants variance).

In addition to quality ratings, we also compared the number of hypotheses and the time needed for each participant to generate a hypothesis between groups. The inexperienced clinical researchers in the VIADS group and control group generated a similar number of valid hypotheses. Those in the VIADS group generated between 1 and 19 valid hypotheses in 2 hours (mean 8.43) and those in the control group generated 2 to 13 (mean 7.63). The inexperienced clinical researchers in the VIADS group took 102–610 seconds to generate a valid hypothesis (mean, 278.6 s), and those in the control group took 250–566 seconds (mean, 358.2 s).

Table 4 shows the random intercept modeling results of our comparison of the time needed to generate hypotheses using the four different strategies. On average, the VIADS group requires significantly less time ( p < 0.05) to generate a hypothesis regardless of the analytic strategy used. The results were consistent with those obtained from the analysis of all hypotheses by independent t-test [ 55 ]. There were no statistically significant random intercept effects between participants, regardless of the analytic strategy used ( Table 4 ).

Multilevel random intercept modeling results on time used to generate hypotheses for different strategies

Note : valid. → valid hypotheses; all. → all hypotheses; Parti. → participants; ICC: intraclass correlation coefficient; β , group mean difference between VIADS and control groups in the time used to generate each hypothesis on average; ICC (for random effect modeling) = between-participants variance/(between-participants variance + within-participants variance)

Experienced clinical researchers

There were three experienced clinical researchers among the participants, two in the VIADS group and one in the control group. The experienced clinical researchers in the VIADS group generated 12 (average time 215 s/hypothesis) and 3 (average time 407 s/hypothesis) valid hypotheses in 2 hours. The experienced clinical researcher in the control group generated 12 valid hypotheses (average time, 413 s/hypothesis). The experienced clinical researchers in the VIADS group received average quality scores of 8.51 (SD, 1.18) and 7.04 (SD, 0.3) while the experienced clinical researcher in the control group received a quality score of 9.84 (SD, 0.81) out of 15 per hypothesis. These results were used for informational purpose.

Follow-up questions

The follow-up questions comprise three parts: verbal questions asked by the study facilitator and a follow-up survey for all participants, and a SUS usability survey for the VIADS group participants [ 36 ]. The results from the first two parts are summarized below.

The verbal questions and the summary answers are presented in Table 5 . Reading and interactions with others were the most used activities to generate new research ideas. Attending conferences, seminars, and educational events, and conducting clinical practice were important in generating hypotheses. There were no specific tools used to initially capture hypotheses or research ideas. Most participants used text documents in Microsoft Word, text messages, emails, or sticky notes to summarize their initial ideas.

Follow-up questions (verbal) and answers after each study session (all study participants)

Figure 2 is a scientific hypothesis generation framework we developed based on literature [ 1 , 3 , 4 , 45 , 56 , 57 ], follow-up questions and answers after study sessions, and self-reflection on our research project trajectories. The external environment, cognitive capacity, and interactions between the individual and the external world, especially the tools used, are categories of critical factors that significantly contribute to hypothesis generation. Cognitive capacity takes a long time to change, and the external environment can be unpredictable. The tools that can interact with existing datasets are one of the modifiable factors in the hypothesis generation framework and this is what we aimed to test in this study.

An external file that holds a picture, illustration, etc.
Object name is nihpp-2023.05.30.23290719v2-f0002.jpg

Scientific hypothesis generation framework: Contributing factors

One follow-up question was about the self-perceived creativity of the study participants. The average hypothesis quality rating score per participant did not correlate with the self-perceived creativity ( p = 0.616, two-tailed Pearson correlation test) or the number of valid hypotheses generated ( p = 0.683, two-tailed Pearson correlation test) by inexperienced clinical researchers. There was no correlation between the highest and lowest 10 ratings and the individual’s self-perceived creativity, in inexperienced clinical researchers regardless of using VIADS or not.

In our follow-up survey, the questions were mainly about participants’ current roles and affiliations, their experience in clinical research, their preference for analytic tools, and their ratings of the importance of different factors (e.g., prospective study, longitudinal study, and the results will be published separately) considered routinely in clinical research study design. Most of the results have been included in Table 1 .

In our follow-up survey, one question was, “If you were provided with more detailed information about research design (e.g., focused population) during your hypothesis generation process, do you think the information would help formulate your hypothesis overall?” All 20 participants, (including two in the pilot study), selected Yes. This demonstrates the recognition and need for assistance during hypothesis generation. In the follow-up surveys, VIADS users provided overwhelmingly positive feedback on VIADS, and they all agreed (100%) that VIADS offered new perspectives on the datasets compared with the tools they currently use for the same type of datasets [ 36 ].

Interpretation of the results

We aim to discover the role of secondary data analytic tools, e.g., VIADS, in generating scientific hypotheses in clinical research and to evaluate the utility and usability of VIADS. The usability of VIADS has been published separately [ 36 ]. Regarding the role and utility of VIADS, we measured the number of hypotheses generated, the average time needed to generate each hypothesis, the quality evaluation of the hypotheses, and the user feedback on VIADS. Participants in the VIADS and control groups generated similar numbers of valid and total hypotheses among inexperienced clinical researchers. The VIADS group a) needed a significantly shorter time to generate each hypothesis on average; b) received significantly lower ratings on quality in feasibility than the control group; c) received significantly lower quality ratings (three of four analytic strategies) in the combination ratings of validity, significance, and feasibility, which most likely due to the feasibility rating differences; d) provided very positive feedback on VIADS [ 36 ] with 75% agreed that VIADS facilitates understanding, presentation, and interpretation of the underlying datasets; e) agreed (100%) that VIADS provided new perspectives on the datasets compares to other tools.

However, the current results were inconclusive in answering the research question. The direct measurements of significant differences between the VIADS and control group were mixed ( Tables 3 and ​ and4). 4 ). The VIADS group took significantly less time than the control group to generate each hypothesis, regardless of the analytic strategy. Considering the sample size and power (0.31 to 0.44) of the study ( Table 4 ), and the absence of significant random intercept effects on feasibility between participants, regardless of analytic strategy, this result is highly significant. The shorter hypothesis generation time in the VIADS group indicates that VIADS may facilitate participants’ thinking or understanding of the underlying datasets. While timing is not as critical in the context of clinical research as it is in clinical care, this result is still very encouraging.

On the other hand, the quality ratings of the hypotheses generated mixed and somewhat unfavorable results. The VIADS group received insignificantly lower ratings for validity and significance, and significant lower ratings for feasibility than the control group, regardless of analytic strategy. In addition, the combined validity, significance, and feasibility ratings of the VIADS group were significantly lower than those of the control group for three analytic strategies (the power ranged from 0.37 to 0.79, Table 3 ). There were significant random intercept effects on validity between participants, regardless of analytic strategy. When we considered all hypotheses among all participants, there were also significant random intercept effects on significance between participants. These results indicate that the significantly lower ratings for feasibility in the VIADS groups may not be caused by random effects. There are various possible reasons for the lower feasibility ratings of the VIADS group, e.g., VIADS facilitates the generation of less feasible hypotheses. While unfeasible ideas are not identical to creative ideas, they may be a deviation on the path to creative ideas.

Although the VIADS group received lower quality ratings than the control group, it would be an overstatement to claim that VIADS reduces the quality of generated hypotheses. We posit that the 1-hour training session received by the VIADS group likely played a critical role in the quality rating differences. Among the inexperienced clinical researchers, six underwent 3-hour sessions, i.e., 1-hour training followed by a 2-hour study session, with a 5-minute break in between. Two participants had the training and the study session on separate days. The cognitive load of the training session was not considered during the study design, therefore the training sessions were not required to be conducted on different days to the study sessions. However, in retrospect, we should have mandated that the training and study sessions take place on separate days. In addition, the VIADS group had a much higher percentage of participants with less than 2 years of research experience (75%) than the control group (43%, Table 1 ). Although we adhered strictly to the randomization list when assigning participants to the two groups, the relatively small sample sizes are likely to have amplified the effects of the research experience imbalance between the two groups.

Literature [ 58 ] suggests that learning a complex tool and performing tasks simultaneously presents extra cognitive load on the participants. This is likely the case in this study. VIADS group participants needed to learn how to use the tool and then analyze the data sets with it to come up with the hypotheses. The cognitive overload may not have been conscious. Therefore, the participants perceived VIADS as helpful in understanding datasets. However, the quality evaluation results did not support the participants’ perceptions of VIADS although the timing differences did support participants’ feedback.

The role of VIADS in the hypothesis generation process may not be linear. The 2-hour use of VIADS did not generate statistically higher average quality ratings on hypotheses; however, all participants (100%) agreed that VIADS provided new perspectives on the datasets. The true role of VIADS in hypothesis generation might be more complicated than we initially thought. Either two hours were inadequate to generate higher average quality ratings, or our evaluation was not adequately granular to capture the differences. A more natural use environment might be necessary instead of a simulated environment to demonstrate detectable differences.

Researchers have long been keen to understand where good research ideas come from [ 59 , 60 ]. Participants’ answers to our follow-up questions have provided anecdotal information about possible activities contributing to the hypothesis generation process. From these insights, and a literature review [ 1 , 3 , 4 , 45 , 56 , 57 ], we formulated a scientific hypothesis generation framework ( Figure 2 ). All the following activities were identified as associated with hypothesis generation: reading, interactions with others, observations during clinical practice, teaching, learning, and listening to presentations. Individuals connected ideas, facts, and phenomena, and formulated them into research questions and hypotheses to test. Although these activities did not answer the question directly, they were identified by the participants as associated with hypothesis generation.

Before the expert panel members evaluated the hypotheses, it was necessary to establish some form of threshold based on the following considerations. First, most of the participants were inexperienced clinical researchers. Second, the participants were unfamiliar with the data-driven hypothesis generation process. And third, the number of hypotheses each participant generated within a 2-hour window. It was unlikely that each hypothesis generated would be valid or valuable and should be given equal attention by the expert panel. Therefore, if three or more experts rated the validity of a hypothesis as 1 (the lowest), then the hypothesis was considered invalid. However, these hypotheses were included in two of the analytic strategies. Meanwhile, we do recognize that identifying an impactful research question is necessary but does not guarantee a successful research project, but only an excellent start.

Three participants with the highest average quality ratings (i.e., 10.55, 10.25, and 9.84 out of 15) were all in the non-VIADS group, of which the top two were inexperienced and the third was an experienced clinical researcher. They all practice medicine. Based on the conversations between them and the study facilitator, they all put much thought into research and connect observations in medical practice and clinical research; their clinical practice experience, education, observation, thinking, and making connections contributed to their higher quality ratings on their hypotheses. This observation verifies the belief that good research ideas require at least three pillars: 1) a deep understanding of the domain knowledge and the underneath mechanisms, 2) capable of connecting knowledge and practical observations (problems or phenomena), 3) capable to put the observations into the appropriate research contexts [ 3 , 4 , 59 , 61 – 63 ].

The three participants with the highest average quality scores were in the non-VIADS group, and the participant with the lowest average quality score was in the VIADS group. Regardless of our randomization, individual variations may play an amplified role when the sample size is relatively small. Although the random effect models allowed more precise statistical analyses, larger sample sizes would have higher power and more reliable results.

Significance of the work

Using the same datasets, we conducted the first human participant investigation to compare data-driven hypothesis generation using VIADS and other analytic tools. The significance of the study can be demonstrated in the following aspects. First , this study demonstrated the feasibility of remotely conducting a data-driven hypothesis generation study via the think-aloud protocol. Second , we established hypothesis quality evaluation metrics and instruments, which may be useful for clinical researchers during peer review or prioritize their research ideas before investing too many resources. Third , this study measured the baseline data for hypotheses, time needed, and quality of hypotheses generated by inexperienced clinical researchers. Fourth , hypothesis generation is complicated, and our results showed the VIADS group generated hypotheses significantly faster than the control group, and the overall quality of the hypothesis was favorable in the control group. Fifth , among inexperienced clinical researchers, we identified that more assistance is needed in the hypothesis generation process. This work has laid a foundation for more structured and organized clinical research projects, starting from a more explicit hypothesis generation process. However, we believe that this is only the tip of the hypothesis generation iceberg.

Strengths and limitations of the work

The study participants were from all over the United States, not a single health system or institution. Although the sample of participants may be more representative, individual variations may have played a more significant role than estimated in the hypothesis quality measurements.

We implemented several strategies to create comparable groups. For example, we used the same datasets, study scripts, and platform (WebEx) for all study sessions conducted by the same study facilitator. Two research assistants examined the time measurements and events of the hypothesis generation process independently and compared their results later, which made the coding more consistent and robust.

The study had a robust design and a thorough analysis. Random intercept effect modeling was a more precise statistical test. We implemented randomization at multiple levels to reduce potential bias. The study participants were separated into experienced and inexperienced groups during screening based on predetermined criteria [ 53 ] and then randomly assigned to the VIADS or non-VIADS groups. For the hypothesis quality evaluation, every ten randomly selected hypotheses were organized into one Qualtrics survey. This provided a fair opportunity for each hypothesis and reduced bias related to the order of hypotheses. The hypothesis quality measurement metrics and instruments were validated iteratively, internally, and externally [ 43 ]. We implemented multiple analytical strategies to provide a comprehensive picture of the data and enhance the robustness of the study. We examined (a) valid hypotheses only; (b) inexperienced clinical researchers only; (c) all hypotheses; and (d) all clinical researchers.

One major limitation was the sample size, which was based on our ability to recruit over the study period and was insufficient to detect an effect of the size we originally anticipated. The power calculation (i.e., the required sample size was 32 for four groups based on the confidence level of 95%, α = 0.05, and a power level of 0.8, β = 0.20) used during the study design was optimistic, mainly due to our belief that VIADS can provide new ways of data visualization and data analysis than other tools. The VIADS groups participants verified our confidence in VIADS via the follow-up surveys. However, the quantitative measures showed mixed results that only partially support our opinion and the real power ranged between 0.31 to 0.79. One possible explanation is that hypothesis generation is highly complicated. A tool like VIADS may be helpful, but the effects might not be easily measured or detected after 2–3 hours of use. Better control of participant characteristics would have reduced the possibility of bias due to participant variations between the two groups. In practice, there is no universally agreed-upon measure of research expertise that could be adopted in our study.

Another limitation is that we used a simulated environment. There is a time constraint during study sessions. Time pressure can reduce hypothesis generation [ 64 ]. However, this influence would be similar in both groups. The hypothesis generation process in real life is usually lengthy and non-linear, with discussions, revisions, feedback, and refinement. We do not know whether a simulated environment can reflect the true natural process of hypothesis generation. The study facilitator who conducted the study sessions could have been a stressor for some participants.

Our current measurements may be inadequate to detect such complexity. All the VIADS group participants agreed that VIADS is helpful in hypothesis generation; however, the quantitative and qualitative measures presented mixed results, which may indicate more granular measurements are needed. We did notice that hypotheses generated by VIADS groups seem more complex than the control groups, however, we did not implement a measure for complexity of hypothesis to provide concrete evidence.

The ICC for the quality ratings of the hypotheses by seven experts was moderate, at 0.49, which is understandable, considering a hypothesis provides limited information in two to three sentences. In contrast, reviewers often disagree on a full paper or a grant proposal, which include much more details. Despite these points, we still would like to emphasize the complex nature of this study. We should keep an open mind while interpreting the results.

Challenges and future directions

The cognitive events and processes during hypothesis generation (i.e., the think-aloud recordings of study sessions) are beyond the scope of this paper but are currently being analyzed, to be published separately. The results of this analysis may shed more light on our outcomes. In this study, we faced several challenges beyond our control that may have affected the study results. The current process can only capture the conscious and verbalized processes and may, for example, have failed to capture unconscious cognitive processes. Therefore, further analysis of the recorded think-aloud sessions might help us to better understand the hypothesis generation process and the differences between groups. A large-scale replication of the present study with more participants is needed. The factors that influence the validity, significance, and feasibility of hypotheses should also be explored further.

The VIADS group participants provided very positive feedback on VIADS and its role in hypothesis generation; however, the limitations of VIADS are apparent. By nature, VIADS is a data analysis and visualization tool. It can only accept specific dataset format and only supports certain types of hypotheses. Therefore, more powerful and comprehensive tools designed to assist hypothesis generation particularly are needed [ 60 ]. Furthermore, a longer duration of use and use of the tool in a more natural environment might be necessary to demonstrate the tools’ effectiveness.

Recruitment is always challenging in human participant studies [ 47 – 49 ]. It is particularly challenging to recruit experienced clinical researchers, even though we made similar efforts and used similar platforms to recruit inexperienced and experienced clinical researchers. The different recruitment outcomes may be because hypothesis generation is not a high priority for experienced clinical researchers. It could also be because they were overwhelmed by existing responsibilities and did not have time for the additional tasks needed by our study.

The role of VIADS in hypothesis generation for clinical research seems uncertain. The VIADS group took significantly less me to generate hypotheses than the control group. However, the overall quality (as measured by the combination of validity, significance, and feasibility scores) of the hypotheses generated in the VIADS group was significantly lower than those of the control group; although, there were statistically significant random intercept effects on validity between participants, regardless of the analytic strategy used. The lower combined quality ratings were likely due to the significantly lower feasibility ratings of the VIADS group. Further characterization of hypotheses, including specific ways in which they might be improved, could guide future tool development. Larger-scale studies may help to reveal more conclusive hypothesis generation mechanisms.

Highlights of the paper:

  • Distinguished the scientific hypothesis generation process from other parts of scientific or medical thinking and reasoning.
  • Conducted a human participant study to generate data-driven hypotheses among clinical researchers.
  • Established baseline data for inexperienced clinical researchers: the number, the quality, the validity rate, and the time needed to generate data-driven hypotheses within 2 hours.
  • VIADS seems to facilitate users to generate hypotheses more efficiently.
  • Cognitive overload can be a critical factor in negatively influencing the quality of the hypothesis generated.

Supplementary Material

Supplement 1.

Supplemental Material 1: The data sets used during the hypothesis generation study

Supplement 2

Supplemental Material 2: Training materials used for participants in VIADS groups

Supplement 3

Supplemental Material 3: Study scripts for participants in VIADS and non-VIADS groups

Supplement 4

Supplemental Material 4: Follow-up survey after the hypothesis generation study

Supplement 5

Supplemental Material 5: Modified version of System Usability Scale

Supplement 6

Supplemental Material 6: Hypothesis quality evaluation instrument for clinical research—full version

Supplement 7

Supplemental Material 7: Hypothesis quality evaluation instrument for clinical research—brief version

Supplement 8

Supplemental Material 8: Coding principles for timing the hypothesis generation process

Funding statement

The project was supported by a grant from the National Library of Medicine (R15LM012941) and was partially supported by the National Institute of General Medical Sciences of the National Institutes of Health (P20 GM121342). This work has also benefited from research training resources and the intellectual environment enabled by the NIH/NLM T15 SC BIDS4Health research training program (T15LM013977).

Competing interests disclosure statement

  • Open access
  • Published: 22 January 2019

Assessing epistemological beliefs of experts and novices via practices in authentic science inquiry

  • Melanie E. Peffer   ORCID: orcid.org/0000-0002-6145-5186 1 &
  • Niloofar Ramezani 2  

International Journal of STEM Education volume  6 , Article number:  3 ( 2019 ) Cite this article

7079 Accesses

23 Citations

6 Altmetric

Metrics details

Achieving science literacy requires learning disciplinary knowledge, science practices, and development of sophisticated epistemological beliefs about the nature of science and science knowledge. Although sophisticated epistemological beliefs about science are important for attaining science literacy, students’ beliefs are difficult to assess. Previous work suggested that students’ epistemological beliefs about science are best assessed in the context of engagement in science practices, such as argumentation or inquiry.

In this paper, we propose a novel method for examining students’ epistemological beliefs about science situated in authentic science inquiry or their Epistemology in Authentic Science Inquiry (EASI). As a first step towards developing this assessment, we performed a novice/expert study to characterize practices within a simulated authentic science inquiry experience provided by Science Classroom Inquiry (SCI) simulations. Our analyses indicated that experts and novices, as defined by their experience with authentic science practices, had distinct practices in SCI simulations. For example, experts, as compared to novices, spent much of their investigations seeking outside information, which is consistent with novice/expert studies in engineering. We also observed that novice practices existed on a continuum, with some appearing more-or less expert-like. Furthermore, pre-test performance on established metrics of nature of science was predictive of practices within the simulation.

Conclusions

Since performance on pre-test metrics of nature of science was predictive of practices, and since there were distinct expert or novice-like practices, it may be possible to use practices in simulated authentic science inquiry as a proxy for student’s epistemological beliefs. Given than novices existed on a continuum, this could facilitate the development of targeted science curriculum tailored to the needs of a particular group of students. This study indicates how educational technologies, such as simulated authentic science inquiry, can be harnessed to examine difficult to assess, but important, constructs such as epistemology.

Introduction

Science literacy is multidimensional and includes general knowledge, understanding of practices such as argumentation or inquiry, positive attitudes towards and experiences in science, development of appropriate mental models about complex relationships, and epistemological beliefs about the nature of science and generation of science knowledge (Renken et al. 2016 ; Elby et al. 2016 ; Schwan et al. 2014 ).To properly educate a scientifically literate populace, it is necessary for science education to include all of these facets; however, many facets overlap, and some, such as authentic inquiry and epistemological beliefs about science, are pedagogically challenging. For example, although inquiry is an essential science practice for generating new science knowledge students are overwhelming exposed to simple, rather than authentic, science inquiry in K-16 classrooms (Chinn and Malhotra 2002 ). Furthermore, some have suggested that each discipline of science has its own distinct nature of science (NOS) principles (Schizas et al. 2016 ). If there are different ways of conceptualizing science inquiry, a key question for science educators is as follows: to attain a scientifically literate populace, which type of inquiry, NOS principles, and epistemological beliefs about science do we teach in a classroom? If consensus regarding the nature of inquiry is difficult to attain, how can science educators effectively evaluate and assess epistemological understanding? This paper proposes a novel method for assessing epistemology situated in authentic science inquiry. As a first step towards developing a formal assessment, we attempt to define practices that align with what experts do in authentic science inquiry and established metrics of nature of science (NOS) understanding and epistemological beliefs about science.

Overlap and distinction between NOS, epistemology, and authentic science inquiry

  • Nature of science

Lederman et al. ( 2002 ) stated that “NOS refers to the epistemology and sociology of science, science as a way of knowing, or the values and beliefs inherent to scientific knowledge and its development” (p. 498). The authors go on to say that important aspects of NOS to focus on pedagogically include the tentative nature of science knowledge, that science knowledge is empirically generated and influenced by the social and cultural environment from which it was generated, and involves human inference and creativity. The authors operationalized NOS as the epistemological aspects that underlie scientific processes such as inquiry, but are not the same as the processes themselves. In a later paper from the same research group, the nature of science inquiry is redefined to include that scientific inquiry begins with a question and follows a non-linear path that can vary extensively and that inquiry may or may not include hypothesis testing. Moreover, data are not equivalent to evidence and explanations for a phenomenon reconcile the data collected by the investigator with what is already known (Lederman et al. 2014 ).

Although Lederman’s definition of NOS is commonly used throughout the literature, it is subject to debate and critique. For example, an individual’s understanding and perception of NOS may change over time, either in response to changes in the field or personal experiences (Deng et al. 2011 ). There is also a question of whether or not there is a universal, domain-general NOS (Abd-El-Khalick 2012 ) or if NOS is better conceptualized within the context of a domain (Schizas et al. 2016 ). In addition to potential disciplinary differences, there is a wide variety of interpretations of NOS among practicing scientists, both within and between various disciplines (Sandoval and Redman 2015 ; Schwartz and Lederman 2008 ).

  • Epistemology

Epistemology is the study of what knowledge is and the exploration of what it means to know something. Questions of epistemology include exploring the nature of truth, justification, and how knowledge manifests as skills versus facts (Knight et al. 2014 ). In the context of science education, most prior work has focused on NOS understanding and personal epistemology (Elby et al. 2016 ). Deng et al. ( 2011 ) contended that epistemology cannot be separated from NOS as is an essential component of inquiry practice. Alternatively, NOS as a term could also be considered interchangeable with personal epistemology or epistemic cognition (Greene et al. 2016 ). Personal epistemology is the set of views or beliefs (known as epistemic beliefs) a person has about the nature of knowledge and knowing (Elby et al. 2016 ; Schraw 2013 ). Personal epistemology is thought to contain cognitive structures, one of which is epistemological commitments. Zeineddin and Abd-El-Khalick ( 2010 ) suggest that epistemological commitments can influence how students reason about a science problem and may explain the disconnection between how one thinks in a formal science context versus their day-day life.

Hofer and Pintrich ( 1997 ) characterized four dimensions of scientific epistemic beliefs: certainty, development, source, and justification. Source and justification both deal with the nature of knowledge and knowing. In the case of science, a less sophisticated belief about the source of science knowledge is that it is derived from an outside authority person, rather than by resulting from one’s inquiry strategies (Conley et al. 2004 ). Justification in science is how an individual uses data or evidence, particularly generated through experiments, to support their claims. Certainty refers to the nature of knowledge as concrete versus tenuous. Certainty is also found in NOS theory; for example, someone with a less sophisticated understanding of certainty (and/or poor understanding of NOS) would claim that scientific knowledge is certain and unchanging and that it is possible to obtain a single “correct” answer. This belief also relates to the development domain in that a more sophisticated understanding would be that scientific information could change in light of new developments (Conley et al. 2004 ). In the context of authentic science inquiry environment provided by Science Classroom Inquiry (SCI) simulations (Peffer et al. 2015 ), we can observe and analyze how a student engages with source (where do participants look for information, and why?), justification (how data are used to support claims?), certainty (how do the students discuss their results?; already examined by Peffer and Kyle 2017 ), and development (how does the student’s interpretation of the problem change in light of new information?).

  • Authentic science inquiry

The Next Generation Science Standards (NGSS; NGSS Lead States 2013 ) replaced the teaching of inquiry as a standalone phenomenon with scientific practices. Practices are defined as “a set of regularities of behaviors and social interactions that, although it cannot be accounted for by any set of rules, can be accounted for by an accepted stabilized coherence of reasoning and activities that make sense in light of each other” (Ford 2015 , p. 1045). Practices include competencies such as engaging in evidence-based argument, modeling, and inquiry. The pedagogical emphasis shifts away from presenting inquiry as a set of rules, such as a prescribed scientific method, and instead focuses on understanding inquiry in its social and cultural context as well as its relationship to other practices. Engaging students in the contextual practices of science is thought to lead to improved science literacy by promoting the development of an understanding of the epistemic process of science, namely generation and nature of science knowledge (Osborne 2014b ). Since these practices are intertwined, some specifically study the intersection between practices, such as model-based inquiry. Model-based inquiry is defined as “an instructional strategy whereby learners are engaged in inquiry in an effort to explore phenomena and construct and reconstruct models in life of the results of scientific investigations” (Campbell et al. 2012 , p. 2394). Windschitl et al. ( 2008 ) suggested that model-based inquiry provides a more epistemologically authentic view of inquiry as it involves five epistemic features related to science knowledge (testable, revisable, explanatory, conjectural and generative) that are often missed with a focus on the type of inquiry promoted by the scientific method. Whether or not an underlying model is required for inquiry is best determined within the context of the inquiry experience and overall pedagogical or research goals.

Inquiry is the predominant means used by scientists to generate science knowledge; however, there is no single definition of inquiry. Hanauer et al. ( 2009 ) proposed that scientific inquiry exists on a continuum and that inquiry is best operationalized through the context and overall pedagogical goals. For example, the pedagogical goal of the inquiry experience could be to develop new knowledge or the goal could be for the student to gain personal and cultural knowledge about the process of inquiry. Similarly, Chinn and Malhotra ( 2002 ) claimed that science inquiry exists on a continuum based on scientific authenticity. They define authentic inquiry as the complex activity that scientists perform to generate new scientific knowledge. Chinn and Malhotra ( 2002 ) listed cognitive processes involved in inquiry such as the generation or research questions, engaging in planning procedures, or finding flaws within experimental strategies and compare these processes between authentic and simple inquiry. For example, in the case of generating research questions, in authentic inquiry scientists generate their own questions, whereas in simple inquiry a research question is provided to the student.

A meta-analysis of research on authentic science inquiry determined that the most common way of defining authenticity in science is providing students the experience “of what scientists ‘do’ (practices), how science is done, and what science ‘is’” (Rowland et al. 2016 , p. 5). However, authentic inquiry does not require hands-on experiences. For example, if a hands-on lab experience is used to replicate a known phenomenon to a student, it may be hands-on, but is clearly what Chinn and Malhotra ( 2002 ) refer to as simple inquiry since a research question is provided to the study, the student follows simple directions, and there is no planning involved. The question of hands-on experiences and authenticity is also reflected in differences in inquiry practices both between domains of science as well as between practicing scientists within those disciplines (Schwartz and Lederman 2008 ; Sandoval and Redman 2015 ). For example, in the context of biology, an individual who uses bioinformatics may study gene expression, but would not work in a “wet” laboratory space but instead solely on a computer. This is in contrast to a biologist who works in a “wet” laboratory studying gene expression who engages in hands-on activities on a daily basis. As an example of the variety of practices observed between domains, a chemist may be engaged in stereotypical bench research with reagents and flasks, but an astronomer may be engaged in observational studies. All are examples of authentic science, but the authenticity is not derived from exactly what each is doing on a day-to-day basis. Instead, authenticity is derived from cognitive processes such as generation of new research questions, use of complicated procedures, and a lack of a single correct answer. In the context of NOS theory, Abd-El-Khalick ( 2012 ) suggested that domain-general versus specific questions are best addressed within specific areas of research, such as authentic inquiry practices. Like NOS, characterization of authentic inquiry should reflect both accepted disciplinary practices and the overall pedagogical goals in which the experience is situated.

Examples of authentic inquiry experiences for students could include exposure to course-based undergraduate research experiences or CUREs (Auchincloss et al. 2014 ; Corwin et al. 2015 ) or simulated authentic science inquiry such as the SCI simulations (Peffer et al. 2015 ). Although “simulated” and “authentic” may seem contrary to one another, we would argue that in the case of SCI simulations the simulated nature of the experience does not detract from its authentic features because it models the thought process used by scientists when engaged in an unstructured, real world problem. Simulated experiences are beneficial when considering how computer based experiences can be leveraged for high-throughput assessment. Furthermore, as discussed above, ‘hands-on’ is not synonymous with ‘authentic’ when categorizing inquiry experiences. There is no single definition for a simulation, and in fact simulations exist on a continuum from rigid algorithmic models of some aspect of reality to modeling a real phenomenon (Renken et al. 2016 ). The simulated authentic inquiry experience provided by SCI is best characterized as a conceptual simulation (Renken et al. 2016 ) because it models an abstract process, namely authentic science inquiry, in a scaffolded, autonomous manner using real-world problems and data. SCI is an authentic experience because it models the thought process and decision-making used by scientists engaged in unstructured real world problems. Although the inquiry experience exists on a computer, it still maintains many facets of authentic inquiry as defined above, allowing students to generate new evidence based ideas and knowledge and requiring students to engage in a non-linear process that is not directed at a single correct answer.

Connecting NOS and epistemology to authentic science inquiry

In the context of authentic science inquiry, we argue that understanding of NOS or NOS-inquiry and epistemological beliefs about science are intertwined and influence both each other and science practices (Fig.  1 ). As pointed out by Elby et al. ( 2016 ), divisions between NOS and personal epistemology research may stem from the separate nature of the two literatures, as both are related to understanding how individuals conceptualize the nature of science knowledge, yet are published in journals targeting different readership, namely science educators versus psychologists. What a student knows about science and authentic science inquiry will influence what they believe about science. For example, the tentative nature of science knowledge is a NOS principle identified by Lederman et al. ( 2002 ) and certainty of knowledge is one of the dimensions of science epistemology identified by Conley et al. ( 2004 ). If a student understands that science knowledge can change in light of new evidence, they likely also believe that science knowledge is not certain. Conversely, if a student has an epistemological belief that knowledge changes over time, then it would be easier to learn the principle that science knowledge is tentative. Furthermore, encouraging students to memorize tenants of NOS understanding out of context does not necessarily enhance a students’ science literacy (Deng et al. 2011 ). Said otherwise, a student can memorize NOS tenants, but these tenants may not translate into sophisticated epistemological beliefs or practices, or transferable skills that will be useful to students in the real world. How epistemological beliefs about science relate to inquiry practices are not well understood (Sandoval 2005 ). This leads to a central question of this work: how is understanding of NOS and epistemological beliefs enacted as student practices in authentic inquiry? For example, our previous work suggested that experts using more tentative language when making their conclusions (Peffer and Kyle 2017 ). This may indicate that experts use more tentative language because they understand that scientific knowledge is not concrete and subject to revision in light of new evidence. Therefore, analysis of inquiry practices, particularly practices in authentic inquiry, may provide a new assessment strategy of NOS/epistemology. We now turn to a discussion of current assessments of NOS/epistemology and the potential of using practices instead of conventional assessments.

figure 1

Our conceptualization of the relationship between NOS understanding, epistemological beliefs, and outcomes. We propose that what individuals know and believe influence one another, and that these can then influence practices. An individual’s experience in a classroom or with a practice can also influence what they know and believe as well

Current assessments of NOS/epistemology

NOS understanding and sophisticated science epistemic beliefs are an essential part of both science education and science literacy (Renken et al. 2016 ) and many assessments of NOS or epistemology have been developed (Akerson et al. 2010 ; Conley et al. 2004 ; Koerber et al. 2015 ; Lederman et al. 2014 ; Lederman et al. 2002 ; Stathopoulou and Vosniadou 2007 ). However, definitions of both NOS and epistemological beliefs about science are difficult to concretize and operationalize leading to assessment challenges. For example, which or whose NOS understanding do we want students to adopt? Which aspects are most crucial for science literacy? What defines a “sophisticated” epistemological belief about science and how does this vary both within and outside of various science disciplines? By trying to fit participants neatly into categories such as “sophisticated,” we lose the breadth of information associated with the wide variety of ways that NOS can be operationalized (Sandoval and Redman 2015 ). Another limitation is that current assessments of NOS and epistemology are taken at one fixed point in time and do not reflect changes in understanding of these principles over time. Forced choice assessments assume that we can neatly fit participants into categories that match the philosophical beliefs of the survey authors, giving a limited view of how the student conceptualized their understanding of NOS or their epistemological beliefs about science (Sandoval 2005 ; Sandoval and Redman 2015 ). Furthermore, Likert scale metrics, often used in these assessments, are criticized for their lack of reliability and validity, and some have called for a cessation of the use of these metrics (Sandoval et al. 2016 ).

Examining practices as assessment of epistemology

One possible solution to assessment challenges is to examine student science practices in real time. For example, Deng et al. ( 2011 ) suggested that sophisticated NOS understanding should be interpreted in the context of how well students argue scientific claims. Sandoval ( 2005 ) observed that how specific epistemological beliefs relate to inquiry practices is largely unknown and suggested that to understand how students make sense of science, an essential research focus was to examine their practices in authentic science inquiry. In the context of evidence evaluation, part of both science inquiry and important for overall literacy, Chinn et al. ( 2014 ) proposed the Aims and Values, Epistemic Ideals, Reliable Processes (AIR) model for examining epistemic cognition in practice. AIR stands for Aims and Value, Epistemic Ideals, and Reliable Processes and was designed to reflect three different aspects of epistemic cognition. Epistemology has also been examined in practice in the context of how sixth grade students evaluate and integrate online resources (Barzilai and Zohar 2012 ).

Expert/novice studies to define assessment criteria in SCI

One way to connect practices to a sophisticated NOS/epistemology is to examine what experts and novices do in authentic situations. Defining what experts are doing (as compared to novices) in the context of authentic science inquiry could lead to the development of criteria that could be used for detecting differences among novices (i.e., who has more-or-less expert-like practices in a group, and by proxy differences in epistemology) and potential areas for pedagogical intervention. Expert/novice differences in practices have been examined in the context of engineering education. Atman et al. ( 2007 ) examined professional engineers working in their field and undergraduates majoring in various engineering sub-disciplines. Participants were tasked with designing a hypothetical playground in a laboratory setting while verbally describing their process. The authors observed that experts engaged in the task for longer, particularly when researching the problem at hand, and gathered both more and a greater variety of information during their activity. Worsley and Blikstein ( 2014 ) compared students with either bachelors or graduate degrees in engineering against students without formal training in engineering to design a stable tower that could hold a 1 kg weight. They observed that expert students tended to engage in iterative strategies, repeatedly testing and refining their designs, and returning to planning throughout the process, rather than one stage of planning at the beginning observed in novice students. Although expertise was defined differently in each study, both studies observed a similar trend that experts tended to have similar iterative processes that involve a mix of doing and refining/seeking additional information. This result may parallel practices in authentic science inquiry, where it is necessary to preform both investigative and information seeking actions as part of a larger investigation. The variability in practices between experts and novices suggests there is an underlying explanation, such as different epistemological beliefs about the practice of engineering design that may explain these behaviors. Within the domain of physics, Hu and Rebello ( 2014 ) found that students’ epistemological framing, particularly around the use of math in solving physics problems, influenced whether they approach physics problems in a more expert-like manner. The authors found that students presented with hypothetical debate problems instead of conventional physics problems were more expert like in their solutions, focusing on qualitative and quantitative sensemaking instead of on how to plug numbers into a memorized equation (Hu and Rebello 2014 ).

Current study

If understanding the connection between epistemological beliefs as they relate to inquiry is key to understanding how students make sense of science (Sandoval 2005 ), it is important to examine student practices in authentic science inquiry. A decade after first calling for examining student practices as a proxy for epistemological beliefs, Sandoval and colleagues reiterated this point and called for further research on how epistemological beliefs can influence certain behaviors, such as how epistemological beliefs underlie the interpretation and analysis of scientific information found on the internet. Since NOS and epistemological beliefs about science are intertwined in inquiry (Fig.  1 ), and at least in engineering, expert and novice practices varied along predictable lines, we suggest that expert and novice practices in authentic science inquiry may also be indicative of underlying NOS and epistemological beliefs. Furthermore, defining what experts do in simulated authentic inquiry as compared to novices may lead to some consensus of what constitutes sophistication. Using practices in authentic science inquiry as a proxy for students’ underlying NOS understanding and epistemological beliefs about science, the researcher or instructor can view what students do in an autonomous, open-ended environment, rather than retroactively assessing students at a single time point.

To develop a practices-based assessment of epistemology/NOS understanding, which identifies expert criteria as the benchmark for future assessment, we examined the practices of experts and novices in the simulated authentic science inquiry environment provided by SCI simulations and connected inquiry practices to existing metrics of NOS and epistemological beliefs about science. Since we argue that NOS understanding in the case of inquiry cannot be separated from epistemology, we refer to the participant’s putative epistemology/NOS understanding as seen through their inquiry practices as their Epistemology in Authentic Science Inquiry or EASI. Given the limitations of existing metrics of NOS and science epistemology, this study lays a foundation for the novel use of simulations to assess constructs such as NOS and epistemology. Using student practices embedded in an authentic activity as a proxy for their underlying beliefs mitigates concerns about constraining what students know and believe to a static metric, which is a limitation of existing pen and paper surveys (Sandoval and Redman 2015 ). The SCI simulation engine also permits autonomy for participants to engage with the simulation in a variety of ways demonstrating the wide range of practices that exist between novices and experts. By assessing EASI via practices, the diversity of approaches utilized by students could lead to the development of systems or curricula that are personalized to individual students. To examine differences in EASI between experts and novices, and how practices in authentic inquiry relate to previously established metrics of NOS/epistemology, we utilized a mixed-methods approach (Creswell 2014 ), to address the following research questions:

What distinguishes expert (meaning, a trained biologist) from novices (undergraduate students) on established metrics of NOS and epistemology?

What distinguishes the authentic science inquiry practices of an expert versus a novice?

How does expert/novice performance on established metrics relate to authentic science inquiry practices?

Participants

There were 28 total participants in this study: 20 novices and 8 experts. All participants were associated with a university in a large southeastern city in the USA. The 20 novices were all undergraduate students, mainly in their third or fourth year of college (70% seniors, 20% juniors, 10% sophomores, no freshman) with little, if any, experience in authentic science practices. A single novice participant stated that they had been working in a psychology lab for the last 2 years and had presented a research poster showcasing their work. However, this participant was not listed as an author on any primary research manuscript. For reproducibility purposes, we have marked this novice in Table  2 with a double asterisk. No other novices indicated any bona fide experience with authentic science practices. Expertise was defined based on experience with authentic science practices, particularly in biology or related fields (neuroscience, public health), although none were experts in ecology or conservation biology, which was the topic of the SCI simulation completed. All experts had engaged in biological sciences research for at least 2 years and were listed as authors on primary research manuscripts either submitted or published at the time of the study. The majority of experts were advanced doctoral students, defined by their completion of their comprehensive examinations and achieving candidacy. A single expert had an earned PhD (marked in Table  2 with an asterisk) and was currently working as a postdoctoral associate. Novice participants had diverse ethnic backgrounds. Fifteen percent were white/European-American, 45% black/African-American, 30% Asian-American, 5% multi-racial, and 5% other. The expert population was comprised of 62% white/European-American and 38% Asian-American. The novice and expert group were both predominantly female. In the novice group, 69% of participants were female and 31% were male; in the expert group, 88% of participants were female and 12% were male.

Data collection and analysis

Pre-test metrics.

All data were collected in a private laboratory setting during a single meeting that lasted approximately 1 h. All participants first completed a pre-test assessment of their NOS understanding and epistemological beliefs about science (descriptions of these metrics and analysis are described below). Participants spent approximately 20 min completing the pre-test assessment. The total pre-test with the scientific epistemic beliefs (SEB) survey in tandem with the NOS items was found to be highly reliable (28 items; α = .85). Due to our small sample size and the total number of items on the pre-test, we were unable to perform factor analysis as a test of validity. However, the validity of the initial assessment tools, slightly modified for this study, was verified (Lederman et al. 2002 ; Lederman et al. 2014 ), which was critical to drawing meaningful and useful statistical inferences (Creswell 2014 ). The NOS assessment included items relevant to inquiry that were originally published as part of the Views of Nature of Science (VNOS) (Lederman et al. 2002 ) and Views About Science Inquiry (VASI) (Lederman et al. 2014 ) assessments, and one item that is unique to this study to assess what students think about inquiry versus science in general (Table  1 ).

We opted not to include the full VNOS or VASI because some aspects (e.g., difference between a theory and a law in VNOS and difference between scientific data and evidence in VASI) were not relevant to our study. Furthermore, given the open-ended nature of the questions, survey fatigue was a concern. We chose VASI questions for 1A and 1B because they were designed to assess the user’s understanding of a lack of a scientific method and that research studies begin with a question, but do not necessarily test a hypothesis (Lederman et al. 2014 ). Questions 2, 3, and 4, from the VNOS, assessed what the participants knew about the generation of scientific knowledge. Understanding the variety of ways one can generate scientific knowledge may reflect an understanding that there are many ways to justify scientific information and no single universal way for generating science knowledge. The final question, from the VNOS, which dealt with whether theories change, was chosen to assess what participants understood about the certainty of scientific information, and if they understood that scientific information can change in light of new evidence. Based on the questions chosen, and their relationship to epistemological beliefs about science, we coded all open-ended responses based on two NOS principles: the understanding of the lack of a universal scientific method (principle 1) and the understanding of the tenuous nature of science (principle 2). Data were coded blind by two independent coders, and overall agreement was 60% agreement for principle 1 and 68% for principle 2. Disagreements were settled through mutual discussion.

We used the SEB survey (Conley et al. 2004 ) to assess participant’s epistemology. Although this metric was initially designed for use with elementary school students it has been used with older age groups including high school (Tsai et al. 2011 ) and undergraduate (Yang et al. 2016 ) students. The SEB survey included 26 separate items divided into four dimensions: source, justification, development, and certainty. Each dimension was counterbalanced. Participants were asked to rate each item on a 5-point Likert scale from strongly disagree to strongly agree. The source and certainty dimensions were reverse coded. Higher scores indicated more sophisticated scientific epistemic beliefs. For each domain of SEB, individual items were averaged together to create a single score for each of the four domains. Total score on the scientific epistemic beliefs survey was found by calculating the average across all items.

SCI simulation module

After completing the pre-test, all participants were logged into the Unusual Mortality Events SCI simulation. Novices completed the simulation in an average of 29.9 min ( SD  = 15.5) minutes, whereas experts completed the SCI simulation module in an average of 48.5 min ( SD  = 17.02). SCI simulations provide a simulated, authentic science inquiry experience within the confines of a typical classroom (Peffer et al. 2015 ). Although SCI is only a simulated version of authentic science inquiry, it maintains many of the features of authentic inquiry. For example, SCI simulations use real-world data, allow users to generate independent research questions, facilitate non-linear investigations with multiple opportunities to revise hypotheses, including dead end or anomalous data, and engaging participants in the process of doing science (Chinn and Malhotra 2002 ; Rowland et al. 2016 ).

The version of Unusual Mortality Simulation used here was a modified version of the module used in Peffer et al. ( 2015 ). Changes included additional freedom to perform actions and updated information in the in-simulation library. These changes were made to allow for autonomy while participants engaged with the simulation and to update content with current scientific knowledge around Unusual Mortality Events in the Indian River Lagoon. The SCI simulation web application captured data as the participant completed the simulation including the order in which actions (such as tests, generation of new hypotheses) were performed, as well as the participant’s responses to open-ended questions embedded throughout the module to interrogate why participants performed certain actions. Additional information about each participants’ practices was captured through screen-capture recordings made during the module. As participants completed the simulation, they verbalized their thought processes using a think-aloud procedure (Someren et al. 1994 ) and these recorded thought processes were later transcribed. Upon completion of the simulation, participants completed a demographics survey and were immediately interviewed about their strategy and rationale for certain decisions.

Mixed-methods analysis

We used a convergent parallel mixed methods design (Creswell 2014 ) to assess differences in expert and novice practices in authentic science inquiry, and their relationship to epistemological beliefs. In this type of mixed-methods research, quantitative and qualitative data are collected simultaneously and analyses are combined to yield a comprehensive analysis of the research question(s) at hand (Creswell 2014 ). We felt that mixed-methods research was best suited to answering our research questions because fully distinguishing the practices of experts and novices required a qualitative approach, but comparing scores on pre-test metrics and relating these scores to practices required a quantitative approach. Obtaining different but complementary qualitative and quantitative data not only allows for a greater understanding of the research problem, but it also enables researchers to use the qualitative data to explore the quantitative findings (Creswell and Plano 2011 ).

Quantitative analysis

We first counted the total number, and type, of actions performed by each user. To determine the number and type of actions participants made, the lead author reviewed each of the screen-capture videos and the logs created by the SCI simulation engine. Actions were categorized as either investigative or information seeking. Investigative actions included the generation of a new hypothesis, performing a test, or making conclusions. These investigative actions are aligned with models of scientific activity that include experimentation, hypothesis generation, and evidence evaluation (Osborne 2014a ). How evidence is evaluated may provide insight into epistemological beliefs such as certainty. For example, a student who believes scientific information to be unchanging may include very little evidence or few tests since there would be nothing else to examine once a final answer is reached. Information seeking actions were any time the user sought additional information as part of his or her inquiry process, including use of internet search engines or various features built into the simulation such as the library, lab notebook, and external links within the simulation. Information seeking actions could be considered as part of the process of experimenting, since users were seeking information about the outside world. However, we chose to distinguish using outside information as a different action within the simulation since it is a known engineering practice chosen by experts (and not novices) in authentic situations (Atman et al. 2007 ). Even though engineering and science practices are not identical, we still felt that pursuit of information as part of a project was likely analogous between the two disciplines. Whether or not a student chooses to seek outside information and what information is sought could provide insight into a participant’s epistemological beliefs about source. For example, is the simulation the ultimate authority, or are there other valid information? What kind of information is sought, peer-reviewed literature or news articles for the general public?

Each investigation was coded blind by two people as simple or complex in nature. The primary author served as the tie-breaker for disagreements. Based on the type of data and number of raters, Cohen’s kappa was used to assess inter-rater reliability. A Cohen’s kappa of 0.533 ( p =  .003) indicated a moderate level of agreement among raters. A simple investigation is reminiscent of the simple inquiry described by Chinn and Malhotra ( 2002 ), where the user performs a few tests until a basic cause and effect relationship is uncovered, at which point they make their conclusions. A complex investigation is one in which the user performs a multi-pronged investigation with multiple cause and effect relationships. The user may perform many related tests with the goal of developing a model that describes some kind of underlying mechanism. The user seeks to connect different sources of evidence in a manner to explain how they are both related to each other and the problem at hand. For example, a participant may choose to relate algal blooms to explain the lack of seagrass, which would explain why a foreign substance was found in the stomach of manatees instead of the sea grass, which is their normal diet. In contrast, a simple investigation would note that sea grass was dead and conclude that as the main cause of the unusual mortality events without any attempt at explaining why sea grass was the cause, or how it related to other lines of evidence. A more complex investigation with multiple lines of evidence may indicate underlying epistemological beliefs about certainty, since there are multiple causes and not a single correct answer. A subset of the data was analyzed for linguistic features to determine if novices or experts differed in the type of language used during the concluding phase of their investigations. These results are reported elsewhere (Peffer and Kyle 2017 ), but we include the findings here as a variable—expert verb score—and therefore as part of our model to describe expert versus novice practices.

Statistical analysis

Quantitative analysis of the data was performed using SAS 9.4 (SAS Institute Inc 2014 ). To determine differences between expert and novice performance on the pre-test, we used Fisher’s exact test instead of chi-square test due to small counts for some categories of certain variables. To predict total number of actions, as well as total number of investigative and information seeking actions, we fitted Poisson (count) regression models with logarithmic link function. Poisson regression modeling is a member of generalized linear models used when modeling count data following a Poisson distribution. PROC GENMOD in SAS 9.4 was used to fit the Poisson regression model, using the maximum likelihood technique to estimate the parameters. The predictors used within the Poisson count regression models were SEB survey performance total average score, expert verb score, and both NOS principles. In this study, significance level was considered .05 for all tests.

Qualitative analysis

Within each category (complex or simple), we chose the average participant and made comparisons with others in the group as appropriate to form an expert dyad and novice dyad. The average participant was determined based on the temporal pattern of types of actions performed (Figs.  3 and 4 ). Since the order of actions was an important distinguishing factor between participants, we opted to use this as our primary criterion for determining the average participant rather than the user’s number of actions (Table  2 ). We do note that for both dyads analyzed here, the number of information seeking actions was higher in the expert groups, which was a trend among our full data set. For consistency, all case study examples are female and assigned a pseudonym. Since the majority of novice participants were African-American, and it was not possible to pick a dyad that corresponded in race to the expert population, we decided to choose two African-American students in their senior year of college. The simple novice, Sally, majored in human learning, and development and the complex novice, Beth, majored in psychology. The simple expert, Lisa, was European-American and in her fourth year of doctoral studies in neurobiology. The complex expert, Janet, was Asian-American and in her fifth year of doctoral studies in biology.

The qualitative analysis focused on two sources of information: the logs created as each participant engaged in the simulation (the participant’s lab notebook) and the think-aloud transcripts. These two sources of data were analyzed for features that were indicative of authentic science inquiry and/or expertise, such as searching for outside information, which is an expert practice in engineering and may be similar in authentic science inquiry (Atman et al. 2007 ) or the use of tentative language (Peffer and Kyle 2017 ). To ensure trustworthiness of our qualitative study (equivalent to reliability and validity of quantitative studies), we used triangulation between the qualitative information, and the quantitative metrics such as how many of each type of actions was performed. We also included rich descriptions in this manuscript to allow readers to form their own conclusions, and involved an external auditor who reviewed this project and indicated her agreement with the conclusions presented here.

Experts perform better on established metrics of nature of science, but not epistemology

First, differences in expert and novice populations were assessed using previously established metrics of NOS and science epistemic beliefs. Experts scored higher than novices on both nature of science principles assessed (Fig.  2 ). For principle 1, 18 novices had a naïve score while seven experts had a mixed or sophisticated score. Only one expert scored as naïve and two novices scored as sophisticated. For principle 2, 12 novices scored as naïve and eight scored as mixed or sophisticated, while all eight experts scored as mixed or sophisticated. The small counts per some combinations of different categories of factors prevented us from performing the chi-square tests of association; instead, the Fisher’s exact test, which is as powerful, was conducted to check whether these observed associations were statistically significant or not. A two by two contingency table was created for both nature of science principles by aggregating the mixed and sophisticated scores to test association between expertise and NOS principle scores. Association between expertise and both principles 1 and 2 scores were statistically significant, using Fisher’s exact test ( p =  .0002 and p =  .0084, respectively).

figure 2

Experts performed better than novices on a pre-test assessment of their NOS knowledge on both principle 1, lack of a universal scientific method ( a ) and principle 2, tenuous nature of science knowledge ( b )

No associations were observed between expertise and scores on the science epistemic beliefs metric on any of the four domains assessed (justification, certainty, development, and source) nor on the total aggregated scores (Table  3 ). Experts scored marginally higher on the source and certainty domains and the overall total score. However, scores among all participants were very high, mostly above four on a five-point scale.

Experts and novices have distinct practices in authentic science inquiry

Investigative style and general patterns of actions (Figs.  3 and 4 ) performed during authentic science inquiry were assessed to examine differences between experts and novices. Investigative style was separated into two categories: simple and complex. Experts perform more complex investigations than novices (62.5% versus 35%, respectively), and novices perform more simple investigations than experts (65% versus 37.5%, respectively).

figure 3

Inquiry trajectories of novice participants

figure 4

Inquiry trajectories of expert participants

General pattern of actions between novices and experts, specifically if the number of investigative, information seeking, and total number also, was investigated. Complex novices performed more investigative actions ( M  = 12.86, SD  = 4.85) than simple novices ( M  = 7.46, SD  = 2.70). In contrast, complex experts performed less investigative actions than simple experts ( M  = 13.60, SD  = 7.20 and M  = 16.00, SD  = 8.19, respectively). Overall, we observed that experts performed more investigative actions than novices ( M  = 14.50, SD  = 7.09 and M  = 9.35, SD  = 4.36, respectively). For information seeking actions, complex novices performed slightly fewer actions than simple novices ( M  = 7.29, SD  = 4.42 and M  = 8.31, SD  = 9.60, respectively). Complex experts on average performed over twice as many information seeking actions as simple experts ( M  = 22.20, SD  = 16.72 and M  = 9.00, SD  = 9.64, respectively), and the simple experts performed slightly more information seeking actions than either of the novice groups. Experts overall sought more information than novices ( M  = 17.25, SD  = 15.27 and M  = 7.95, SD  = 8.04, respectively).

The quantitative part of the analysis complemented our observations regarding the relationship that existed between the category of expertise participants belonged to (expert or novice) and the number of actions they performed. Fitting a logistic regression model, using PROC LOGISTIC within SAS 9.4, it was shown that the total number of actions performed significantly predicted whether participants belonged to the expert or novice category of expertise ( Likelihood ratio χ 2  = 6.22, p  = .013; Wald χ 2  = 3.95, p =  .047). The Hosmer and Lemeshow goodness of fit test detected no evidence of lack of fit of the aforementioned logistic regression model ( p  = .19), implying that the design was a good fit.

Actions in context were examined by using randomly selected videos to generate profiles describing the sequence of inquiry events (e.g., information seeking, hypothesis generation) for five simple and five complex novices (Fig.  3 ) and three simple and three complex experts (Fig.  4 ). Among both experts and novices, as more actions were performed, the frequency of information seeking actions increased. Among all experts analyzed, the complex novices, and simple novices N4 and N5, demonstrated an iterative process of moving back and forth between investigative and information seeking actions. These informative seeking phases were remarkably long, particularly in complex experts, E5 and E6 (Fig.  4 ). Both E5 and E6, in addition to N10, began their investigations with seeking information and continue to look for outside information regularly throughout their investigation. Among both simple novices and experts, there was a trend towards short periods of information seeking, when one external resource was utilized, rather than an in-depth review of literature. We now turn to an in-depth discussion of practices by four representative participants.

Case studies

The simple novice: sally.

Sally (N3, Fig.  3 ) generated two hypotheses, performed three tests, and did one information seeking action. Sally began her investigation by generating two hypotheses, “Could the cause of the deaths be due to contamination that has taken place in the lagoon? Are the animals who live in the lagoon killing each other off due to lack of food” with the rationale, “…because pollution leads to contamination and it usually has a lot to do with spikes in the death of animals…because I learned that there are many different animals that live in the lagoon that may be lacking food to eat and are not surviving by eating other animals.” Sally then stated in response to the query “How would you like to test your hypothesis?” that she “would like to go into the lagoon and test the waters for contamination, and watch the animals’ behaviors.” In her think-aloud transcript, Sally mentioned that she wanted “to come up with a hypothesis, but [she] was not sure of the background information [and wondered if] she should guess at one?” This is a behavior identified as characteristic of novices, rather than pursue additional information in light of what they do not know, they move forward with their investigation. In contrast, we see that N6 and N10 (Fig.  3 ) and E5 and E6 (Fig.  4 ), all coded as complex, began their investigations with more effort put forth in gathering information before executing their first test.

After completing her hypothesis generation phase, Sally then performed a test to check water salinity in the lagoon to see if it “is at a level for animals to survive” and noted that she did not think it was a contributing factor. Sally then chose to examine invasive species in the Indian River Lagoon to see if “invasive species could be the cause.” After reviewing the test results (that the Indian River Lagoon contains over 240 non-native species), she wrote in her lab notebook that yes, they were having an impact, “because they can affect how the ecosystem works in the lagoon . ” In her think-aloud transcript, Sally said:

“So, maybe the exotic species could be bringing in maybe some type of disease or something that the native species cannot fight off. Okay, I think I’m ready to make a conclusion, although I don’t know if I should -- I’m trying to think, should I change my hypothesis? Because now I’m interested to know that if the species that arrived there from all these different places, or if they could have maybe caused the threat to the ecosystem which now has spiked the mortality rate, or the mortality rate has gone up in the native species that were there.”

Sally decided to generate a new hypothesis, rather than make a conclusion. Her new hypothesis was “Are the species that are brought into the lagoon causing the ecosystem to change which is causing the mortality rate to increase?” with the rationale, “The fact that I found new information about other species migrating to the lagoon and this could affect the mortality rate.” Sally stated that she wanted to test her hypothesis “By seeing how the ecosystem has changed since the migration of the new species, and how the changes could affect the native species.” What is unique to Sally’s investigation is that after generating this hypothesis, she then repeated the same invasive species test, because she had “learned that the invasive species can cause change in the ecosystem.” Given the design of the simulation, she received the same information again, but rather than focus on the fact that invasive species were present in the Indian River Lagoon, she instead focused on the last sentence relating to the impact of invasive species on ecosystems. She then stated that “Yes, I believe they do contribute because they can change the ecosystem which could cause the native species to not adjust to the change but die.”

Sally then concluded “…invasive species…caused change to the ecosystem and now the native species cannot adapt…so they are now dying” – however, she never collected any information directly tying the presence of exotic species to the high mortality of dolphins, manatees, and pelicans in 2013. When asked during her interview how she knew she was ready to conclude, Sally said “because she had enough information.” Sally’s only information seeking action occurred during her conclusion phase when she checked her original hypotheses in her notebook; this lack of information seeking actions was characteristic of novices.

The complex novice: Beth

Beth (Fig.  3 , N8) generated two hypotheses, performed three tests and did eleven information seeking actions. Beth also had some experience working as a laboratory research assistant, but had never presented a poster describing her own independent research nor was listed as an author on a primary research manuscript. Prior to generating her first hypothesis, she focused on the preliminary data provided by the simulation in the introduction section. In particular, she focused on the temporal distribution of dolphin deaths to devise her first hypothesis. In her think-aloud transcript she stated:

“[Dolphin stranding] spikes in March…it’s like spring, summer. Maybe people are on the boats…I’m thinking maybe tourist/people attracted to the area, and possibly boating accidents…Okay, my hypothesis: the unusual increase of deaths of dolphins in this area could be attributed to tourist/boating population in the area...looking at spring/summer months in which deaths increase, and in which are known to be times of the year where tourists/boating is more popular. How would I like to test it? I would like to examine tourist/boating rates during the same time period in which these deaths are occurring.”

Beth’s first test is to examine dolphin necropsy Footnote 1 results because she wanted “to see how they died...possibly trauma from boating accidents?” in her think-aloud transcript, she notes that the necropsy report indicated that dolphins were emaciated. She says “what does that mean?” and proceeded to use Google to look up the definition of emaciated. After looking up the definition, she then stated out loud, “Thin or weak. Lack of food, so that has nothing to do with that,” likely referring to her original hypothesis. She then went on to say “No evidence of entanglement. Skin, eyes, mouth generally normal. Some animals have skin lesions. Oh, brain infections, that’s bad…No presence of any dangerous toxic poisoning…15 out of 144 was positive [for morbillivirus]. What is morbillivirus?” Beth then followed a simulation link that provided additional information about morbillivirus and then used Google to search the phrase “How do you get morbillivirus” and from there followed a link for a NOAA.gov website. After pursuing this information, Beth then said, “My hypothesis is definitely refuted. No evidence of head trauma as would be seen in boating accidents. I want to generate a new hypothesis.” What is notable about this statement is that Beth says that her hypothesis is refuted not wrong or incorrect. Our previous work (Peffer and Kyle 2017 ) examining the language used by experts versus novices suggested that experts use more hedging language, such as refuted or supported. Notably, here we observe a novice with a more complex style of investigation using more expert like language.

Beth then generated a second hypothesis, “The spread of morbillivirus is causing the unusual high rates of death among dolphins.” Her next test was to examine manatee necropsies because they are “closer in relation/environment to the dolphin.” After reviewing the results and determining that the manatees had inflamed gastrointestinal tracts and red algae in their stomachs, she read more about the algae and stated that her hypothesis was “Supported. The discovery of [red algae] could be a possible method of transmission amongst and between dolphins/manatees/other sea life in procuring morbillivirus.” At the transition page where the participants can choose to do another test, generate a new hypothesis, or make their conclusions, Beth said “All right. I want to do one more test just to make sure. Okay, they said it was algae, right? It’s algae? Okay, so I’m going to [do the] algal blooms [test].” After reviewing the test results and seeking additional information on algal blooms, she stated “Yes. [algal blooms are contributing to the unusual morality events] Algal blooms are contributing to toxins which in turn can possibly manifest into either this virus or the death of surrounding sea life.”

During the conclusion phase, Beth returns to the morbillivirus resource to determine what other animals are at risk for morbillivirus. She then went on to say “But how do you get [morbillivirus] in the first place? Must be like a reaction to something, which may be that algae. Okay, my final conclusion is the high rates of deaths amongst dolphins can be attributed to morbillivirus, and which may be surmounting due to a reaction to algal blooms.” When asked to provide evidence that supported her conclusion, she stated, “Autopsies, Algal bloom rates, Virus information concerning effects on the body in which seen in the autopsies,” but gave no evidence for how morbillivirus could be related to increased numbers of algal blooms.

Summary: Comparison of simple and complex novices

Both novice participants performed the same number of investigative decisions, and were slightly below the average number of investigative actions performed by novices. The complex novice (Beth) performed significantly more information seeking actions than the simple novice (Sally). We also noted that complex novices left the simulation to pursue other information. Simple novices, if they performed information seeking actions, typically only used sources linked from the simulation itself. Both Sally and Beth demonstrated a logical structure to their investigations, but the complex novice, Beth, was more detail oriented and her investigation followed a logical pattern. Rather than stating “I don’t know” like Sally, Beth reviewed the information until she felt that she had a good starting point for her investigation. Beth used more expert like language as she thought aloud using terms such as “supported” and “refuted” whereas novices generally used either less expert language (e.g., “my hypothesis is wrong”). All participants were proficient English speakers and previous work (Peffer and Kyle 2017 ) indicated that written use of tentative language was a hallmark of expert practices. With both of the novice participants, their conclusions were ill-supported by the evidence collected. In Sally’s investigation, she concluded that the invasive species were contributing to the deaths of dolphins, manatees, and pelicans because they are generally known to disrupt ecosystems. In Beth’s investigation, she proposed a link between morbillivirus and algal blooms, but gave no evidence to support this idea other than both instances were occurring at the same time.

The simple expert: Lisa

Lisa (E1, Fig.  4 ) generated two hypotheses, performed three tests, and did two information seeking actions. We also noted that Lisa had the simplest investigation of all of the simple experts, but since the average simple expert was male, for consistency purposes we chose Lisa. Similar to Sally, who spent seven minutes reviewing the introductory material, but not Beth, who spent three minutes, Lisa spent ten minutes, the longest of all four participants highlighted in these case studies, reading and thinking-aloud about the introductory material, including taking notes by hand. When initially asked as to what she thought was causing the animal deaths, Lisa entered into her notebook, “I don’t think I have enough information at this point to hypothesize the cause of the UMEs,” and stated out loud, “I’m not sure I can really postulate with this much information. Perhaps I’m missing something.” When prompted to make a hypothesis, Lisa stated that her hypothesis was “In 2013, human interactions increased the number of unusual mortality events in manatees, dolphins, and pelicans in the Indian River Lagoon.” In her think-aloud transcript Lisa stated that she “was always taught to phrase [her] hypothesis very, very carefully.” When asked how she would like to test her hypothesis, Lisa stated, “I would need access to data reporting human interactions in the Indian River Lagoon in 2013. I would also need to identify a similar ecosystem (or several) in another part of the world with differing levels of human interaction but housing the same species. I would compare UMEs for these species in the Indian River Lagoon to the comparable estuaries elsewhere. This would not allow me to infer causation but could yield correlational information.” There are two notable observations from this statement. First, Lisa is discussing the need for a control group. Although control of variables strategy is identified as an important developmental milestone as children learn to think scientifically (Schwichow et al. 2016 ), it can also be indicative of a less sophisticated epistemology because it may represent an understanding that science inquiry can only proceed with controlled experiments, whereas some disciplines of inquiry rely instead on observational data. The second observation is Lisa’s concern with causation and correlation. She is acknowledging the importance of identifying a mechanistic link between two observed phenomena, which may point towards a more sophisticated epistemological stance. We also note that in her think-aloud transcript at this point in her investigation that Lisa states “I’m not sure with causation how you would do that. You can’t put it in a lab.” which may indicate some acknowledgement that there are a variety of scientific practices. However, the focus of lab work may indicate the presence of a less sophisticated belief that all science occurs in labs.

After generating her hypothesis, Lisa then examined dead affected dolphins, because this “could provide information about whether or not an infectious disease or biotoxin is implicated in the UMEs.” While reviewing the findings, Lisa follows a link embedded in the simulation to read more about morbillivirus. She also noted aloud that “15 of the 144 [examined were positive for morbillivirus], which is not conclusive.” When reflecting upon these results, Lisa noted in her lab notebook that the “Most striking…findings [were] that animals were emaciated, and also that the conditions they present with [were] consistent with a morbillivirus infection, yet the majority of the dolphins did not test positive for this type of infection. My hypothesis is neither supported nor refuted by this information.” Of importance here is her use of tentative language, and that she notes that only some dolphins are sick; interestingly, her reflection does not fully explain the wide scale of the 2013–2014 unusual mortality events.

Lisa next tests to see if there are invasive species present in the Indian River Lagoon. After learning that there are many different invasive species in the Indian River Lagoon and following an embedded link within the simulation to read more about invasive species, she states in her notebook that she “[does not] think [that invasive species are contributing to the unusual mortality events]. The identified invasive species are things such as sea squirts and mussels and I would not expect these species to greatly impact the health of dolphins and manatees...On the other hand, the presence of invasive species could limit the food supply for the larger species being impacted by UMEs so perhaps it is a contributing factor.” Lisa then decides to generate a new hypothesis stating aloud:

“we had global warming Footnote 2 with the invasive species. And then we have a virus and dead dolphins. I’m getting to a point where I feel like it may not be just one of these divisions. It may not be just infectious diseases or just ecological factors. It may be that global warming has brought in these invasive species, and also maybe some biotoxins or infectious diseases have been allowed to propagate.”

Lisa enters her new hypothesis into her lab notebook as “In 2013, climate change contributed to an increase in UMEs in dolphins, manatees, and pelicans in the Indian River Lagoon,” giving the rationale that “ … the information that I recently accessed implicating that climate change encouraged the intrusion of invasive species into this particular ecosystem. These invasive species could limit the food supply for the larger mammals and warmer temperatures could also support growth of biotoxins that might cause UMEs,” and stating she would like to test her hypothesis by “Compar[ing] the UMEs in this estuary with the UMEs in an estuary that has colder temperature water. If they differ, look specifically at the level and types of invasive species and biotoxins.” Lisa’s desire to use a comparison group as part of testing her hypothesis and her saying there may be two possible causative factors, not a simple cause and effect relationship, were notable since use of a control group can indicate either a sophisticated understanding of scientific practices, namely the importance of a control, or a less sophisticated understanding indicating that all scientific investigations require controls.

Lisa performed one additional test, examining the water temperature, because her “hypothesis features warmer water as a possible cause of UMEs.” After determining that the median water temperature was at the high end of normal in 2013, she states in her lab notebook, “Yes, I think the water temperature is contributing to the unusual mortality events because when a median of 25.09 degrees C is reported, this means there were much higher temperatures recorded throughout the year as well. The average temperature range is up to 25 degrees C, which does not diverge from the 2013 reported median temp, but average and median are different types of statistical data.” Lisa’s final conclusion was “Global warming played a role in the increase in UMEs in 2013,” and she cited “increased water temperature and the enhanced presence of an invasive species (sea squirts) that has been shown to migrate to warmer temperature waters” as evidence for her conclusion.

The complex expert: Janet

Janet generated two hypotheses, performed three tests, and 16 information seeking actions. Similar to Lisa and Beth, Janet also spends time gathering information at the start of her investigation. Like Lisa, Janet also requested paper to help organize her thoughts. Janet is unique in that she began to look for additional information at the very beginning of the simulation even before generating a hypothesis. This was a behavior generally characteristic of the complex experts (Fig.  4 ) and consistent with novice/expert studies in the engineering education literature (Atman et al. 2007 ). After reviewing the background material in the simulation, Janet used Google to search “estuaries and undetermined deaths of animals.” After the results appeared, Janet said aloud “Okay, so a lot of stuff popped up. Which one will I go to first? Maybe going to the New Yorker because it seems like it’s a reliable, quick read source.” This comment by Janet is notable as this was one of the few instances where a participant discussed the reliability of the source, which is another facet of epistemology (Barzilai and Zohar 2012 ; Mason et al. 2011 ). As Janet was developing her hypothesis, she continued to seek information and in her think aloud transcript discussed different sources that she would use to find information. Janet says:

“I do think it’s pollution, but I don’t know where -- actually, I should be going onto PubMed Footnote 3 instead of Google now…so typed in estuary and animal deaths into PubMed… Nothing too relevant…now that I typed in lagoon instead of estuary, one specific article came up…before I want to write a hypothesis, I want to make sure I have at least a decent hypothesis, so I’m going to look at some more relevant articles hopefully on PubMed, and see what else they say. The first one said that it was due to cyanobacteria changing the metabolism of the animals in the Brazil Lagoon…Okay, so now I’ve gone back to Google because I want to see…if there’s, like, a general broad view, like different type of perspectives on what could be the cause. Then that will help me try to consider all the possibilities that was involved”

While generating her hypothesis, Janet switched between primary literature found in PubMed and general information articles found in Google. From her transcript, there appeared to be some rationale between her decisions to search for information in each location. In general, experts would seek out information in the primary literature, but novices never looked in the primary literature. This may provide some insight into Janet’s epistemological beliefs as they related to the source of scientific information, namely the use of primary literature. When the investigator queried Janet’s decision to refer to the same article repeatedly while generating her hypothesis, Janet stated aloud “[the article] at least cited some science or something talking about this. And it looks like it has some scientific backing. And what the point is seems to be decent in that it’s not too [irrational].”

Janet’s hypothesis, “There is high death rates in the lagoon, because the environment of the lagoon is not stable, that is temperature/pH/salinity/02 levels have changed,” and her rationale for this hypothesis, “From observing the pie chart and it showing what factors accounted for the deaths,” referred to information collected at the beginning of her investigation. When asked how she would like to test her hypothesis, Janet stated in her lab notebook, “By creating a pseudo lagoon in the lab then altering each of those factors and seeing how it will affect the fish/animals that will be put in there.” Again, like Lisa, we see the effort to create an experimental setting that will allow for control of variables.

Janet spent the bulk of her investigation in this preparatory, information seeking mode (eight minutes) and the most time of all four participants examined in these case studies generating her first hypothesis (fourteen minutes). After generating her first hypothesis, Janet tested the average dissolved oxygen in the Indian River Lagoon “because it is one of the independent variables that I think is contributing to throwing the lagoon off. Maybe O2 levels are lower, causing sickness.” Notably, the term “independent variable” was not utilized by novice participants. After reviewing the results, Janet stated that dissolve oxygen did not contribute to the unusual mortality events. Janet then tested water salinity stating, “Salinity is an independent variable that is potentially contributing to the deaths, and lagoons are kind of where fresh and salt water meet, so maybe an imbalance will cause problems.” In response to the data, Janet stated that there was not enough information to decide whether or not water salinity is contributing to the unusual mortality events.

Janet then decided to generate a new hypothesis, “Cyanobacteria is changing the metabolism of animals in the lagoon to make them sick,” with the rationale that “There were many PubMed articles about this.” Again, Janet explicitly connected her hypotheses to her collected data. Janet then stated that she would like to test her hypothesis by “Test[ing] the levels of cyanobacteria in different lagoons that had deaths and no deaths. Field experiment.” We see here the emphasis on having a control group, but also on conducting a field experiment, rather than trying to conduct a field experiment in a lab which is what Lisa suggested. Janet also stated aloud “...different from my original hypothesis, I had suggested to stimulate or create a fake lagoon in the lab. I thought that would be better to control. But since this is more like measurements wise because your interest is just looking at the cyanobacteria, you can actually maybe test the tissues from the dead animals. This could be a field experiment. You don’t have to do much in the lab maybe.” When the investigator queried Janet as to why she decided to generate a new hypothesis, Janet said that her previous tests had “Failed… so, I’m going back to the cyanobacteria. I originally didn’t go there because I had grouped cyanobacteria with biotoxin, so I thought that was already accounted for. But there was a lot of PubMed articles about this, so I’m sure it’s maybe more reliable than my hypothesis [based on an NPR article].”

Janet then used Google to determine if cyanobacteria are the same thing as an algal bloom. She described the results from Google aloud stating “I typed in cyanobacteria and algal bloom, and a lot of things popped up. For example, the first thing is from the EPA. And it says that cyanobacteria are harmful algal blooms, so they’re the same thing. So, I’m going to choose look for algal bloom levels instead.” From here, Janet chose to examine the presence of algal blooms in the Indian River Lagoon. She stated she wanted to do this test because “Because articles from PubMed suggest that cyanobacteria is the culprit for deaths in the lagoon and algal bloom is cyanobacteria.” Janet then followed links embedded in the simulation, including an article on Brown Tides and a recording of historical harmful algal blooms in the Indian River Lagoon. She concluded aloud that algal blooms were a contributing factor “because they can block sunlight and deplete oxygen.” From here, Janet said, “I wanted to do one more test because looking at algal blooms is not enough, but that option isn’t available. So, I’m going to click on I’m ready to make my conclusion.” In her notebook, Janet listed her final conclusion as, “Cyanobacteria like algal blooms are contributing to deaths perhaps by decreasing O2 levels and sunlight,” and the evidence she collected to support this conclusion as, “The damaging history of algal blooms and PubMed articles.”

In Janet’s post-simulation interview, she made several epistemologically relevant comments. For example, when asked if this simulation changed her perceptions of science, she stated, “No, but it did bring to my attention that I probably am not too confident about the different types of experiments. Because what we do in the lab is usually hypothesis…driven. And so, like, this is kind of hypothesis, but it’s more about observation. There’s this issue with the lagoon. Like, what do you think is happening? And then [a question on the pre-test queried if a scenario was] an experiment? So, I realized I was kind of fuzzy on that, and I wasn’t sure if there’s a different type of experiment that is not like one where you have to test, where you can just observe.” In response to why she decided to conclude, Janet stated that she “really wanted to get it right. Not that you can get it right.” Although there may be some ambiguity on what the participant meant is “right” (a correct answer, versus a proper experimental design), it is notable that she then clarified her comment by saying that it is not necessarily possible to get something “right,” which could indicate a more sophisticated epistemological stance.

Summary: Comparison of simple and complex experts

Lisa and Janet were similar in that they spent considerable care and effort taking notes and exploring the background information before fully engaging in the simulation. However, Janet pursued extensive amounts of additional information prior to generating her hypothesis. Both experts proposed experiments to test their hypotheses involving a control group. Experts may have acknowledged the importance of a control group as part of experimental design because of training that required more experimental, rather than observational work, or an indication that there is only a “single” correct way of doing science. Lisa and Janet both commented that there may be a “single,” correct way of doing science, which may also result from their scientific training. Focusing on a single correct method of doing science, one that requires control groups, would indicate a less sophisticated understanding of science practices, whereas acknowledging multiple ways of doing science would be a more sophisticated practice. Janet pursued more information, to a deeper level, specifically commenting on why she was pursuing different sources, whereas Lisa tended to use only information provided by the simulation, more similar to novice behavior. Both experts followed a logical pattern, but Lisa’s investigation was more cursory and ended when she obtained a test result with a plausible explanation, rather than pursuing a more complex relationship such as Janet.

Pre-test performance predicts authentic inquiry practices

Given the wide variety of practices observed in both experts and novices, putative predictive practice models based on pre-test performance were generated. First, performance on the pre-test metrics was predicted by expertise (Fig.  2 ) and predicted the total number of actions. The first Poisson count regression model was built to predict the total number of actions, using different predictors including demographic information of the participants,

where, μ is the average number of actions. Asterisks specify significant coefficients in the model; that is, the predictors that significantly contributed in the process of predicting the total number of actions performed by the participants. Table  4 shows the parameter estimates of the first Poisson count regression model for predicting the number of actions. Additional file  1 : Table S1 shows the test of main effects for this model.

The results of the first Poisson regression model demonstrated that NOS principle 2 (tenuous nature of science), but not NOS principle 1 (lack of a universal scientific method), significantly predicted total number of actions. This variable had three categories of 2 (naïve), 3 (mixed), and 4 (sophisticated), where a 4 was used as the baseline category. Coefficient of 0.23 for NOSprinciple 2 2 means, holding the other variables constant in the model, the difference in the logs of expected number of actions is expected to be 0.23 units higher for participants with naïve understanding of the tenuous nature of science, compared to those whose understanding was sophisticated. For example, an average of ten actions for the population of students with a sophisticated score in the tenuous nature of science corresponds to about 12 actions for a similar population of students with naïve scores. Coefficient of 0.41 for NOSprinciple 2 3 means, under the same circumstances, the difference in the logs of expected number of actions is expected to be 0.41 units higher for participants with mixed understanding of the tenuous nature of science, compared to those with sophisticated understanding. Here, an average of ten actions for the population of students with a sophisticated score in the tenuous nature of science corresponds to an expectation of about 15 actions for a similar population of students with mixed scores.

When building the second Poisson count regression model to predict the number of information seeking actions using the same predictors used for the first model, NOS principle 1 (lack of a universal scientific method) ( χ 2  = 1465, p  = .0007), but not NOS principle 2 (tenuous nature of science) was a significant predictor. For the third Poisson count model, which was built to predict the number of investigative actions using the same predictors as the first and second models, neither NOS principle 1 nor NOS principle 2 were significant predictors.

Within the first count regression model, performance on the SEB survey was examined to determine if it significantly predicted the total number of actions. SEB was entered into the model in two different ways to see which one was more appropriate for the final model. First, each of the four SEB domains was entered as four separate variables; second, the SEB total average score was entered as one variable. Keeping everything else the same, both models were fitted and the count regression model with the total average SEB (model deviance: 98.5481, AICC: 268.6583), fitted better than the same model with four separate variables for SEB survey performance (model deviance: 97.5033, AICC: 289.4212). The same relationship existed between the models when predicting the investigative number of actions and information seeking number of actions. Thus, for all three of the Poisson count regression models, the total average SEB was used as a predictor. Total number of actions and information seeking number of actions were significantly predicted by SEB survey performance ( χ 2  = 27.87, p  < .0001 and χ 2  = 42.66, p  < .0001), respectively. However, the total number of investigative actions was not significantly predicted by SEB performance. Finally, the relationship between expert verb score and actions performed in SCI within the three Poisson count regression models discussed above was examined. Expert verb score significantly predicted the total number of actions ( χ 2  = 12.36, p  = .0004) and number of investigative actions ( χ 2  = 14.03, p  = .0002), but it did not contribute significantly in predicting the number of information seeking actions.

In this study, a baseline for expert practices in a simulated authentic science inquiry environment was established. These practices may be reflective of the knowledge and beliefs about the nature of science inquiry and how inquiry generates new knowledge, or EASI. Participant practices in the simulated authentic inquiry experience offered by SCI simulations, including actions performed, type of investigation, and language used during the conclusion phases were tied to existing metrics of nature of science understanding and scientific epistemic beliefs to generate preliminary models of what student practices in authentic inquiry can reveal about underlying epistemological beliefs. Given concerns with existing metrics of NOS or science epistemology (Sandoval 2005 ; Sandoval et al. 2016 ; Sandoval and Redman 2015 ), the importance of this foundational knowledge for attaining science literacy (Renken et al. 2016 ; Schwan et al. 2014 ), and the potential power of using simulations for high-throughput, real-time assessment of difficult to measure constructs, this study raised several interesting questions and prompts future research in using simulation based assessment for understanding difficult to measure constructs.

To define expert-like EASI, both experts and novices completed pre-test metrics to assess their baseline NOS understanding and science epistemic beliefs and a SCI simulation. Experts had a more sophisticated understanding of two NOS principles assessed: that science knowledge is tenuous and the lack of a universal scientific method. Experts also scored marginally higher on the SEB survey, particularly in the source and certainty domains, although this difference was not statistically significant. These results, in addition to the recruitment criteria for the experts and novices, underscores that experts not only have more experience with authentic science inquiry, but that their underlying understanding of science was also more sophisticated than novices. Since there could possibly be some concern about being able to “do” science without understanding the whole process, this result suggests that experts had both a better understanding of the process of science inquiry and experience performing inquiry. This study provides preliminary evidence of a relationship between what participants know and believe about science and their inquiry practices. Although experts were defined by their measurable experience with authentic science practices, namely their years of experience, publishing novel primary peer-review research, and passing their comprehensive exams, it may be possible that other aspects of expertise could have influenced these results. For example, none of the experts had expertise in the specific content area of the simulation. How would practices change if an individual had content expertise, but may or may not have expertise in authentic science practices? Could an expert be more likely to use certain heuristics if they are a content expert and therefore have their investigations appear simpler? Might an expert express more naïve views within the context of the simulation even though they would express complex views in another situation (Sandoval and Redman 2015 )? Perhaps one interpretation of the expert trajectories observed is that when unfamiliar with the content area experts chose a complicated and sophisticated trajectory during authentic inquiry because they could not rely on previous experience. Future work will examine expert practices both in and outside their content area of expertise.

Experts also performed more complex investigations that were aimed at uncovering a mechanistic cause and effect relationships, performed more actions total, particularly information seeking actions, and overall had inquiry profiles that alternated between periods of testing and information seeking. Notably, we observed a linear relationship between overall complexity and total actions; the increase in actions was not due to random activity during the simulation, but was likely intentional in nature. Future work with a higher sample size, and increased power, will allow fitting more advanced predictive logistic models to probe for other predictors that may contribute towards predicting the level of expertise of the participants.

The case study analysis further demonstrated that more-expert like investigations were correlated with planning time and information seeking decisions, consistent with novice/expert studies in the engineering education literature (Atman et al. 2007 ). The difference in information seeking decisions, both between experts and novices and between the complex and simple investigations raises some interesting questions about EASI. Why did some novices, such as the simple novice showcased here, choose to ignore the availability of information and instead say, “I do not know?” This could be explained by self-efficacy, which among high school students is related to students’ scientific epistemic beliefs (Tsai et al. 2011 ). It also may be possible that students who do not have high self-efficacy towards their ability to do science may be unlikely to fully engage with what they do not understand, and therefore perform simple investigations. Preliminary work indicated that practices within SCI correlated with affective factors such as self-efficacy, metacognition, or a sense of having a science identity (Peffer et al. 2018 ). This result could also indicate that the novices in question have only been exposed to the canned recipe-like simple inquiry that tends to predominate in K-16 classrooms (Chinn and Malhotra 2002 ) and without an obvious answer the default is to say “I don’t know.” Since simple inquiry environments are built on following a set of instructions, not engaging in independent exploration, it may be that novices have not learned about the importance of synthesizing outside information as part of their autonomous investigation, which is considered part of understanding of the nature of science inquiry (Lederman et al. 2014 ).

The complex expert highlighted here spent considerable time during her think-aloud to discuss the source of articles that she used during her investigation. She also reviewed the primary literature as part of her investigation, which was observed among some of the experts, but not among any novices. Source of knowledge is a dimension of scientific epistemic beliefs (Conley et al. 2004 ), and how online sources are evaluated is thought to represents students’ epistemic thinking (Barzilai and Zohar 2012 ). Therefore, how students evaluate and choose to incorporate evidence into their investigations may be an important epistemologically salient episode to target for both assessment purposes and classroom instruction.

Both score on the second NOS principle (tenuous nature of science knowledge) and SEB total score predicted total actions. Since the two domains of the SEB survey that were most different between experts and novices were source and certainty, this suggested a connection between how students conceptualize the nature of science knowledge and the depth of their investigations. Although neither alone predicted investigative or information seeking actions, the presence of more actions and the correlation with overall complexity of investigation may indicate that students who have more sophisticated understanding of the nature of science knowledge engage in more sophisticated investigations aimed at revealing cause and effect relationships, not simply arriving at a single answer. However, these data are limited by a smaller sample size, and although these prospective correlations are interesting, additional work is necessary to fully understand the connections between baseline understanding of NOS and epistemological beliefs about science and how knowledge or beliefs plays out during authentic science inquiry.

Limitations

A possible confound for this study may be the interest in the topic at hand. A more interested student may have been more willing to seek information and pursue many tests to reach their conclusion. For example, one research participant stated in his post-simulation interview that he “didn’t want to commit the time to really figure it out and would stay and work more if he was earning more class credit.” He also commented that he would “get an A” for doing it, regardless of how well he tried. In contrast to this participant, Sally stated multiple times in her think-aloud transcript how “interested” she was in the subject. It is also worth noting that both of the described instances were coded as simple novices. Related to interest, another potential confound is the perception of time. Beth specifically stated in her post simulation interview that “she wasn’t in a hurry, had nowhere else to be.”

Another limitation was with the metrics used. In addition to reliability and validity concerns (Deng et al. 2011 ; Sandoval and Redman 2015 ), we also observed very high scores on the SEB and had low interrater reliability on the NOS metric. However, despite concerns, the hypothesis that experts would perform better than novices was statistically supported. Furthermore, in the modeling analysis, both metrics were consistent with one another and predicted total number of actions performed. Investigative and information seeking actions performed were not related to complexity of investigation, likely due to small sample size and limited power. Although there was high reliability for our pre-test metric, small sample size prevented an accurate calculation of the validity. However, preliminary quantitative analyses combined with the qualitative analysis yielded provocative results about differences in authentic science practices between experts and novices. Further studies with additional participants are warranted.

Implications and future directions

The development of sophisticated epistemological beliefs about science is an essential goal of science education and overall science literacy. However, it is difficult to assess and consequently measure epistemological beliefs. Although existing metrics provide a snapshot of students’ epistemological beliefs and/or NOS understanding, there are many concerns about their reliability and validity (Deng et al. 2011 ; Sandoval and Redman 2015 ).

In addition the lengthy nature of the VNOS or VASI metrics can preclude their use in a classroom setting or with large numbers of participants. Assessing what students do in authentic science inquiry as a proxy for what students know and believe about science may provide a potential solution. In the present study, experts and novices exhibited distinct inquiry trajectories that were correlated with their scores on extant metrics of NOS and epistemological beliefs about science. Although our work had a small overall sample size, the data supported the hypothesis that practices reflect epistemological beliefs. Using a computer-based assessment and learning analytics techniques, such as automated language analysis (Peffer and Kyle 2017 ), could allow for high-throughput measurement and analysis of student practices.

Improved methods of assessing students’ epistemological beliefs about science may provide new pedagogical avenues to both understand and address the epistemic processes of inquiry. Certain practices in authentic inquiry such as use of tentative or hedging language (Peffer and Kyle 2017 ), persistent seeking of information outside the simulation, and use of complex inquiry strategies were reflective of expertise and correlated with performance on pre-test assessment of NOS understanding. Future studies expanding this work, using larger sample sizes and consequently more powerful statistical models, will provide additional details about epistemologically salient episodes in authentic science inquiry that could be pedagogically targeted. For example, the effectiveness of teaching interventions designed to increase not only student understanding of NOS, but their application, could be tested. This may be particularly useful with preservice teachers who can accurately describe NOS items in the classroom, but fail to transfer that knowledge into their classrooms (Bartos and Lederman 2014 ).

Novice practices existed on a continuum from less to more sophisticated (Fig.  3 ). This diversity highlights the variety of epistemic perspectives in a given population and provides the possibility for personalizing learning in the classroom. For example, are better pedagogical outcomes observed if less expert-like novices are paired with more expert-like novices? Do less expert-like novices respond to pedagogical interventions differently than more expert-like novices? If students who are intermediary between the novices and experts participated in this study, such as advanced undergraduate biology majors, would we observe a hybrid profile between experts and novices? Tracing the development of sophisticated epistemological beliefs overtime could help indicate which existing pedagogical interventions are most effective at promoting the development of not only content and practice knowledge, but the epistemic underpinnings. Also, the identification of what “expert” means in the context of simulated authentic inquiry may reveal new pedagogical targets for promoting the development of EASI. New pedagogical interventions could also be developed by understanding what is happening over the evolution of a student’s development from a novice to an expert. Furthermore, since not all novices will in fact become experts, but still require an expert-like understanding of how science works to be a productive member of society, these prospective pedagogical interventions could lead to improved overall science literacy.

Despite its small sample size, this study represents the first iteration of a larger study that may change the way researchers and instructors assess the underlying philosophical foundations students have about the relationship between science inquiry and generation of new science knowledge. The qualitative results described here provide important grounding for future work developing a practices-based assessment of EASI. Performance on a pre-test metric and overall expertise predicted actions performed during inquiry and were able to identify some potentially epistemologically salient episodes for future examination with a larger sample size. This work highlights the potential of using high-throughput, real-time assessment of simulated authentic science practices as a less-constrained way of examining constructs that are traditionally difficult to assess. Examining what students do during inquiry, rather than what they say about inquiry on a standard measure, removes the philosophical assumptions that come with traditional assessments. Consequently, examining EASI in SCI may lead to new areas of pedagogical focus and techniques that improve student EASI and overall science literacy. For example, how can the inquiry profiles generated by students be used to personalize instruction? The potential power of using simulations as an assessment technique to examine multiple simultaneous users in real time is exciting, as are the implications for how simulations can be leveraged to improve science literacy.

A necropsy is an autopsy performed on an animal.

We note here that climate change is the more accurate description of what Lisa means by global warming. At another point in her think-aloud transcript, Lisa acknowledges her error.

PubMed is a database of biology and medical primary literature.

Abbreviations

Aims and Values, Epistemic Ideals, Reliable Processes

Course-Based Undergraduate Research Experience

Epistemology in Authentic Science Inquiry

Next Generation Science Standards

Nature of Science

Science Classroom Inquiry

Scientific epistemic belief

Views About Science Inquiry

Views of Nature of Science

Abd-El-Khalick, F. (2012). Examining the sources for our understandings about science: Enduring conflations and critical issues in research on nature of science in science education. International Journal of Science Education, 34 (3), 353–374.

Article   Google Scholar  

Akerson, V. L., Cullen, T. A., & Hanson, D. L. (2010). Experienced teachers’ strategies for assessing nature of science conceptions in the elementary classroom. Journal of Science Teacher Education, 21 (6), 723–745. https://doi.org/10.1007/s10972-010-9208-x .

Atman, C., Adams, R., Cardella, M., Turns, J., Mosborg, S., & Saleem, J. (2007). Engineering design processes: a comparison of students and expert practitioners. Journal of Engineering Education, 96 (4), 359–379. https://doi.org/10.1002/j.2168-9830.2007.tb00945.x .

Auchincloss, L. C., Laursen, S. L., Branchaw, J. L., Eagan, K., Graham, M., Hanauer, D. I., et al. (2014). Assessment of course-based undergraduate research experiences: a meeting report. CBE Life Sciences Education, 13 (1), 29–40. https://doi.org/10.1187/cbe.14-01-0004 .

Bartos, S. A., & Lederman, N. G. (2014). Teachers’ knowledge structures for nature of science and scientific inquiry: Conceptions and classroom practice. Journal of Research in Science Teaching, 51 (9), 1150–1184. https://doi.org/10.1002/tea.21168 .

Barzilai, S., & Zohar, A. (2012). Epistemic thinking in action: evaluating and integrating online sources. Cognition and Instruction, 30 (1), 39–85. https://doi.org/10.1080/07370008.2011.636495 .

Campbell, T., Oh, P. S., & Neilson, D. (2012). Discursive modes and their pedagogical functions in model-based inquiry (MBI) classrooms. International Journal of Science Education, 34 (15), 2393–2419. https://doi.org/10.1080/09500693.2012.704552 .

Chinn, C. A., & Malhotra, B. A. (2002). Epistemologically authentic inquiry in schools: a theoretical framework for evaluating inquiry tasks. Science Education, 86 (2), 175–218. https://doi.org/10.1002/sce.10001 .

Chinn, C. A., Rinehart, R. W., & Buckland, L. A. (2014). Epistemic cognition and evaluating information: Applying the AIR model of epistemic cognition. In D. Rapp & J. Braasch (Eds.), Processing inaccurate information . Cambridge, MA: MIT Press.

Google Scholar  

Conley, A. M., Pintrich, P. R., Vekiri, I., & Harrison, D. (2004). Changes in epistemological beliefs in elementary science students. Contemporary Educational Psychology, 29 (2), 186–204. https://doi.org/10.1016/j.cedpsych.2004.01.004 .

Corwin, L., Graham, M., & Dolan, E. (2015). Modeling course-based undergraduate research experiences: an agenda for future research and evaluation. CBE Life Sciences Education, 14 (1), es1. https://doi.org/10.1187/cbe.14-10-0167 .

Creswell, J. W. (2014). Research design: qualitative, quantitative, and mixed methods approaches (4th ed.). Thousand Oaks: SAGE Publications.

Creswell, J. W., & Plano, C. V. L. (2011). Designing and conducting mixed methods research . Los Angeles: SAGE Publications.

Deng, F., Chen, D., Tsai, C., & Chai, C. S. (2011). Students’ views of the nature of science: a critical review of research. Science Education, 95 (6), 961–999. https://doi.org/10.1002/sce.20460 .

Elby, A., Macrander, C., & Hammer, D. (2016). Epistemic cognition in science. In I. Bråten, J. Greene, & W. Sandoval (Eds.), Handbook of Epistemic Cognition (pp. 113–127). New York: Routledge.

Ford, M. (2015). Educational implications of choosing “Practice” to describe science in the Next Generation Science Standards. Science Education, 99 (6), 1041–1048.

Greene, J. A., Sandoval, W. A., & Bråten, I. (2016). Handbook of epistemic cognition . London: Routledge Ltd. - M.U.A. https://doi.org/10.4324/9781315795225 .

Book   Google Scholar  

Hanauer, D. I., Hatfull, G. F., & Jacobs-Sera, D. (2009). Active assessment: assessing scientific inquiry (1st ed.). New York; London: Springer. https://doi.org/10.1007/978-0-387-89,649-6 .

Hofer, B. K., & Pintrich, P. R. (1997). The development of epistemological theories: beliefs about knowledge and knowing and their relation to learning. Review of Educational Research, 67 (1), 88–140. https://doi.org/10.2307/1170620 .

Hu, D., & Rebello, N. (2014). Shifting college students’ epistemological framing using hypothetical debate problems. Physical Review Special Topics - Physics Education Research, 10 (1), 010117. https://doi.org/10.1103/PhysRevSTPER.10.010117 .

Knight, S., Buckingham Shum, S., & Littleton, K. (2014). Epistemology, assessment, pedagogy: where learning meets analytics in the middle space. Journal of Learning Analytics, 1 (2), 23–47. https://doi.org/10.18608/jla.2014.12.3 .

Koerber, S., Osterhaus, C., & Sodian, B. (2015). Testing primary-school children’s understanding of the nature of science. British Journal of Developmental Psychology, 33 (1), 57–72. https://doi.org/10.1111/bjdp.12067 .

Lederman, J. S., Lederman, N. G., Bartos, S. A., Bartels, S. L., Meyer, A. A., & Schwartz, R. S. (2014). Meaningful assessment of learners’ understandings about scientific inquiry—The views about scientific inquiry (VASI) questionnaire. Journal of Research in Science Teaching, 51 (1), 65–83. https://doi.org/10.1002/tea.21125 .

Lederman, N. G., Abd-El-Khalick, F., Bell, R. L., & Schwartz, R. E. S. (2002). Views of nature of science questionnaire: toward valid and meaningful assessment of learners’ conceptions of nature of science. Journal of Research in Science Teaching, 39 (6), 497–521. https://doi.org/10.1002/tea.10034 .

Mason, L., Ariasi, N., & Boldrin, A. (2011). Epistemic beliefs in action: spontaneous reflections about knowledge and knowing during online information searching and their influence on learning. Learning and Instruction, 21 (1), 137–151. https://doi.org/10.1016/j.learninstruc.2010.01.001 .

NGSS Lead States. (2013). Next Generation Science Standards: For States, By States . Washington, D.C.: The National Academies Press.

Osborne, J. (2014a). Scientific practices and inquiry in the science classroom. In N. Lederman & S. K. Abell (Eds.), Handbook of Research on Science Education . Abingdon: Routledge.

Osborne, J. (2014b). Teaching scientific practices: meeting the challenge of change. Journal of Science Teacher Education, 25 (2), 177–196. https://doi.org/10.1007/s10972-014-9384-1 .

Peffer, M., Royse, E., & Abelein, H. (2018). Influence of affective factors on practices in simulated authentic science inquiry. In Rethinking Learning in Digital Age: Making the Learning Sciences Count. Conference Proceedings of the International Conference of the Learning Sciences (Vol. 3).

Peffer, M. E., Beckler, M. L., Schunn, C., Renken, M., & Revak, A. (2015). Science classroom inquiry (SCI) simulations: a novel method to scaffold science learning. PLoS One, 10 (3), e0120638.

Peffer, M. E., & Kyle, K. (2017). Assessment of language in authentic science inquiry reveals putative differences in epistemology. In Proceedings of the Seventh International Learning Analytics & Knowledge Conference (pp. 138–142). New York: ACM.

Renken, M., Peffer, M., Otrel-Cass, K., Girault, I., & Chiocarriello, A. (2016). Simulations as scaffolds in science education . Springer International Publishing: Springer.

Rowland, S., Pedwell, R., Lawrie, G., Lovie-Toon, J., & Hung, Y. (2016). Do we need to design course-based undergraduate research experiences for authenticity? Cell Biology Education, 15 (4), ar79. https://doi.org/10.1187/cbe.16-02-0102 .

Sandoval, W. A. (2005). Understanding students’ practical epistemologies and their influence on learning through inquiry. Science Education, 89 (4), 634–656. https://doi.org/10.1002/sce.20065 .

Sandoval, W. A., Greene, J. A., & Bråten, I. (2016). Understanding and promoting thinking about knowledge: origins, issues, and future directions of research on epistemic cognition. Review of Research in Education, 40 (1), 457–496. https://doi.org/10.3102/0091732X16669319 .

Sandoval, W. A., & Redman, E. H. (2015). The contextual nature of scientists’ views of theories, experimentation, and their coordination. Science & Education, 24 (9), 1079–1102. https://doi.org/10.1007/s11191-015-9787-1 .

SAS Institute Inc. (2014). SAS/ETS®13.2 User’s Guide . Cary: SAS Institute Inc.

Schizas, D., Psillos, D., & Stamou, G. (2016). Nature of science or nature of the sciences? Science Education, 100 (4), 706–733. https://doi.org/10.1002/sce.21216 .

Schraw, G. (2013). Conceptual integration and measurement of epistemological and ontological beliefs in educational research. ISRN Education, 2013 , 1–19.

Schwan, S., Grajal, A., & Lewalter, D. (2014). Understanding and engagement in places of science experience: Science museums, science centers, zoos, and aquariums. Educational Psychologist, 49 (2), 70–85. https://doi.org/10.1080/00461520.2014.917588 .

Schwartz, R., & Lederman, N. (2008). What scientists say: scientists’ views of nature of science and relation to science context. International Journal of Science Education, 30 (6), 727–771. https://doi.org/10.1080/09500690701225801 .

Schwichow, M., Croker, S., Zimmerman, C., Höffler, T., & Härtig, H. (2016). Teaching the control-of-variables strategy: a meta-analysis. Developmental Review, 39 , 37–63.

Someren, M. V., Barnard, Y. F., & Sandberg, J. A. (1994). The think aloud method: a practical approach to modelling cognitive processes . Academic Press.

Stathopoulou, C., & Vosniadou, S. (2007). Exploring the relationship between physics-related epistemological beliefs and physics understanding. Contemporary Educational Psychology, 32 , 255–281.

Tsai, C., Jessie Ho, H. N., Liang, J., & Lin, H. (2011). Scientific epistemic beliefs, conceptions of learning science and self-efficacy of learning science among high school students. Learning and Instruction, 21 (6), 757–769. https://doi.org/10.1016/j.learninstruc.2011.05.002 .

Windschitl, M., Thompson, J., & Braaten, M. (2008). Beyond the scientific method: model-based inquiry as a new paradigm of preference for school science investigations. Science Education, 92 (5), 941–967. https://doi.org/10.1002/sce.20259 .

Worsley, M., & Blikstein, P. (2014). Analyzing engineering design through the lens of computation. Journal of Learning Analytics, 1 (2), 151–186. https://doi.org/10.18608/jla.2014.12.8 .

Yang, F., Huang, R., & Tsai, I. (2016). The effects of epistemic beliefs in science and gender difference on university students’ science-text reading: An eye-tracking study. International Journal of Science and Mathematics Education, 14 (3), 473–498. https://doi.org/10.1007/s10763-014-9578-1 .

Zeineddin, A., & Abd-El-Khalick, F. (2010). Scientific reasoning and epistemological commitments: Coordination of theory and evidence among college science students. Journal of Research in Science Teaching, 47 (9), 1064–1093.

Download references

Acknowledgements

We would like to thank Maggie Renken for advice on study design and to Chris Schunn and Norm Lederman for helpful feedback on the earlier versions of the manuscript. We would also like to thank Josephine Lindsley, Arianna Garcia and Hannah Abelein for assistance with coding and Renee Schwarz for guidance on administering the nature of science items.

Not applicable.

Availability of data and materials

All de-identified data sets used in this study are available from the corresponding author upon reasonable request.

Author information

Authors and affiliations.

College of Natural and Health Sciences, University of Northern Colorado, Ross Hall 1556, Campus Box 92, Greeley, CO, 80639, USA

Melanie E. Peffer

Volgenau School of Engineering, George Mason University, 4400 University Drive, MS 4A7, Fairfax, VA, 22030, USA

Niloofar Ramezani

You can also search for this author in PubMed   Google Scholar

Contributions

MP designed the study, collected data, and performed qualitative analysis. NR performed the quantitative analysis. Both authors contributed to writing of the manuscript and approved the final version.

Corresponding author

Correspondence to Melanie E. Peffer .

Ethics declarations

Ethics approval and consent to participate.

All research procedures were approved by Georgia State University’s Institutional Review Board (IRB # H16103) and all research participants gave informed consent prior to participating in research. All data presented here is in de-identified form.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional file

Additional file 1:.

Table S1. Parameter Estimates of Poisson Count Regression for Number of Investigative Actions. (PDF 275 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article.

Peffer, M.E., Ramezani, N. Assessing epistemological beliefs of experts and novices via practices in authentic science inquiry. IJ STEM Ed 6 , 3 (2019). https://doi.org/10.1186/s40594-018-0157-9

Download citation

Received : 21 June 2018

Accepted : 21 December 2018

Published : 22 January 2019

DOI : https://doi.org/10.1186/s40594-018-0157-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Educational technology

hypothesis generation in science practice

hypothesis generation in science practice

Hypothesis Generation and Interpretation

Design Principles and Patterns for Big Data Applications

  • © 2024
  • Hiroshi Ishikawa 0

Department of Systems Design, Tokyo Metropolitan University, Hino, Japan

You can also search for this author in PubMed   Google Scholar

  • Provides an integrated perspective on why decisions are made and how the process is modeled
  • Presentation of design patterns enables use in a wide variety of big-data applications
  • Multiple practical use cases indicate the broad real-world significance of the methods presented

Part of the book series: Studies in Big Data (SBD, volume 139)

1155 Accesses

This is a preview of subscription content, log in via an institution to check access.

Access this book

  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Other ways to access

Licence this eBook for your library

Institutional subscriptions

Table of contents (8 chapters)

Front matter, basic concept.

Hiroshi Ishikawa

Science and Hypothesis

Machine learning and integrated approach, hypothesis generation by difference, methods for integrated hypothesis generation, interpretation, back matter.

  • Hypothesis Generation
  • Hypothesis Interpretation
  • Data Engineering
  • Data Science
  • Data Management
  • Machine Learning
  • Data Mining
  • Design Patterns
  • Design Principles

About this book

The novel methods and technologies proposed in  Hypothesis Generation and Interpretation are supported by the incorporation of historical perspectives on science and an emphasis on the origin and development of the ideas behind their design principles and patterns.

Authors and Affiliations

About the author.

He has published actively in international, refereed journals and conferences, such as ACM Transactions on Database Systems , IEEE Transactions on Knowledge and Data Engineering , The VLDB Journal , IEEE International Conference on Data Engineering, and ACM SIGSPATIAL and Management of Emergent Digital EcoSystems (MEDES). He has authored and co-authored a dozen books, including Social Big Data Mining (CRC, 2015) and Object-Oriented Database System (Springer-Verlag, 1993).

Bibliographic Information

Book Title : Hypothesis Generation and Interpretation

Book Subtitle : Design Principles and Patterns for Big Data Applications

Authors : Hiroshi Ishikawa

Series Title : Studies in Big Data

DOI : https://doi.org/10.1007/978-3-031-43540-9

Publisher : Springer Cham

eBook Packages : Computer Science , Computer Science (R0)

Copyright Information : The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024

Hardcover ISBN : 978-3-031-43539-3 Published: 02 February 2024

Softcover ISBN : 978-3-031-43542-3 Due: 31 January 2024

eBook ISBN : 978-3-031-43540-9 Published: 01 January 2024

Series ISSN : 2197-6503

Series E-ISSN : 2197-6511

Edition Number : 1

Number of Pages : XII, 372

Number of Illustrations : 52 b/w illustrations, 125 illustrations in colour

Topics : Theory of Computation , Database Management , Data Mining and Knowledge Discovery , Machine Learning , Big Data , Complex Systems

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

LSE - Small Logo

  • About the LSE Impact Blog
  • Comments Policy
  • Popular Posts
  • Recent Posts
  • Subscribe to the Impact Blog
  • Write for us
  • LSE comment

February 3rd, 2016

Putting hypotheses to the test: we must hold ourselves accountable to decisions made before we see the data..

5 comments | 2 shares

Estimated reading time: 5 minutes

David Mellor

We are giving $1,000 prizes to 1,000 scholars simply for making clear when data were used to generate or test a hypothesis. Science is the best tool we have for understanding the way the natural world works. Unfortunately, it is in our imperfect hands . Though scientists are curious and can be quite clever , we also fall victim to biases that can cloud our vision. We seek rewards from our community, we ignore information that contradicts what we believe, and we are capable of elaborate rationalizations for our decisions. We are masters of self-deception .

Yet we don’t want to be. Many scientists choose their career because they are curious and want to find  real answers to meaningful questions. In its idealized form, science is a process of proposing explanations and then using data to expose their weaknesses and improve them. This process is both elegant and brutal. It is elegant when we find a new way to explain the world, a way that no one has thought of before. It is brutal in a way that is familiar to any graduate student who has proposed an experiment to a committee or to any researcher who has submitted a paper for peer-review. Logical errors, alternative explanations, and falsification are not just common – they are baked into the process.

Image credit: Winnowing Grain Eastman Johnson  Museum of Fine Arts, Boston

Using data to generate potential discoveries and using data to subject those discoveries to tests are distinct processes. This distinction is known as exploratory (or hypothesis-generating) research and confirmatory (or hypothesis-testing) research. In the daily practice of doing research, it is easy to confuse which one is being done. But there is a way – preregistration.  Preregistration defines how a hypothesis or research question will be tested – the methodology and analysis plan. It is written down in advance of looking at the data, and it maximizes the diagnosticity of the statistical inferences used to test the hypothesis. After the confirmatory test, the data can then be subjected to any exploratory analyses to identify new hypotheses that can be the focus of a new study. In this way, preregistration provides an unambiguous distinction between exploratory and confirmatory research.The two actions, building and tearing down, are both crucial to advancing our knowledge. Building pushes our potential knowledge a bit further than it was before. Tearing down separates the wheat from the chaff. It exposes that new potential explanation to every conceivable test to see if it survives.

To illustrate how confirmatory and exploratory approaches can be easily confused, picture a path through a garden, forking at regular intervals, as it spreads out into a wide tree. Each split in this garden of forking paths is a decision that can be made when analysing a data set. Do you exclude these samples because they are too extreme? Do you control for income/age/height/wealth? Do you use the mean or median of the measurements? Each decision can be perfectly justifiable and seem insignificant in the moment. After a few of these decisions there exists a surprisingly large number of reasonable analyses. One quickly reaches the point where there are so many of these reasonable analyses, that the traditional threshold of statistical significance, p < .05, or 1 in 20, can be obtained by chance alone .

ARENA

If we don’t have strong reasons to make these decisions ahead of time, we are simply exploring the dataset for the path that tells the most interesting story. Once we find that interesting story, bolstered by the weight of statistical significance, every decision on that path becomes even more justified, and all of the reasonable, alternative paths are forgotten. Without us realizing what we have done, the diagnosticity of our statistical inferences is gone. We have no idea if our significant result is a product of accumulated luck with random error in the data, or if it is revealing a truly unusual result worthy of interpretation.

This is why we must hold ourselves accountable to decisions made before seeing the data. Without putting those reasons into a time-stamped, uneditable plan, it becomes nearly impossible to avoid making decisions that lead to the most interesting story. This is what preregistration does. Without preregistration, we effectively change our hypothesis as we make those decisions along the  forking path. The work that we thought was confirmatory becomes exploratory without us even realizing it.

I am advocating for a way to make sure the data we use to create our explanations is separated from the data that we use to test those explanations. Preregistration does not put science in chains . Scientists should be free to explore the garden and to advance knowledge. Novelty, happenstance, and unexpected findings are core elements of discovery. However, when it comes time to put our new explanations to the test, we will make progress more efficiently and effectively by being as rigorous and as free from bias as possible.

Preregistration is effective. After the United States required that all clinical trials of new treatments on human subjects be preregistered, the rate of finding a significant effect on the primary outcome variable fell from 57% to just 8% within a group of 55 cardiovascular studies. This suggests that flexibility in analytical decisions had an enormous effect on the analysis and publication of these large studies. Preregistration is supported by journals and research funders . Taking this step will show that you are taking every reasonable precaution to reach the most robust conclusions possible, and will improve the weight of your assertions.

Most scientists, when testing a hypothesis, do not specify key analytical decisions prior to looking through a dataset. It’s not what we’re trained to do. We at the Center for Open Science want to change that. We will be giving 1,000 researchers $1,000 prizes for publishing the results of preregistered work. You can be one of them. Begin your preregistration by going to https://cos.io/prereg .

preregchallenge (2)

Note: This article gives the views of the   author(s), and not the position of the LSE Impact blog, nor of the London School of Economics. Please review our  Comments Policy  if you have any concerns on posting a comment below.

About the Author:

David Mellor is a Project Manager at the Center for Open Science and works to encourage preregistration. He received his PhD from Rutgers University in Ecology and Evolution has been an active researcher in the behavioral ecology and citizen science communities.

Print Friendly, PDF & Email

About the author

' src=

I strongly agree with almost all of this. One question, though. I sometimes take part in studies that use path models. It can happen that a referee suggests an additional pathway that makes sense to us. But this would not have been in the original specification of the model. Come to think of it this kind of thing must happen pretty often. How would you view that?

That is a great point and is a very frequent occurrence. I think that the vast majority of papers come out of peer review with one or more changes in how the data are analyzed. The best way to handle that is with transparency: “The following, additional paths (or tests, interactions, correlations, etc..) were conducted after data collection was complete…” The important distinction is to not present those new pathways as simply part of the a-priori tests or to lump them with the same analyses presented initially and planned ahead of time. This way, the reader will be able to properly situate those new tests in the complete body of evidence presented in the paper. After data collection and initial analysis, any new tests were created by being influenced by the data and are, in essence, a new hypothesis that is now being tested with the same data that was used to create it. That new test can be confirmed with later follow up study using newly collected data.

Doesn’t this just say – we can only be honest by being rigid? It carries hypothetico-deductive ‘logic’ to a silly extreme, ignoring the inherently iterative process of theorization, recognition of interesting phenomena, and data analysis. But, creative research is not like this. How can you formulate meaningful hypotheses without thinking about and recognizing patterning in the data – the two go hand in hand, and are not the same as simply ‘milking’ data for significant results.

  • Pingback: Testing a Hypothesis? Be Upfront About It and Win $1,000

Hi Patrick, Thank you for commenting. I very much agree that meaningful hypotheses cannot be made without recognizing patterns in the date. That may the best way to make a reasonable hypothesis. However, the same data that are used to create the hypothesis cannot be used to test that same hypothesis, and this is what preregistration does. It makes it clear to ourselves exactly what the hypothesis is before seeing the data, so that the data aren’t then used to subtly change/create a new hypothesis. If it does, fine, great! But that is hypothesis building, not hypothesis testing. That is exploratory work, not confirmatory work.

Leave a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Notify me of follow-up comments by email.

Related Posts

hypothesis generation in science practice

The Uberification of the University: How much further could the public university be disrupted?

August 27th, 2015.

hypothesis generation in science practice

Book Review: Once Upon an Algorithm: How Stories Explain Computing by Martin Erwig

December 10th, 2017.

hypothesis generation in science practice

Formalised data citation practices would encourage more authors to make their data available for reuse

July 17th, 2017.

hypothesis generation in science practice

The aesthetic elements of data visualisaton may offer a solution to the anxieties of big data.

March 31st, 2015.

hypothesis generation in science practice

Visit our sister blog LSE Review of Books

Generating real-world evidence at scale using advanced analytics

Advanced techniques for generating real-world evidence (RWE) help pharmaceutical companies deliver insights that transform outcomes for patients—and create significant value. McKinsey estimates  that over the next three to five years, an average top-20 pharma company could unlock more than $300 million a year by adopting advanced RWE analytics across its value chain.

About the authors

Opportunities to improve outcomes for patients are proliferating thanks to advances in analytical methods, coupled with growing access to rich RWE data sources. Pharma companies can collect hundreds of millions of records with near–real-time data and deploy them in use- and outcomes-based research and in risk-sharing agreements with stakeholders. Yet although the outlook is promising, many pharma companies still struggle to deploy advanced analytics (AA) effectively to generate RWE. In this article, we explore how to set up analytics teams for success, select use cases that add value, and build a technical platform to support AA applications—from early efforts to deployment at scale.

Create interdisciplinary RWE teams with AA expertise

Historically, RWE departments have employed professionals with expertise in biostatistics, health policy, clinical medicine, and epidemiology (including pharmaco-epidemiology). These experts have extracted value from real-world data (RWD) by using classical descriptive and statistical analytical methods. To add advanced analytics to the methodology toolbox, pharma companies must make two key organizational and cultural changes: creating teams with expertise in four disciplines and working to combine the strengths of their different analytical cultures.

The composition of teams

To deliver RWE analytics at scale, teams need experts from four distinct disciplines.

Clinical experts provide medical input on the design of use cases and the types of patients or treatments to be considered. By taking into account clinical-practice and patient access issues, they can also help interpret the patterns in data. When a project ends, they help the organization to understand the medical plausibility of the results.

Methods experts with a background in epidemiological and health outcomes or in biostatistics ensure that the bar is sufficiently high for the analytical approach and statistical veracity of individual causal findings. People with this profile have often played, and can continue to play, the translator role (see below), which is responsible for securing intercompany alignment on, for example, study protocols and the validation of the main outcomes of interest.

Translators act as intermediaries between business stakeholders and clinical and technical teams. They are responsible for converting business and medical requirements into directives for the technical team and for interpreting its output in a format that can be integrated into business strategies or processes. Translators may have clinical, epidemiological, or analytics backgrounds, but this is not a prerequisite. Communication and leadership skills, an agile mindset, and the ability to translate goals into analytics and insights into actions and impact are the prerequisites.

AA specialists assemble the technical data and build the algorithms for each use case. Statisticians design robust studies and help minimize biases. Data engineers create reusable data pipelines that combine multiple data sources for modeling. Machine learning (ML) specialists use the data to build high-performance predictive algorithms.

These four groups of experts work iteratively together to deliver use cases, much as software teams deliver projects via agile sprints. This approach enables teams to test early outputs with clinical and business stakeholders and to adjust course as needed.

Bridge different cultures in the analytics team

Pharma companies tend to draw analytics professionals from two historically separate cultures with different priorities and methodological groundings.

The culture of biostatisticians is based on explanatory modeling 1 Terms printed in italics are defined in the glossary. : simulating RWD conditions to mimic randomized clinical-trial (RCT) settings, so that the team can extract the right causal relationships from the data. Explanatory modeling relies on techniques such as propensity scores to mimic RCT conditions and on confidence intervals and p-values to assess the robustness of findings and to avoid false discoveries.

Hypothesis generation: Applies analytical approaches to real-world evidence (RWE) to identify potential new avenues for research and follow-up confirmatory studies. Hypothesis generation often uses associative and correlational approaches that do not necessarily satisfy standard requirements of causal inference.

Hypothesis validation: Uses a set of analytical approaches to establish the causal confirmation (or rejection) of the proposed hypothesis. These approaches include experimental validation by means of, for instance, randomized clinical trials (RCTs) or formal causal inference frameworks. Hypothesis validation requires stringency in experimental design.

Explainable artificial intelligence (XAI): Aims to ensure that users can understand and interpret the output of an AI analysis. It uses techniques such as SHAP (a game-theoretic approach explaining the output of any machine-learning model) or statistical models (such as Bayesian networks) that can provide a detailed view of the drivers of treatment decisions. Such techniques can also expose differences between groups in the effect of a patient characteristic: for example, advanced age may reduce the risk of treatment switching among patients seen by generalist physicians but increase it for those seen by specialists. Most XAI techniques attempt to combine the predictive power of machine learning with the interpretability of explanatory models.

Explanatory modeling: Focuses on extracting specific, replicable, and easy-to-understand insights about heterogeneity—for example, by attributing a difference in response to a treatment effect or a subpopulation. If these insights are reproducible, they become integrated into scientific and clinical knowledge. Explanatory modeling relies on stringent methods of data collection, such as randomization of treatments, to permit causal interpretations of the results or, alternatively, it involves using bespoke statistical methods to stress-test each finding. Because of these constraints (and historical reasons), explanatory modeling traditionally focuses on the smaller data sets in most RCTs and relies on measures of confidence in individual findings (such as p-values or confidence/credible intervals) to avoid false discoveries. Modern advances in this field (for instance, in the flourishing area of causal inference) leverage the flexibility of machine learning while at the same time relaxing strict requirements to ascertain what companies can safely conclude from data sets collected even in less-than-ideal circumstances.

Predictive modeling: Uses flexible models that can capture heterogeneity in data and offer good predictions on novel examples. Prediction workflows protect against overfitting and chance findings by validating models on novel data sets (referred to as either test or validation data sets). That approach tends to work more satisfactorily when data are abundant, as they are in RWE. The reliance on larger sample sizes explains much of why advanced analytics/machine learning tools are increasingly the standard in real-world-data analysis. Models rely on highly nonlinear interactions between different features in data and offer predictions by averaging large numbers of variations of these interactions.

By contrast, the culture of data scientists and AA/ML practitioners focuses on predictive modeling : the use of flexible models that can capture heterogeneity in data and offer good predictions on novel examples. Predictive modeling relies on methods such as boosting and ensembling, which tend to work better in big data environments. These experts seek to maximize predictive performance, even (sometimes) at the expense of explanatory insights.

This dichotomy, of course, is an artificial one, and many roles straddle these two cultures: for example, many data scientists have a background in biostatistics, and many epidemiologists and biostatisticians understand data engineering and machine learning workflows on large data sets. Combining these analytical cultures into one multidisciplinary team is critical to make the best use of large RWE data sets. In our experience, teams achieve the biggest impact when a biostatistician or epidemiologist is responsible for the statistical veracity of individual causal findings and an ML specialist for predictive accuracy and generalization. Both kinds of specialists must also work closely with clinical experts and data engineers to understand issues involving the quality of data or selection mechanisms. These issues include systematically underreported lab data or other sources of bias.

Practitioners from the two cultures sometimes have conflicting priorities, which RWE leaders must resolve. The best results are achieved by reasoning back from impact, not forward from methods. For a given use case, that means identifying what will generate the impact—for instance, supporting regulatory approval, improving the likelihood of publication, maximizing predictive performance, or optimizing interpretability. Then the team must decide which approach or combination of approaches best suits that goal.

In this context, it is important to distinguish between hypothesis generation and hypothesis validation . Generation focuses on identifying correlations or patterns that represent new avenues of research and can subsequently be confirmed by follow-up clinical trials or other means. Validation uses more conservative methods designed to avoid false discoveries and create new insights. The use of RWE in studies to generate hypotheses can greatly expand the research horizons of pharma companies by including larger and more richly described cohorts. RWE or other observational data can also be used to validate hypotheses, but under stricter conditions and sometimes in combination with RCT data. Both generation and validation can use a combination of AI/ML and biostatistics methods, but generation puts more emphasis on AI/ML, validation on biostatistics methods.

One pharma company built a tool that used insurance claims data to predict risk as accurately as possible at the level of individual patients. Clarity over this shared goal allowed the team to choose a suitable approach, work in an agile way, and avoid arguments about methodology. For example, when a biostatistician expressed a preference for a more explainable but lower-performance model, the team resolved the debate simply by referring to the jointly aligned goal.

Focus teams on scenarios in which AA adds value

Starting with the cases most likely to demonstrate the value of the analytical approach helps to ensure senior management’s support, which is critical to success in platform and process transformations. By focusing on the right cases, pharma companies can deepen their understanding of specific outcomes for patients, help physicians make better decisions, expand a drug’s use across approved indications, and advance scientific knowledge more broadly.

AA methods can help generate evidence in many areas of the therapeutic value chain (Exhibit 1), such as combining RCTs and RWD to optimize the design of trials in R&D. Leading pharma companies already see an impact in four significant areas: head-to-head drug comparisons, benefit and risk assessments for pharmacovigilance, drivers of treatment decisions, and support systems for clinical decisions.

Head-to-head drug comparisons

In these studies, companies analyze evidence to establish which patients respond best to a given therapy or which patient subgroups benefit more from a therapy than from its comparators. Head-to-head comparisons can yield valuable insights: for instance, older patients with a specific comorbidity may stand to benefit most from the treatment of interest. Such insights can help healthcare stakeholders make better decisions, inform physicians about treatments likely to deliver the best outcomes, and feed into the design of future clinical trials. Evidence from comparative analyses can also be submitted to regulators to support a treatment’s label expansion, without the need for new clinical trials. 2 See, for instance, “US FDA approves Ibrance (palbociclib) for the treatment of men with HR+, HER2- metastatic breast cancer,” Pfizer, April 4, 2019; and “MCC Rx marks first use of real-world data for FDA approval,” Clinical Oncology News , September 13, 2018.

Advances in explanatory modeling allow companies to apply machine-learning techniques within an appropriate statistical framework for inferring causal effects from RWD. By combining predictive approaches with causal inference in head-to-head observational studies, companies can generate evidence on a wider range of patients than they could with classical methods and explore a fuller set of subpopulations that may experience better outcomes. The use of ML techniques can also improve confounder control and yield more accurate estimations of the effectiveness of treatments than classical methods can.

With advanced analytics, studies can also automatically scale head-to-head comparisons and repeat them across many clinical endpoints, therapies, and patient groups. These analyses can then be presented to the wider business via interactive dashboards or other digital assets  not only to provide robust insights into outcomes, treatment differentiation, and unmet needs but also to help teams generate evidence for further validation at scale (Exhibit 2).

Benefit and risk assessments for pharmacovigilance

The use of machine-learning algorithms enables companies to screen real-world data continuously for a broad array of potential adverse events and to flag new safety risks (or signals) for a treatment. Once risks are detected, researchers can assess and weigh them against the benefits. Real-world data were used in this way to detect blood clots in a very small percentage of patients receiving the Oxford/AstraZeneca COVID-19 vaccine and in the subsequent benefit/risk assessment of the vaccine by the safety committee of the European Medicines Agency. 3 “COVID-19 vaccine AstraZeneca: Benefits still outweigh the risks despite possible link to rare blood clots with low blood platelets,” European Medicines Agency, March 18, 2021. The use of large real-time sources of data on patients enables researchers to capture otherwise unattainable signals for rare conditions and to gain novel insights into the risks and benefits of treatments and thus helps to improve pharmacovigilance processes.

Machine learning also enables companies to screen continuously for a broad array of potential adverse events and to integrate large unstructured data sets that would otherwise be hard to process automatically. For example, algorithms can not only annotate reports of clinical cases with information (such as symptoms, disease status, and severity) that specialists can interpret but also prioritize these reports by relevance. This makes the review of medical cases for potential signals more efficient and detection more rigorous. Finally, as with head-to-head drug comparisons, ML algorithms can generate evidence of causality that can improve the assessment of known safety risks across a greater range of patient profiles.

To create an industrialized benefit/risk platform, algorithms can be embedded into robust data-engineering workflows, scaled up across a broad range of data sets, and processed on regular automated schedules (Exhibit 3). Such platforms give pharmacovigilance teams ready access to automated, standardized, and robust insights into a vast set of potential signals across the asset portfolio.

Drivers of treatment decisions

An analysis of RWD can shed light on what motivates patients to switch therapies, adjust the dose, or discontinue treatment and what makes physicians prescribe different therapies or doses. Patients receiving a certain therapy, for example, were found to be more likely to discontinue treatment if they had visited a specialist physician in the three months after starting it. Such valuable insights can be further validated by subjecting potential selection mechanisms to sensitivity analyses. The use of techniques such as inverse-probability-of-censoring weighting helps companies to understand whether patients who drop out of the observation window differ systematically from those who don’t. Physicians can then use these insights to help patients stay on a therapy and obtain the maximum benefit from it.

Predictive modeling can yield deep insights into the complex drivers of treatment decisions and can do so at a larger scale than classical RWE analytics: for instance, the combinations of patient characteristics that predict a treatment decision or the (possibly nonlinear) way an individual patient characteristic affects the likelihood of a decision. With classical techniques, the researcher must often manually encode such relationships into the algorithm. Explainable artificial intelligence (XAI) techniques, such as SHAP, or causal-inference methods, such as Bayesian networks, can provide a detailed view of the drivers of treatment decisions. They can also expose the way differences between groups affect a patient characteristic: for example, advanced age may reduce the risk of treatment switching among patients seen by generalist physicians but increase it for those seen by specialists.

Since companies can use the same predictive-modeling approach for many aspects of the treatment journey, models can be scaled up to scan for drivers across decision points and across the entire patient population for a particular indication. This approach helps generate detailed evidence on the potential drivers of decisions for various treatments, not just a company’s own therapies.

Support systems for clinical decisions

Decision support systems, which are increasingly common in clinical settings, seek to improve medical decisions by providing information on patients and relevant clinical data. These systems are often developed with large real-world data sets and are subject to close regulatory scrutiny. Typically, they rely on algorithms that can, for example, predict (as a function of thousands of patient-level characteristics) the risk that a patient will suffer heart failure during the week after visiting a physician. Such systems help to raise the standard of care, reduce its cost, and improve safety for patients.

Researchers have shown that the accuracy of the predictions clinical decision support systems generate can be improved dramatically by replacing classical statistical models with machine-learning techniques such as bespoke neural-network architectures and tree-based methods. In addition, XAI techniques can be employed (much as treatment decision drivers are) to help physicians interpret and use these tools effectively. So can algorithms designed to be just as interpretable as statistical models, but without sacrificing predictive performance. These algorithms include generalized additive models (GAMs) and their machine-learning counterpart (GA2M), which are gaining a foothold in healthcare.

The predictions and explanations machine-learning algorithms generate can be presented to physicians in an application that supports their clinical decision making (Exhibit 4). To ensure ease of use, companies developing these applications involve user experience designers from an early stage to work closely and iteratively with physicians.

Build a technical platform to scale up safely

To progress from experimenting with AA to deploying scaled-up use cases across indications, therapies, and locations, RWE leaders must build a technical platform with five key components. Given the right data environment, tooling, processes, teams, and ethical principles, companies can develop analytical use cases at pace and extend them across their portfolios.

  • An integrated data environment should replace ad hoc data sets with an industrialized, connected data environment giving users access to a comprehensive collection of data sets through a central data lake. In such an environment, data catalogues are implemented in line with business priorities and a clear map of RWE data sources. Data sets are ingested and then enriched by repeatable data-engineering workflows: for example, regular workflows can be created to build feature representations of patients based on their medical histories, and multiple analytical applications can then use these representations. Linked data sets become proprietary knowledge that anyone in the organization can use. By creating industrialized processes to manage and track frameworks for data access, metadata, data lineage, and compliance, companies help ensure that workflows are robust and repeatable and that all RWE analyses maintain the integrity and transparency of data.
  • Modern tooling replaces disparate and fragmented computing resources with enterprise-grade distributed scientific and computing tools. To avoid interference with everyday business operations, analytics are typically developed in a lab-like environment separate from other systems. This environment must be flexible and scalable to handle a number of different analytical approaches: for instance, the use of notebook workflows. Data engineers must be able to access scalable resources as and when needed to support data-processing workflows. Similarly, data scientists must have the flexibility to switch between common programming languages (such as Python and R) to use the modeling-software packages they need. Depending on their analytical approach, they may also need access to hardware designed to accelerate deep-learning workflows.
  • Machine-learning operations (MLOps) replace manual processes for deploying models with automated processes in a factory-style environment where analytics models can be run continuously and at scale. (This helps to reduce development times and to make models more stable.) Automated integration and deployment give users reliable access to analytics solutions. Meanwhile, to manage the performance of models, control risks, and prevent unforeseen detrimental impacts on operations, monitoring processes should regularly check the predictions that algorithms generate. A central platform team with skills in MLOps, software operations (DevOps), and platform design usually builds the factory environment and develops the processes.

Analytics product teams , which replace project teams, focus on mutually reinforcing use cases. They build reusable modular software (or data products) and assemble and configure these elements in line with business needs as they scale up a use case to a new therapeutic area or business unit: for example, a general data model and algorithm built with rich US data sets could be ported to another region with more limited data by selecting only the relevant components. This approach enables companies to scale up the generation of evidence across indications, therapies, and use cases and allows researchers to run hundreds of analyses across multiple patient outcomes and subpopulations.

A product owner who defines the product’s success criteria and is accountable for realizing short- and midterm objectives leads such a team. Its other members have profiles resembling those of RWE teams. Each product team develops and advances the methodology for its own use case while working together with other teams to capture synergies between data products.

  • Ethical principles for data science and machine learning ensure that insights are extracted from RWD and other complex data sources safely. Care should be taken to avoid any risk of causing disproportionate harm to underprivileged or historically underrepresented groups. Risk management in AI is a complex and rapidly evolving field; regulators, professional bodies, and thought leaders have developed increasingly sophisticated positions on it.

AI risk management involves more than just the standard data ethics commitments, such as respect for privacy and the secure handling of sensitive data. The issues here include the explainability of models (so that nonspecialists can understand them) and fairness (so that they have comparable error rates or other criteria across, for example, gender or race). An ethically aware mindset recognizes that bias can be present in RWD because of historical disparities. It takes steps to detect and mitigate that bias by using clinical input and care when constructing cohorts and endpoints in RWE, as well as AA techniques that automatically profile the fairness of a predictive algorithm. These measures should be complemented by the use of privacy-respecting technologies, modern data governance  with clear accountability and ownership, and explainable methods (rather than black-box algorithms).

When successful companies apply AA to real-world data, they create interdisciplinary teams, focus on use cases that demonstrate value, and build platforms to provide an integrated data environment. Scaled up in this way, advanced RWE analytics helps organizations to make decisions more objective and to shift the focus from products to patients. In this way, it supports the broader goal of delivering the right drug to the right patient at the right time.

Chris Anagnostopoulos is a senior expert and associate partner in McKinsey’s Athens office; David Champagne and Alex Devereson are partners in McKinsey’s London office, of which Thomas Huijskens is an alumnus and where Matej Macak is an associate partner.

The authors wish to thank Anas El Turabi and Lucy Pérez for their contributions to this article.

Explore a career with us

Related articles.

Doctor wearing mask

Integrated evidence generation: A paradigm shift in biopharma

Creating value from next-generation real-world evidence

Creating value from next-generation real-world evidence

Precision medicine in practice Strategies for rare cancers

Precision medicine in practice: Strategies for rare cancers

IMAGES

  1. Scientific hypothesis

    hypothesis generation in science practice

  2. Steps in the hypothesis Generation

    hypothesis generation in science practice

  3. 🏷️ Formulation of hypothesis in research. How to Write a Strong

    hypothesis generation in science practice

  4. Hypothesis Testing- Meaning, Types & Steps

    hypothesis generation in science practice

  5. PPT

    hypothesis generation in science practice

  6. Revolutionizing Hypothesis Generation

    hypothesis generation in science practice

VIDEO

  1. Proportion Hypothesis Testing, example 2

  2. Level I CFA Quant: Hypothesis Testing

  3. 164- Management 3 0 Practice

  4. Scientific Method Vocab: Hypothesis Example

  5. Hypothesis Testing Made Easy: These are the Steps

  6. LECTURE 1: HYPOTHESIS TESTING

COMMENTS

  1. Formulating Hypotheses for Different Study Designs

    Formulating Hypotheses for Different Study Designs. Generating a testable working hypothesis is the first step towards conducting original research. Such research may prove or disprove the proposed hypothesis. Case reports, case series, online surveys and other observational studies, clinical trials, and narrative reviews help to generate ...

  2. PDF Hypothesis Generation in Biology: A Science Teaching Challenge ...

    ABSTRACT. Helping students understand and generate appropriate hypotheses and test their subsequent predictions in science in general and biology in particular should. - be at the core of teaching the nature of science. However, there is much confusion among students and teachers about the difference between hypotheses and predictions.

  3. Data-Driven Hypothesis Generation in Clinical Research: What We Learned

    Hypothesis generation is an early and critical step in any hypothesis-driven clinical research project. Because it is not yet a well-understood cognitive process, the need to improve the process goes unrecognized. Without an impactful hypothesis, the significance of any research project can be questionable, regardless of the rigor or diligence applied in other steps of the study, e.g., study ...

  4. Hypothesis-generating research and predictive medicine

    The hypothesis-generating mode of research has been primarily practiced in basic science but has recently been extended to clinical-translational work as well. Just as in basic science, this approach to research can facilitate insights into human health and disease mechanisms and provide the crucially needed data set of the full spectrum of ...

  5. Hypothesis Generation in Biology: A Science Teaching Challenge ...

    Helping students understand and generate appropriate hypotheses and test their subsequent predictions — in science in general and biology in particular — should be at the core of teaching the nature of science. However, there is much confusion among students and teachers about the difference between hypotheses and predictions. Here, I present evidence of the problem and describe steps that ...

  6. Hypothesis

    Generate a hypothesis in advance through pre-analyzing a problem (i.e., generation of a prestage hypothesis ). 3. Collect data related to the prestage hypothesis by appropriate means such as experiment, observation, database search, and Web search (i.e., data collection). 4. Process and transform the collected data as needed. 5.

  7. PDF 1.0 Introdu Ov a 7Testing

    Hypothesis Generation is a category that includes three specific techniques— Simple Hypotheses, Multiple Hypotheses Generator™, and Quadrant Hypothesis Generation. Simple Hypotheses is the easiest to use but not always the best selection. Use the Multiple Hypotheses Generator™ to identify a large set of all possible hypotheses.

  8. PDF Scientific hypothesis generation process in clinical research: a

    the scientific hypothesis to determine the answer to research questions 2,4. Scientific hypothesis generation and scientific hypothesis testing are distinct processes 2,5. In clinical research, research questions are often delineated without the support of systematic data analysis (i.e., not data-driven) 2,6,7. Using and analyzing existing data ...

  9. (PDF) Hypothesis Generation in Biology

    practice science: formulate a hypothesis, deduce its consequences (make a predic-tion), and observe those consequences (per- ... THE AMERICAN BIOLOGY TEACHER HYPOTHESIS GENERATION IN BIOLOGY. 501.

  10. Hypothesis Generation from Literature for Advancing Biological

    Hypothesis Generation is a literature-based discovery approach that utilizes existing literature to automatically generate implicit biomedical associations and provide reasonable predictions for future research. Despite its potential, current hypothesis generation methods face challenges when applied to research on biological mechanisms.

  11. Amplify scientific discovery with artificial intelligence

    This is an area where a new generation of artificial intelligence (AI) systems can radically transform the practice of scientific discovery. Such systems are showing an increasing ability to automate scientific data analysis and discovery processes, can search systematically and correctly through hypothesis spaces to ensure best results, can ...

  12. Machine learning and data mining: strategies for hypothesis generation

    Abstract. Strategies for generating knowledge in medicine have included observation of associations in clinical or research settings and more recently, development of pathophysiological models ...

  13. Data-driven hypothesis generation among inexperienced clinical

    To understand how humans use such systems to generate hypotheses in practice may provide unique insights into our understanding of scientific hypothesis generation, which can help system developers to better automate systems to facilitate the process. ... Health science: 2 (25%) 2 (28.6%)---Clinical research coordinator: 2 (25%) 1 (14.3% ...

  14. (PDF) Hypothesis generation

    Hypothesis Generation. Hermann Moisl. University of Newcastle. 1.0 INTRODUCTION. The aim of science is to understand reality. An academic discipline, philosophy of science, is. devoted to ...

  15. How to Write a Strong Hypothesis

    5. Phrase your hypothesis in three ways. To identify the variables, you can write a simple prediction in if…then form. The first part of the sentence states the independent variable and the second part states the dependent variable. If a first-year student starts attending more lectures, then their exam scores will improve.

  16. Assessing epistemological beliefs of experts and novices ...

    For example, although inquiry is an essential science practice for generating new science knowledge students are overwhelming exposed to simple, rather than authentic, science inquiry in K-16 classrooms (Chinn and Malhotra 2002). ... After completing her hypothesis generation phase, Sally then performed a test to check water salinity in the ...

  17. Hypothesis Generation and Interpretation

    Academic investigators and practitioners working on the further development and application of hypothesis generation and interpretation in big data computing, with backgrounds in data science and engineering, or the study of problem solving and scientific methods or who employ those ideas in fields like machine learning will find this book of ...

  18. Presenting domain information or self‐exploration to foster hypothesis

    This study investigated the effects of presenting domain information (basic information about the domain) either together with or instead of offering exploratory practice (an exploratory opportunity in a simulation-based representation of the learning domain) prior to inquiry learning for facilitating students' hypothesis generation and subsequent inquiry processes and their knowledge acquisition.

  19. Scientific hypothesis

    Countless hypotheses have been developed and tested throughout the history of science.Several examples include the idea that living organisms develop from nonliving matter, which formed the basis of spontaneous generation, a hypothesis that ultimately was disproved (first in 1668, with the experiments of Italian physician Francesco Redi, and later in 1859, with the experiments of French ...

  20. Putting hypotheses to the test: We must hold ourselves accountable to

    In the daily practice of doing research, it is easy to confuse what is being done. There is often confusion over whether a study is exploratory (hypothesis ... We are giving $1,000 prizes to 1,000 scholars simply for making clear when data were used to generate or test a hypothesis. Science is the best tool we have for understanding the way the ...

  21. Generating real-world evidence at scale using advanced analytics

    Hypothesis generation often uses associative and correlational approaches that do not necessarily satisfy standard requirements of causal inference. Hypothesis validation: Uses a set of analytical approaches to establish the causal confirmation (or rejection) of the proposed hypothesis. These approaches include experimental validation by means ...

  22. PDF Science & Engineering Practices in Next Generation Science Standards

    Science & Engineering Practices in Next Generation Science Standards Asking Questions and Defining Problems: A practice of science is to ask and refine questions that lead to descriptions and explanations of how the natural and designed world(s) works and which can be empirically tested.

  23. Evaluating MCMC samplers

    A better definition: one's "superpower" is a self-proclaimed talent, the existence of which is implied by the hypothesis that "everyone… Anoneuoid on Now here's a tour de force for ya April 24, 2024 8:45 PM