• What Is Generalizability In Research?

Moradeke Owa

  • Data Collection

Generalizability is making sure the conclusions and recommendations from your research apply to more than just the population you studied. Think of it as a way to figure out if your research findings apply to a larger group, not just the small population you studied.

In this guide, we explore research generalizability, factors that influence it, how to assess it, and the challenges that come with it.

So, let’s dive into the world of generalizability in research!

Defining Generalizability

Defining Generalizability

Generalizability refers to the extent to which a study’s findings can be extrapolated to a larger population. It’s about making sure that your findings apply to a large number of people, rather than just a small group.

Generalizability ensures research findings are credible and reliable. If your results are only true for a small group, they might not be valid.

Also, generalizability ensures your work is relevant to as many people as possible. For example, if you were to test a drug only on a small number of patients, you could potentially put patients at risk by prescribing the drug to all patients until you are confident that it is safe for everyone.

Factors Influencing Generalizability

Here are some of the factors that determine if your research can be adapted to a large population or different objects:

1. Sample Selection and Size

The size of the group you study and how you choose those people can affect how well your results can be applied to others. Think of it like asking one person out of a friendship group of 16 if a game is fun, doesn’t accurately represent the opinion of the group.

2. Research Methods and Design 

Different methods have different levels of generalizability. For example, if you only observe people in a particular city, your findings may not apply to other locations. But if you use multiple methods, you get a better idea of the big picture.

3. Population Characteristics

Not everyone is the same. People from different countries, different age groups, or different cultures may respond differently.  That’s why the characteristics of the people you’re looking at have a significant impact on the generalizability of the results.

4. Context and Environment 

Think of your research as a weather forecast. A forecast of sunny weather in one location may not be accurate in another. Context and environment play a role in how well your results translate to other environments or contexts.

Internal vs. External Validity

Internal vs. External Validity

You can only generalize a study when it has high validity, but there are two types of validity- internal and external. Let’s see the role they play in generalizability:

1. Understanding Internal Validity

Internal validity is a measure of how well a study has ruled out alternative explanations for its findings. For example, if a study investigates the effects of a new drug on blood pressure, internal validity would be high if the study was designed to rule out other factors that could affect blood pressure, such as exercise, diet, and other medications.

2. Understanding External Validity

External validity is the extent to which a study’s findings can be generalized to other populations, settings, and times. It focuses on how well your study’s results apply to the real world.

For example, if a new blood pressure-lowering drug were to be studied in a laboratory with a sample of young healthy adults, the study’s external validity would be limited. This is because the study doesn’t consider people outside the population such as older adults, patients with other medical conditions, and more.

3 . The Relationship Between Internal and External Validity

Internal validity and external validity are often inversely related. This means that studies with high internal validity may have lower external validity, and vice versa.

For example, a study that randomly assigns participants to different treatment groups may have high internal validity, but it may have lower external validity if the participants are not representative of the population of interest.

Strategies for Enhancing Generalizability

Several strategies enable you to enhance the generalizability of their findings, here are some of them:

1 . Random Sampling Techniques

This involves selecting participants from a population in a way that gives everyone an equal chance of being selected. This helps to ensure that the sample is representative of the population.

Let’s say you want to find out how people feel about a new policy. Randomly pick people from the list of people who registered to vote to ensure your sample is representative of the population.

2 . Diverse Sample Selection

Choose samples that are representative of different age groups, genders, races, ethnicities, and economic backgrounds. This helps to ensure that the findings are generalizable to a wider range of people.

3 . Careful Research Design

Meticulously design your studies to minimize the risk of bias and confounding variables. A confounding variable is a factor that makes it hard to tell the real cause of your results.

For example, you are studying the effect of a new drug on cholesterol levels. Even if you take a random sample of participants and randomly select them to receive either a new drug or placebo if you don’t control for the participant’s diet, your results could be misleading. You could be attributing cholesterol balance to drugs when it is due to their diet.

4 . Robust Data Collection Methods

Use robust data collection methods to minimize the risk of errors and biases. This includes using well-validated measures and carefully training data collectors.

For instance, an online survey tool could be used to conduct online polls on how voters change their minds during an election cycle rather than relying on phone interviews, which would make it harder to get repeat voters to participate in the study and review their views over time.

Challenges to Generalizability

1. sample bias .

Sample bias happens when the group you study doesn’t represent everyone you want to talk about. For example, if you’re researching ice cream preferences and only ask your friends, your results might not apply to everyone because your friends are not the only people who take ice cream.

2. Ethical Considerations

Ethical considerations can limit your research’s generalizability because it wouldn’t be right or fair. For example, it’s not ethical to test a new medicine on people without their permission.

3 . Resource Constraints 

Having a limited budget for a project also restricts your research’s generalizability. For example, if you want to conduct a large-scale study but don’t have the resources, time, or personnel, you opt for a small-scale study, which could make your findings less likely to apply to a larger population.

4. Limitations of Research Methods

Tools are just as much a part of your research as the research itself. If you an ineffective tool, you might not be able to apply what you’ve learned to other situations.

Assessing Generalizability

Evaluating generalizability allows you to understand the implications of your findings and make realistic recommendations. Here are some of the most effective ways to assess generalizability:

Statistical Measures and Techniques

Several statistical tools and methods allow you to assess the generalizability of your study. Here are the top two:

  • Confidence Interval

A confidence interval is a range of values that is likely to contain the true population value. So if a researcher looks at a test and sees that the mean score is 78 with a 95% confidence interval of 70-80, they’re 95% sure that the actual population score is between 70-80.

The p-value indicates the likelihood that the results of the study, or more extreme results, will be obtained if the null hypothesis holds. A null hypothesis is the supposition that there is no association between the variables being analyzed.

A good example is a researcher surveying 1,000 college students to study the relationship between study habits and GPA. The researcher finds that students who study for more hours per week have higher GPAs. 

The p-value below 0.05 indicates that there is a statistically significant association between study habits and GPA. This means that the findings of the study are not by coincidence.

Peer Review and Expert Evaluation

Reviewers and experts can look at sample selection, study design, data collection, and analysis methods to spot areas for improvement. They can also look at the survey’s results to see if they’re reliable and if they match up with other studies.

Transparency in Reporting

Clearly and concisely report the survey design, sample selection, data collection methods, data analysis methods, and findings of the survey. This allows other researchers to assess the quality of the survey and to determine whether the results are generalizable.

The Balance Between Generalizability and Specificity

Assessing Generalizability

Generalizability refers to the degree to which the findings of a study can be applied to a larger population or context. Specificity, on the other hand, refers to the focus of a study on a particular population or context.

a. When Generalizability Matters Most

Generalizability comes into play when you want to make predictions about the world outside of your sample. For example, you want to look at the impact of a new viewing restrictions policy on the population as a whole.

b. Situations Where Specificity is Preferred

Specificity is important when researchers want to gain a deep understanding of a specific group or phenomenon in detail. For example, if a researcher wants to study the experiences of people with a rare disease.

Finding the Right Balance Between Generalizability and Specificity

The right balance between generalizability and specificity depends on the research question. 

Case 1- Specificity over Generalizability

Sometimes, you have to give up some of their generalizability to get more specific results. For example, if you are studying a rare genetic condition, you might not be able to get a sample that’s representative of the population.

Case 2- Generalizability over Specificity 

In other cases, you may need to sacrifice some specificity to achieve greater generalizability. For example, when studying the effects of a new drug, you need a sample that includes a wide range of people with different characteristics.

Keep in mind that generalizability and specificity are not mutually exclusive. You can design studies that are both generalizable and specific.

Real-World Examples

Here are a few real-world examples of studies that turned out to be generalizable, as well as some that are not:

1. Case Studies of Research with High Generalizability

We’ve been talking about how important a generalizable study is and how to tell if your research is generalizable. Let’s take a look at some studies that have achieved this:

a. The Framingham Heart Study  

This is a long-running study that has been tracking the health of over 15,000 participants since 1948. The study has provided valuable insights into the risk factors for heart disease, stroke, and other chronic diseases

The findings of the Framingham Heart Study are highly generalizable because the study participants were recruited from a representative sample of the general population.

b. The Cochrane Database of Systematic Reviews  

This is a collection of systematic reviews that evaluate the evidence for the effectiveness of different healthcare interventions. The Cochrane Database of Systematic Reviews is a highly respected source of information for healthcare professionals and policymakers. 

The findings of Cochrane reviews are highly generalizable because they are based on a comprehensive review of all available evidence.

2. Case Studies of Research with Limited Generalizability

Let’s look at some studies that would fail to prove their validity to the general population:

  • A study that examines the effects of a new drug on a small sample of participants with a rare medical condition. The findings of this study would not be generalizable to the general population because the study participants were not representative of the general population.
  • A study that investigates the relationship between culture and values using a sample of participants from a single country. The findings of this study would not be generalizable to other countries because the study participants were not representative of people from other cultures.

Implications of Generalizability in Different Fields

Peer Review and Expert Evaluation

Research generalizability has significant effects in the real world, here are some ways to leverage it across different fields:

1. Medicine and Healthcare

Generalizability is a key concept of medicine and healthcare. For example, a single study that found a new drug to be effective in treating a specific condition in a limited number of patients might not apply to all patients.

Healthcare professionals also leverage generalizability to create guidelines for clinical practice. For example, a guideline for the treatment of diabetes may not be generalizable to all patients with diabetes if it is based on research studies that only included patients with a particular type of diabetes or a particular level of severity.

2. Social Sciences

Generalizability allows you to make accurate inferences about the behavior and attitudes of large populations. People are influenced by multiple factors, including their culture, personality, and social environment.

For example, a study that finds that a particular educational intervention is effective in improving student achievement in one school may not be generalizable to all schools.

3. Business and Economics

Generalizability also allows companies to conclude how customers and their competitors behave. Factors like economic conditions, consumer tastes, and tech trends can change quickly, so it’s hard to generalize results from one study to the next.

For example, a study that finds that a new marketing campaign is effective in increasing sales of a product in one region may not be generalizable to other regions. 

The Future of Generalizability in Research

The Future of Generalizability in Research

Let’s take a look at new and future developments geared at improving the generalizability of research:

1. Evolving Research Methods and Technologies

The evolution of research methods and technologies is changing the way that we think about generalizability. In the past, researchers were often limited to studying small samples of people in specific settings. This made it difficult to generalize the findings to the larger population.

Today, you can use various new techniques and technologies to gather data from a larger and more varied sample size. For example, online surveys provide you with a large sample size in a very short period.

2. The Growing Emphasis on Reproducibility

The growing emphasis on reproducibility is also changing the way that we think about generalizability. Reproducibility is the ability to reproduce the results of a study by following the same methods and using a similar sample.

For example,  you publish a study that claims that a new drug is effective in treating a certain disease. Two other researchers replicated the study and confirmed the findings. This replication helps to build confidence in the findings of the original study and makes it more likely that the drug will be approved for use.

3. The Ongoing Debate on Generalizability vs. Precision

Generalizability refers to the ability to apply the findings of a study to a wider population. Precision refers to the ability to accurately measure a particular phenomenon.

For some researchers, generalizability matters more than accuracy because it means their findings apply to a larger number of people and have an impact on the real world. For others, accuracy matters more than generalization because it enables you to understand the underlying mechanisms of a phenomenon.

The debate over generalizability versus precision is likely to continue because both concepts are very important. However, it is important to note that the two concepts are not mutually exclusive. It is possible to achieve both generalizability and precision in research by using carefully designed methods and technologies.

Generalizability allows you to apply the findings of a study to a larger population. This is important for making informed decisions about policy and practice, identifying and addressing important social problems, and advancing scientific knowledge.

With more advanced tools such as online surveys, generalizability research is here to stay. Sign up with Formplus to seamlessly collect data from a global audience.

Logo

Connect to Formplus, Get Started Now - It's Free!

  • Case Studies of Research
  • External Validity
  • Generalizability
  • internal validity
  • Specificity
  • Moradeke Owa

Formplus

You may also like:

What is Research Replicability in Surveys

Research replicability ensures that if one researcher does a study, another researcher could do the same study and get pretty similar...

findings of research study are extrapolated to

Conversational Analysis in Research: Methods & Techniques

Communication patterns can reveal a great deal about our social interactions and relationships. But identifying and analyzing them can...

What is Retrieval Practice?

Learning something new is like putting a shelf of books in your brain. If you don’t take them out and read them again, you will probably...

Internal Validity in Research: Definition, Threats, Examples

In this article, we will discuss the concept of internal validity, some clear examples, its importance, and how to test it.

Formplus - For Seamless Data Collection

Collect data the right way with a versatile data collection tool. try formplus and transform your work productivity today..

All Subjects

study guides for every class

That actually explain what's on your next test, extrapolation, from class:, causal inference.

Extrapolation is the process of estimating unknown values by extending or projecting from known data points. This technique is crucial in understanding how results observed in a specific sample or experimental setting might apply to a broader population or different contexts, which relates closely to issues of external validity and the generalizability of findings.

congrats on reading the definition of Extrapolation . now let's actually learn it.

5 Must Know Facts For Your Next Test

  • Extrapolation can introduce errors if the relationship between variables changes outside the observed range of data, potentially leading to misleading conclusions.
  • In machine learning for causal inference, extrapolation is often necessary when applying learned models to new datasets, but caution must be exercised to avoid overestimating the model's applicability.
  • External validity is fundamentally linked to extrapolation, as it assesses whether study results are applicable to settings or populations beyond those studied.
  • Understanding the limits of extrapolation is critical; for instance, applying results from a controlled environment directly to real-world situations can yield inaccurate predictions.
  • The validity of extrapolated conclusions heavily depends on the robustness of the underlying causal assumptions made during analysis.

Review Questions

  • Extrapolation can significantly impact the reliability of findings because it involves making predictions about unobserved data based on known values. If the underlying relationships remain stable across contexts, then extrapolated conclusions may hold true. However, if those relationships change or do not apply outside the studied sample, it can lead to erroneous interpretations and flawed decision-making. Thus, careful consideration of the context and assumptions is vital when relying on extrapolated results.
  • Extrapolating machine learning models poses several challenges, including overfitting and potential changes in underlying data distributions. When a model is overfit to training data, it may not perform well when applied to new datasets due to its lack of generalization. Strategies such as cross-validation, regularization techniques, and ensuring diverse training datasets can help improve model robustness and accuracy. Additionally, conducting sensitivity analyses can assess how variations in input affect output predictions, helping validate extrapolations.
  • External validity is inherently linked to extrapolation as it assesses whether research findings can be applied beyond the specific conditions of a study. If researchers fail to establish strong external validity, their ability to extrapolate results confidently to broader populations or different contexts becomes compromised. This impacts generalizability since findings that cannot be reliably extrapolated may misrepresent real-world scenarios or lead to ineffective interventions. Therefore, establishing external validity through careful study design and consideration of contextual factors is crucial for valid extrapolation.

Related terms

Generalization : The process of applying findings from a study sample to a larger population, which relies on the assumption that the sample accurately represents the population.

A modeling error that occurs when a machine learning model learns the details and noise in the training data to the extent that it negatively impacts its performance on new data.

Transferability : The extent to which findings from one context can be applied to another, often assessed in qualitative research settings.

" Extrapolation " also found in:

Subjects ( 34 ).

  • AP Statistics
  • Advanced quantitative methods
  • Algebra and Trigonometry
  • Approximation Theory
  • Blockchain and Cryptocurrency
  • Business Analytics
  • Business Valuation
  • College Algebra
  • College Introductory Statistics
  • Computational Mathematics
  • Contemporary Mathematics for Non-Math Majors
  • Forecasting
  • Honors Pre-Calculus
  • Honors Statistics
  • Intermediate Financial Accounting 2
  • Intro to Business Statistics
  • Introduction to Demographic Methods
  • Introduction to Econometrics
  • Introduction to Film Theory
  • Mathematical Biology
  • Mathematical Fluid Dynamics
  • Numerical Analysis I
  • Numerical Analysis for Data Science and Statistics
  • Numerical Solution of Differential Equations
  • Population and Society
  • Preparatory Statistics
  • Principles of Finance
  • Programming for Mathematical Applications
  • Screenwriting II
  • Thermodynamics I
  • Variational Analysis

© 2024 Fiveable Inc. All rights reserved.

Ap® and sat® are trademarks registered by the college board, which is not affiliated with, and does not endorse this website..

Study Design 101: Systematic Review

  • Case Report
  • Case Control Study
  • Cohort Study
  • Randomized Controlled Trial
  • Practice Guideline
  • Systematic Review
  • Meta-Analysis
  • Helpful Formulas
  • Finding Specific Study Types

A document often written by a panel that provides a comprehensive review of all relevant studies on a particular clinical or health-related topic/question. The systematic review is created after reviewing and combining all the information from both published and unpublished studies (focusing on clinical trials of similar treatments) and then summarizing the findings.

  • Exhaustive review of the current literature and other sources (unpublished studies, ongoing research)
  • Less costly to review prior studies than to create a new study
  • Less time required than conducting a new study
  • Results can be generalized and extrapolated into the general population more broadly than individual studies
  • More reliable and accurate than individual studies
  • Considered an evidence-based resource

Disadvantages

  • Very time-consuming
  • May not be easy to combine studies

Design pitfalls to look out for

Studies included in systematic reviews may be of varying study designs, but should collectively be studying the same outcome.

Is each study included in the review studying the same variables?

Some reviews may group and analyze studies by variables such as age and gender; factors that were not allocated to participants.

Do the analyses in the systematic review fit the variables being studied in the original studies?

Fictitious Example

Does the regular wearing of ultraviolet-blocking sunscreen prevent melanoma? An exhaustive literature search was conducted, resulting in 54 studies on sunscreen and melanoma. Each study was then evaluated to determine whether the study focused specifically on ultraviolet-blocking sunscreen and melanoma prevention; 30 of the 54 studies were retained. The thirty studies were reviewed and showed a strong positive relationship between daily wearing of sunscreen and a reduced diagnosis of melanoma.

Real-life Examples

Yang, J., Chen, J., Yang, M., Yu, S., Ying, L., Liu, G., ... Liang, F. (2018). Acupuncture for hypertension. The Cochrane Database of Systematic Reviews, 11 (11), CD008821. https://doi.org/10.1002/14651858.CD008821.pub2

This systematic review analyzed twenty-two randomized controlled trials to determine whether acupuncture is a safe and effective way to lower blood pressure in adults with primary hypertension. Due to the low quality of evidence in these studies and lack of blinding, it is not possible to link any short-term decrease in blood pressure to the use of acupuncture. Additional research is needed to determine if there is an effect due to acupuncture that lasts at least seven days.

Parker, H.W. and Vadiveloo, M.K. (2019). Diet quality of vegetarian diets compared with nonvegetarian diets: a systematic review. Nutrition Reviews , https://doi.org/10.1093/nutrit/nuy067

This systematic review was interested in comparing the diet quality of vegetarian and non-vegetarian diets. Twelve studies were included. Vegetarians more closely met recommendations for total fruit, whole grains, seafood and plant protein, and sodium intake. In nine of the twelve studies, vegetarians had higher overall diet quality compared to non-vegetarians. These findings may explain better health outcomes in vegetarians, but additional research is needed to remove any possible confounding variables.

Related Terms

Cochrane Database of Systematic Reviews

A highly-regarded database of systematic reviews prepared by The Cochrane Collaboration , an international group of individuals and institutions who review and analyze the published literature.

Exclusion Criteria

The set of conditions that characterize some individuals which result in being excluded in the study (i.e. other health conditions, taking specific medications, etc.). Since systematic reviews seek to include all relevant studies, exclusion criteria are not generally utilized in this situation.

Inclusion Criteria

The set of conditions that studies must meet to be included in the review (or for individual studies - the set of conditions that participants must meet to be included in the study; often comprises age, gender, disease type and status, etc.).

Now test yourself!

1. Systematic Reviews are similar to Meta-Analyses, except they do not include a statistical analysis quantitatively combining all the studies.

a) True b) False

2. The panels writing Systematic Reviews may include which of the following publication types in their review?

a) Published studies b) Unpublished studies c) Cohort studies d) Randomized Controlled Trials e) All of the above

Evidence Pyramid - Navigation

  • Meta- Analysis
  • Case Reports
  • << Previous: Practice Guideline
  • Next: Meta-Analysis >>

Creative Commons License

  • Last Updated: Sep 25, 2023 10:59 AM
  • URL: https://guides.himmelfarb.gwu.edu/studydesign101

GW logo

  • Himmelfarb Intranet
  • Privacy Notice
  • Terms of Use
  • GW is committed to digital accessibility. If you experience a barrier that affects your ability to access content on this page, let us know via the Accessibility Feedback Form .
  • Himmelfarb Health Sciences Library
  • 2300 Eye St., NW, Washington, DC 20037
  • Phone: (202) 994-2962
  • [email protected]
  • https://himmelfarb.gwu.edu

This website may not work correctly because your browser is out of date. Please update your browser .

Extrapolate findings

An evaluation usually involves some level of generalising of the findings to other times, places or groups of people. 

For many evaluations, this simply involves generalising from data about the current situation or the recent past to the future.

For example, an evaluation might report that a practice or program has been working well (finding), therefore it is likely to work well in the future (generalisation), and therefore we should continue to do it (recommendation). In this case, it is important to understand whether or not future times are likely to be similar to the time period of the evaluation.  If the program had been successful because of support from another organisation, and this support was not going to continue, then it would not be correct to assume that the program would continue to succeed in the future.

For some evaluations, there are other types of generalising needed.  Impact evaluations which aim to learn from the evaluation of a pilot to make recommendations about scaling up must be clear about the situations and people to whom results can be generalised. 

There are often two levels of generalisation.  For example, an evaluation of a new nutrition program in Ghana collected data from a random sample of villages. This allowed statistical generalisation to the larger population of villages in Ghana.  In addition, because there was international interest in the nutrition program, many organisations, including governments in other countries, were interested to learn from the evaluation for possible implementation elsewhere.

Analytical generalisation involves making projections about the likely transferability of findings from an evaluation, based on a theoretical analysis of the factors producing outcomes and the effect of context.

Statistical generalisation involves statistically calculating the likely parameters of a population using data from a random sample of that population.

Horizontal evaluation is an approach that combines self-assessment by local participants and external review by peers.

Positive deviance (PD), a behavioural and social change approach, involves learning from those who find unique and successful solutions to problems despite facing the same challenges, constraints and resource deprivation as others.

Realist evaluation aims to identify the underlying generative causal mechanisms that explain how outcomes were caused and how context influences these.

This blog post and its associated replies, written by Jed Friedman for the World Bank, describes a process of using analytic methods to overcome some of the assumptions that must be made when extrapolating results from evaluations to other settings.

  • << Synthesise data across evaluations
  • Report & Support Use of findings >>

Expand to view all resources related to 'Extrapolate findings'

  • Qualitative research & evaluation methods: Integrating theory and practice
  • Randomised control trials for the impact evaluation of development initiatives: a statistician's point of view

'Extrapolate findings' is referenced in:

  • 52 weeks of BetterEvaluation: Week 34 Generalisations from case studies?

Framework/Guide

  • Communication for Development (C4D) :  C4D: Generalise findings
  • Analytical generalisation
  • Statistical generalisation

Back to top

© 2022 BetterEvaluation. All right reserved.

Extrapolation beyond the end of trials to estimate long term survival and cost effectiveness

Author affiliations

orcid logo

Key messages

Extrapolation beyond time periods studied in clinical trials is usually necessary to estimate long term effects of treatments

Many statistical survival models can be used to extrapolate data, but these can have widely varying results, which affects estimated clinical effectiveness and cost effectiveness

The choice of survival model and credibility of the extrapolations should be inspected carefully when making policy decisions that inform the allocation of healthcare resources

This paper explains the importance of extrapolating beyond the end of trials to estimate the long term benefits associated with new treatments, why this is done, and the limitations of various approaches.
  • Introduction

Policy makers worldwide use economic evaluation to inform decisions when allocating limited healthcare resources. A critical part of this evaluation involves accurately estimating long term effects of treatments. Yet, evidence is usually from clinical trials of short duration. Rarely do all participants encounter the clinical event of interest by the trial’s end. When people might benefit from a long term treatment, health technology assessment agencies recommend that economic evaluations extrapolate beyond the trial period to estimate lifetime benefits. 1 2 This kind of evaluation is common for people with cancer, when effective treatments delay disease progression and improve survival.

Use of survival modelling: rationale

To make funding decisions, health technology assessment agencies rely on accurate estimates of the benefits and costs of new treatments compared with existing treatments. For treatments that improve survival, accurate estimates of survival benefits are crucial. Policy makers use estimates of mean (average) survival rather than median survival, taking into account the probability of death over a lifetime across all patients with the disease. This mean is represented by the area under survival curves that plot the proportion of patients alive over time by treatment.

In figure 1 , the purple area represents a mean survival benefit associated with an experimental compared with a control treatment, but this benefit is a restricted mean, limited to the trial period. The curves separate early, and remain separated at the end of the trial, so it is reasonable to expect that benefits would continue to accrue beyond the trial’s end. The orange smooth curves represent survival models fitted to the trial data and extrapolated beyond the trial. The area between the orange curves estimates the mean lifetime survival benefit associated with the experimental treatment. This area is much larger than the purple area, and is relevant for economic evaluation.

Survival modelling to extrapolate beyond the trial—mean survival restricted to the trial period, and extrapolated

Description of survival models

Survival models extrapolate beyond the trial. They typically have a parametric specification, which means that they rely on an assumed distribution of probabilities of, for example, death over time, which is defined by a set of parameters such as shape and scale. The chosen parametric model is fitted to the observed trial survival data, and values estimated for each parameter. The model is then used to generate survival probabilities beyond the trial period to predict what would have happened had the trial continued until everyone died.

In health technology assessments, a set of standard models typically include: exponential, Weibull, Gompertz, log-logistic, log-normal, and generalised gamma models. 3 Each survival model involves different assumptions about the shape of the hazard function—that is, the risk over time of the event of interest,—which is usually death. Figure 2 shows the hazard function shapes assumed when using standard parametric models; over time these can stay the same, increase, decrease, or have one turning point (that is, the hazard increases then decreases, or decreases then increases).

Survival modelling to extrapolate beyond the trial—hazard shapes associated with standard parametric survival models

Selecting a model

Extrapolating survival curves predicts the unknown. No one can know which models most accurately predict survival—although it might be possible to determine which models produce extrapolations that are plausible. Different models often result in substantially different estimates of survival and cost effectiveness. 4 Figure 3 shows a range of survival models fitted to the same data. While all the parametric models seem to fit the observed trial data well, they predict large differences in longer term and mean survival. The more immature the trial data, the more likely the long term predictions will differ. Model choice affects estimated treatment benefits and, consequently, cost effectiveness.

Survival modelling to extrapolate beyond the trial—a variety of standard parametric models fitted to the same data

To choose clinically plausible survival models, modellers must assess fit to the trial data, but also, crucially, assess the credibility of the extrapolations. 4 5 This approach involves considering external data sources with longer term data such as other trials, disease registries, and general population mortality rates. Biological plausibility, pharmacological mechanisms, and clinical opinion should also be considered. Although identifying a single best model might not be possible, this approach ensures that policy makers use credible models.

Limitations of standard survival models

Standard parametric survival models have limitations. They might rely on hazard functions with implausible shapes ( figure 2 ), and might neither fit the data well nor provide credible extrapolations. As illustrated in figure 3 , the implications of choosing the wrong survival model are serious, because the choice of model affects survival predictions. Figure 4 illustrates a hypothetical hazard function of death from a cancer. No standard parametric models could capture the shape of this function, although more complex survival models can, such as flexible parametric models, fractional polynomials, piecewise models, or mixture cure models.

Survival modelling to extrapolate beyond the trial—a hypothesised, realistic hazard function

Flexible parametric models (such as restricted cubic spline models) segment the survival curve into portions, using knots to model hazard functions that have many turning points. 6 However, flexible parametric models will not generate turning points beyond the period of observed trial data unless modellers use external information, which they rarely do, such as longer term hazard rates from registry data. Indeed, while flexible parametric models are likely to fit the data well, beyond the data they reduce to standard Weibull, log-normal, or log-logistic models (therefore assuming that a transformation of the survival function is a linear function of log-time), and might generate implausible extrapolations. In figure 4 , if the trial were short and ended in the period where the hazard function is rising, a flexible parametric model would extrapolate that rising hazard, based on the observed segment of data.

An alternative option is to use fractional polynomials to model a hazard function with a complex shape, placing no restrictions on the hazard and survival functions beyond the period of observed data. However, while these models might fit the observed data well, the lack of restrictions on the extrapolation can lead to implausible predictions. 7 Other options include piecewise models, where separate survival models are fitted to defined portions of the observed survival data using cut-off points. The extrapolation is based on the model fitted to the final observed period. Piecewise models can be sensitive to the choice of cut-off points, and lead to extrapolations based on the last portion of data where numbers of trial participants and numbers of deaths among these participants are often low. 8 Generalised additive models and dynamic survival models have recently been suggested as potentially valuable novel approaches for modelling and extrapolating survival data. 7

Mixture cure models can capture complex hazard functions because they predict survival separately for cured and uncured patients, 9 and estimate a cure fraction—that is, the proportion of patients who would be cured. Predicting survival for cured and uncured patients separately could result in a model that generates credible extrapolations. However, a key issue that is difficult—or perhaps impossible—is to estimate a cure fraction reliably based on short term data. When the cure fraction is estimated inaccurately, cure models can result in poor survival predictions.

Extrapolation in practice

Decision makers, such as those on committees of the National Institute for Health and Care Excellence (NICE), discuss, document, and assess the approaches that pharmaceutical companies use to predict long term survival. Often the approach has a large impact on cost effectiveness estimates ( box 1 ). Typically, NICE reviews appraisals three years after the initial recommendation, and some drugs are placed in the Cancer Drugs Fund, providing an opportunity for checking extrapolations once longer term data are available, often from the key trial. However, while drugs in the Cancer Drugs Fund undergo rigorous reappraisal, other reviews are rarely done comprehensively, leaving extrapolations unchecked.

Impact of survival modelling in technology appraisals by the National Institute for Health and Care Excellence (NICE)

When NICE appraised pembrolizumab for untreated, advanced oesophageal and gastro-oesophageal junction cancer, the appraisal committee identified four approaches to survival modelling that it considered to be credible. 10 These approaches were a log-logistic piecewise model, a log- logistic piecewise model incorporating an assumed waning of the treatment effect over time, a log-logistic model not fitted using a piecewise approach, and a generalised gamma piecewise model. The incremental gains in quality adjusted life years (QALYs) associated with pembrolizumab ranged from 0.50 to 1.07 QALYs per person over a lifetime, with the estimated cost per incremental QALY doubling between the most and least optimistic analysis. 11

When NICE appraised tisagenlecleucel (a chimeric antigen receptor T cell treatment) for relapsed or refractory, diffuse, large B cell, acute lymphoblastic leukaemia, the committee acknowledged that survival was a key uncertainty, considered cure possible, and discussed several mixture cure models. Cure fractions varied by 35 percentage points depending on the model, with cost effectiveness estimates that varied from potentially acceptable to unacceptable. 12 The committee accepted using a mixture cure model based on clinical experts suggesting that some patients could be cured. However, the committee preferred a model that estimated a lower cure fraction than that estimated by the manufacturer’s preferred model, because the manufacturer’s model predicted a cure fraction that was higher than the proportion of patients who remained event-free in the tisangenlecleucel trials. Tisagenlecleucel was recommended for use in the Cancer Drugs Fund to allow the trial to accrue more data on overall survival before making a final decision on its routine use in the NHS. 12

  • Conclusions

When treatments make people live longer, it is important to extrapolate beyond the end of clinical trials to estimate mean survival gains and cost effectiveness over a period longer than the trial. Several survival models are available, and these result in widely varying estimates. To choose a model, researchers should consider a model’s fit to the observed trial survival data, and the credibility of predictions beyond the trial. More complex models could, but do not necessarily, result in better extrapolations. To inform decision making, survival models must be scrutinised while considering a range of plausible models and their impact on cost effectiveness. Analysts should follow recommended processes, report analyses clearly, justify chosen models by describing why and how the models have been selected, detail how well models fit the observed data, and describe what the models predict about hazards and survival. 4 8 This approach provides decision makers with the reassurance needed to make decisions when allocating healthcare resources.

  • Publication history
  • Rapid Responses
  • - Google Chrome

Intended for healthcare professionals

  • My email alerts
  • BMA member login
  • Username * Password * Forgot your log in details? Need to activate BMA Member Log In Log in via OpenAthens Log in via your institution

Home

Search form

  • Advanced search
  • Search responses
  • Search blogs
  • News & Views
  • Participants in...

Participants in research: Routine extrapolation of randomised controlled trials is absurd

  • Related content
  • Peer review
  • Bruce G Charlton , reader in evolutionary psychiatry ([email protected])
  • School of Biology, University of Newcastle upon Tyne, Newcastle upon Tyne NE1 7RU

EDITOR—For more than a decade it has been an article of faith in evidence based medicine that randomised controlled trials are “best evidence” and their findings can routinely be extrapolated to clinical situations. 1 In his editorial Sackett, the founder of evidence based medicine, seeks retrospectively to reassure clinicians that this practice was justifiable, but the accompanying study by Vist et …

Log in using your username and password

BMA Member Log In

If you have a subscription to The BMJ, log in:

  • Need to activate
  • Log in via institution
  • Log in via OpenAthens

Log in through your institution

Subscribe from £184 *.

Subscribe and get access to all BMJ articles, and much more.

* For online subscription

Access this article for 1 day for: £50 / $60/ €56 ( excludes VAT )

You can download a PDF version for your personal record.

Buy this article

findings of research study are extrapolated to

Extrapolating baseline trend in single-case data: Problems and tentative solutions

  • Published: 27 November 2018
  • Volume 51 , pages 2847–2869, ( 2019 )

Cite this article

findings of research study are extrapolated to

  • Rumen Manolov   ORCID: orcid.org/0000-0002-9387-1926 1 , 2 ,
  • Antonio Solanas 1 &
  • Vicenta Sierra 2  

5592 Accesses

13 Citations

2 Altmetric

Explore all metrics

Single-case data often contain trends. Accordingly, to account for baseline trend, several data-analytical techniques extrapolate it into the subsequent intervention phase. Such extrapolation led to forecasts that were smaller than the minimal possible value in 40% of the studies published in 2015 that we reviewed. To avoid impossible predicted values, we propose extrapolating a damping trend, when necessary. Furthermore, we propose a criterion for determining whether extrapolation is warranted and, if so, how far out it is justified to extrapolate a baseline trend. This criterion is based on the baseline phase length and the goodness of fit of the trend line to the data. These proposals were implemented in a modified version of an analytical technique called Mean phase difference. We used both real and generated data to illustrate how unjustified extrapolations may lead to inappropriate quantifications of effect, whereas our proposals help avoid these issues. The new techniques are implemented in a user-friendly website via the Shiny application, offering both graphical and numerical information. Finally, we point to an alternative not requiring either trend line fitting or extrapolation.

Similar content being viewed by others

findings of research study are extrapolated to

How important is the linearity assumption in a sample size calculation for a randomised controlled trial where treatment is anticipated to affect a rate of change?

findings of research study are extrapolated to

A Review of Time Scale Fundamentals in the g-Formula and Insidious Selection Bias

Search for efficient complete and planned missing data designs for analysis of change.

Avoid common mistakes on your manuscript.

Several features of single-case experimental design (SCED) data have been mentioned as potential reasons for the difficulty of analyzing such data quantitatively, for the lack of consensus regarding the most appropriate statistical analyses, and for the continued use of visual analysis (Campbell & Herzinger, 2010 ; Kratochwill, Levin, Horner, & Swoboda, 2014 ; Parker, Cryer, & Byrns, 2006 ; Smith, 2012 ). Some of the data features that have received the most attention are serial dependence (Matyas & Greenwood, 1997 ; Shadish, Rindskopf, Hedges, & Sullivan, 2013 ), the common use of counts or other outcome measures that are not continuous or normally distributed (Pustejovsky, 2015 ; Sullivan, Shadish, & Steiner, 2015 ), the shortness of the data series (Arnau & Bono, 1998 ; Huitema, McKean, & McKnight, 1999 ), and the presence of trends (Mercer & Sterling, 2012 ; Parker et al., 2006 ; Solomon, 2014 ). In the present article we focus on trends. The reason for this focus is that trend is a data feature whose presence, if not taken into account, can invalidate conclusions regarding an intervention’s effectiveness (Parker et al., 2006 ). Even when there is an intention to take the trend into account, several challenges arise. First, linear trend has been defined in several ways in the context of SCED data (Manolov, 2018 ). Second, there has been recent emphasis on the need to consider nonlinear trends (Shadish, Rindskopf, & Boyajian, 2016 ; Swan & Pustejovsky, 2018 ; Verboon & Peters, 2018 ). Third, some techniques for controlling trend may provide insufficient control (see Tarlow, 2017 , regarding Tau-U by Parker, Vannest, Davis, & Sauber, 2011 ), leading applied researchers to think that their results represent an intervention effect beyond baseline trend, which may not be justified. Fourth, other techniques may extrapolate baseline trend regardless of the degree to which the trend line is a good representation of the baseline data, and despite the possibility of impossible values being predicted (see Parker et al.’s, 2011 , comments on the regression model by Allison & Gorman, 1993 ). The latter two challenges compromise the interpretation of results.

Aim, focus, and organization of the article

The aim of the present article is to provide further discussion on four issues related to baseline trend extrapolation, based on the comments by Parker et al. ( 2011 ). As part of this discussion, we propose tentative solutions to the issues identified. Moreover, we specifically aim to improve one analytical procedure, which extrapolates baseline trend and compares this extrapolation to the actual intervention-phase data: the mean phase difference (MPD; Manolov & Solanas, 2013 ; see also the modification and extension in Manolov & Rochat, 2015 ).

Most single-case data-analytical techniques focus on linear trend, although there are certain exceptions. One exception is a regression-based analysis (Swaminathan, Rogers, Horner, Sugai, & Smolkowski, 2014 ), for which the possibility of modeling quadratic trend has been discussed explicitly. Another is Tau-U, developed by Parker et al. ( 2011 ), which deals more broadly with monotonic (not necessarily linear) trends. We stick here to linear trends and their extrapolation, a decision that reflects Chatfield’s ( 2000 ) statement that relatively simple forecasting methods are preferred, because they are potentially more easily understood. Moreover, this focus is well aligned with our willingness to improve the MPD, a procedure for fitting a linear trend line to baseline data. Despite this focus, three of the four issues identified by Parker et al. ( 2011 ), and the corresponding solutions we propose, are also applicable to nonlinear trends.

Organization

In the following sections, first we mention procedures that include extrapolating the trend line fitted in the baseline, and distinguish them from procedures that account for baseline trend but do not extrapolate it. Second, we perform a review of published research in order to explore how frequently trend extrapolation leads to out-of-bounds predicted values for the outcome variable. Third, we deal separately with the four main issues of extrapolating a baseline trend, as identified by Parker et al. ( 2011 ), and we offer tentative solutions to these issues. Fourth, on the basis of the proposals from the previous two points, we propose a modification of the MPD. In the same section, we also provide examples, based on previously published data, of the extent to which our modification helps avoid misleading results. Fifth, we include a small proof-of-concept simulation study.

Analytical techniques that entail extrapolating baseline trend

Visual analysis.

When discussing how visual analysis should be carried out, Kratochwill et al. ( 2010 ) stated that “[t] he six visual analysis features are used collectively to compare the observed and projected patterns for each phase with the actual pattern observed after manipulation of the independent variable” (p. 18). Moreover, the conservative dual criteria for carrying out structured visual analysis (Fisher, Kelley, & Lomas, 2003 ) entail extrapolating split-middle trend in addition to extrapolating mean level. This procedure has received considerable attention recently as a means of improving decision accuracy (Stewart, Carr, Brandt, & McHenry, 2007 ; Wolfe & Slocum, 2015 ; Young & Daly, 2016 ).

Regression-based analyses

Among the procedures based on regression analysis, the last treatment day procedure (White, Rusch, Kazdin, & Hartmann, 1989 ) entails fitting ordinary least squares (OLS) trend lines to the baseline and intervention phases separately, and comparison between the two is performed for the last intervention phase measurement occasion. In the Allison and Gorman ( 1993 ) regression model, baseline trend is extrapolated before it is removed from both the A and B phases’ data. Apart from OLS regression, the generalized least squares proposal by Swaminathan et al. ( 2014 ) fits trend lines separately to the A and B phases, but baseline trend is still extrapolated for carrying out the comparisons. The overall effect size described by the authors entails comparing the treatment data as estimated from the treatment-phase trend line to the treatment data as estimated from the baseline-phase trend line.

Apart from the procedures based on the general linear model (assuming normal errors), generalized linear models (Fox, 2016 ) need to be mentioned as well in the present subsection. Such models can deal with count data, which are ubiquitous in single-case research (Pustejovsky, 2018a ), specifying a Poisson model (rather than a normal one) for the conditional distribution of the response variable (Shadish, Kyse, & Rindskopf, 2013 ). Other useful models are based on the binomial distribution, specifying a logistic model (Shadish et al., 2016 ), when the data are proportions that have a natural floor (0) and ceiling (100). Despite dealing with certain issues arising from single-case data, these models are not flawless. Note that a Poisson model may present limitations when the data are more variable than expected (i.e., alternative models have been proposed for overdispersed count data; Fox, 2016 ), whereas a logistic model may present the difficulty of not knowing the floor or ceiling (i.e., the upper asymptote) or of forcing artificial limits. Finally, what is most relevant to the topic of the present text is that none of these generalized linear models necessarily includes an extrapolation of baseline trend. Actually, some of them (Rindskopf & Ferron, 2014 ; Verboon & Peters, 2018 ) consider the baseline data together with the intervention-phase data in order to detect when the greatest change is produced. Other models (Shadish, Kyse, & Rindskopf, 2013 ) include an interaction term between the dummy phase variable and the time variable, making possible the estimation of change in slope.

Nonregression procedures

MPD involves estimating baseline trend and extrapolating it into the intervention phase in order to compare the predictions with the actual intervention-phase data. Another nonregression procedure, Slope and level change (SLC; Solanas, Manolov, & Onghena, 2010 ), involves estimating baseline trend and removing it from the whole series before quantifying the change in slope and the net change in level (hence, SLC). In one of the steps of the SLC, baseline trend is removed from the n A baseline measurements and the n B intervention-phase measurements by subtracting from each value ( y i ) the slope estimate ( b 1 ), multiplied by the measurement occasion ( i ). Formally, \( {\overset{\sim }{y}}_i={y}_i-i\times {b}_1;i=1,2,\dots, \left({n}_A+{n}_B\right) \) . This step does resemble extrapolating baseline trend, but there is no estimation of the intercept of the baseline trend line, and thus a trend line is not fitted to the baseline data and then extrapolated, which would lead to obtaining residuals as in Allison and Gorman’s ( 1993 ) model. Therefore, we consider that it is more accurate to conceptualize this step as removing baseline trend from the intervention-phase trend for the purpose of comparison.

Nonoverap indices

Among nonoverlap indices, the percentage of data points exceeding median trend (Wolery, Busick, Reichow, & Barton, 2010 ) involves fitting a split-middle (i.e., bi-split) trend line and extrapolating it into the subsequent phase. Regarding Tau-U (Parker et al., 2011 ), it only takes into account the number of baseline measurements that improve previous baseline measurements, and this number is subtracted from the number of intervention-phase values that improve the baseline-phase values. Therefore, no intercept or slope is estimated, and no trend line is fitted or extrapolated, either. The way in which trend is controlled for in Tau-U cannot be described as trend extrapolation in a strict sense.

Two other nonoverlap indices also entail baseline trend control. According to the “additional output” calculated at http://ktarlow.com/stats/tau/ , the baseline-corrected Tau (Tarlow, 2017 ) removes baseline trend from the data using the expression \( {\overset{\sim }{y}}_i={y}_i-i\times {b}_{1(TS)};i=1,2,\dots, \left({n}_A+{n}_B\right) \) , where b 1( TS ) is the Theil–Sen estimate of slope. In the percentage of nonoverlapping corrected data (Manolov & Solanas, 2009 ), baseline trend is eliminated from the n values via the same expression as for baseline-corrected Tau, \( {\overset{\sim }{y}}_i={y}_i-i\times {b}_{1(D)};i=1,2,\dots, \left({n}_A+{n}_B\right) \) , but slope is estimated via b 1( D ) (see Appendix B ) instead of via b 1( TS ) . Therefore, as we discussed above for SLC, there is actually no trend extrapolation in the baseline-corrected Tau or percentage-of-nonoverlapping-corrected data.

Procedures not extrapolating trend

The analytical procedures included in the present subsection do not extrapolate baseline trend, but they do take baseline trend into account. We decided to mention these techniques for three reasons. First, we wanted to provide a broader overview of analytical techniques applicable to single-case data. Second, we wanted to make it explicit that not all analytical procedures entail baseline trend extrapolation, and therefore, such extrapolation is not an indispensable step in single-case data analysis. Stated in other words, it is possible to deal with baseline trend without extrapolating it. Third, the procedures mentioned here were those more recently developed or suggested for single-case data analysis, and so they may be less widely known. Moreover, they can be deemed more sophisticated and more strongly grounded on statistical theory than is MPD, which is the focus of the present article.

The between-case standard mean difference, also known as the d statistic (Shadish, Hedges, & Pustejovsky, 2014 ), assumes stable data, but the possibility of detrending has been mentioned (Marso & Shadish, 2015 ) if baseline trend is present. It is not clear that a regression model using time and its interaction with a dummy variable representing phase entails baseline trend extrapolation. Moreover, a different approach was suggested by Pustejovsky, Hedges, and Shadish ( 2014 ) for obtaining a d statistic—namely, in relation to multilevel analysis. In multilevel analysis, also referred to as hierarchical linear models , the trend in each phase can be modeled separately, and the slopes can be compared (Ferron, Bell, Hess, Rendina-Gobioff, & Hibbard, 2009 ). Another statistical option is to use generalized additive models (GAMs; Sullivan et al., 2015 ), in which there is greater flexibility for modeling the exact shape of the trend in each phase, without the need to specify a particular model a priori. GAMs that have been specifically suggested include the use of cubic polynomial curves fitted to different portions of the data and joined at the specific places (called knots ) that divide the data into portions. Just like when using multilevel models, trend lines are fitted separately to each phase, without the need to extrapolate baseline trend.

A review of research published in 2015

Aim of the review.

It has already been stated (Parker et al., 2011 ) and illustrated (Tarlow, 2017 ) that baseline trend extrapolation can lead to impossible forecasts for the subsequent intervention-phase data. Accordingly, the research question we chose was the percentage of studies in which extrapolating the baseline trend of the data set (across several different techniques for fitting the trend line) leads to values that are below the lower bound or above the upper bound of the outcome variable.

Search strategy

We focused on the four journals that have published most SCED research, according to the review by Shadish and Sullivan ( 2011 ). These journals are Journal of Applied Behavior Analysis , Behavior Modification , Research in Autism Spectrum Disorders , and Focus on Autism and Other Developmental Disabilities . Each of these four journals published more than ten SCED studies in 2008, and the 76 studies they published represent 67% of all studies included in the Shadish and Sullivan review. Given that the bibliographic search was performed in September 2016, we focused on the year 2015 and looked for any articles using phase designs (AB designs, variations, or extensions) or alternation designs with a baseline phase and providing a graphical representation of the data, with at least three measurements in the initial baseline condition.

Techniques for finding a best fitting straight line

For the present review, we selected five techniques for finding a best-fitting straight line: OLS, split-middle, tri-split, Theil–Sen, and differencing. The motivation for this choice was that these five techniques are included in single-case data-analytical procedures (Manolov, 2018 ), and therefore, applied researchers can potentially use them. The R code used for checking whether out-of-bounds forecasts are obtained is available at https://osf.io/js3hk/ .

Upper and lower bounds

The data were retrieved using Plot Digitizer for Windows ( https://plotdigitizer.sourceforge.net ). We counted the number and percentage of studies in which values out of logical bounds were obtained after extrapolating the baseline trend, estimated either from an initial baseline phase or from a subsequent withdrawal phase (e.g., in ABAB designs) for at least one of the data sets reported graphically in the article. The “logical bounds” were defined as 0 as a minimum and 1 or 100 as a maximum, when the measurement provided was a proportion or a percentage, respectively. Additional upper bounds included the maximal scores obtainable for an exam (e.g., Cheng, Huang, & Yang, 2015 ; Knight, Wood, Spooner, Browder, & O’Brien, 2015 ), for the number of steps in a task (e.g., S. J. Gardner & Wolfe, 2015 ), for the number of trials in the session (Brandt, Dozier, Juanico, Laudont, & Mick, 2015 ; Cannella-Malone, Sabielny, & Tullis, 2015 ), or for the duration of transition between a stimulus and reaching a location (Siegel & Lien, 2015 ), or the total duration of a session, when quantifying latency (Hine, Ardoin, & Foster, 2015 ). We chose a conservative approach, and did not to speculate Footnote 1 about upper bounds for behaviors that were expressed as either a frequency (e.g., Fiske et al., 2015 ; Ledbetter-Cho et al., 2015 ) or a rate (e.g., Austin & Tiger, 2015 ; Fahmie, Iwata, & Jann, 2015 ; Rispoli et al., 2015 ; Saini, Greer, & Fisher, 2015 ). Footnote 2

Results of the review

The numbers of articles included per journal are as follows. From the Journal of Applied Behavior Analysis , 27 SCED studies were included from the 46 “research articles” published (excluding three alternating-treatment designs without a baseline), and 20 more SCED studies were included from the 30 “reports” published (excluding two alternating-treatments design without a baseline and one changing-criterion design). From Behavior Modification , eight SCED studies were included from the 39 “articles” published (excluding two alternating-treatments design studies without a baseline, two studies with other designs without phases, one study with phases but only two measurements in the baseline phase, meta-analyses of single cases, and data analysis for single-case articles). From Research in Autism Spectrum Disorders , seven SCED studies were included from the 67 “original research articles” published (excluding one SCED study that did not have a minimum of three measurements per phase, as per Kratochwill et al., 2010 ). From Focus on Autism and Other Developmental Disabilities , six SCED studies were included from the 21 “articles” published. The references to all 68 articles reviewed are available in Appendix A at https://osf.io/js3hk/ .

The results of this review are as follows. Extrapolation led to impossibly small values for all five trend estimators in 27 studies (39.71%), in contrast to 34 studies (50.00%) in which that did not happen for any of the trend estimators. Complementarily, extrapolation led to impossibly large values for all five trend estimators in eight studies (11.76%), in contrast to 56 studies (82.35%) in which that did not happen for any of the trend estimators. In terms of when the extrapolation led to an impossible value, a summary is provided in Table 1 . Note that this table refers to the data set in each article, including the earliest out-of-bounds forecast. Thus, it can be seen that for all trend-line-fitting techniques, it was most common to have out-of-bounds forecasts already before the third intervention phase measurement occasion. This is relevant, considering that an immediate effect can be understood to refer to the first three intervention data points (Kratochwill et al., 2010 ).

These results suggest that researchers using techniques to extrapolate baseline trend should be cautious about downward trends that would apparently lead to negative values, if continued. We do not claim that the four journals and the year 2015 are representative of all published SCED research, but the evidence obtained suggests that trend extrapolation may affect the meaningfulness of the quantitative operations performed with the predicted data frequently enough for it to be considered an issue worth investigation.

Main issues when extrapolating baseline trend, and tentative solutions

The main issues when extrapolating baseline trend that were identified by Parker et al. ( 2011 ) include (a) unreliable trend lines being fitted; (b) the assumption that trends will continue unabated; (c) no consideration of the baseline phase length; and (d) the possibility of out-of-bounds forecasts. In this section, we comment on each of these four issues identified by Parker et al. ( 2011 ) separately (although they are related), and we propose tentative solutions, based on the existing literature. However, we begin by discussing in brief how these issues could be avoided rather than simply addressed.

Avoiding the issues

Three decisions can be made in relation to trend extrapolation. First, the researcher may wonder whether there is any clear trend at all. For that purpose, a tool such as a trend stability envelope (Lane & Gast, 2014 ) can be used. According to Lane and Gast, a within-phase trend would be considered stable (or clear) when at least 80% of the data points fell within the envelope defined by the split-middle trend line plus/minus 25% of the baseline median. Similarly, Mendenhall and Sincich ( 2012 ) suggested, although not in the context of single-case data, that a good fit of an OLS trend line would be represented by a coefficient of variation of 10% or smaller. We consider that either of these descriptive approaches is likely to be more reasonable than testing the statistical significance of the baseline trend before deciding whether or not to take it into account, because such a statistical test might lack power for short baselines (Tarlow, 2017 ). Using Kendall’s tau as a measure of the percentage of improving data points (Vannest, Parker, Davis, Soares, & Smith, 2012 ) would not inform one about whether a clear linear trend were present, because it refers more generally to a monotonic trend.

Second, if the data show considerable variability and no clear trend, it is possible to use a quantification that does not rely on (a) linear trend, (b) any specific nonlinear trend, or (c) any average level whatsoever, by using a nonoverlap index. Specifically, the nonoverlap of all pairs (NAP; Parker & Vannest, 2009 ) can be used when the baseline data do not show a natural improvement, whereas Tau-U (Parker et al., 2011 ) can be used when such an improvement is apparent but it is not necessarily linear. Footnote 3 A different approach could be to quantify the difference in level (e.g., using a d statistic) after showing that the assumption of no trend is plausible via a GAM (Sullivan et al., 2015 ). Thus, there would be no trend line fitting and no trend extrapolation.

Third, if the trend looks clear (visually or according to a formal rule) and the researcher decides to take it into account, it is also possible not to extrapolate trend lines. For instance, it is possible to fit separate trend lines to the different phases and compare the slopes and intercepts of these trend lines, as in piecewise regression (Center, Skiba, & Casey, 1985–1986 ).

Although these potential solutions seem reasonable, here we deal with another option: namely, the case in which baseline extrapolation is desired (because it is part of the analytical procedure chosen prior to data collection), but the researcher is willing to improve the way in which such extrapolation is performed.

First issue: Unreliable trend lines fitted

If an unreliable linear trend is fitted (e.g., the relation between the time variable and the measurements would be described by a small R 2 value), then the degree of confidence we have in the representation of the baseline data is reduced. If the fit of the baseline trend line to the data is poor, its extrapolation would also be problematic. It is expected that, if the amount of variability were the same, shorter baselines would result in more uncertain estimates. In that sense, this issue is related to the next one.

Focusing specifically on reliability, we advocate quantifying the amount of fit of the trend line and using this information when deciding on baseline trend extrapolation. Regarding the comparison between actual and fitted values, Hyndman and Koehler ( 2006 ) reviewed the drawbacks of several measures of forecast accuracy, including widely known options such as the minimum square error ( \( \frac{{\left({y}_i-{\widehat{y}}_i\right)}^2}{n} \) , based on a quadratic loss function and inversely related to R 2 ) or the minimum absolute error ( \( \frac{\left|{y}_i-{\widehat{y}}_i\right|}{n} \) , based on a linear loss function). Hyndman and Koehler proposed the mean absolute scaled error (MASE). For a trend line fitted to the  n A baseline measurements, MASE can be written as follows:

Hyndman and Koehler ( 2006 , p. 687) stated that MASE is “easily interpretable, because values of MASE greater than one indicate that the forecasts are worse, on average, than in-sample one-step forecasts from the naïve method.” (The naïve method entails predicting a value from the previous one—i.e., the random-walk model that has frequently been used to assess the degree to which more sophisticated methods provide more accurate forecasts that this simple procedure; Chatfield, 2000 .) Thus, values of MASE greater than one could be indicative that a general trend (e.g., a linear one, as in MPD) does not provide a good enough fit to the data from which it was estimated, because it does not improve the fit of the naïve method.

Second issue: Assuming that trend continues unabated

This issue refers to treating baseline trend as if it were always the same for the whole period of extrapolation. By default, all the analytical techniques mentioned in the “Analytical Techniques That Entail Extrapolating Baseline Trend” section extrapolate baseline trend until the end of the intervention phase. Thus, one way of dealing with this issue would be to limit the extrapolation, following Rindskopf and Ferron ( 2014 ), who stated that “for a short period, behavior may show a linear trend, but we cannot project that linear behavior very far into the future” (p. 229). Similarly, when discussing the gradual-effects model, Swan and Pustejovksy ( 2018 ) also cautioned against long extrapolations, although their focus was on the intervention phase and not on the baseline phase.

An initial approach could be to select how far out to extrapolate baseline trend prior to gathering and plotting the data, by selecting a number that would be the same across studies. When discussing an approach for comparing levels when trend lines are fitted separately to each phase, it has been suggested that a comparison can be performed at the fifth intervention-phase measurement occasion (Rindskopf & Ferron; 2014 ; Swaminathan et al., 2014 ). It is possible to extend this recommendation to the present situation and state that the baseline trend should be extrapolated until the fifth intervention-phase measurement occasion. The choice of five measurements is arbitrary, but it is well-aligned with the minimal phase length required in the What Works Clearinghouse Standards (Kratochwill et al., 2010 ). Nonetheless, our review (Table 1 ) suggests that impossible extrapolations are common even before the fifth intervention-phase measurement occasion, and thus a comparison at that point might not avoid comparison with an impossible projection from the baseline. Similarly, when presenting the gradual-effects model, Swan and Pustejovsky ( 2018 ) defined the calculation of the effect size for an a priori set number of intervention-phase measurement occasions. In their study, this number depends on the actually observed intervention-phase lengths. Moreover, Swan and Pustejovsky suggested a sensitivity analysis, comparing the results of several possible a-priori-set numbers. It could be argued that a fixed choice would avoid making data-driven decisions that could favor finding results in line with the expectations of the researchers (Wicherts et al., 2016 ). A second approach would be to choose how far away to extrapolate on the basis of both a design feature (baseline phase length; see the next section) and a data feature (the amount of fit of the trend line to the data, expressed as the MASE). In the following discussion, we present a tentative solution including both these aspects.

Third issue: No consideration of baseline-phase length

Parker et al. ( 2011 ) expressed a concern that baseline trend correction procedures do not take into consideration the length of the baseline phase. The problem is that a short baseline is potentially related to unreliable trend, and it could also entail predicting many values (i.e., a longer intervention phase) from few values, which is not justified.

To take baseline length ( n A ) into account, one approach would be to limit the extrapolation of baseline trend to the first n A treatment-phase measurement occasions. This approach introduces an objective criterion based on a characteristic of the design. A conservative version of this alternative would be to estimate how far out to extrapolate using the following expression: \( {\widehat{n}}_B=\left\lfloor {n}_A\times \left(1- MASE\right)\right\rfloor \) , applying the restriction that \( 0\le {\widehat{n}}_B\le {n}_B \) . Thus, the extrapolation is determined by both the number of baseline measurements ( n A ) and the goodness of fit of the trend line to the data. When MASE  > 1, the expression for \( {\widehat{n}}_B \) would give a negative value, precluding extrapolation. For data in which MASE  < 1, the better the fit of the trend line to the data, the further out extrapolation could be considered justified. From the expression presented for \( {\widehat{n}}_B \) , it can be seen that if the result of the multiplication is not an integer, the value representing the number of intervention-phase measurement occasions to which to extend the baseline trend ( \( {\widehat{n}}_B \) ) would be truncated. Finally, note the restriction that \( {\widehat{n}}_B \) should be equal to or smaller than n B , because it is possible that the baseline is longer than the intervention phase ( n A  >  n B ) and that, even after applying the correction factor representing the fit of the trend line \( {\widehat{n}}_B>{n}_B \) . Thus, whenever \( {\widehat{n}}_B>{n}_B \) , it is reset to \( {\widehat{n}}_B={n}_B \) .

Fourth issue: Out-of-bounds forecasts

Extrapolating baseline trend for five, n A , or \( {\widehat{n}}_B \) measurement occasions may make trend extrapolation more reasonable (or, at least, less unreasonable), but none of these options precludes out-of-bounds forecasts. When Parker et al. ( 2011 ) discussed the issue that certain procedures to control for baseline trend could lead to projecting trend beyond rational limits, they proposed the conservative trend correction procedure implemented in Tau-U. This procedure could be useful for statistically controlling baseline trend, although the evidence provided by Tarlow ( 2017 ) suggests that the trend control incorporated in Tau-U is insufficient (i.e., leads to false positive results), especially as compared to other procedures, including MPD. An additional limitation of this trend correction procedure is that it cannot be used when extrapolating baseline trend. Therefore, we consider other options in the following text.

Nonlinear models

One option, suggested by Rindskopf and Ferron ( 2014 ), is to use nonlinear models for representing situations in which a stable and low initial level during the baseline phase experiences a change due to the intervention (e.g., an upward trend) before settling at a stable high level. Rindskopf and Ferron suggested using logistic regression with an additional term for identifying the moment at which the response has gone halfway between the floor and the ceiling. Similarly, Shadish et al. ( 2016 ) and Verboon and Peters ( 2018 ) used a logistic model for representing data with clear floor and ceiling effects. The information that can be obtained by fitting a generalized logistic model is in terms of the floor and ceiling levels, the rate of change, and the moments at which the change from the floor to the ceiling plateau starts and stops (Verboon & Peters, 2018 ). Shadish et al. ( 2016 ) acknowledged that not all analysts are expected to be able to fit intrinsically nonlinear models and that choosing one model over another is always partly arbitrary, suggesting nonparametric smoothing as an alternative.

Focusing on the need to improve MPD, the proposals by Rindskopf and Ferron ( 2014 ) and Verboon and Peters ( 2018 ) are not applicable, since the logistic model they present deals with considering the data of a baseline phase and an intervention phase jointly, whereas in MPD baseline trend is estimated and extrapolated in order to allow for a comparison between projected and observed patterns of the outcome variable (as suggested by Kratochwill et al., 2010 , and Horner, Swaminathan, Sugai, & Smolkowski, 2012 , when performing visual analysis). In contrast, Shadish et al. ( 2016 ) used the logistic model for representing the data within one of the phases in order to explore whether any within-phase change took place, but they were not aiming to use the within-phase model for extrapolating to the subsequent phase.

Although not all systematic changes in the behavior of interest are necessarily linear, there are three drawbacks to applying nonlinear models to single-case data, or even to usually longer time-series data (Chatfield, 2000 ). First, there has not been extensive research with short-time-series data and any of the possible nonlinear models (e.g., logistic, Gompertz, or polynomial) applicable for modeling growth curves in order to ensure that known minimal and maximal values of the measurements are not exceeded. Second, it may be difficult to distinguish between a linear model with disturbance and an inherently nonlinear model. Third, a substantive justification is necessary, based either on theory or on previously fitted nonlinear models, for preferring one nonlinear model instead of another or for preferring a nonlinear model instead of the more parsimonious linear model. However, the latter two challenges are circumvented by GAMs, because they allow one to avoid the need to explicitly posit a specific model for the data (Sullivan et al., 2015 ).

Winsorizing

Faith, Allison, and Gorman ( 1997 ) suggested rescaling manually out-of-bounds predicted scores within limits, a manipulation similar to winsorization. Thus, a trend is extrapolated until the values predicted are no longer possible, and then a flat line is set at the minimum/maximum possible value (e.g., 0 when the aim is to eliminate a behavior, or 100% when the aim is to improve in the completion of a certain task). The “manual” rescaling of out-of-bounds forecasts could be supported by Chatfield’s ( 2000 , pp. 175–179) claim that it is possible to make judgmental adjustments to forecasts and also to use the “eyeball test” for checking whether forecasts are intuitively reasonable, given that background knowledge (albeit background as simple as knowing the bounds of the outcome variable) is part of nonautomatic univariate methods for forecasting in time-series analysis. In summary, just as in the logistic model, winsorizing the trend line depends on the data at hand. As a limitation, Parker et al. ( 2011 ) claimed that such a correction would impose an artificial ceiling on the effect size. However, it could also be argued that computing an effect size on the basis of impossible values is equally (or more) artificial, since it involves only crunching numbers, some of which (e.g., negative frequencies) are meaningless.

Damping trend

A third option arises from time-series forecasting, in which exponential smoothing is one of the methods commonly used (Billah, King, Snyder, & Koehler, 2006 ). Specifically, in double exponential smoothing, which can be seen as a special case of Holt’s ( 2004 ) linear trend procedure, it is possible to include a damping parameter (E. S. Gardner & McKenzie, 1985 ) that indicates how much the slope of the trend is reduced in subsequent time periods. According to the review performed by E. S. Gardner ( 2006 ), the damped additive trend is the model of choice when using exponential smoothing. A damped trend can be interpreted as an attenuation reflecting the gradual reduction of the trend until the behavior eventually settles at an upper or a lower asymptote. This would address Parker et al.’s ( 2011 ) concern that it may not be reasonable to consider that the baseline trend will continue unabated until the end of the intervention phase in the absence of an effect. Moreover, the behavioral progression is more gradual than the one implied when winsorizing. Furthermore, a gradual change is also the basis of recent proposals for modeling longitudinal data using generalized additive models (Bringmann et al., 2017 ).

Aiming for a tentative solution for out-of-bounds forecasts for techniques such as MPD, we consider it reasonable to borrow the idea of damping the trend from the linear trend model by Holt ( 2004 ). In contrast, the application of that model in its entirety to short SCED baselines (Shadish & Sullivan, 2011 ; Smith, 2012 ; Solomon, 2014 ) is limited by the need to estimate several parameters (a smoothing parameter for level, a smoothing parameter for trend, a damping parameter, the initial level, and the initial trend).

We consider that a gradually reduced trend conceptualization seems more substantively defensible than abruptly winsorizing the trend line. In that sense, instead of extrapolating the linear trend until the lower or upper bound is reached and then flattening the trend line, it is possible to estimate the damping coefficient in such a way as to ensure that impossible forecasts are not obtained during the period of extrapolation (i.e., in the \( {\widehat{n}}_B \) or n B measurement occasions after the last baseline data point, according to whether extrapolation is limited, as we propose here, or not). The damping parameter is usually represented by the Greek letter phi ( φ ), so that the trend line extrapolated into the intervention phase would be based on the baseline trend ( b 1 ) as follows: \( {b}_1\times {\varphi}^i;i=1,2,\dots, {\widehat{n}}_B \) , so that the first predicted intervention-phase measurement is \( {\widehat{y}}_1={\widehat{y}}_{n_A}+{b}_1\times \varphi \) , and the subsequent forecasts (for \( i=2,3,\dots, {\widehat{n}}_B \) ) are obtained via \( {\widehat{y}}_i={\widehat{y}}_{i-1}+{b}_1\times {\varphi}^i \) . The previous expressions are presented using \( {\widehat{n}}_B \) , but they can be rewritten using n B in the case that extrapolation is not limited in time. For avoiding extrapolation to impossible values, the damping parameter would be estimated from the data in such a way that the final predicted value \( {\widehat{y}}_{{\widehat{n}}_B} \) would still be within the bounds of the outcome variable. We propose an iterative process checking the values of φ from 0.05 to 1.00 in steps of 0.001, in order to identify the largest φ value k for which there are no out-of-bounds values, whereas for ( k  + 0.001) there is one or more such values. The closer φ is to 1, the farther away in the intervention phase is the first out-of-bounds forecast produced. Estimating φ from the data and not setting it to an a-priori-chosen value is in accordance with the usually recommended practice in exponential smoothing (Billah et al., 2006 ).

Justification of the tentative solutions

Our main proposal is to combine the quantitative criterion for how far out to extrapolate baseline trend ( \( {\widehat{n}}_B \) ) with damping, in case the latter is necessary within the \( {\widehat{n}}_B \) limit. The fact that both \( {\widehat{n}}_B \) and the damping parameter φ are estimated from the data rather than being predetermined implies that this proposal is data-driven. We consider that the data-driven quantification of \( {\widehat{n}}_B \) is also not necessarily a drawback, due to three reasons: (a) An objective formula was proposed for estimating how far out it is reasonable to extrapolate the baseline trend, according to the data at hand; that is, the choice is not made subjectively by the researcher in order to favor his/her hypotheses. (b) This formula is based on both a design feature (i.e., the baseline phase length) and a data feature (i.e., the MASE as a measure of the accuracy of the trend line fitted). And (c) no substantive reason may be available a priori regarding when extrapolation becomes unjustified.

We also consider that estimating the damping parameter from the data is not a drawback, either, given that (a) φ is estimated from the data in Holt’s linear trend model for which it was proposed; (b) damping trend can be considered conceptually similar to choosing a function, in a growth curve model, that makes possible incorporating an asymptote (Chatfield, 2000 ), because both methods model decisions made by the researcher on the basis of knowing the characteristics of the data and, in both cases, the moment at which the asymptote is reached depends on the data at hand and not on a predefined criterion; and (c) the use of regression splines (Bringmann et al., 2017 ; Sullivan et al., 2015 ) for modeling a nonlinear relation is also data-driven, despite the fact that a predefined number of knots may be used.

The combined use of \( {\widehat{n}}_B \) plus the estimation of φ can be applied to the OLS baseline trend (as used in the Allison & Gorman, 1993 , model), to the split-middle trend (as used in the conservative dual criterion, Fisher et al., 2003 ; or in the percentage of data points exceeding the median trend, Wolery et al., 2010 ), or to the trend extrapolation that is part of MPD (Manolov & Solanas, 2013 ). In the following section, we focus on MPD.

The present proposal is also well-aligned with Bringmann et al.’s ( 2017 ) recommendation for models that do not require existing theories about the expected nature of the change in the behavior, excessively high computational demands, or long series of measurements. Additionally, as these authors suggested, the methods need to be readily usable by applied researchers, which is achieved by the software implementations we have created.

Limitations of the tentative solutions

As we mentioned previously, it could be argued that the tentative solutions are not necessary if the researcher simply avoids extrapolation. Moreover, we do not argue that the expressions presented for deciding whether and how far to extrapolate are the only possible, or necessarily the optimal, ones; we rather aimed at defining an objective rule on a solid, albeit arbitrary, basis. An additional limitation, as was suggested by a reviewer, is that for a baseline with no variability, MASE would not be defined. In such a case, when the same value is repeated n A times (e.g., when the value is 0 because the individual is unable to perform the action required), we do consider that an unlimited extrapolation would be warranted, because the reference to which the intervention-phase data would be compared would be clear and unambiguous.

Incorporating the tentative solutions in a data-analytical procedure

Modifying the mpd.

The revised version of the MPD includes the following steps:

Estimate the slope of the baseline trend as the average of the differenced data ( b 1( D ) ).

Fit the trend line, choosing Footnote 4 one of the three definitions of the intercept (see Appendix B at https://osf.io/js3hk/ ), according to the value of the MASE.

Extrapolate the baseline trend, if justified (i.e., if MASE  < 1), for as many intervention-phase measurement occasions as is justified (i.e., for the first \( {\widehat{n}}_B \) measurement occasions of the intervention phase) and considering the need for damping the trend to avoid out-of-bounds forecasts. The damping parameter φ would be equal to 1 when all \( {\widehat{n}}_B \) forecasts are within bounds, or φ  < 1, otherwise.

Compute MPD as the difference between the actually obtained and the forecast first \( {\widehat{n}}_B \) intervention-phase values.

Illustration of the proposal for modifying MPD

In the present section, we chose three of the studies included in the review that we performed (all three data sets are available at https://osf.io/js3hk/ , in the format required by the Shiny application, http://manolov.shinyapps.io/MPDExtrapolation , implementing the modified version of MPD). From the illustrations it is clear that, although the focus of the present text is comparison between a pair of phases, such a comparison can be conceptualized to be part of a more appropriate design structure, such as ABAB or multiple-baseline designs (Kratochwill et al., 2010 ; Tate et al., 2013 ), by replicating the same procedure for each AB comparison. Such a means of analyzing data corresponds to the suggestion by Scruggs and Mastropieri ( 1998 ) to perform comparisons only for data that maintain the AB sequence.

The Ciullo, Falcomata, Pfannenstiel, and Billingsley ( 2015 ) data were chosen because their multiple-baseline design includes short baselines and extrapolation to out-of-bounds forecasts (impossibly low values) for both the first tier Footnote 5 (Fig. 1 ) and the third tier. In Fig. 1 , trend extrapolation was not limited (i.e., the baseline trend was extrapolated for all n B = 7 values), to allow for comparing winsorizing and damping the trend. Limiting the extrapolation to \( {\widehat{n}}_B \) = 2 would have made either winsorizing or damping the trend unnecessary, because no out-of-bound forecasts would have been obtained; MPD would have been equal to 40.26.

figure 1

Results for mean phase difference (MPD) with the slope estimated through differencing and the intercept computed as in the Theil–Sen estimator. The results in the left panel are based on winsorizing the trend line when the lower bound is reached. The results in the right panel are based on damping the trend. Trend extrapolation is not limited. The data correspond to the first tier (a participant called Salvador) in the Ciullo et al. ( 2015 ) multiple-baseline design study

Limiting the amount of extrapolation seems reasonable, because for both of methods the intervention phase is almost three times as long as the baseline phase; using \( {\widehat{n}}_B \) leads to avoiding impossibly low forecasts for these data and to more conservative estimates of the magnitude of the effect. Damping the trend line was necessary for three of the four tiers, where it also led to more conservative estimates, given that the out-of-bounds forecasts were in a direction opposite from the one desired with the intervention. The numerical results are available in Table 2 .

The data from Allen, Vatland, Bowen, and Burke ( 2015 ) were chosen, because this study represents a different data pattern: Longer baselines are available, which could allow for better estimation of the trend, but the baseline data are apparently very variable. Intervention phases were also longer, which required extrapolations farther out in time. Thus, we wanted to illustrate how limiting extrapolations affects the quantification of an effect.

For Tier 1, out-of-bounds forecasts (impossible high values in the same direction as desired for the intervention) are obtained. However, damping the trend led to avoiding such forecasts and also to greater estimates of the effect. For Tiers 2 and 3 (the latter is represented in Fig. 2 ), limiting the amount of extrapolation had a very strong effect, due to the high MASE values, and only a very short extrapolation was justified for Tiers 2 and 3. The limited extrapolation is also related to greater estimates of the magnitude of the effect for Tiers 2 and 3.

figure 2

Results for mean phase difference (MPD) with the slope estimated through differencing and the intercept computed as in the Theil–Sen estimator. Trend extrapolation was not limited (left) versus limited (right). Damping the trend was not necessary in either case ( φ = 1). The data correspond to the third tier of the Allen et al. ( 2015 ) multiple-baseline design study

Therefore, using only the first \( {\widehat{n}}_B \) intervention-phase data points for the comparison reflects a reasonable doubt regarding whether the (not sufficiently clear) improving baseline trend would have continued unchanged throughout the whole intervention phase (i.e., for 23 or 16 measurement occasions, for Tiers 2 and 3, respectively). The numerical results are available in Table 3 .

The data from Eilers and Hayes ( 2015 ) were chosen because they include baselines of varying lengths, out-of-bounds forecasts for Tiers 1 and 2, and a nonlinear pattern in Tier 3 (to which a linear trend line is expected to show poor fit). For these data, damping and limiting the extrapolation, when applied separately, both correct overestimation of the effect that would arise from out-of-bounds (high) forecasts in a direction opposite from the one desired in the intervention. Such an overestimation, in the absence of damping, would lead to MPD values implying more than a 100% reduction, which is meaningless (see Fig. 3 ).

figure 3

Results for mean phase difference (MPD) with the slope estimated through differencing and the intercept computed as in the Theil–Sen estimator. Trend was damped completely (right; φ = 0) versus not damped (left; φ = 1). Trend extrapolation is not limited in this figure. The data correspond to the second tier of the Eilers and Hayes ( 2015 ) multiple-baseline design study

Specifically, damping the trend is necessary in Tiers 1 and 2 to avoid such forecasts. Note that for Tier 3, the fact that a straight line does not represent the baseline data well is reflected by MASE  > 1 and \( {\widehat{n}}_B<1 \) , leading to a recommendation not to extrapolate the baseline trend. The numerical results are available in Table 4 .

General comments

In general, the modifications introduced in MPD achieve the aims to (a) avoid extrapolating from a short baseline to a much longer intervention phase (Example 1); (b) avoid assuming that the trend will continue exactly the same for many measurement occasions beyond the baseline phase (Example 2); (c) follow an objective criterion regarding a baseline trend line that is not justified in being extrapolated at all (Example 3); and (d) avoid excessively large quantifications of effect when comparing to impossibly bad (countertherapeutic) forecasts in the absence of an effect (Examples 1 and 3). Furthermore, note that for all the data sets included in this illustration, the smallest MASE values were obtained using the Theil–Sen definition of the intercept.

Small-scale simulation study

To obtain additional evidence regarding the performance of the proposals, an application to generated data was a necessary complement to the application of our proposals to previously published real behavioral data. The simulation presented in this section should be understood as a proof of concept, rather than as a comprehensive source of evidence. We consider that further thought and research should be dedicated to simulating discrete bounded data (e.g., counts, percentages) and to studying the present proposals for deciding how far to extrapolate baseline trend and how to deal with impossible extrapolations.

Data generation

We simulated independent and autocorrelation count data using a Poisson model, following the article by Swan and Pustejovsky ( 2018 ) and adapting the R code available in the supplementary material to their article ( https://osf.io/gaxrv and https://www.tandfonline.com/doi/suppl/10.1080/00273171.2018.1466681 ). The adaptation consisted in adding the general trend for certain conditions (denoted here by β 1 , whereas β 2 denotes the change-in-level parameter, unlike in Swan & Pustejovsky, 2018 , who denoted the change in level by β 1 ) and simulating immediate instead of delayed effects (i.e., we set ω = 0). Given that ω = 0, the simulation model, as described by Swan and Pustejovsky, is as follows. The mathematical expectancy for each measurement occasion is μ t  = exp( β 0  +  β 1 t  +  β 2 D ), where t is the time variable defined taking values 1, 2, . . . , n A + n B , and D is a dummy variable for change in level, taking n A values of 0 followed by n B values of 1. The first value, Y 1 , is simulated from a Poisson distribution with a mean set to λ 1  =  μ 1 . Subsequent values ( j = 2, 3, . . . , n A + n B ) are simulated taking autocorrelation into account ( φ j  =   min  { φ ,  μ j / μ j  − 1 }), leading to the following mean for the Poisson distribution: λ j  =  μ j  −  φ j μ j  − 1 . Finally, the values from second to last were simulated as Y j  =  X j  +  Z j , where Z j follows a Poisson distribution with mean λ j , and X j follows a binomial distribution with Y j  − 1  trials and a probability of φ j .

The specific simulation parameters for defining μ t were e β0 = 50 (representing the baseline frequency), β 1 = 0, − 0.1, − 0.2, β 2 = − 0.4 (representing the intervention effect as an immediate change in level), and autocorrelation φ = 0 or 0.4. Regarding the intervention effect, according to the formula % change  = 100 %  × [exp( β 2 ) − 1] (Pustejovsky, 2018b ), the effect was a reduction of approximately 33%, or 16.5 points, from the baseline level ( e β0 ), set to 50. The phase lengths ( n A = n B ) were 5, 7, and 10.

The specific simulation parameters β , as well as simulating the intervention effect as a reduction, were chosen in such a way as to produce a floor effect for certain simulation conditions. That is, for some of the conditions, the values of the dependent variable were equal or close to zero before the end of the intervention phase, and thus could not improve any more. For these conditions, extrapolating the baseline trend would lead to impossible negative forecasts. Such a data pattern represents well the findings from our review, according to which in almost 40% of the articles at least one AB comparison led to impossible negative predictions if the baseline trend were continued. Example data sets of the simulation conditions are presented as figures at https://osf.io/js3hk/ . A total of 10,000 iterations were performed for each condition using R code ( https://cran.r-project.org ).

Data analysis

Six different quantifications of the intervention effect were computed. First, an immediate effect was computed, as defined in piecewise regression (Center et al., 1985–1986 ) and by extension in multilevel models (Van den Noortgate & Onghena, 2008 ). This immediate effect represents a comparison, for the first intervention-phase measurement occasion, between the extrapolated baseline trend and the fitted intervention-phase trend. Second, an average effect was computed, as defined in the generalized least squares proposal by Swaminathan et al. ( 2014 ). This average effect ( δ AB ) is based on the expression by Rogosa ( 1980 ), initially proposed for computing an overall effect in the context of the analysis of covariance when the regression slopes were not parallel. The specific expressions are (1) for the baseline data, \( {y}_t^A={\beta}_0^A+{\beta}_0^At+{e}_t \) , where t = 1, 2, . . ., n A ; (2) for the intervention-phase data, \( {y}_t^B={\beta}_0^B+{\beta}_0^Bt+{e}_t \) , where t = n A + 1, n A + 2, . . ., n A + n B ; and (3) \( {\delta}_{AB}=\left({\beta}_0^A-{\beta}_0^B\right)+\left({\beta}_1^A-{\beta}_1^B\right)\frac{2{n}_A+{n}_B+1}{2} \) . Additionally, four versions of the MPD were computed: (a) one estimating the baseline trend line using the Theil–Sen estimator, with no limitation of the extrapolation and no correction for impossible forecasts; (b) MPD incorporating \( {\widehat{n}}_B \) for limiting the extrapolation [MPD Limited]; (c) MPD incorporating \( {\widehat{n}}_B \) and using flattening to correct impossible forecasts [MPD Limited Flat]; and (d) MPD incorporating \( {\widehat{n}}_B \) and using damping to correct impossible forecasts [MPD Limited Damping]. Finally, we obtained two additional pieces of information: the percentage of iterations in which \( {\widehat{n}}_B<1 \) (due to MASE being greater than 1) and the quartiles (plus minimum and maximum) corresponding to \( {\widehat{n}}_B \) for each experimental condition.

The results of the simulation are presented in Tables 5 , 6 , and 7 , for phase lengths of five, seven, and ten measurements, respectively. When there is an intervention effect ( β 2 = − 0.4) but no general trend ( β 1 = 0), all quantifications lead to very similar results, which are also very similar to the expected overall difference of 16.5. The most noteworthy result for these conditions is that, when there is autocorrelation, for phase lengths of seven and ten data points, the naïve method is more frequently a better model for the baseline data than the Theil–Sen trend (e.g., 17.51% for autocorrelated data vs. 6.61% for independent data when n A = n B = 10). This is logical because, according to the naïve method each data point is predicted from the previous one, and positive first-order autocorrelation entails that adjacent values are more similar to each other than would be expected by chance.

When there is a general trend and n A = n B = 5 (Table 5 ), the floor effect means that only the immediate effect remains favorable for the intervention (i.e., lower values for the dependent variable in the intervention phase). In contrast, a comparison between the baseline extrapolation and the treatment data leads to overall quantifications ( δ AB and MPD) suggesting deterioration. This is because of the impossible (negative) predicted values. The other versions of MPD entail quantifications that are less overall (i.e., \( {\widehat{n}}_B<{n}_B\Big) \) , and the MPD version that both limits extrapolation and uses damping to avoid impossible projections is the one that leads to values more similar to the immediate effect.

For conditions with n A = n B = 7 (Table 6 ), the results and the comments are equivalent. The only difference is that for a general trend expressed as β 1 = − 0.2, the baseline “spontaneous” reduction is already large enough to reach the floor values, and thus even the immediate effect is unfavorable for the intervention. The results for n A = n B = 10 (Table 7 ) are similar. For n A = n B = 10, we added another condition in which the general trend was not so pronounced (i.e., β 1 = − 0.1) as to lead to a floor effect already during the baseline. For these conditions, the results are similar to the ones for n A = n B = 5 and β 1 = − 0.2.

In summary, when there is a change in level in the absence of a general trend, the proposals for limiting the extrapolation and avoiding impossible forecasts do not affect the quantification of an overall effect. Additionally, in situations in which impossible forecasts would be obtained, these proposals lead to quantifications that better represent the data pattern. We consider that for data patterns in which the floor is reached soon after introducing the intervention, an immediate effect and subsequent values at the floor level (e.g., as quantified by the percentage zero data; Scotti, Evans, Meyer, & Walker, 1991 ) should be considered sufficient evidence (if they are replicated) for an intervention effect. That is, we consider that such quantifications would be a more appropriate evaluation of the data pattern than an overall quantification, such as δ AB and MPD in absence of the proposals. Thus, we consider the proposals to be useful. Still, the specific quantifications obtained when the proposals are applied to MPD should not be considered perfect, because they will depend on the extent to which the observed data pattern matches the expected data pattern (e.g., whether a spontaneous improvement is expected, whether an immediate effect is expected) and on the type of quantification preferred (e.g., a raw difference as in MPD, a percentage change such as the one that could be obtained from the log response ratio [Pustejovsky, 2018b ], or a difference in standard deviations, such as the BC-SMD [Shadish et al., 2014 ]).

In terms of the \( {\widehat{n}}_B \) values obtained, Tables 5 , 6 , and 7 show that most typically (i.e., the central 50%), extrapolations were considered justified from two to four measurement occasions into the interventions phase. This is well-aligned with the idea of an immediate effect consisting of the first three intervention phase measurement occasions (Kratochwill et al., 2010 ) and is broader than the immediate effect defined in piecewise regressions and multilevel models (focusing only on the first measurement occasion). Such a short extrapolation would avoid the untenable assumption that the baseline trend would continue unabated for too long. Moreover, damping the baseline trend helps identify a more appropriate reference for comparing the actual intervention data points.

General discussion

Extrapolating baseline trend: issues, breadth of these issues, and tentative solutions.

Several single-case analytical techniques entail extrapolating baseline trend—for instance, the Allison and Gorman ( 1993 ) regression model, the nonregression technique called mean phase difference (Manolov & Solanas, 2013 ), and the nonoverlap index called the percentage of data points exceeding the median trend (Wolery et al., 2010 ). An initial aspect to take into account is that these three techniques estimate the intercept and slope of the trend line in three different ways. When a trend line is fitted to the baseline data, the amount of fit of the trend line to the data has to be considered, plus whether it is reasonable to consider that the trend will continue unchanged and whether extrapolating the trend would lead to predicted values that are impossible in real data. The latter issue appeared to be present in SCED data published in 2015, given that in approximately 10% of the studies reviewed, forecasts above the maximal possible value were obtained, and in 40% the forecasts were below the minimal possible value, for all five trend line fitting procedures investigated. The proposals we make here take into account the length of the baseline phase, the amount of fit of the trend line to the data, and the need to avoid meaningless comparisons between actual values and impossible predicted values. Moreover, limiting the extrapolation emphasizes the idea that a linear trend is only a model that serves as an approximation of how the data would behave if the baseline continued for a limited amount of time, rather than assuming that a linear trend is necessarily the correct model for the progression of the measurements in the absence of an intervention.

The examples provided with real data and the simulation results from applying the proposals to the MPD illustrate how the present proposal for correcting out-of-bounds forecasts avoids both excessively low and excessively high effect estimates when the bounds of the measurement units are considered. Moreover, the quantitative criterion for deciding how far out to extrapolate baseline trend serves as an objective rule for not extrapolating a trend line into the intervention phase when the baseline data are not represented well by such a line.

Recommendations for applied researchers

In relation to our proposals, we recommend both limiting the extrapolation and allowing for damping the trend. Limiting the extrapolation leads to a quantification that combines to criteria mentioned in the What Works Clearninghouse Standards (Kratochwill et al., 2010 ): immediate change and comparison of the projected versus observed data pattern, whereas damping a trend avoids completely meaningless comparisons. Moreover, in relation to the MPD, we advocate defining its intercept according to the smallest MASE value. In relation to statistical analysis in general, we do not recommend that applied researchers should necessarily always use analytical techniques to extrapolate a baseline trend (e.g., MPD, generalized least squares analysis by Swaminathan et al., 2014 , or the Allison & Gorman, 1993 , OLS model). Rather, we caution regarding the use of such techniques for certain data sets and propose a modification of MPD that avoids obtaining quantifications of effects that are based on unreasonable comparisons. Additionally, we also caution researchers that when a trend line is fitted to the data, in order to improve transparency, it is important to report the technique used for estimating the intercept and slope of this trend line, given that several such techniques are available (Manolov, 2018 ). Finally, for cases in which the data show substantial variability and are not represented well by a straight line, or even by a curved line, we recommend applying the nonoverlap of all pairs, which makes use of all the data and not only of the first \( {\widehat{n}}_B \) measurements of the intervention-phase data.

Beyond the present focus on trend, some desirable features of analytical techniques have been suggested by Wolery et al. ( 2010 ) and expanded on by Manolov, Gast, Perdices, and Evans ( 2014 ). Readers interested in broader reviews of analytical techniques can also consult Gage and Lewis ( 2013 ) and Manolov and Moeyaert ( 2017 ). In general, we echo the recommendation to use quantitative analysis together with visual analysis (e.g., Campbell & Herzinger, 2010 ; Harrington & Velicer, 2015 ; Houle, 2009 ), and we further reflect on this point in the following section.

Validating the quantifications and enhancing their interpretation: Software developments

Visual analysis is regarded as a tool for verifying the meaningfulness of the quantitative results yielded by statistical techniques (Parker et al., 2006 ). In that sense, representing visually the trend line fitted and extrapolated or the transformed data after baseline trend has been removed is crucial. Accordingly, recent efforts have focused on using visual analysis to help choose the appropriate multilevel model (Baek, Petit-Bois, Van Den Noortgate, Beretvas, & Ferron, 2016 ). To make more transparent what exactly is being done with the data to obtain the quantifications, the output of the modified MPD is both graphical and numerical (see http://manolov.shinyapps.io/MPDExtrapolation , which allows for choosing whether to limit the extrapolation of the baseline trend and whether to use damping or winsorizing in the case of out-of-bounds forecasts). For MPD, in which the quantification is the average difference between the extrapolated baseline trend and the actual intervention phase measurements, the graphical output clearly indicates which are the forecast values (plus whether a trend is maintained or damped) and how far away the baseline trend is extrapolated. Moreover, the color of the arrows from predicted to actual intervention-phase values we have used in the figures of this article indicated, for a comparison, whether (green) or not (red) the difference was in the direction desired. In summary, the graphical representation of comparisons performed in MPD makes easier using visual analysis to validate and help interpret the information obtained.

Limitations in relation to the alternatives for extrapolating linear baseline trend for forecasting

In the present study, we discussed extrapolating linear trends because the MPD, our focal analytical technique, fits a straight line to the baseline data before extrapolating them. Nevertheless, it would be possible to fit a nonlinear (e.g., logistic) model to the baseline data (Shadish et al., 2016 ). Furthermore, there are many other alternative procedures for estimating and extrapolating trend, especially in the context of time-series analysis.

Among univariate time-series procedures for forecasting, Chatfield ( 2000 ) distinguished formal statistical models, that is, mathematical representations of reality (e.g., ARIMA; state space; growth curve models, such as logistic and Gompertz; nonlinear models, including artificial neural networks) and ad hoc methods, that is, formulas for computing forecasts. Among the ad hoc methods the most well-known and frequently used options are exponential smoothing (which can be expressed within the framework of state space models; De Gooijer & Hyndman, 2006 ) and the related Holt linear-trend procedure or the Holt–Winters procedure including a seasonal component. As we mentioned previously, the idea of damping a trend is borrowed from the Holt linear-trend procedure, on the basis of the work of E. S. Gardner and McKenzie ( 1985 ).

Regarding ARIMA, according to the Box–Jenkins approach already introduced in the single-case designs context, the aim is to identify the best parsimonious model by means of three steps: model identification, parameter estimation, and diagnostic checking. An appropriate model would then be used for forecasting. The difficulties of correctly identifying the ARIMA model for single-case data, via the analysis of autocorrelations and partial autocorrelations, have been documented (Velicer & Harrop, 1983 ), leading to fewer plausible models being proposed that would avoid this initial step (Velicer & McDonald, 1984 ). The simulation evidence available (Harrop & Velicer, 1985 ) for these models refers to data series of 40 measurements (i.e., 20 per phase), which is more than might be expected from typical single-case baselines (almost half of the initial baselines contained four or fewer data points) or series lengths (median of 20, according to the review by Shadish & Sullivan, 2011 , with most series containing fewer than 40 measurements). Moreover, to the best of our knowledge, the possibility of obtaining out-of-bounds predicted values has not been discussed, nor have tentative solutions been proposed for this issue.

Holt’s ( 2004 ) linear-trend procedure is another option for forecasting that is available in textbooks (e.g., Mendenhall & Sincich, 2012 ), and therefore is potentially accessible to applied researchers. Holt’s model is an extension of simple exponential smoothing including a linear trend. This procedure can be extended further by including a damping parameter (E. S. Gardner & McKenzie, 1985 ) that indicates how much the slope of the trend is reduced in subsequent time periods. The latter model is called the additive damped trend model , and according to the review by E. S. Gardner ( 2006 ), it is the model of choice when using exponential smoothing. The main issue with the additive damped trend model is that it requires estimating three parameters—one smoothing parameter for the level, one smoothing parameter for the trend, and the damping parameter—and it is also recommended to estimate the initial level and trend via optimization. It is unclear whether reliable estimates can be obtained with the usually short baseline phases in single-case data. We performed a small-scale check using the R code by Hyndman and Athanasopoulos ( 2013 , chap. 7.4). For instance, for the Ciullo et al. ( 2015 ) data with n A  ≤ 4 and the multiple-baseline data by Eilers and Hayes ( 2015 ) with n A equal to 3, 5, and 8, the number of measurements was not sufficient to estimate the damping parameter, and thus only a linear trend was extrapolated. The same was the case for the Allen et al. ( 2015 ) data for n A  = 5 and 9, whereas for n A  = 16, it was possible to use the additive damped trend model. Our check suggested that the minimum baseline length required for applying the additive damped trend model is 10, which is greater than (a) the value found in at least 50% of the data sets reviewed by Shadish and Sullivan ( 2011 ); (b) the modal value of six baseline data points reported in Smith’s ( 2012 ) review; and (c) the average baseline length in the Solomon ( 2014 ) review.

Therefore, the reader should be aware that there are alternatives for estimating and extrapolating trend for forecasting. However, to the best of our knowledge, none of these alternatives is directly applicable to single-case data without any issues, or without the need to explore which model or method is more appropriate, and in which circumstances, questions that do not have clear answers even for the usually longer time-series data (Chatfield, 2000 ).

Future research

One line of future research could be to focus on testing the proposals via a broader simulation, such as one that applied different analytical techniques: for instance, the MPD, before computing δ AB in the context of regression analysis, and the percentage of data points exceeding the median trend. Another line of research could focus on a comparison between the version of MPD incorporating the proposals and the recently developed generalized logistic model of Verboon and Peters ( 2018 ). Such a comparison could entail a field test and a survey among applied researchers on the perceived ease of use and the utility of the information provided.

Author note

The authors thank Patrick Onghena for his feedback on previous versions of this article.

In contrast, in the meta-analysis by Chiu and Roberts ( 2018 ), for outcomes for which there was no true maximum, the largest value actually obtained was treated as a maximum, before converting the values into percentages. If we had followed the same procedure, we would have found a greater frequency of impossibly high forecasts.

The references in this paragraph correspond to the studies included in the review and are available in Appendix A at our Open Science Framework site: https://osf.io/js3hk/ .

Note that Tarlow ( 2017 ) identified several issues with Tau-U and proposed the “baseline-corrected Tau,” which, however, corrects the data using the linear trend as estimated with the Theil–Sen estimator, and thus implicitly assumes that a straight line is a good representation of the baseline data.

It could be argued that having three different ways of defining the intercept available (i.e., in the Shiny application) may prompt applied researchers to choose the definition that favors their hypotheses or expectations. Nevertheless, we advocate using the definition of the intercept that provides a better fit to the data, both visually and quantitatively, as assessed via the MASE.

Following Tate and Perdices ( 2018 ), we use the term “tier” to refer to each AB comparison within a multiple-baseline design. Therefore, “tiers” could refer to different individuals, if the multiple-baseline design entails a staggered replication across participants, or to different behaviors or settings, if there is replication across behaviors or settings. Additionally, the term “tier” enables us to avoid confusion with the term “baseline,” which denotes only the A phase of the AB comparison.

Allen, K. D., Vatland, C., Bowen, S. L., & Burke, R. V. (2015). Parent-produced video self-modeling to improve independence in an adolescent with intellectual developmental disorder and an autism spectrum disorder: A controlled case study. Behavior Modification , 39 , 542–556.

PubMed   Google Scholar  

Allison, D. B., & Gorman, B. S. (1993). Calculating effect sizes for meta-analysis: The case of the single case. Behaviour Research and Therapy , 31 , 621−631.

Arnau, J., & Bono, R. (1998). Short time series analysis: C statistic vs. Edgington model. Quality & Quantity , 32 , 63–75.

Google Scholar  

Austin, J. E., & Tiger, J. H. (2015). Providing alternative reinforcers to facilitate tolerance to delayed reinforcement following functional communication training. Journal of Applied Behavior Analysis , 48 , 663−668.

Baek, E. K., Petit-Bois, M., Van Den Noortgate, W., Beretvas, S. N., & Ferron, J. M. (2016). Using visual analysis to evaluate and refine multilevel models of single-case studies. Journal of Special Education , 50 , 18–26.

Billah, B., King, M. L., Snyder, R. D., & Koehler, A. B. (2006). Exponential smoothing model selection for forecasting. International Journal of Forecasting , 22 , 239–247.

Brandt, J. A. A., Dozier, C. L., Juanico, J. F., Laudont, C. L., & Mick, B. R. (2015). The value of choice as a reinforcer for typically developing children. Journal of Applied Behavior Analysis , 48 , 344−362.

Bringmann, L. F., Hamaker, E. L., Vigo, D. E., Aubert, A., Borsboom, D., & Tuerlinckx, F. (2017). Changing dynamics: Time-varying autoregressive models using generalized additive modeling. Psychological Methods , 22 , 409–425. https://doi.org/10.1037/met0000085

Article   PubMed   Google Scholar  

Campbell, J. M., & Herzinger, C. V. (2010). Statistics and single subject research methodology. In D. L. Gast (Ed.), Single subject research methodology in behavioral sciences (pp. 417–453). London: Routledge.

Cannella-Malone, H. I., Sabielny, L. M., & Tullis, C. A. (2015). Using eye gaze to identify reinforcers for individuals with severe multiple disabilities. Journal of Applied Behavior Analysis , 48 , 680–684. https://doi.org/10.1002/jaba.231

Center, B. A., Skiba, R. J., & Casey, A. (1985–1986). A methodology for the quantitative synthesis of intra-subject design research. Journal of Special Education , 19 , 387–400.

Chatfield, C. (2000). Time-series forecasting. London: Chapman & Hall/CRC.

Cheng, Y., Huang, C. L., & Yang, C. S. (2015). Using a 3D immersive virtual environment system to enhance social understanding and social skills for children with autism spectrum disorders. Focus on Autism and Other Developmental Disabilities , 30 , 222−236.

Chiu, M. M., & Roberts, C. A. (2018). Improved analyses of single cases: Dynamic multilevel analysis. Developmental Neurorehabilitation , 21 , 253–265.

Ciullo, S., Falcomata, T. S., Pfannenstiel, K., & Billingsley, G. (2015). Improving learning with science and social studies text using computer-based concept maps for students with disabilities. Behavior Modification , 39 , 117–135.

De Gooijer, J. G., & Hyndman, R. J. (2006). 25 years of time series forecasting. International Journal of Forecasting , 22 , 443–473.

Eilers, H. J., & Hayes, S. C. (2015). Exposure and response prevention therapy with cognitive defusion exercises to reduce repetitive and restrictive behaviors displayed by children with autism spectrum disorder. Research in Autism Spectrum Disorders , 19 , 18–31.

Fahmie, T. A., Iwata, B. A., & Jann, K. E. (2015). Comparison of edible and leisure reinforcers. Journal of Applied Behavior Analysis , 48 , 331−343.

Faith, M. S., Allison, D. B., & Gorman, D. B. (1997). Meta-analysis of single-case research. In R. D. Franklin, D. B. Allison, & B. S. Gorman (Eds.), Design and analysis of single-case research (pp. 245–277). Mahwah: Erlbaum.

Ferron, J. M., Bell, B. A., Hess, M. R., Rendina-Gobioff, G., & Hibbard, S. T. (2009). Making treatment effect inferences from multiple-baseline data: The utility of multilevel modeling approaches. Behavior Research Methods , 41 , 372–384. https://doi.org/10.3758/BRM.41.2.372

Fisher, W. W., Kelley, M. E., & Lomas, J. E. (2003). Visual aids and structured criteria for improving visual inspection and interpretation of single-case designs. Journal of Applied Behavior Analysis , 36 , 387–406.

PubMed   PubMed Central   Google Scholar  

Fiske, K. E., Isenhower, R. W., Bamond, M. J., Delmolino, L., Sloman, K. N., & LaRue, R. H. (2015). Assessing the value of token reinforcement for individuals with autism. Journal of Applied Behavior Analysis , 48 , 448−453.

Fox, J. (2016). Applied regression analysis and generalized linear models (3rd). London: Sage.

Gage, N. A., & Lewis, T. J. (2013). Analysis of effect for single-case design research. Journal of Applied Sport Psychology , 25 , 46–60.

Gardner, E. S., Jr. (2006). Exponential smoothing: The state of the art—Part II. International Journal of Forecasting , 22 , 637–666.

Gardner, E. S., Jr., & McKenzie, E. (1985). Forecasting trends in time series. Management Science , 31 , 1237–1246.

Gardner, S. J., & Wolfe, P. S. (2015). Teaching students with developmental disabilities daily living skills using point-of-view modeling plus video prompting with error correction. Focus on Autism and Other Developmental Disabilities , 30 , 195−207.

Harrington, M., & Velicer, W. F. (2015). Comparing visual and statistical analysis in single-case studies using published studies. Multivariate Behavioral Research , 50 , 162–183.

Harrop, J. W., & Velicer, W. F. (1985). A comparison of alternative approaches to the analysis of interrupted time-series. Multivariate Behavioral Research , 20 , 27–44.

Hine, J. F., Ardoin, S. P., & Foster, T. E. (2015). Decreasing transition times in elementary school classrooms: Using computer-assisted instruction to automate intervention components. Journal of Applied Behavior Analysis , 48 , 495–510. https://doi.org/10.1002/jaba.233

Holt, C. C. (2004). Forecasting seasonals and trends by exponentially weighted moving averages. International Journal of Forecasting , 20 , 5–10.

Horner, R. H., Swaminathan, H., Sugai, G., & Smolkowski, K. (2012). Considerations for the systematic analysis and use of single-case research. Education and Treatment of Children , 35 , 269–290.

Houle, T. T. (2009). Statistical analyses for single-case experimental designs. In D. H. Barlow, M. K. Nock, & M. Hersen (Eds.), Single case experimental designs: Strategies for studying behavior change (3rd, pp. 271–305). Boston: Pearson.

Huitema, B. E., McKean, J. W., & McKnight, S. (1999). Autocorrelation effects on least-squares intervention analysis of short time series. Educational and Psychological Measurement , 59 , 767–786.

Hyndman, R. J., & Athanasopoulos, G. (2013). Forecasting: Principles and practice. Retrieved March 24, 2018, from https://www.otexts.org/fpp/7/4

Hyndman, R. J., & Koehler, A. B. (2006). Another look at measures of forecast accuracy. International Journal of Forecasting , 22 , 679–688.

Knight, V. F., Wood, C. L., Spooner, F., Browder, D. M., & O’Brien, C. P. (2015). An exploratory study using science eTexts with students with Autism Spectrum Disorder. Focus on Autism and Other Developmental Disabilities , 30 , 86−99.

Kratochwill, T. R., Hitchcock, J. H., Horner, R. H., Levin, J. R., Odom, S. L., Rindskopf, D. M., & Shadish, W. R. (2010). Single case designs technical documentation . In the What Works Clearinghouse: Procedures and standards handbook (Version 1.0). Available at http://ies.ed.gov/ncee/wwc/pdf/reference_resources/wwc_scd.pdf

Kratochwill, T. R., Levin, J. R., Horner, R. H., & Swoboda, C. M. (2014). Visual analysis of single-case intervention research: Conceptual and methodological issues. In T. R. Kratochwill & J. R. Levin (Eds.), Single-case intervention research: Methodological and statistical advances (pp. 91–125). Washington, DC: American Psychological Association.

Lane, J. D., & Gast, D. L. (2014). Visual analysis in single case experimental design studies: Brief review and guidelines. Neuropsychological Rehabilitation , 24 , 445–463.

Ledbetter-Cho, K., Lang, R., Davenport, K., Moore, M., Lee, A., Howell, A., . . . O’Reilly, M. (2015). Effects of script training on the peer-to-peer communication of children with autism spectrum disorder. Journal of Applied Behavior Analysis , 48 , 785−799.

Manolov, R. (2018). Linear trend in single-case visual and quantitative analyses. Behavior Modification , 42 , 684–706.

Manolov, R., Gast, D. L., Perdices, M., & Evans, J. J. (2014). Single-case experimental designs: Reflections on conduct and analysis. Neuropsychological Rehabilitation , 24 , 634−660. https://doi.org/10.1080/09602011.2014.903199

Manolov, R., & Moeyaert, M. (2017). Recommendations for choosing single-case data analytical techniques. Behavior Therapy , 48 , 97−114.

Manolov, R., & Rochat, L. (2015). Further developments in summarising and meta-analysing single-case data: An illustration with neurobehavioural interventions in acquired brain injury. Neuropsychological Rehabilitation , 25 , 637−662.

Manolov, R., & Solanas, A. (2009). Percentage of nonoverlapping corrected data. Behavior Research Methods , 41 , 1262–1271. https://doi.org/10.3758/BRM.41.4.1262

Manolov, R., & Solanas, A. (2013). A comparison of mean phase difference and generalized least squares for analyzing single-case data. Journal of School Psychology , 51 , 201−215.

Marso, D., & Shadish, W. R. (2015). Software for meta-analysis of single-case design: DHPS macro . Retrieved January 22, 2017, from http://faculty.ucmerced.edu/wshadish/software/software-meta-analysis-single-case-design

Matyas, T. A., & Greenwood, K. M. (1997). Serial dependency in single-case time series. In R. D. Franklin, D. B. Allison, & B. S. Gorman (Eds.), Design and analysis of single-case research (pp. 215–243). Mahwah: Erlbaum.

Mendenhall, W., & Sincich, T. (2012). A second course in statistics: Regression analysis (7th). Boston: Prentice Hall.

Mercer, S. H., & Sterling, H. E. (2012). The impact of baseline trend control on visual analysis of single-case data. Journal of School Psychology , 50 , 403–419.

Parker, R. I., Cryer, J., & Byrns, G. (2006). Controlling baseline trend in single-case research. School Psychology Quarterly , 21 , 418−443.

Parker, R. I., & Vannest, K. (2009). An improved effect size for single-case research: Nonoverlap of all pairs. Behavior Therapy , 40 , 357–367. https://doi.org/10.1016/j.beth.2008.10.006

Parker, R. I., Vannest, K. J., Davis, J. L., & Sauber, S. B. (2011). Combining nonoverlap and trend for single-case research: Tau-U. Behavior Therapy , 42 , 284−299. https://doi.org/10.1016/j.beth.2010.08.006

Pustejovsky, J. E. (2015). Measurement-comparable effect sizes for single-case studies of free-operant behavior. Psychological Methods , 20 , 342−359.

Pustejovsky, J. E. (2018a). Procedural sensitivities of effect sizes for single-case designs with directly observed behavioral outcome measures. Psychological Methods . Advance online publication. https://doi.org/10.1037/met0000179

Pustejovsky, J. E. (2018b). Using response ratios for meta-analyzing single-case designs with behavioral outcomes. Journal of School Psychology , 68 , 99–112.

Pustejovsky, J. E., Hedges, L. V., & Shadish, W. R. (2014). Design-comparable effect sizes in multiple baseline designs: A general modeling framework. Journal of Educational and Behavioral Statistics , 39 , 368–393.

Rindskopf, D. M., & Ferron, J. M. (2014). Using multilevel models to analyze single-case design data. In T. R. Kratochwill & J. R. Levin (Eds.), Single-case intervention research: Methodological and statistical advances (pp. 221−246). Washington, DC: American Psychological Association.

Rispoli, M., Ninci, J., Burke, M. D., Zaini, S., Hatton, H., & Sanchez, L. (2015). Evaluating the accuracy of results for teacher implemented trial-based functional analyses. Behavior Modification , 39 , 627−653.

Rogosa, D. (1980). Comparing nonparallel regression lines. Psychological Bulletin , 88 , 307–321. https://doi.org/10.1037/0033-2909.88.2.307

Article   Google Scholar  

Saini, V., Greer, B. D., & Fisher, W. W. (2015). Clarifying inconclusive functional analysis results: Assessment and treatment of automatically reinforced aggression. Journal of Applied Behavior Analysis , 48 , 315–330. https://doi.org/10.1002/jaba.203

Article   PubMed   PubMed Central   Google Scholar  

Scotti, J. R., Evans, I. M., Meyer, L. H., & Walker, P. (1991). A meta-analysis of intervention research with problem behavior: Treatment validity and standards of practice. American Journal on Mental Retardation , 96 , 233–256.

Scruggs, T. E., & Mastropieri, M. A. (1998). Summarizing single-subject research: Issues and applications. Behavior Modification , 22 , 221–242.

Shadish, W. R., Hedges, L. V., & Pustejovsky, J. E. (2014). Analysis and meta-analysis of single-case designs with a standardized mean difference statistic: A primer and applications. Journal of School Psychology , 52 , 123–147.

Shadish, W. R., Kyse, E. N., & Rindskopf, D. M. (2013). Analyzing data from single-case designs using multilevel models: New applications and some agenda items for future research. Psychological Methods , 18 , 385–405. https://doi.org/10.1037/a0032964

Shadish, W. R., Rindskopf, D. M., & Boyajian, J. G. (2016). Single-case experimental design yielded an effect estimate corresponding to a randomized controlled trial. Journal of Clinical Epidemiology , 76 , 82–88.

Shadish, W. R., Rindskopf, D. M., Hedges, L. V., & Sullivan, K. J. (2013). Bayesian estimates of autocorrelations in single-case designs. Behavior Research Methods , 45 , 813–821.

Shadish, W. R., & Sullivan, K. J. (2011). Characteristics of single-case designs used to assess intervention effects in 2008. Behavior Research Methods , 43 , 971−980. https://doi.org/10.3758/s13428-011-0111-y

Siegel, E. B., & Lien, S. E. (2015). Using photographs of contrasting contextual complexity to support classroom transitions for children with Autism Spectrum Disorders. Focus on Autism and Other Developmental Disabilities , 30 , 100−114.

Smith, J. D. (2012). Single-case experimental designs: A systematic review of published research and current standards. Psychological Methods , 17 , 510–550. https://doi.org/10.1037/a0029312

Solanas, A., Manolov, R., & Onghena, P. (2010). Estimating slope and level change in N = 1 designs. Behavior Modification , 34 , 195−218.

Solomon, B. G. (2014). Violations of assumptions in school-based single-case data: Implications for the selection and interpretation of effect sizes. Behavior Modification , 38 , 477−496.

Stewart, K. K., Carr, J. E., Brandt, C. W., & McHenry, M. M. (2007). An evaluation of the conservative dual-criterion method for teaching university students to visually inspect AB-design graphs. Journal of Applied Behavior Analysis , 40 , 713−718.

Sullivan, K. J., Shadish, W. R., & Steiner, P. M. (2015). An introduction to modeling longitudinal data with generalized additive models: Applications to single-case designs. Psychological Methods , 20 , 26−42. https://doi.org/10.1037/met0000020

Swaminathan, H., Rogers, H. J., Horner, R., Sugai, G., & Smolkowski, K. (2014). Regression models for the analysis of single case designs. Neuropsychological Rehabilitation , 24 , 554−571.

Swan, D. M., & Pustejovsky, J. E. (2018). A gradual effects model for single-case designs. Multivariate Behavioral Research , 53 , 574–593. https://doi.org/10.1080/00273171.2018.1466681

Tarlow, K. (2017). An improved rank correlation effect size statistic for single-case designs: Baseline corrected Tau. Behavior Modification , 41 , 427–467.

Tate, R. L., & Perdices, M. (2018). Single-case experimental designs for clinical research and neurorehabilitation settings: Planning, conduct, analysis and reporting. London: Routledge.

Tate, R. L., Perdices, M., Rosenkoetter, U., Wakima, D., Godbee, K., Togher, L., & McDonald, S. (2013). Revision of a method quality rating scale for single-case experimental designs and n -of-1 trials: The 15-item Risk of Bias in N -of-1 Trials (RoBiNT) Scale. Neuropsychological Rehabilitation , 23 , 619–638. https://doi.org/10.1080/09602011.2013.824383

Van den Noortgate, W., & Onghena, P. (2008). A multilevel meta-analysis of single-subject experimental design studies. Evidence-Based Communication Assessment and Intervention , 2 , 142–151.

Vannest, K. J., Parker, R. I., Davis, J. L., Soares, D. A., & Smith, S. L. (2012). The Theil–Sen slope for high-stakes decisions from progress monitoring. Behavioral Disorders , 37 , 271–280.

Velicer, W. F., & Harrop, J. (1983). The reliability and accuracy of time series model identification. Evaluation Review , 7 , 551–560.

Velicer, W. F., & McDonald, R. P. (1984). Time series analysis without model identification. Multivariate Behavioral Research , 19 , 33–47.

Verboon, P., & Peters, G. J. (2018). Applying the generalized logistic model in single case designs: Modeling treatment-induced shifts. Behavior Modification . Advance online publication. https://doi.org/10.1177/0145445518791255

White, D. M., Rusch, F. R., Kazdin, A. E., & Hartmann, D. P. (1989). Applications of meta-analysis in individual subject research. Behavioral Assessment , 11 , 281–296.

Wicherts, J. M., Veldkamp, C. L., Augusteijn, H. E., Bakker, M., van Aert, R. C., & Van Assen, M. A. (2016). Degrees of freedom in planning, running, analyzing, and reporting psychological studies: A checklist to avoid p -hacking. Frontiers in Psychology , 7 , 1832. https://doi.org/10.3389/fpsyg.2016.01832

Wolery, M., Busick, M., Reichow, B., & Barton, E. E. (2010). Comparison of overlap methods for quantitatively synthesizing single-subject data. Journal of Special Education , 44 , 18–29.

Wolfe, K., & Slocum, T. A. (2015). A comparison of two approaches to training visual analysis of AB graphs. Journal of Applied Behavior Analysis , 48 , 472–477. https://doi.org/10.1002/jaba.212

Young, N. D., & Daly, E. J., III. (2016). An evaluation of prompting and reinforcement for training visual analysis skills. Journal of Behavioral Education , 25 , 95–119.

Download references

Author information

Authors and affiliations.

Department of Social Psychology and Quantitative Psychology, Faculty of Psychology, University of Barcelona, Barcelona, Spain

Rumen Manolov & Antonio Solanas

Department of Operations, Innovation and Data Sciences, ESADE Business School, Ramon Llull University, Barcelona, Spain

Rumen Manolov & Vicenta Sierra

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Rumen Manolov .

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References to the studies included in the present review of single-case research published in 2015 in four journals: Journal of Applied Behavior Analysis , Behavior Modification , Research in Autism Spectrum Disorders , and Focus on Autism and Other Developmental Disabilities .

Allen, K. D., Vatland, C., Bowen, S. L., & Burke, R. V. (2015). An evaluation of parent-produced video self-modeling to improve independence in an adolescent with intellectual developmental disorder and an autism spectrum disorder: A controlled case study. Behavior Modification , 39 , 542–556.

Austin, J. E., & Tiger, J. H. (2015). Providing alternative reinforcers to facilitate tolerance to delayed reinforcement following functional communication training. Journal of Applied Behavior Analysis , 48 , 663–668.

Austin, J. L., Groves, E. A., Reynish, L. C., & Francis, L. L. (2015). Validating trial-based functional analyses in mainstream primary school classrooms. Journal of Applied Behavior Analysis , 48 , 274–288.

Boudreau, B. A., Vladescu, J. C., Kodak, T. M., Argott, P. J., & Kisamore, A. N. (2015). A comparison of differential reinforcement procedures with children with autism. Journal of Applied Behavior Analysis , 48 , 918–923.

Brandt, J. A. A., Dozier, C. L., Juanico, J. F., Laudont, C. L., & Mick, B. R. (2015). The value of choice as a reinforcer for typically developing children. Journal of Applied Behavior Analysis , 48 , 344–362.

Cannella-Malone, H. I., Sabielny, L. M., & Tullis, C. A. (2015). Using eye gaze to identify reinforcers for individuals with severe multiple disabilities. Journal of Applied Behavior Analysis , 48 , 680–684.

Carroll, R. A., Joachim, B. T., St Peter, C. C., & Robinson, N. (2015). A comparison of error-correction procedures on skill acquisition during discrete-trial instruction. Journal of Applied Behavior Analysis , 48 , 257–273.

Cheng, Y., Huang, C. L., & Yang, C. S. (2015). Using a 3D immersive virtual environment system to enhance social understanding and social skills for children with autism spectrum disorders. Focus on Autism and Other Developmental Disabilities , 30 , 222–236.

Ciccone, F. J., Graff, R. B., & Ahearn, W. H. (2015). Increasing the efficiency of paired-stimulus preference assessments by identifying categories of preference. Journal of Applied Behavior Analysis , 48 , 221–226.

Ciullo, S., Falcomata, T. S., Pfannenstiel, K., & Billingsley, G. (2014). Improving learning with science and social studies text using computer-based concept maps for students with disabilities. Behavior Modification , 39 , 117–135.

Daar, J. H., Negrelli, S., & Dixon, M. R. (2015). Derived emergence of WH question–answers in children with autism. Research in Autism Spectrum Disorders , 19 , 59–71.

DeQuinzio, J. A., & Taylor, B. A. (2015). Teaching children with autism to discriminate the reinforced and nonreinforced responses of others: Implications for observational learning. Journal of Applied Behavior Analysis , 48 , 38–51.

Derosa, N. M., Fisher, W. W., & Steege, M. W. (2015). An evaluation of time in establishing operation on the effectiveness of functional communication training. Journal of Applied Behavior Analysis , 48 , 115–130.

Ditzian, K., Wilder, D. A., King, A., & Tanz, J. (2015). An evaluation of the performance diagnostic checklist–human services to assess an employee performance problem in a center-based autism treatment facility. Journal of Applied Behavior Analysis , 48 , 199–203.

Donaldson, J. M., Wiskow, K. M., & Soto, P. L. (2015). Immediate and distal effects of the good behavior game. Journal of Applied Behavior Analysis , 48 , 685–689.

Downs, H. E., Miltenberger, R., Biedronski, J., & Witherspoon, L. (2015). The effects of video self-evaluation on skill acquisition with yoga postures. Journal of Applied Behavior Analysis , 48 , 930–935.

Dupuis, D. L., Lerman, D. C., Tsami, L., & Shireman, M. L. (2015). Reduction of aggression evoked by sounds using noncontingent reinforcement and time-out. Journal of Applied Behavior Analysis , 48 , 669–674.

Engstrom, E., Mudford, O. C., & Brand, D. (2015). Replication and extension of a check-in procedure to increase activity engagement among people with severe dementia. Journal of Applied Behavior Analysis , 48 , 460–465.

Fahmie, T. A., Iwata, B. A., & Jann, K. E. (2015). Comparison of edible and leisure reinforcers. Journal of Applied Behavior Analysis , 48 , 331–343.

Fichtner, C. S., & Tiger, J. H. (2015). Teaching discriminated social approaches to individuals with Angelman syndrome. Journal of Applied Behavior Analysis , 48 , 734–748.

Fisher, W. W., Greer, B. D., Fuhrman, A. M., & Querim, A. C. (2015). Using multiple schedules during functional communication training to promote rapid transfer of treatment effects. Journal of Applied Behavior Analysis , 48 , 713–733.

Fiske, K. E., Isenhower, R. W., Bamond, M. J., Delmolino, L., Sloman, K. N., & LaRue, R. H. (2015). Assessing the value of token reinforcement for individuals with autism. Journal of Applied Behavior Analysis , 48 , 448–453.

Fox, A. E., & Belding, D. L. (2015). Reducing pawing in horses using positive reinforcement. Journal of Applied Behavior Analysis , 48 , 936–940.

Frewing, T. M., Rapp, J. T., & Pastrana, S. J. (2015). Using conditional percentages during free-operant stimulus preference assessments to predict the effects of preferred items on stereotypy preliminary findings. Behavior Modification , 39 , 740–765.

Fu, S. B., Penrod, B., Fernand, J. K., Whelan, C. M., Griffith, K., & Medved, S. (2015). The effects of modeling contingencies in the treatment of food selectivity in children with autism. Behavior Modification , 39 , 771–784.

Gardner, S. J., & Wolfe, P. S. (2014). Teaching students with developmental disabilities daily living skills using point-of-view modeling plus video prompting with error correction. Focus on Autism and Other Developmental Disabilities , 30 , 195–207.

Gilroy, S. P., Lorah, E. R., Dodge, J., & Fiorello, C. (2015). Establishing deictic repertoires in autism. Research in Autism Spectrum Disorders , 19 , 82–92.

Groskreutz, M. P., Peters, A., Groskreutz, N. C., & Higbee, T. S. (2015). Increasing play-based commenting in children with autism spectrum disorder using a novel script-frame procedure. Journal of Applied Behavior Analysis , 48 , 442–447.

Haq, S. S., & Kodak, T. (2015). Evaluating the effects of massed and distributed practice on acquisition and maintenance of tacts and textual behavior with typically developing children. Journal of Applied Behavior Analysis , 48 , 85–95.

Hayes, L. B., & Van Camp, C. M. (2015). Increasing physical activity of children during school recess. Journal of Applied Behavior Analysis , 48 , 690–695.

Hine, J. F., Ardoin, S. P., & Foster, T. E. (2015). Decreasing transition times in elementary school classrooms: Using computer-assisted instruction to automate intervention components. Journal of Applied Behavior Analysis , 48 , 495–510.

Kelley, M. E., Liddon, C. J., Ribeiro, A., Greif, A. E., & Podlesnik, C. A. (2015). Basic and translational evaluation of renewal of operant responding. Journal of Applied Behavior Analysis , 48 , 390–401.

Kodak, T., Clements, A., Paden, A. R., LeBlanc, B., Mintz, J., & Toussaint, K. A. (2015). Examination of the relation between an assessment of skills and performance on auditory–visual conditional discriminations for children with autism spectrum disorder. Journal of Applied Behavior Analysis , 48 , 52–70.

Knight, V. F., Wood, C. L., Spooner, F., Browder, D. M., & O’Brien, C. P. (2015). An exploratory study using science eTexts with students with Autism Spectrum Disorder. Focus on Autism and Other Developmental Disabilities , 30 , 86–99.

Kuhl, S., Rudrud, E. H., Witts, B. N., & Schulze, K. A. (2015). Classroom-based interdependent group contingencies increase children’s physical activity. Journal of Applied Behavior Analysis , 48 , 602–612.

Lambert, A. M., Tingstrom, D. H., Sterling, H. E., Dufrene, B. A., & Lynne, S. (2015). Effects of tootling on classwide disruptive and appropriate behavior of upper-elementary students. Behavior Modification , 39 , 413–430.

Lambert, J. M., Bloom, S. E., Samaha, A. L., Dayton, E., & Rodewald, A. M. (2015). Serial alternative response training as intervention for target response resurgence. Journal of Applied Behavior Analysis , 48 , 765–780.

Ledbetter-Cho, K., Lang, R., Davenport, K., Moore, M., Lee, A., Howell, A., . . . O’Reilly, M. (2015). Effects of script training on the peer-to-peer communication of children with autism spectrum disorder. Journal of Applied Behavior Analysis , 48 , 785–799.

Lee, G. P., Miguel, C. F., Darcey, E. K., & Jennings, A. M. (2015). A further evaluation of the effects of listener training on derived categorization and speaker behavior in children with autism. Research in Autism Spectrum Disorders , 19 , 72–81.

Lerman, D. C., Hawkins, L., Hillman, C., Shireman, M., & Nissen, M. A. (2015). Adults with autism spectrum disorder as behavior technicians for young children with autism: Outcomes of a behavioral skills training program. Journal of Applied Behavior Analysis , 48 , 233–256.

Mechling, L. C., Ayres, K. M., Foster, A. L., & Bryant, K. J. (2014). Evaluation of generalized performance across materials when using video technology by students with autism spectrum disorder and moderate intellectual disability. Focus on Autism and Other Developmental Disabilities , 30 , 208–221.

Miller, S. A., Rodriguez, N. M., & Rourke, A. J. (2015). Do mirrors facilitate acquisition of motor imitation in children diagnosed with autism? Journal of Applied Behavior Analysis , 48 , 194–198.

Mitteer, D. R., Romani, P. W., Greer, B. D., & Fisher, W. W. (2015). Assessment and treatment of pica and destruction of holiday decorations. Journal of Applied Behavior Analysis , 48 , 912–917.

Neely, L., Rispoli, M., Gerow, S., & Ninci, J. (2014). Effects of antecedent exercise on academic engagement and stereotypy during instruction. Behavior Modification , 39 , 98–116.

O’Handley, R. D., Radley, K. C., & Whipple, H. M. (2015). The relative effects of social stories and video modeling toward increasing eye contact of adolescents with autism spectrum disorder. Research in Autism Spectrum Disorders , 11 , 101–111.

Paden, A. R., & Kodak, T. (2015). The effects of reinforcement magnitude on skill acquisition for children with autism. Journal of Applied Behavior Analysis , 48 , 924–929.

Pence, S. T., & St Peter, C. C. (2015). Evaluation of treatment integrity errors on mand acquisition. Journal of Applied Behavior Analysis , 48 , 575–589. https://doi.org/10.1002/jaba.238

Peters, L. C., & Thompson, R. H. (2015). Teaching children with autism to respond to conversation partners’ interest. Journal of Applied Behavior Analysis , 48 , 544–562.

Peterson, K. M., Volkert, V. M., & Zeleny, J. R. (2015). Increasing self-drinking for children with feeding disorders. Journal of Applied Behavior Analysis , 48 , 436–441.

Protopopova, A., & Wynne, C. D. (2015). Improving in-kennel presentation of shelter dogs through response-dependent and response-independent treat delivery. Journal of Applied Behavior Analysis , 48 , 590–601.

Putnam, B. C., & Tiger, J. H. (2015). Teaching braille letters, numerals, punctuation, and contractions to sighted individuals. Journal of Applied Behavior Analysis , 48 , 466–471.

Quinn, M. J., Miltenberger, R. G., & Fogel, V. A. (2015). Using TAGteach to improve the proficiency of dance movements. Journal of Applied Behavior Analysis , 48 , 11–24.

Rispoli, M., Ninci, J., Burke, M. D., Zaini, S., Hatton, H., & Sanchez, L. (2015). Evaluating the accuracy of results for teacher implemented trial-based functional analyses. Behavior Modification , 39 , 627–653.

Rosales, R., Gongola, L., & Homlitas, C. (2015). An evaluation of video modeling with embedded instructions to teach implementation of stimulus preference assessments. Journal of Applied Behavior Analysis , 48 , 209–214.

Saini, V., Greer, B. D., & Fisher, W. W. (2015). Clarifying inconclusive functional analysis results: Assessment and treatment of automatically reinforced aggression. Journal of Applied Behavior Analysis , 48 , 315–330.

Saini, V., Gregory, M. K., Uran, K. J., & Fantetti, M. A. (2015). Parametric analysis of response interruption and redirection as treatment for stereotypy. Journal of Applied Behavior Analysis , 48 , 96–106.

Scalzo, R., Henry, K., Davis, T. N., Amos, K., Zoch, T., Turchan, S., & Wagner, T. (2015). Evaluation of interventions to reduce multiply controlled vocal stereotypy. Behavior Modification , 39 , 496–509.

Siegel, E. B., & Lien, S. E. (2014). Using photographs of contrasting contextual complexity to support classroom transitions for children with Autism Spectrum Disorders. Focus on Autism and Other Developmental Disabilities , 30 , 100–114.

Slocum, S. K., & Vollmer, T. R. (2015). A comparison of positive and negative reinforcement for compliance to treat problem behavior maintained by escape. Journal of Applied Behavior Analysis , 48 , 563–574.

Smith, K. A., Shepley, S. B., Alexander, J. L., Davis, A., & Ayres, K. M. (2015). Self-instruction using mobile technology to learn functional skills. Research in Autism Spectrum Disorders , 11 , 93–100.

Sniezyk, C. J., & Zane, T. L. (2014). Investigating the effects of sensory integration therapy in decreasing stereotypy. Focus on Autism and Other Developmental Disabilities , 30 , 13–22.

Speelman, R. C., Whiting, S. W., & Dixon, M. R. (2015). Using behavioral skills training and video rehearsal to teach blackjack skills. Journal of Applied Behavior Analysis , 48 , 632–642.

Still, K., May, R. J., Rehfeldt, R. A., Whelan, R., & Dymond, S. (2015). Facilitating derived requesting skills with a touchscreen tablet computer for children with autism spectrum disorder. Research in Autism Spectrum Disorders , 19 , 44–58.

Vargo, K. K., & Ringdahl, J. E. (2015). An evaluation of resistance to change with unconditioned and conditioned reinforcers. Journal of Applied Behavior Analysis , 48 , 643–662.

Vedora, J., & Grandelski, K. (2015). A comparison of methods for teaching receptive language to toddlers with autism. Journal of Applied Behavior Analysis , 48 , 188–193.

Wilder, D. A., Majdalany, L., Sturkie, L., & Smeltz, L. (2015). Further evaluation of the high-probability instructional sequence with and without programmed reinforcement. Journal of Applied Behavior Analysis , 48 , 511–522.

Wunderlich, K. L., & Vollmer, T. R. (2015). Data analysis of response interruption and redirection as a treatment for vocal stereotypy. Journal of Applied Behavior Analysis , 48 , 749–764.

Appendix B: Versions of the mean phase difference

In the initial proposal (Manolov & Solanas, 2013 ), MPD.2013 entails the following steps:

Estimate baseline trend as the average of the differenced baseline phase data:

Extrapolate baseline trend, adding the trend estimate ( b 1( D ) ) to the last baseline phase data point ( \( {y}_{n_A} \) ) to predict the first intervention phase data point ( \( {\widehat{y}}_{n_A+1} \) ). Formally, \( {\widehat{y}}_{n_A+1}={y}_{n_A}+{b}_{1(D)} \) . This entails that the intercept of the baseline trend line is \( {b}_{0(MPD.2013)}={y}_{n_A}-{n}_A\times {b}_{1(D)} \) .

Continue extrapolating adding the trend estimate to the previously obtained forecast. Formally, \( {\widehat{y}}_{n_A+j}={\widehat{y}}_{n_A+j-1}+{b}_{1(D)};j=2,3,\dots, {n}_B \) .

Obtain MPD as the difference between the actually obtained treatment data (y j ) and the treatment measurements as predicted from baseline trend ( \( {\widehat{y}}_j \) ): \( {MPD}_{2013}=\frac{\sum_{j=1}^{n_B}\left({y}_j-{\widehat{y}}_j\right)}{n_B} \) .

In its modified version (Manolov & Rochat, 2015 ), MPD.2015 entails the following steps:

Estimate baseline trend as the average of the differenced baseline phase data: the same b 1( D ) previously defined.

Establish the pivotal point in the baseline at the crossing of Md ( x ) =  Md (1, 2, …,  n A ) on the abscissa and \( Md(y)= Md\left({y}_1,{y}_2,\dots, {y}_{n_A}\right) \) on the ordinate.

Establish a fitted value at an existing baseline measurement occasion around Md ( y ). Formally, \( {\widehat{y}}_{\left\lfloor Md(x)\right\rfloor }= Md(y)-\left( Md(x)-\left\lfloor Md(x)\right\rfloor \right)\times {b}_1 \) .

Fit the baseline trend to the whole baseline, subtracting and adding the estimated baseline slope from the fitted value obtained in the previous step, according to the measurement occasion.

Therefore, the intercept of the baseline trend line is defined as:

Extrapolate the baseline trend into the treatment phase, starting from the last fitted baseline value: \( {\widehat{y}}_{n_A+1}={\widehat{y}}_{n_A}+{b}_{1(D)} \) .

Continue extrapolating adding the trend estimate to the previously obtained forecast: \( {\widehat{y}}_{n_A+j}={\widehat{y}}_{n_A+j-1}+{b}_{1(D)};j=2,3,\dots, {n}_B \) .

Obtain MPD as the difference between the actually obtained treatment data and the treatment measurements as predicted from baseline trend: \( {MPD}_{2015}=\frac{\sum_{j=1}^{n_B}\left({y}_j-{\widehat{y}}_j\right)}{n_B} \) .

We propose a third way of defining the intercept, namely, in the same way as estimated in the Theil–Sen estimator, that is, as the median difference between actual data points and the trend multiplied by the measurement occasion: b 0( TS )  =  Md ( y i  −  b 1( D )  ×  i ); i  = 1, 2, …, n A . Note that the slope is still estimated as in the original proposal (Manolov & Solanas, 2013 ).

Rights and permissions

Reprints and permissions

About this article

Manolov, R., Solanas, A. & Sierra, V. Extrapolating baseline trend in single-case data: Problems and tentative solutions. Behav Res 51 , 2847–2869 (2019). https://doi.org/10.3758/s13428-018-1165-x

Download citation

Published : 27 November 2018

Issue Date : December 2019

DOI : https://doi.org/10.3758/s13428-018-1165-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Single-case designs
  • Extrapolation
  • Forecasting
  • Find a journal
  • Publish with us
  • Track your research
  • Open access
  • Published: 29 May 2023

Extrapolating empirical long-term survival data: the impact of updated follow-up data and parametric extrapolation methods on survival estimates in multiple myeloma

  • LJ Bakker 1 , 2 ,
  • FW Thielen 1 , 2 ,
  • WK Redekop 1 , 2 ,
  • CA Uyl-de Groot 1 , 2 &
  • HM Blommestein 1 , 2  

BMC Medical Research Methodology volume  23 , Article number:  132 ( 2023 ) Cite this article

2543 Accesses

Metrics details

In economic evaluations, survival is often extrapolated to smooth out the Kaplan-Meier estimate and because the available data (e.g., from randomized controlled trials) are often right censored. Validation of the accuracy of extrapolated results can depend on the length of follow-up and the assumptions made about the survival hazard. Here, we analyze the accuracy of different extrapolation techniques while varying the data cut-off to estimate long-term survival in newly diagnosed multiple myeloma (MM) patients.

Empirical data were available from a randomized controlled trial and a registry for MM patients treated with melphalan + prednisone, thalidomide, and bortezomib- based regimens. Standard parametric and spline models were fitted while artificially reducing follow-up by introducing database locks. The maximum follow-up for these locks varied from 3 to 13 years. Extrapolated (conditional) restricted mean survival time (RMST) was compared to the Kaplan-Meier RMST and models were selected according to statistical tests, and visual fit.

For all treatments, the RMST error decreased when follow-up and the absolute number of events increased, and censoring decreased. The decline in RMST error was highest when maximum follow-up exceeded six years. However, even when censoring is low there can still be considerable deviations in the extrapolated RMST conditional on survival until extrapolation when compared to the KM-estimate.

Conclusions

We demonstrate that both standard parametric and spline models could be worthy candidates when extrapolating survival for the populations examined. Nevertheless, researchers and decision makers should be wary of uncertainty in results even when censoring has decreased, and the number of events has increased.

Peer Review reports

Introduction

The data available for assessing efficacy of novel healthcare technologies in oncology often comes from randomized controlled trials (RCTs). However, RCTs do not provide all necessary information for assessing the cost-effectiveness of these technologies. RCTs often have limited follow-up times and thus increased censoring at market approval while a lifetime horizon is usually recommended in best-practice guidelines for economic evaluations [ 1 , 2 ]. This lifetime horizon ensures that all differences (i.e., short- and long-term) of the technologies compared are accounted for. Since a lifetime horizon almost always exceeds the follow-up duration of RCTs or other data sources used in economic evaluations (e.g., registries), empirical survival data are typically right censored [ 3 ]. For the novel treatment assessed, this can result in considerable uncertainty regarding the parametric survival function. For the comparator, this depends on whether the treatment administered in the trial is representative of what happens in current care and whether alternative sources of data are available to inform long-term survival.

There is substantial variation in the percentage of patients that is right censored depending on the type of disease [ 4 ]. For hematological malignancies for instance, the average percentage censored was 84% in initial publications and 54% in updated publications, whereas for other malignancies this varied from 28 to 73% in the initial publication and 13-47% in updated results [ 4 ]. With an increase in novel immunotherapies resulting in prolonged survival for multiple myeloma patients, such as daratumumab and lenalidomide [ 5 , 6 ], this issue has become even more prominent in recent years.

To address the issue of right-censoring, parametric survival functions and other methods for extrapolation are used to estimate long-term survival, making assumptions about the underlying hazard function for the extrapolated period based on the data observed [ 7 ]. Many types of models can be used to extrapolate survival from empirical evidence. Standard parametric models are generally included (e.g., Weibull, lognormal), but it is recommended to also consider more flexible models (e.g., spline, parametric mixture models) that allow for multiple turning points in the hazard function [ 8 ]. More flexible parametric spline models for instance were found in a previous study by Gray et al. to predict 10-year survival quite accurately for large cohorts of registry patients for which there was little uncertainty in the data [ 9 ].

Assessing the suitability of models and selecting the best-fitting model for extrapolation can be done through inspection of log cumulative hazard plots, inspection of visual fit, and statistical tests (e.g., Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC)) [ 10 ]. Real-world data may also guide model selection by assessing whether extrapolated results are plausible when compared to patient survival outside the context of a clinical trial [ 11 ]. Prior research has suggested that model selection should consider the length of follow-up of the data available [ 12 ]. In a case study, Bullement et al. assessed the accuracy of extrapolations for four different data-cuts of the JAVELIN Merkel 200 trial which studied the treatment effect of avelumab for patients with Merkel cell carcinoma. The authors found that extrapolations using longer follow-up (e.g., 36 months) favored more flexible spline-based models [ 12 ].

Despite this guidance, selecting a good fitting model and analyzing the uncertainty surrounding model choice remains a challenging endeavor and several publications already assessed the accuracy of extrapolations (e.g., [ 4 , 7 , 8 , 9 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 ]). These studies vary in the type of disease treated (e.g., melanoma, lung cancer), type of treatment evaluated (e.g., immunotherapy, surgery), the type of models compared, the duration of empirical follow-up time, the availability of individual patient data (IPD) and recreated data from published RCTs, their sample size, as well as the inclusion of external data sources. The overall accuracy of extrapolations has been found to be correlated with the percentage censored [ 4 ]. Everest et al. conducted a systematic review to find published RCTs with initial and updated results. For the 32 eligible RCTs, the accuracy of extrapolations based on the initial publication was then assessed after reconstructing individual patient data and fitting standard parametric models. The authors found that the difference between the extrapolated survival and the empirical survival increased as the percentage of patients censored increased [ 4 ].

In this study, we aim to compare extrapolation methods to assess the relationship between data maturity and survival projection accuracy, in the presence of several data sources. Both standard parametric models and spline models were fitted to RCT and patient registry data from patients with multiple myeloma while varying the maximum data cut-off (DCO) times. These extrapolations were not informed by alternative sources of information assuming that solely the dataset at hand with its particular DCO would be the best source available for extrapolation. The resulting extrapolations were compared to long-term empirical survival to determine the best candidate models. The results of our study may assist researchers in assessing whether the IPD is sufficiently mature for cost-effectiveness analysis and guide their decision-making concerning the sensitivity analyses that should be conducted.

Patient population & data

All details on the data sources, treatment arms, inclusion dates, and the data cuts can be found in Table  1 . IPD from an RCT performed by the Dutch Haemato-oncology Foundation for Adults in the Netherlands (HOVON), and data from the Dutch National Cancer Registry (NKR) were used to assess the accuracy of extrapolations compared to long-term empirical survival. The HOVON49 study compared melphalan + prednisone (HOVON - MP) with melphalan + prednisone + thalidomide (HOVON - Thal) in newly diagnosed multiple myeloma patients > 65 years of age [ 19 ]. Patients were included between September 2002 and July 2007 and long-term follow-up was available until December 2015.

Data from the NKR registry, including the Dutch Population based HAematological Registry for Observational Studies (PHAROS), were also used [ 20 , 21 ]. From the PHAROS database, newly diagnosed patients that received first-line treatment with MP (PHAROS - MP), thalidomide (PHAROS - Thal) and bortezomib (PHAROS - Bort) based regimens were included. Patients receiving melphalan + prednisone + bortezomib (NKR+ - MPV) present in the NKR + data were also included as a separate cohort. The mean age of the PHAROS-Bort cohort was slightly lower (Table  1 ) because at the time bortezomib was not the recommended first-line treatment for all multiple myeloma patients. In 2006–2011, bortezomib based regimens were mainly prescribed in younger patients followed by patients with kidney failure [ 22 ]. All dates for inclusion and exclusion can be found in Table  1 and Figure S1 . For all patients in the PHAROS database and the NKR + database, follow-up was included up until January 2022.

Overall survival was extrapolated using data sets which varied in the maximum follow-up time after the start of patient inclusion. For the MP arm from the HOVON49 study for instance, four datasets were created. In the first, patients were included from September 2002 until September 2005. Thus, the maximum follow-up of patients was three years and only patients that were enrolled before September 2005 were included. For the second HOVON - MP dataset, all enrolled patients were included (since enrollment ended in July 2007) but the final follow-up date was six years after starting enrollment (i.e., September 2008) and so forth.

The DCOs (i.e., < 3, 6, 8, 10, and 13 years) were chosen based on previously reported results from Everest et al. and Bullement et al. [ 4 , 12 ] and according to the maximum potential follow-up in the dataset. For instance, if inclusion started in 2002 and the DCO was 2005 the longest that a patient could have been followed was 3 years (Table  1 ). The minimum amount of potential follow-up time for all patients included in the data is also reported in Table  1 . For instance, when the maximum follow-up was < 6 years for the HOVON-MP arm the minimum potential follow-up for all patients included was 1 year and when the maximum follow-up was < 8 years the minimum potential follow-up was 3 years.

Fitted models

The models used to extrapolate results included all commonly used standard parametric models recommended by the Technical Support Documents 14 and 21 from the National Institute for Health and Care Excellence Decision Support Unit (i.e., exponential, Weibull, Gompertz, Gamma, log-logistic, lognormal, Generalized Gamma) and spline models. Spline-based models are flexible models where the survival function is transformed by a link function using natural cubic splines [ 23 ]. Natural cubic splines impose monotonicity for the tails where the number at risk is low, whereas at earlier points monotonicity is guaranteed due to the data density if the sample size is reasonable [ 23 ]. The transformed survival function is then smoothed reducing the risk of sudden deviations especially in the tail. Knots are placed at extreme values of the survival times and internally [ 23 ]. Here, the number of knots was varied from one to three and a hazard, odds, and normal scale were used.

Model selection & accuracy of predictions

In the results, we presented the models that had the lowest AIC, lowest BIC, and the best visual fit based on survival, and hazard plots. To select those models with the best visual fit, two authors (LB, HB) reviewed all curves independently. For the best visual fit, curves were selected based on four criteria: their fit to the Kaplan-Meier survival curve, the feasibility of the extrapolated survival, their fit to the smoothed hazard and the feasibility of the extrapolated hazard. If, based on these four criteria, multiple models were still eligible for ‘best’ fit, the model with the smallest number of parameters that needed to be estimated was selected. For instance, an exponential distribution for which one parameter needs to be estimated would be preferred over a generalized gamma distribution (three parameters). After individual selection, any remaining discrepancies were resolved by discussion to reach consensus. A third author (FT) participated in these discussions to resolve any ties in model selection. In preparation of the discussion the third author randomly assessed one-third of all curves according to the criteria noted above.

The accuracy of predictions was estimated using the restricted mean survival time (RMST). The RMST is equal to the mean survival restricted to a maximum time t instead of lifetime. It can be calculated by estimating the area under the curve (AUC) up until time t using integration [ 24 ]. All models were fitted and RMST estimated using the flexsurv package (version 2.1) in R [ 25 ]. First, a lifetime RMST was estimated for all cohorts. Here, the AUC was estimated for the extrapolated survival curves with the time horizon set to 35 years. Hereafter, the extrapolated survival was compared to the empirical survival with the horizon for RMST depending on the length of follow-up in empirical survival (Table  1 ). The RMST error was defined as the difference between the RMST from extrapolated curves and the RMST for the KM-estimate. In the second set of analyses, RMST was limited to the extrapolated proportion of the survival curve. Here, RMST was estimated conditional on surviving up until the point from where on extrapolation was required. Thus, for the data set with a maximum of three years follow-up, RMST was estimated conditional on having survived 3 years. Variations in RMST error were also plotted according to the percentage censored, absolute number of events, and the type of model (i.e., standard parametric or spline). For the spline models, knots were automatically placed at the centiles following recommendations by Royston & Parmar, when using the flexsurv package [ 23 ]. R version 4.0.3 was used for all analyses together with the packages flexsurv, muhaz, survRM2, lme4.

Ethical approval

Approval for use of the PHAROS and NKR + data was granted through the supervisory committee of the Dutch Integral Cancer Registry. Approval for secondary use of the data from the HOVON49 study was provided by HOVON.

Overall, 1853 patients were included, who received a variety of treatment regimens in a regular clinical care setting (PHAROS & NKR+) or in an RCT (HOVON) (Table S1 ). For all patient cohorts the percentage censored was initially high but quickly decreased over time with longer follow-up (Table S1 ). Kaplan-Meier estimates and the number at risk for the respective time points were plotted grouped according to the treatment received (i.e., MP, thalidomide, bortezomib-based), the data source (i.e., HOVON, PHAROS, NKR+) and the maximum follow-up (i.e., 3, 6, 8, 10, and 13 years) (Fig.  1 , S2 - S7 ).

figure 1

Long term overall survival of patients treated with bortezomib-based regimens for the PHAROS (registry) data and the NKR+ (registry) data, patients treated with MP-based regimens for the HOVON (RCT) data and the PHAROS (registry) data, and patients treated with thalidomide-based regimens for the HOVON (RCT) data and the PHAROS (registry) data

Lifetime RMST

The extrapolated lifetime RMST varied considerably according to the data source and the types of models fitted (Figure S8 ). Overall, the variation in the extrapolated lifetime RMST was high for models estimated with limited follow-up. For example, for HOVON-Thal with a maximum follow-up of 3 years, the RMST varied from 5 years to 22.5 years. The variation for HOVON – MP, PHAROS – MP and, NKR+ - MPV was considerably smaller compared to all other arms (Figures S8 , S9 ) varying from 2.5 years to less than 10 years. The survival estimates declined considerably as the percentage censored decreased (Fig.  2 ) but also as the absolute number of events increased for almost all models (Figure S10 ).

figure 2

Lifetime RMST according to the percentage censored and the type of model

Observed and estimated RMST from the RCT

In Table S2 we present a comparison between the observed long-term survival (i.e., 11 years) and estimated RMST for four different data cuts using data from the HOVON RCT. RMST estimates were restricted to the maximum follow-up. The mean survival estimates were considerably smaller compared to the 35-year time horizon, but the uncertainty was also large when follow-up was short. The standard parametric models were often selected based on AIC, BIC, and visual fit whereas no clear preference for either standard parametric models or spline models could be seen for the model with the lowest RMST error. Curves often overlapped and the differences between curves were often negligible making selections based on model fit difficult. We also observed that the RMST error based on the selection using BIC was almost always lower than the model selected based on AIC and visual fit. However, this was usually the exponential distribution which tended to under- or overestimate the hazard in the earlier months and vice versa in later months.

The RMST error was higher for the short-term follow-up (< 3 years) for which the censoring percentages were also higher (HOVON - MP: 73%, HOVON - Thal: 77%) relative to the number of events (HOVON - MP:29, HOVON - Thal: 25) (Table S1 ). However, as the length of follow-up increased, the error reduced with the absolute largest difference in RMST from < 3 years of follow-up to < 6 years of follow-up which coincided with a large reduction in censoring (HOVON - MP: 73–38%, HOVON - Thal: 77–48%). Confidence intervals of the models selected almost always overlapped.

Observed and estimated RMST from registries

For the registries, the maximum follow-up was slightly longer and therefore the RMST was estimated for 14 years (Table S3 ). Here the model selected with the best visual fit seemed to change less often when follow-up increased, and standard parametric models were almost always selected based on AIC, BIC, and best visual fit. For the NKR + data, the absolute RMST error was much smaller due to shorter length of the time horizon for which RMST was estimated (i.e., 8 years).

Overall, standard parametric models regularly had the smallest absolute RMST error (i.e., in 67%) but as censoring decreased, the lowest absolute RMST error was more often a spline model (i.e., PHAROS - MP, NKR+ - MPV). The error in the extrapolations based on the datasets with short follow-up (< 3 years) was large, irrespective of the sample size of the dataset used and the percentage censored. The error decreased as the follow-up increased and thus censoring decreased.

The RMST error for all models decreased when follow-up increased (Figures S11 - S14 ). The RMST errors for all treatments (regardless of the sample size, censoring, events, and the time horizon of the RMST) were low when 8 years of follow-up or more was available (S11-S14). Decreased censoring and more events coincided with smaller RMST errors (Fig.  3 , S15 - S17 ).

figure 3

RMST error according to the percentage censored and the type of model. RMST is estimated for a time horizon of 8 years and a maximum follow-up of 3 and 6 years

RMST error conditional on survival

For the RMST error conditional on having survived until extrapolation, the decline was less pronounced as censoring reduced and the number of events increased (Fig.  4 , S18 ). Moreover, the spread in error was much wider for standard parametric models compared to spline models (Fig.  4 , S18 , S19 ). In Fig.  4 , the spread in the conditional RMST error between different models becomes smaller when censoring is less compared to the data cuts with higher percentages censored. However, even for the lowest percentages of censoring (e.g., 30–40%) there were some considerable deviations in the extrapolated RMST from the KM-estimate. This was also observed when the number of events was higher (e.g., > 100 events) (Figure S18 ).

figure 4

The RMST error conditional on surviving until extrapolation plotted according to the percentage censored and the type of model. RMST is estimated for a time horizon of 8 years and a maximum follow-up of 3 and 6 years

In this study we analyze the accuracy of extrapolations for a non-solid tumor while varying the percentage censored using trial and registry data representative in sample size of those generally available to health economic researchers. We compared RMST estimated from extrapolated survival with the long-term Kaplan-Meier estimate in patients with multiple myeloma for a variety of treatments, data sources and maximum follow-up times. When reimbursement dossiers are drafted, the length of follow-up of patients included in the pivotal trial is often limited. Here, insight into the consequences of the uncertainty of extrapolations and the different models fitted, is essential since they are used to inform (conditional) reimbursement decisions of policy makers. This is an even bigger issue for clinical trials of novel immunotherapies such as daratumumab, where the percentage censored for overall survival is high [ 5 , 26 ].

These results align with Everest et al., meaning that the RMST error increases when the percentage censored increases. For trials of hematologic diseases, the average percentage of censoring was 84% for initial publications and 54% for the final publication [ 4 ]. Although it can be seen in Fig.  3 and S15 that the RMST error has extensively declined for a percentage censored of 54% or lower, there can still be considerable uncertainty in these extrapolations. This was more pronounced when the error in RMST was estimated conditional on having survived until extrapolation (Fig.  4 ). Decision makers should critically review whether decisions on reimbursement can be made when the extrapolated survival is based on high percentages censored. In the economic evaluations that support these decisions, those models should be fitted which are relevant considering the disease at hand and based on clinical expertise. Here, the sensitivity analyses adopted by health economic researchers demonstrate the potential impact on cost-effectiveness of uncertainty for instance coinciding with high percentages censored but also when percentage censored is low.

In this study, we found no conclusive evidence that standard parametric models are better than spline models or vice versa. The highest absolute RMST error was regularly seen with a standard parametric model. This suggests that uncertainty analyses for health economic evaluations including all standard parametric models, could adequately address the extent (i.e., upper, and lower limits) of the uncertainty in the incremental cost effectiveness ratio. The relationship between the percentage censored and RMST error further stipulates the need to identify those methods that lead to the lowest RMST error, even when the percentage censored is high. Further research should assess whether spline models perform better or worse, with large percentages censored and a small absolute number of events.

Limitations

This study focused on the RMST error as an outcome measure, which enables a comparison between the extrapolated and observed survival. There are however some drawbacks of this outcome measure. First, underestimation and overestimation over time can compensate and ultimately result in a relatively small RMST error. This aligns with results from a prior study in which large cohorts of registry data were used to extrapolate 10-year survival [ 9 ]. Gray et al. observed that the exponential distribution both under- and overestimated the hazard, resulting in a low RMST [ 9 ]. Second, for obtaining the RMST, a maximum time is required. While we could implement a life-time horizon for estimating RMST, we were bounded by the observation time for calculating the error in RMST which differed for the different data sources.

Another limitation of the outcome used is the fact that the Kaplan-Meier estimate itself is an estimate of the true survival function for a given cohort of patients. Although inherent to this kind of research, the error in RMST could be influenced by the fact that the number at risk decreases as time progresses. This is for instance reflected in the conditional survival estimated for HOVON-MP with > 10 years of follow-up where none of the few patients in the sample pass away between 10 and 11 years of follow-up.

Overall, the cohort size in our study was relatively small (i.e., smallest cohort of Gray et al. being N = 5407 [ 9 ]) which increases the uncertainty in extrapolated survival. This can also (partially) explain, why our findings differ from those by Gray et al. who found spline models to perform well even for short follow-up times. While larger cohorts are preferred and might be available for some treatments, our sample sizes are representative of clinical trials in hematology generally used as input for economic evaluations [ 5 , 27 , 28 ]. This makes our research applicable to current practices where health economic modelling is often performed using data from RCTs with a similar sample size. Another limitation was the heterogeneity in the PHAROS-bort cohort. The considerable uncertainty in the extrapolations for this cohort might be (partially) explained by the small sample size but perhaps also by the heterogeneity in the cohort. Due to the small sample size further stratification according to age was not feasible but would be recommended when such variation is present when performing an economic evaluation.

We employed commonly used parametric and spline models and did not consider more recent and complex models such as cure, parametric mixture, and landmark models [ 8 , 15 ]. In Technical Support Document 21 from the National Institute for Health and Care Excellence Decision Support Unit, Rutherford et al. provide recommendations for their appropriate use and, although we did not include them in this analysis, they could be a relevant addition for instance when modelling survival for potentially curative treatments (e.g., CAR-T) [ 8 ]. Another topic for which an increasing amount of research is available concerns the inclusion of external data (e.g., registry data, national statistics). Including such external data to correct for excessively predicted survival in the extrapolations has been recommended when extrapolating survival from RCTs [ 7 , 11 ]. Although this can sometimes reduce the overestimation of survival, this was beyond the scope of this study.

The generalizability of our findings to other areas of disease, particularly other hematological malignancies for which little evidence concerning the accuracy of extrapolations is available, will strongly depend on the similarities between the populations studied. The six datasets used in this study differ in the types of patients included, treatments administered, and hence in their hazard function. Similarly, the generalizability of these findings to other hematological malignancies will strongly depend on these features.

In this study, we compare extrapolated survival of multiple myeloma patients to prolonged empirical survival for a wide variety of DCOs using data from an RCT and registries. Uncertainty in extrapolations can have a large impact on use of healthcare services when the error in long-term survival is large and when it leads to incorrect conclusions for decision makers.

We found that the RMST error can become quite small for both standard parametric and spline models but also that RMST error increases for all models as censoring increases. The error in RMST for the extrapolated period only also reduced as the percentage censored decreased and the number of events decreased. However, this reduction was much less pronounced.

Health economic researchers should consider a variety of models in their (uncertainty) analyses when extrapolating survival in economic evaluations. Here, although the RMST error is high when the percentage censored is high, careful consideration of uncertainty analyses also seems warranted when longer follow-up is available.

Data Availability

The data that support the findings of this study are available in the Dutch Cancer Registry (IKNL), and the Dutch Haemato-oncology Foundation for Adults in the Netherlands (HOVON). Data are available upon reasonable request through the corresponding author (LB) under condition that permission for access is granted by the Dutch National Cancer Registry (IKNL), and the Dutch Haemato-oncology Foundation for Adults in the Netherlands (HOVON).

Abbreviations

Akaike Information Criterion

Area under the curve

Bayesian Information Criterion

Data cut off

Individual patient data

Dutch Haemato-Oncology Foundation for Adults in the Netherlands

Lower Confidence Interval

Melphalan + Prednisone

Dutch National Cancer Registry

Dutch Population based HAematological Registry for Observational Studies

Randomized Controlled Trial

Restricted Mean Survival Time

Upper Confidence Interval

Sharma D, Aggarwal AK, Downey LE, Prinja S. National healthcare economic evaluation guidelines: a cross-country comparison. PharmacoEconomics-Open. 2021 Sep;5(3):349–64.

Dutch Pharmacoeconomic Guidelines [Internet] Diemen: National Health Care Institute the Netherlands. Available from: https://www.zorginstituutnederland.nl/publicaties/publicatie/2016/02/29/richtlijn-voor-het-uitvoeren-van-economische-evaluaties-in-de-gezondheidszorg .

Latimer NR. Survival analysis for economic evaluations alongside clinical trials—extrapolation with patient-level data: inconsistencies, limitations, and a practical guide. Med Decis Making. 2013 Aug;33(6):743–54.

Everest L, Blommaert S, Chu RW, Chan KK, Parmar A. Parametric Survival Extrapolation of early Survival data in economic analyses: a comparison of projected Versus observed updated survival. Value in Health. 2021 Nov 24.

Mateos MV, Cavo M, Blade J, Dimopoulos MA, Suzuki K, Jakubowiak A, Knop S, Doyen C, Lucio P, Nagy Z, Pour L. Overall survival with daratumumab, bortezomib, melphalan, and prednisone in newly diagnosed multiple myeloma (ALCYONE): a randomised, open-label, phase 3 trial. The Lancet. 2020 Jan 11;395(10218):132 – 41.6.

Jackson GH, Davies FE, Pawlyn C, Cairns DA, Striha A, Collett C, Hockaday A, Jones JR, Kishore B, Garg M, Williams CD. Lenalidomide maintenance versus observation for patients with newly diagnosed multiple myeloma (Myeloma XI): a multicentre, open-label, randomised, phase 3 trial. The Lancet Oncology. 2019 Jan 1;20(1):57–73.

Jackson C, Stevens J, Ren S, Latimer N, Bojke L, Manca A, Sharples L. Extrapolating survival from randomized trials using external data: a review of methods. Med Decis Making. 2017 May;37(4):377–90.

Rutherford MJ, Lambert PC, Sweeting MJ, Pennington R, Crowther MJ, Abrams KR, Latimer NR. NICE DSU Technical Support Document 21. Flexible Methods for Survival Analysis. Department of Health Sciences, University of Leicester, Leicester, UK. 2020 Jan 23:1–97.

Gray J, Sullivan T, Latimer NR, Salter A, Sorich MJ, Ward RL, Karnon J. Extrapolation of survival curves using standard parametric models and flexible parametric spline models: comparisons in large registry cohorts with advanced cancer. Med Decis Making. 2021 Feb;41(2):179–93.

Latimer N. NICE DSU technical support document 14: survival analysis for economic evaluations alongside clinical trials-extrapolation with patient-level data. Rep Decis Support Unit. 2011 Jun.

Vickers A. An evaluation of survival curve extrapolation techniques using long-term observational cancer data. Med Decis Making. 2019 Nov;39(8):926–38.

Bullement A, Willis A, Amin A, Schlichting M, Hatswell AJ, Bharmal M. Evaluation of survival extrapolation in immuno-oncology using multiple pre-planned data cuts: learnings to aid in model selection. BMC Med Res Methodol. 2020 Dec;20(1):1–4.

Davies C, Briggs A, Lorgelly P, Garellick G, Malchau H. The “hazards” of extrapolating survival curves. Med Decis Making. 2013 Apr;33(3):369–80.

Kearns B, Stevenson MD, Triantafyllopoulos K, Manca A. Comparing current and emerging practice models for the extrapolation of survival data: a simulation study and case-study. BMC Med Res Methodol. 2021 Dec;21(1):1–1.

Bullement A, Latimer NR, Gorrod HB. Survival extrapolation in cancer immunotherapy: a validation-based case study. Value in Health. 2019 Mar 1;22(3):276 – 83.

Ouwens MJ, Mukhopadhyay P, Zhang Y, Huang M, Latimer N, Briggs A. Estimating lifetime benefits associated with immuno-oncology therapies: challenges and approaches for overall survival extrapolations. PharmacoEconomics. 2019 Sep;37(9):1129–38.

Gibson E, Koblbauer I, Begum N, Dranitsaris G, Liew D, McEwan P, Monfared AA, Yuan Y, Juarez-Garcia A, Tyas D, Lees M. Modelling the survival outcomes of immuno-oncology drugs in economic evaluations: a systematic approach to data analysis and extrapolation. PharmacoEconomics. 2017 Dec;35(12):1257–70.

Lanitis T, Proskorovsky I, Ambavane A, Hunger M, Zheng Y, Bharmal M, Phatak H. Survival analysis in patients with metastatic merkel cell carcinoma treated with Avelumab. Advances in therapy. 2019 Sep;36(9):2327–41.

Wijermans P, Schaafsma M, Termorshuizen F, Ammerlaan R, Wittebol S, Sinnige H, Zweegman S, van Marwijk Kooy M, Van Der Griend R, Lokhorst H, Sonneveld P. Phase III study of the value of thalidomide added to melphalan plus prednisone in elderly patients with newly diagnosed multiple myeloma: the HOVON 49 Study. Journal of Clinical Oncology. 2010 Jul 1;28(19):3160-6.

Blommestein HM, Franken MG, Uyl-de Groot CA. A practical guide for using registry data to inform decisions about the cost effectiveness of new cancer drugs: lessons learned from the PHAROS registry. PharmacoEconomics. 2015 Jun;33(6):551–60.

Verelst SGR, Blommestein HM, De Groot S, Gonzalez-McQuire S, DeCosta L, de Raad JB, Uyl-de Groot CA, Sonneveld P. Long-term outcomes in patients with multiple myeloma: a retrospective analysis of the Dutch Population-based HAematological Registry for Observational Studies (PHAROS). Hemasphere 2018 May 4;2(4):e45. doi: https://doi.org/10.1097/HS9.0000000000000045 . PMID: 31723779; PMCID: PMC6746001.

Blommestein H, Uyl-de Groot C, Visser O, Oerlemans S, Verelst S, van den Broek E, Issa D, Aarts M, Louwman M, Sonneveld P, Postuma W, Coebergh JW, van de Poll L, Huijgens P. Impact of new systemic treatments of patients with hematological malignancies in the Netherlands: population-based cohort studies of process and outcome as a basis for assessments of cost-effectiveness. Report, PHAROS, Netherlands; 2014.

Royston P, Parmar MK. Flexible parametric proportional-hazards and proportional‐odds models for censored survival data, with application to prognostic modelling and estimation of treatment effects. Statistics in medicine. 2002 Aug 15;21(15):2175–97.

Royston P, Parmar MK. The use of restricted mean survival time to estimate the treatment effect in randomized clinical trials when the proportional hazards assumption is in doubt. Statistics in medicine. 2011 Aug 30;30(19):2409-21.

Jackson CH. Flexsurv: a platform for parametric survival modeling in R. Journal of statistical software. 2016 May 12;70.

Palumbo A, Chanan-Khan A, Weisel K, Nooka AK, Masszi T, Beksac M, Spicka I, Hungria V, Munder M, Mateos MV, Mark TM. Daratumumab, bortezomib, and dexamethasone for multiple myeloma. New Engl J Med 2016 Aug 25;375(8):754–66.

Facon T, Kumar SK, Plesner T, Orlowski RZ, Moreau P, Bahlis N, Basu S, Nahi H, Hulin C, Quach H, Goldschmidt H. Daratumumab, lenalidomide, and dexamethasone versus lenalidomide and dexamethasone alone in newly diagnosed multiple myeloma (MAIA): overall survival results from a randomised, open-label, phase 3 trial. The Lancet Oncology. 2021 Nov 1;22(11):1582-96.

Zweegman S, van der Holt B, Mellqvist UH, Salomo M, Bos GM, Levin MD, Visser-Wisselaar H, Hansson M, van der Velden AW, Deenik W, Gruber A. Melphalan, prednisone, and lenalidomide versus melphalan, prednisone, and thalidomide in untreated multiple myeloma. Blood, The Journal of the American Society of Hematology. 2016 Mar 3;127(9):1109-16.

Download references

Acknowledgements

This study used data from the Dutch Cancer Registry (IKNL), PHAROS. and the Dutch Haemato-oncology Foundation for Adults in the Netherlands (HOVON). The authors are grateful to the various registration teams of the Netherlands Cancer Registry and HOVON for the data collection and delivery.

The authors received no financial support for this research.

Author information

Authors and affiliations.

Erasmus School of Health Policy and Management, Erasmus University Rotterdam, P.O. Box 1738, Rotterdam, 3000 DR, The Netherlands

LJ Bakker, FW Thielen, WK Redekop, CA Uyl-de Groot & HM Blommestein

Erasmus Centre for Health Economics Rotterdam, Erasmus University, Rotterdam, The Netherlands

You can also search for this author in PubMed   Google Scholar

Contributions

Concept and design: LB, HB, FT, CUG, WR. Acquisition of data: LB, HB. Analysis and interpretation of data: HB, LB, WR, FT, CUG. Drafting of the manuscript: HB, LB, WR, FT, CUG. Critical revision of the paper for important intellectual content: HB, LB, WR, FT, CUG. Statistical analysis: LB. Supervision: HB, CUG.

Corresponding author

Correspondence to LJ Bakker .

Ethics declarations

Ethical approval and consent to participate.

Neither obtaining informed consent from patients nor approval by a medical ethics committee is obligatory for this type of observational studies containing no directly identifiable data (art. 9.2 sub j General Data Protection Regulation, art. 24 Dutch GDPR Implementation Act jo). Administrative permission for use of the anonymized data from the Netherlands Cancer Registry (NCR) was granted through the supervisory committee of the NCR. Administrative permission for use of the anonymized data from the Dutch Haemato-oncology Foundation for Adults in the Netherlands (HOVON) was granted by the HOVON executive board and the HOVON multiple myeloma working group. All data provided to the researchers by the NCR & HOVON was in anonymized format. This study was conducted according to the principles of the Declaration of Helsinki.

Consent for publication

Not applicable.

Competing interests

FT reports previous consultation for AstraZeneca, Optimax Access, Dark Peak Analytics, and grants from Celgene outside the submitted work. Previous and ongoing research was or is partly funded by the CADTH (Canadian Agency for Drugs and Technologies in Health), the Dutch Ministry of Health, Welfare and Sport, and the European Haematology Association. HB reports previous research grants from BMS (Celgene BV), advisory board fee from Pfizer, outside the submitted work paid to the institute; Previous and ongoing research was or is partly funded by the CADTH (Canadian Agency for Drugs and Technologies in Health), the Dutch Healthcare Institute, and Medical Delta. LB reports previous and ongoing research grants from the European H2020 Research Programme and the Convergence Program outside the submitted work. WR reports previous and ongoing research grants from the European H2020 Research Programme and the Convergence Program outside the submitted work. CUG reports unrestricted grants from Boehringer Ingelheim, Astellas, Sanofi, Janssen-Cilag, Bayer, Sanofi, Amgen, Merck, Gilead, Novartis, and Astra Zeneca, Roche, and grants from European Research Programmes, CADTH (Canadian Agency for Drugs and Technologies in Health), the Dutch Healthcare Institute, the European Haematology Association, and Dutch Ministry of Health. All grants were outside the submitted work.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Bakker, L., Thielen, F., Redekop, W. et al. Extrapolating empirical long-term survival data: the impact of updated follow-up data and parametric extrapolation methods on survival estimates in multiple myeloma. BMC Med Res Methodol 23 , 132 (2023). https://doi.org/10.1186/s12874-023-01952-2

Download citation

Received : 09 January 2023

Accepted : 16 May 2023

Published : 29 May 2023

DOI : https://doi.org/10.1186/s12874-023-01952-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Parametric Extrapolation
  • Multiple myeloma
  • Kaplan-Meier

BMC Medical Research Methodology

ISSN: 1471-2288

findings of research study are extrapolated to

Logo

Brill | Nijhoff

Brill | Wageningen Academic

Brill Germany / Austria

Böhlau

Brill | Fink

Brill | mentis

Brill | Schöningh

Vandenhoeck & Ruprecht

V&R unipress

Open Access

Open Access for Authors

Transformative Agreements

Open Access and Research Funding

Open Access for Librarians

Open Access for Academic Societies

Discover Brill’s Open Access Content

Organization

Stay updated

Corporate Social Responsiblity

Investor Relations

Policies, rights & permissions

Review a Brill Book

 
 
 

Author Portal

How to publish with Brill: Files & Guides

Fonts, Scripts and Unicode

Publication Ethics & COPE Compliance

Data Sharing Policy

Brill MyBook

Ordering from Brill

Author Newsletter

Piracy Reporting Form

Sales Managers and Sales Contacts

Ordering From Brill

Titles No Longer Published by Brill

Catalogs, Flyers and Price Lists

E-Book Collections Title Lists and MARC Records

How to Manage your Online Holdings

LibLynx Access Management

Discovery Services

KBART Files

MARC Records

Online User and Order Help

Rights and Permissions

Latest Key Figures

Latest Financial Press Releases and Reports

Annual General Meeting of Shareholders

Share Information

Specialty Products

Press and Reviews

 
   
   
   
   

Chapter 15 Extrapolation of Animal Research Data to Humans: An Analysis of the Evidence

  • Download PDF

The ethical arguments against animal experimentation remain ever-strong. In addition, the scientific case against the use of animals in research grows more compelling, with exponential progress in the development of alternative methods and new research technologies. The Dutch authorities recently announced an ambitious, but welcome, proposal to phase out “the use of laboratory animals in regulatory safety testing of chemicals, food ingredients, pesticides and (veterinary) medicines” by 2025, as well as “the use of laboratory animals for the release of biological products, such as vaccines” ( Netherlands National Committee for the protection of animals used for scientific purposes, NCad, 2016 , p. 3). National government departments (e.g., the United Kingdom, UK , Home Office) have stated that alternatives to animals are now considered necessary for scientific as much as ethical reasons, also conceding that pressure exists within the research community to use animals in order to get published . Furthermore, only 20% of animal tests across the European Union ( EU ) each year are conducted to meet regulatory requirements, with the vast majority carried out as basic research (including basic medical research) or breeding of genetically modified ( GM ) animals at academic institutions (European Commission, 2013b).

Despite the strength of both scientific and moral arguments, animal research continues to increase worldwide, especially given the rising trend in use of GM animals. A Catch 22 situation also exists, with regulators largely refusing to break with tradition and continuing to accept only animal data, even when robust human-based data exists. Additionally, when new animal-free, human-relevant methods are developed, regulators often insist that research still be performed on animals; this is considered to be one of the major barriers to achieving change and, in turn, results in an industry reluctant to invest in non-animal research, if its results are unlikely to be accepted ( Schiffelers et al., 2012 ).

Whilst public engagement, via campaigns to highlight animal suffering, remains vital, a renewed focus on scientific, political, and financial interests is needed. This focus is needed to emphasize the fundamental message that animal research simply does not deliver what is needed, in order to influence those who regulate, finance, or approve animal experiments and have a meaningful impact on their ongoing reduction but primarily, their replacement . Scientific evidence is needed, on an ongoing basis, of the inadequacy of animal experiments in predicting human outcomes, combined with a focus on the modern, non-animal techniques that have the potential to replace them, to drive an ongoing recognition of the need for genuine, significant investment in human-relevant research. Additionally, not all animal tests need replacing, many can simply end; so providing appropriate evidence of these types of tests is also essential.

In striving to achieve a paradigm shift to end animal experimentation, for scientific as much as ethical reasons, an evidence-based approach is required. There remains a vital need for a combination of drivers in innovative, animal-free scientific research, training, and education, as well as continued lobbying and campaigning to key stakeholders (i.e., scientists, regulators, and political audiences).

Animal experimentation falls into two broad categories: basic research (including basic medical research) and a relatively smaller category, toxicity (or safety) testing of new substances, which includes chemicals for use in personal care, household products, industrial substances, foodstuffs, or pharmaceuticals (the latter are also tested for efficacy). There is overlap, to some extent, in these categories, with some animal procedures categorized as “fundamental toxicology”, for example. A two-fold strategy is suggested to end the use of animals in all experimental research. The first should focus on how a large number of procedures performed, both in basic research and product-safety testing, can simply end today; in other words, they do not need non-animal replacements. The second should focus on procedures that are considered to require replacement. This could be through intelligent and strategic combinations of existing non-animal tests (integrated testing strategies) and/or further development of new human-relevant models. Examples of these and their success in replacing animals to date are discussed later in this chapter.

A popular argument in support of continuing animal research is that they have been used for decades in the research and development of new medicines. The fact that millions of animals have been used over years, often in the same repeated experiments, is not in dispute. However, their continued use does not prove necessity . It is also relevant to note that from early on in a scientific career, one is discouraged from saying that experiments “didn’t work” but instead, to conclude how further research or new approaches must be tried next, in light of unsuccessful or unexpected results. The use of animals has been grandfathered through, due to convention, anecdotal evidence or belief, rather than robust scientific validity. “ We must use a living system” … but it is the wrong living system and no matter how many animals are used, they will never provide an appropriate model for humans. This needs to change, particularly when considering the growing industry of breeding and supplying millions of GM animals worldwide each year, in repeated attempts to mimic the human condition.

The vast majority of animals are used either for basic research or breeding of GM strains. This is clear when reviewing recent official statistics for the three highest animal-using countries in the EU ; the UK , Germany, and France. For example, more than 3.9 million procedures on animals (mice, rats, rabbits, guinea pigs, dogs, horses, cats, non-human primates, pigs, sheep, cattle, birds, xenopus, and fish, among other species) were carried out in the UK in 2016. Of these, 729,390 were genetically modified, including more than 149,000 animals deliberately bred to suffer a harmful phenotype (a deliberately induced condition, such as cancer, failed immune system, or organ failure to try to simulate disease in humans). There were also increases in the number of experiments across several species, and a significant number of experiments for ingredients in household products (1700 procedures), to meet industrial chemicals legislation requirements, despite a policy on testing for such purposes ( Home Office, 2017 ). In fact, of the total 3.9 million procedures conducted in the UK in 2016, only 13% were carried out for regulatory purposes. Germany bred 1.2 million GM animals in 2015 (with similar numbers of harmful phenotype animals to the UK ), representing 42% of the 2.8 million animals used annually ( Federal Ministry of Food and Agriculture, 2016 ). Figures reported for France in 2014 show that 1.8 million animals were used, however the proportion of GM animals was not reported ( Ministry of Higher Education & Research, 2016 ).

Several thousand diseases affect humans. Of these, only 500 currently have fda -approved treatments available ( National Center for Advancing Translational Sciences, 2017 ). In every discipline of disease research, animals are used on an ongoing basis, yet it is continually reported that mechanisms of human conditions investigated in such animals are still not understood. This is because basic research in animals is a demand-driven and self-perpetuating system, with much research being proposed and licensed on the basis of being repetitively performed on animals (often termed as “well established” or “well documented” models). Such research is neither legally required, nor does it have to be relevant or applicable to human disease to be licensed. Another key barrier to replacing animals, even when scientifically valid alternatives are available, is awareness and acceptance of their use, both by researchers and regulators ( Ramirez et al., 2015 ).

The first part of this chapter provides an analysis of extrapolation of animal studies to humans, by sampling systematic reviews carried out to assess evidence of clinical translation and incorporating a review of literature on animal toxicity studies for some well-known, established case study drugs (e.g., paracetamol, aspirin, penicillin) and animal versus human findings. The second part addresses drivers for change and the development of animal-free (or rather, human-relevant ) research methods, as well as some examples of procedures that do not need replacing as they can simply stop, when considering that they can logically be avoided or rejected on the basis of a correctly performed (and legally required) harm-benefit assessment. The chapter aims to provide an overview of the above topics and suggestions for the way forward as part of a new paradigm for a global, animal-research free future.

1 Part 1: Analysis of Abstracts from Systematic Reviews of Animal Studies

To carry out an analysis of systematic reviews on animal experiments, a review of a sample of available literature was performed. The intention of this analysis was to provide a generally qualitative review of the literature. To do so, two separate sources were used. First, a search in PubMed ( National Centre for Biotechnology Information, 2016 ) was made using the keyword search of “systematic review animal studies.” This resulted in a total of 163,585 publications. PubMed allows search by Article Type and selecting this as “systematic review” further filtered results to 8,291 listings, also sorted by relevance. Second, the Google Scholar database, using the same search terms, “systematic review animal studies,” for consistency, yielded 2,530,000 results (Google Scholar, 2016). Dates of publications ranged from 1999–present. Generally, PubMed provided more recent listings compared to Google Scholar, which resulted in older publications; but this was useful to provide a greater scope for review over the past two decades as well as avoid duplication.

To account for time constraints, while still providing a reasonable sample size, the first 50 abstract listings within each source were reviewed, giving a sample total of 100 (see Table 15.1 ). If a publication appeared within both sources, this was also accounted for, although duplicates were relatively few. Where publications were found to be not relevant, further listings were reviewed to compensate for this and to maintain a total of 100.

Relevant abstracts were assessed overall, depending on the following information:

Clear concordance between human and animal studies. Limited concordance between human and animal studies. Lack of concordance between human and animal studies, due to one of the following factors: Unclear reporting, bias, inconsistency, species differences, heterogeneity, and lack of clinical translation.

It should be noted that the term in this context refers to a qualitative, rather than quantitative analysis of the literature within the available timeframe. It is also important to note that the systematic reviews analyzed and the studies included within these publications (based on eligibility criteria assigned by the authors) are just a fraction of thousands of papers reviewed but rejected, some spanning four or five decades and using hundreds of thousands of animals.

Of the 100 abstracts reviewed (50 from PubMed; 50 from Google Scholar) none stated unequivocal and conclusive concordance between animals and humans. A low proportion of abstracts (20%) described limited concordance in specific procedures, but this was generally qualified with an advisory to interpret the findings with caution and the need for more clinical studies to provide better evidence in humans.

Species used in studies included rats, mice, rabbits, cats, dogs, sheep, pigs, and non-human primates, among others. A wide range of disease areas were covered, including several types of cancer, heart disease, stroke, neurological disorders (e.g., Alzheimer’s and Parkinson’s disease), diabetes, bone defects and facial disorders, dental research, gene therapy and stem cell research, to provide some examples. Several publications were general reviews of how animal data translates to humans, as well as reviews of animal studies in specific disease areas.

The large majority of reviews (75%) found that assessment of human response from animal data is significantly limited due to one or more of the following factors: species differences, lack of clinical translation, poor quality methodology, inconsistency, and publication bias, resulting in overstatement of the benefits of animal use in predicting human disease outcomes or safety. There was a distinct lack of evidence clinically, despite many therapies in use based on animal studies. Numerous reviews highlighted the successes claimed over basic research outcomes or new therapies in animal “models”, which have, however, failed to translate to the clinic to help patients ( ). Concerns over paucity of evidence, publication bias and consequently, overstatement of benefit in translating animal data to humans have led to many systematic reviews ( ).

A key finding from this review is that, not only is publication bias very common in animal research, but many additional results considered unsuccessful remain unpublished. This issue was raised in a number of the systematic reviews analyzed, for example in animal models of stroke ( ). Several reviews also raised concerns over animal studies and human trials being carried out simultaneously ( ; ). Moreover, this analysis found several studies (5% of the sample reviewed), highlighting animal experiments that could have been entirely omitted and carried out directly and far more effectively and ethically in clinical or observational studies in humans; for example, studies on dietary intake and cardiovascular health ( ; ), or trials of substances already in human use ( ). In many cases, human trials were carried out in parallel with animal experiments, representing examples of animal use that can be . This is discussed later in this chapter in more detail.

The findings from the majority of publications reviewed are consistent with other evidence on the problems of translating animal data to humans; for example, the (jointly commissioned in 2006 by a number of major UK research councils and chaired by Sir David Weatherall). A subsequent review in 2011 addressed one of the recommendations in the Weatherall report, to review ten years of brain research in monkeys retrospectively. Not only did the review reveal some disturbing insights into the routine suffering of non-human primates used in neurology, but it reported the equally concerning finding that “In most cases, however, little direct evidence was available of actual medical benefit in the form of changes in clinical practice or new treatments” ( , p. 13). These findings were emphasized more recently in the , published in November 2013 by the (then) Animal Procedures Committee (apc, now Animals in Science Committee) ( , 2013).

Further evidence of increasing concern over the validity of animal research was highlighted in a British Medical Journal review entitled, , which concluded that “Funds might be better directed towards clinical rather than basic research, where there is a clearer return on investment in terms of effects on patient care” ( , p. 1). This article adds to a wealth of evidence on the poor performance of animals in predicting human responses, with an accuracy of approximately 20%–60%, depending on reviews cited ( ; ). Additionally, in a series of studies between 2013–2015, a collaboration between the Fund for the Replacement of Animals in Medical Experiments and Cruelty Free International involved the analysis of an unprecedented level of independent data from both preclinical toxicity studies and human clinical trials. The studies revealed the inadequacies of animal toxicity studies in a number of species (i.e., dog, rabbit, mouse, rat, and non-human primate) in predicting human adverse events; and the urgency for more human-relevant methods to be developed (Bailey et al., , , ).

To some extent the pharmaceutical industry recognizes that the models it has been using are inadequate. There is encouraging research into alternative approaches and further consideration of the problem in some areas ( ). In 2014, the National Institutes of Health (nih), , began the investigation of more than 100 drugs that showed success in rodents but went on to fail in human trials ( , 2016).

With regard to the of experiments covered by the reviews examined, the majority of publications focused on basic research in animals (66%). This was expected, given that this is the largest area of animal use. The remaining 34% were concerned with reviewing safety or efficacy of substances, including new and existing drugs, herbal therapies, or food related additives or substances (e.g., low calorie sweeteners) ( ).

A follow-up literature review was performed to further address publications of toxicity tests in animals. This specific sample of the most recent literature was chosen to provide meaningful case studies on three well-known and widely-used drugs worldwide: paracetamol (acetaminophen), aspirin and penicillin. Information on these studies is presented in , and each drug is briefly discussed below.

Paracetamol was first marketed in the 1950’s ( ) and is well known as one of the world’s most common household drugs, traded under many different brand names, including Tylenol and Panadol. Despite being marketed for over five decades and the vast availability of data on global human use, paracetamol continues to be routinely tested on animals, both for “blue sky” research and attempts to market it for new purposes. Using a similar methodology to the previous review, the general key search terms of “paracetamol toxicity animals” resulted in 2,431 listings in PubMed. (Note: Using similar terms “acetaminophen toxicity animals” provided 2,358 listings and a brief review established, as expected, that many of these were the same results).

Review of the first five listings under the above search term, published between 2014–2016, provided extensive evidence of ongoing experimental research into paracetamol in animals. For example, hepatotoxicity has been well known for decades as a risk of paracetamol overdose in humans, yet, inducing such effects in mice is still carried out routinely, worldwide ( ). Further recent studies show that macaques are considered a poor model due to their resistance to paracetamol poisoning when compared to humans ( ). Experimental dosing and killing of newborn mice continues ( ), despite paracetamol’s widespread global use in children and pregnant mothers, as shown by far more directly relevant clinical or observational studies to check for effects on offspring ( ). Other experimental studies included, force feeding of GM mice with a drug to inhibit an enzyme that activates the toxic response to paracetamol in order to investigate resistance (Pu et al., 2016), and numerous similar, experimental testing in mice ( ), despite much earlier, advanced human-based studies to investigate resistance to paracetamol toxicity ( ).

Another publication investigated a widely-used industrial chemical that humans are routinely exposed to in the environment via air, diet, and water: aniline. The aim of the study was to investigate aniline’s conversion to paracetamol, and its effects on male fertility. Yet, instead of employing the directly relevant approach of investigating the vast amount of already available clinical and observational exposure data, groups of mice were injected, before being killed and dissected along with their offspring, for examination ( ).

As well as review of specific experiments on paracetamol toxicity in animals, further publications published between 1996 and 2012 on systematic reviews of paracetamol toxicity were analyzed (see ). Included were reviews of translating animal models of paracetamol toxicity to humans, stating that “Considerable effort has been made to predict and model drug-induced liver injury in humans using laboratory animals with only little success and even some controversy” (McGill and Jaeschke, 2014, p. 10). A further review of paracetamol, and similar drugs in its class, concluded that there was insufficient evidence based on animal (and human) tests to assess toxic effects on the human kidney (Feinstein et al., 2000). An analysis of clinical treatment for paracetamol-induced injury during liver surgery documented 19 different studies carried out on mice, rats, dogs, and pigs with varying results, concluding that evidence was insufficient to suggest the therapy was clinically relevant (Jegatheeswaran and Siriwardena, 2010).

The remaining reviews focused on how paracetamol, over two decades ago (as one of a group of “well-studied” hepatoxicants), highlighted the need to evaluate links between and testing strategies (Huggett et al., 1996); and more recently, despite the extensive toxicity testing of paracetamol, evidence supporting its use in specific groups of patients (e.g., the critically ill) was considered lacking (Jefferies et al., 2012), highlighting the value of data that can only be gathered in clinical research. When taken in normal regular doses, paracetamol is largely considered safe in humans for a number of pain associated conditions. Yet, it causes a wide range of toxicities in many species, for example, cancer in mice and rats (Hueper et al., 2012). In fact, given requirements today for extensive regulatory toxicity testing in animals, it is highly likely that paracetamol would be denied approval based on its poor safety profile in animals.

Acetylsalicylic acid, commonly known as Aspirin, has been in human use for more than a century. It is still considered successful; and given its relatively cheap production costs and widespread use for a number of indications, it is still considered a “blockbuster” drug in terms of revenue ( ). Yet, the human relevant dose of aspirin is lethal to rats and causes toxic effects in many animal species, including embryonic deformities in dogs, cats, mice, rats, monkeys, and rabbits ( ). Like paracetamol, given the poor safety record of aspirin in animals, it would very likely be denied approval for human use if newly marketed, according to today’s regulatory testing requirements ( ). Aspirin continues to be routinely tested on animals, despite availability of vast libraries of both historical and new human data.

With the same methodology and sampling as the previous reviews, the search terms, “aspirin toxicity animals” were used. PubMed revealed experiments carried out between 2012–2016, including 15-day oral toxicity studies of derivatives of aspirin in Wistar rats and subchronic toxicity studies in mice (see ). Searching for publications under the terms “acetylsalicylic acid toxicity in animals” resulted in specific studies published between 2000–2013. These included administration of large doses to pregnant rabbits, concluding that aspirin is not teratogenic to them, and highlighting inconsistencies with previous rabbit experiments and species differences with rats, having been “extensively studied” and exhibiting birth defects (Cappon et al., 2003).

Alexander Fleming’s pioneering work on penicillin is well known. Following this, Florey and Chain won a Nobel Prize in the 1940’s for successful results in mice with penicillin; yet, they considered themselves fortunate to have chosen to test mice instead of guinea pigs, who showed lethal side effects to the drug, as Florey later remarked: “Mice were used in the initial toxicity tests because of their small size, but what a lucky chance it was, for in this respect man is like the mouse and not the guinea pig. If we had used guinea pigs extensively we should have said that penicillin was toxic and we probably should not have proceeded to try and overcome the difficulties of producing the substance for trial in man” ( , p. 12).

In fact, penicillin is safe, to some extent, in mice and rats but has severe, often lethal, effects in hamsters and guinea pigs due to their very sensitive intestinal microbiota, making them particularly susceptible when compared to other species. Animal users are quick to respond to this issue, stating that multiple species are used to assess the most appropriate “model” for humans and account for differences (heterogeneity) in animals. Again, no dispute is made on this; indeed, this has been the tenet in toxicology for decades, testing in different species and varying doses, modifying their condition either genetically, chemically or physically in an attempt to elicit the reaction needed. Yet, the high attrition rate of new pharmaceuticals and lack of progress in key areas of disease research should suggest that something is wrong. The use of animals is the only area of scientific research where the same dated techniques are still being used 60–70 years later, despite their limitations being well known. No other area of science continues to use such a dogmatic approach. As evidence of this, a general literature review of the search terms “penicillin toxicity animals” results in numerous publications over decades, several of the most recent listings (2011–2016, see ) involving rats, rabbits, and other animals; even using penicillin in repeated experiments to induce effects, including anxiety and depression (to try to mimic effects in rats already seen in patients), weight loss, organ failure, and deliberate epilepsy to test the effects of other drugs that, like penicillin, are already in extensive global use, with a wealth of clinical toxicity data available.

It should be noted that use of the term in this chapter refers only to methods that the use of animals and their tissues. It is necessary to make this distinction, given the widely-used terminology of the 3Rs (replacement, reduction and refinement), first proposed in ( ). The ultimate goal of Russell and Burch in establishing the 3Rs was . While measures to refine methods or reduce animal numbers are, of course, to be encouraged, much attention is devoted to these 2Rs and, to some extent, it has diverted focus from replacement.

Given six decades of the 3R principles, dedicated attention to replacement is long overdue. This is also reflected in European Directive 2010/63/EU on animals used for scientific purposes ( , p. 2), which states that it “represents an important step towards achieving the final goal of full replacement of procedures on live animals for scientific and educational purposes as soon as it is scientifically possible to do so. To that end, it seeks to facilitate and promote the advancement of alternative approaches.” Although the Directive was implemented in January 2013, there has been relatively little decrease and, in many cases, an increase in animal use across individual Member States. Therefore, there is still great scope for improvement, particularly with regard to funding the development, acceptance, and adoption of animal-free, human-based methods.

Furthermore, the broad interpretation of the term under the auspices of the 3Rs is used to describe the use of some animals as “alternatives” to others, for example the use of zebrafish over rodents ( ); transgenic mice to “replace” non-human primates ( ); and even the use of minipigs, instead of dogs, as an “alternative” that may be more acceptable to the public, because they are considered “food animals” ( ). Aside from the poor ethical argument, replacing one animal with another still fails to address the problem.

Use of public opinion and political lobbying to drive legislative change remains vital to fueling research and developments in animal-free science. The clearest example in recent years is the phased-in bans on animal-tested cosmetics across the EU between 2009–2013 (European Commission, 2013). A testing ban on cosmetic ingredients was enforced from March 11, 2009, along with a partial marketing ban for 10 animal-test requirements. This was eventually followed by a further marketing ban from March 11, 2013 for endpoints considered more complex (i.e., repeat dose toxicity, skin sensitization, reproductive toxicity, carcinogenicity, and toxicokinetics). However, for safety data requirements of cosmetic substances, some of these endpoints are rarely or not required ( ). Despite delays in implementing the bans and legal challenges attempting to abolish them altogether, they had a monumental effect on the industry, with the development of numerous methods to be ready in time. Despite loopholes with conflicting chemicals safety-testing legislation, such as Registration, Evaluation, Authorisation and Restriction of CHemicals (reach) ( ), the bans have been responsible for one of the most significant advances towards replacing animal tests in decades.

The campaign to end cosmetics tests on animals began in the 1970s, and it took until 1993 to see legislation amendments to mark the implementation of official EU bans. After a further two decades of delays, the bans were finally enforced, with significant resistance amid claims that innovation would be stifled and that the development of alternatives would not be possible. Instead, the opposite was achieved. The development of methods was stimulated to address a number of toxicological (the result of a study to determine how toxic a substance is). The endpoints included skin irritation, eye irritation, skin corrosion, phototoxicity, skin absorption/penetration, acute toxicity, and genotoxicity/mutagenicity. In preparation for the forthcoming bans, around 30 new assays were validated by 2007 ( ), with more developed since and projections that the toxicity testing market will be worth US$17,227 million by 2018 ( Newswire, 2014). The bans have also affected positive change outside EU borders, with similar bans now in place in India, Israel, Norway, and New Zealand, as well as partial or full enforcements in many other countries.

Replacement can be (and is being) achieved by a number of approaches, including and models. Some examples are discussed below.

In its 2014 Delivery Plan, , the UK Home Office devoted much of the text to supporting the continuance of animal research. However, the plan also showcases human methods, for example, using induced pluripotent stem cells ( ), which it describes as “work that in the past could only have been modelled in animal systems” ( , p. 16).

Scientists at the University of Newcastle have developed human skin-based assays, using cells isolated from the blood sample of healthy volunteers to assess new drugs, cosmetics and household products. The technology, now marketed as Skimune by Alcyomics Ltd, could have predicted the adverse effects seen in the volunteers of the TGN1412 monoclonal antibody clinical trial in 2006 ( ).

Other high performance initiatives include physiologically-based pharmacokinetic modelling (pbpk), which quantitatively predicts the characteristics of substances in the body (e.g., blood flow or effects on organs). The introduction of pbpk models over the past two decades is credited with reducing drug failure rates from over 40% to under 10% ( ). Another major area of replacement research uses devices, known as Multi Organ Chips (moc), to mimic the human body’s response to chemicals and disease processes, with the ultimate goal being a . Over the past few years, advances in moc technology have been exponential. For example, the organ on a chip devices developed at Harvard’s Wyss Institute can mimic events in tissue function and disease, such as air flow, bacterial infection, immune system response, blood clotting, fluid leakage and, most recently, electrical activity across cells, to predict safety and disease mechanisms in patients ( ). For further discussion see Wilkinson, 2019, Chapter 26.

At the United States nih Chemical Genomics Centre, a major testing program has been underway since 2004, involving a robotic-arm system that tests thousands of chemicals, using patient donated cells. The high throughput system performs approximately 3 million tests per week in relation to a different disease. The success of the system (also funded under the , Tox21, initiative) in screening and identifying suitable candidate drugs has dramatically saved time, cost, and resources, resulting in human clinical trials starting within a year.

A further groundbreaking concept is the Adverse Outcome Pathway (aop), a key component of the paradigm shift towards human-relevant methods and establishing a robust system for predicting human safety. An aop is a sequence of events that starts by a chemical effect at the molecular level (termed a Molecular Initiating Event) and progresses through changes (termed Key Events) in cells, tissues, and organs to produce an adverse effect in the body. aops act as a bridge between emerging methods of safety testing and, ultimately, what happens in the body in response to a particular substance (xenobiotic). With increasing knowledge, aops can be linked to form networks, revealing adverse outcomes that share pathways and vice versa. One example is the establishment of test methods that map the three key stages of the aop for skin sensitization, now accepted at Organisation for Economic Co-Operation and Development (oecd) level ( ). Before the EU cosmetic testing bans were implemented, there was a high level of skepticism over the prospect of testing substances for skin sensitization (and other complex endpoints) without the use of animals; and while further work remains to be done, major progress has been made. The aop program was established in 2012 and, including skin sensitization, there are now six aops approved at the oecd level; five relating to human health effects and one to address potential ecotoxicological effects to wildlife (fish, birds, and amphibians). A further 227 aops are in development ( , 2017).

In addition to new and models to address safety testing, other areas of animal use previously considered essential, such as education, have seen coordinated replacement initiatives. Although animals are still used extensively in this area, great successes have been achieved to date. For example, campaigns by all involved in the International Network for Humane Education (InterNICHE) project to provide training and disseminate information on humane methods in medicine, biology, and veterinary research (e.g., mannequins and simulation techniques) continue to affect great change in universities and schools worldwide (InterNICHE, 2017). Other progress is being made in education as well. In 2016, Washington University announced it would end its 25-year use of cats for intubation training (the last university in the United States still using cats in this way). Instead, it will now use mannequins and advanced simulators, following significant investment in its simulation center, which made the decision possible, following sustained public awareness campaigns ( ). Also in 2016, Johnson and Johnson subsidiary, Ethicon, finally agreed to remove live pigs from its medical-device training program, stating that it “discontinued live animal use in sales training across our North America region” ( ). For further discussion see Pawlowski, et al., 2019, Chapter 22.

In addition to the vast range of human-based technologies now available, another sensible approach is to improve the use of data from clinical, epidemiological, and biomonitoring studies. All of these have been considerably underused to date and could not only improve patient-safety and disease research but avoid the unnecessary use of animals.

The ethical arguments concerning use of animals and problems with scientific validity are compounded further by the issue of experiments, which is a widespread problem. Many of the same tests are carried out over and over again, often amid claims of needing to maintain confidentiality and preserving intellectual property, despite mandates to share data. One example, a robust analysis of safety data submitted under the reach program, recently revealed that, incredibly, the Draize eye irritation test had been carried out on rabbits for two chemicals, ( ).

Not all tests need replacing. Many can end now as they are out of date or have been found to be redundant. A recent case is the deletion of the single-dose toxicity test from the European Medicines Agency guidelines, after it was recognized that information from the test could be obtained elsewhere, and that the test was of ( ). Furthermore, there are many examples of animal tests that require a analysis and retrospective assessment to not only assess whether scientific objectives were met but also whether such procedures should have been approved at all. A case study to demonstrate this further is the European Coalition to End Animal Experiments (eceae). In 2014, eceae estimated that its strategy of toxicological review and comment on animal testing proposals for chemicals registered under the reach legislation saved at least 18,000 animals, through rejected and withdrawn proposals. This was achieved on the basis of existing data or evidence that the tests proposed were unnecessary or unjustified ( , 2014). Another recent example is the welcome decision that the year-long chronic-toxicity test for pesticides is no longer required in dogs, on the basis that it is not scientifically justified ( ). The test has been dropped in the EU, the United States, and Canada. Although there are still requirements for the one-year test to be carried out in other countries, the restrictions mark a change in attitudes and a meaningful review of testing requirements. The campaign continues to see the test abolished worldwide as soon as possible.

The aim of this chapter was to provide a qualitative overview of evidence from systematic reviews and some individual studies of not only the flawed approach to the continued use of animals in trying to predict mechanisms of human disease; but also the success of existing and emerging animal-free methods, the opportunities for intelligent use of human-based data, and the distinction between animal tests that require replacement and those that can simply end.

Advances in science, providing better technologies on an ongoing basis, should pave the way for acceptance of non-animal methods. In some areas, such as cosmetics testing, there is unprecedented change and global recognition that animal use must end. Yet, in other areas of animal research, despite a wealth of better science, the realities of some conventional attitudes, resistance to change, and an industry reliant on the continuation of animal experimentation (e.g., major establishments funded by long-term programs of animal research, financial partnerships, GM animal breeding, commercial breeders, suppliers, and transporters of animals) mean that political lobbying, campaigning, and raising public awareness must continue to play a major role. Fortunately, there are a number of animal protection, political, and scientific stakeholders who continue to work in the field, actively pushing for change, to increase recognition that animal research must end and to achieve the paradigm shift that is urgently needed for humans and animals.

Dedication This chapter is dedicated to Andrew Tyler.

. (2017). . [online] Available at: [Accessed 16 February 2017].

. ( ). . [online] Available at: [Accessed 16 February 2017]. )| false (2013). . London: . [online] Available at: [Accessed 30 November 2016].

( ). .

London: . [online] Available at: [Accessed 30 November 2016].)| false J., M. Thew and M. Balls (2013). An Analysis of the Use of Dogs in Predicting Human Toxicology and Drug Safety. , 41, pp. 335350.

,

and

( ). . , , pp. – .)| false J., M. Thew and M. Balls (2014). An Analysis of the Use of Animal Models in Predicting Human Toxicology and Drug Safety. , 42, pp. 181199.

,

and

( ). . , , pp. – .)| false J., M. Thew and M. Balls (2015). Predicting Human Drug Toxicity and Safety via Animal Tests: Can Any One Species Predict Drug Toxicity in Any Other, and Do Monkeys Help?. , 43, pp. 393403.

,

and

( ). . , , pp. – .)| false P. (2002). Preclinical Testing for Teratogenicity and Developmental Toxicity: Methods and Limitations. ,57(2), pp. 109114.

( ). . , ( ), pp. – .)| false (2011). .Report of a panel chaired by Professor Sir Patrick Bateson, FRS. London. [online] Available at: [Accessed 30 November 2016].

( ). . ,

FRS.

London. [online] Available at: [Accessed 30 November 2016].)| false (2013). Oxford Works with Drug-makers to Reverse 90% Trial Failure Rate. . [online] Available at: [Accessed 16 December 2016].

( ). . . [online] Available at: [Accessed 16 December 2016].)| false M., K.F. Müller, J.J. Meerpohl, E. von Elm, B. Lang, E. Motschall, V. Gloy, F. Lamontagne, G. Schwarzer and D. Bassler (2013). Publication Bias in Animal Research: A Systematic Review Protocol. ,2(23). [online] Available at: [Accessed 30 November 2016].

,

,

,

,

,

,

,

,

and

( ). . , ( ). [online] Available at: [Accessed 30 November 2016].)| false (2016). . Charles River Technical Sheet. [online] Available at: [Accessed 11 December 2016].

( ). .

Charles River Technical Sheet. [online] Available at: [Accessed 11 December 2016].)| false (2007). . Helsinki. [online] Available at: [Accessed 30 November 2016].

( ). .

Helsinki. [online] Available at: [Accessed 30 November 2016].)| false (2014). . London. [online]. Available at: [Accessed 3 August 2017].

( ). ECEAE Claims 18,000 Animals Saved Through REACH Testing Proposals Process.

London. [online]. Available at: [Accessed 3 August 2017]. )| false (2013a). . Brussels. [online] Available at: [Accessed 30 November 2016].

( a). .

Brussels. [online] Available at: [Accessed 30 November 2016].)| false (2013b). . Brussels: SWD (2013), 497 final. [online] Available at: [Accessed 30 November 2016].

( b). .

Brussels:

SWD (2013), 497 final. [online] Available at: [Accessed 30 November 2016].)| false (2010). . [online] Available at: [Accessed 30 November 2016].

( ). ”. [online] Available at: [Accessed 30 November 2016].)| false (2010). Directive 2010/63/EU of the European Parliament and of the Council of 22 September 2010 on the protection of animals used for scientific purposes. , L276, pp. 3379. [online] Available at: [Accessed 12 August 2017].

( ). . , , pp. – . [online] Available at: [Accessed 12 August 2017].)| false (2016). . Berlin. [online] Available at: [Accessed 25 May 2017].

( ). .

Berlin. [online] Available at: [Accessed 25 May 2017].)| false H. (1953). The Advance of Chemotherapy by Animal Experiment. ,41, p. 12.

( ). . , , p. .)| false R., G. Bode, L. Ellegaard, J.W. van der Laan, and Steering Group of the RETHINK Project. (2010). The RETHINK Project: Minipigs as Models for the Toxicity Testing of New Medicines and Chemicals: An Impact Assessment. ,62, pp. 158159.

,

,

,

, and Steering Group of the RETHINK Project. ( ). . , , pp. – .)| false F. (2014) How Predictive and Productive is Animal Research? ,348, p. 3719.

( ) , , p. .)| false [Accessed 30 November 2016].

T. (2008). Food for thought … on Alternative Methods for Cosmetics Safety Testing. , p. 25.

( ). . , p. .)| false T. (2009). Per Aspirin Ad Astra. ,37(2), pp. 4547.

( ). . , ( ), pp. – .)| false T.C., R. Watzlawick, J.K. Rhodes, M.R. Macleod and P.J. Andrews (2016). Study Protocol: A Systematic Review and Meta-analysis of Hypothermia in Experimental Traumatic Brain Injury: Why Have Promising Animal Studies Not Been Replicated in Pragmatic Clinical Trials? ,3(2), p. e00020.

,

,

,

and

( ). , ( ), p. .)| false S., R. Cardoso, F. Pinho-Ribeiro, J. Crespigio, T. Cunha, J. Alves-Filho, R. da Silva, P. Pinge-Filho, S. Ferreira, F. Cunha, R. Casagrande and W. Verri (2013). 5-lipoxygenase deficiency reduces acetaminophen-induced hepatotoxicity and lethality. ,627046.

,

,

,

,

,

,

,

,

,

,

and

( ). deficiency reduces acetaminophen-induced hepatotoxicity and lethality. , .)| false J., C. Chalmey, H. Modick, L. Jensen, G. Dierkes, T. Weiss, B. Jensen, M. Nørregard, K. Borkowski, B. Styrishave, H. Koch, M. Severine, B. Jegou, K. Kristiansen and D. Kristensen (2015). Aniline Is Rapidly Converted into Paracetamol Impairing Male Reproductive Development. ,148(1), pp. 288298.

,

,

,

,

,

,

,

,

,

,

,

,

,

and

( ). . , ( ), pp. – .)| false (2014). . London. [online] Available at [Accessed 30 November 2016].

( ). .

London. [online] Available at [Accessed 30 November 2016].)| false (2017). , . London. [online] Available at: [Accessed 30 November 2016].

( ). , .

London. [online] Available at: [Accessed 30 November 2016].)| false J., R. de Haan, M. Vermeulen, P. Luiten and M. Limburg (2001). Nimodipine in Animal Model Experiments of Focal Cerebral Ischemia: A Systematic Review. ,32(10), pp. 24332438.

,

,

,

and

( ). . , ( ), pp. – .)| false [Accessed 6 August 2017].

W., I. Fegert, R. Billington, R. Lewis, K. Bentley, W. Bomann, P. Botham, B. Stahl, B. Ravenzwaay and H. Spielmann (2010). A 1-year Toxicity Study in Dogs Is No Longer a Scientifically Justifiable Core Data Requirement for the Safety Assessment of Pesticides. , 40(1), pp. 115.

,

,

,

,

,

,

,

,

and

( ). . , ( ), pp. – .)| false J. (2014). An Out of Body Experience. . [online] Available at: [Accessed 30 November 2016].

( ). . . [online] Available at: [Accessed 30 November 2016].)| false Z., B. Ritz, J. Virk and J. Olsen (2014). Maternal Use of Acetaminophen During Pregnancy and Risk of Autism Spectrum Disorders in Childhood: A Danish National Birth Cohort Study. ,9(9), pp. 951958.

,

,

and

( ). . , ( ), pp. – .)| false C., L. Criens-Poublon, C. Cockrell and R.J. de Haan (2002). Wound Healing in Cell Studies and Animal Model Experiments by Low Level Laser Therapy. Were Clinical Studies Justified? A Systematic Review. ,17(2), pp. 110134.

,

,

and

( ). . , ( ), pp. – .)| false T., A. Maertens, D. Russo, C. Rovida, H. Zhu and T. Hartung (2016) Analysis of Draize E Eye Irritation Testing and Its Prediction by Mining Publicly Available 2008–2014 REACH Data. ,33(2), pp. 123134.

,

,

,

,

and

( ) . , ( ), pp. – .)| false P., R. Edwards, R. Tootle, C. Selden, E. Roberts and H. Hodgson (1999). Resistance of Three Immortalized Human Hepatocyte Cell Lines to Acetaminophen and N-acetyl-p-benzoquinoneimine Toxicity. ,31(5), pp. 841851.

,

,

,

,

and

( ). . , ( ), pp. – . )| false J. Jr. (2010). Building a Tiered Approach to Predictive Toxicity Screening: A Focus on Assays with Relevance. ,13(2), pp. 188206.

( ). Predictive Toxicity Screening: A Focus on Assays with Relevance. , ( ), pp. – .)| false (2016). . Paris. [online] Available at: [Accessed 30 November 2016].

( ). .

Paris. [online] Available at: [Accessed 30 November 2016].)| false (2017). . [online] Available at: [Accessed 7 February 2017].

( ). . [online] Available at: [Accessed 7 February 2017].)| false (2016). . [online] Available at: [Accessed 30 November 2016].

( ). . [online] Available at: [Accessed 30 November 2016].)| false (2016). . [online] Available at: [Accessed 30 November 2016].

( ). . [online] Available at: [Accessed 30 November 2016].)| false (2016). . The Hague. [online] Available at: [Accessed 30 November 2016].

( ). .

The Hague. [online] Available at: [Accessed 30 November 2016].)| false G., E. Antignac, T. Re and H. Toutain (2010). Safety Assessment of Personal Care Products/Cosmetics and Their Ingredients. ,243; pp. 239259.

,

,

and

( ). . , ; pp. – .)| false (2017) . [online] Available at: [Accessed 6 August 2017].

( ) . [online] Available at: [Accessed 6 August 2017].)| false (2015). . [online] Available at: [Accessed 30 November 2016].

( ). . [online] Available at: [Accessed 30 November 2016].)| false (2016). . [online] Available at: [Accessed 6 August 2017].

( ). . [online] Available at: [Accessed 6 August 2017].)| false P., I. Roberts, E. Sena, P. Wheble, C. Briscoe, P. Sandercock, M. Macleod, M. Luciano, P. Jayaram and K. Khan (2007). Comparison of Treatment Effects Between Animal Experiments and Clinical Trials: Systematic Review. ,334(7586), p. 197.

,

,

,

,

,

,

,

,

and

( ). . , ( ), p. .)| false (2016). . [online]. Available at: www.pcrm.org/pcrm.org/media/news/washington-university-ends-live-cat-labs-for-pediatrics-training [Accessed 6 August 2017].

( ). . [online]. Available at: www.pcrm.org/pcrm.org/media/news/washington-university-ends-live-cat-labs-for-pediatrics-training [Accessed 6 August 2017].)| false R., A. Pawak and S. Challa (2015). Systemic Exposure of Paracetamol (Acetaminophen) Was Enhanced by Quercetin and Chrysin Co-administration in Wistar Rats and Model: Risk of Liver Toxicity. ,41(11), pp. 17931800.

,

and

( ). Model: Risk of Liver Toxicity. , ( ), pp. – .)| false Newswire (2014). . [online] Available at: [Accessed 10 July 2017].

Newswire ( ). . [online] Available at: [Accessed 10 July 2017].)| false T., S. Beken, M. Chlebus, G. Ellis, C. Griesinger, S. De Jonghe, I. Manou, A. Mehling, K. Reisinger, L. Rossi, J. van der Laan, R. Weissenhorn and U. Sauer (2015). Knowledge Sharing to Facilitate Regulatory Decision-making in regard to Alternatives to Animal Testing: Report of an EPAA Workshop. , 73(1), pp. 210226.

,

,

,

,

,

,

,

,

,

,

,

and

( ). . , ( ), pp. – .)| false H.A., M.R. Goff, S.A. Poole and G. Chen (2015). Eating frequency, food intake and weight: A systematic review of human and animal experimental studies. ,18(2), p. 38.

,

,

and

( ). . , ( ), p. .)| false J., V. Monteiro, R. de Souza Gomes, M. do Carmo, G. da Costa, P. Ribera and M. Monteiro (2016). Action Mechanism and Cardiovascular Effect of Anthocyanins: A Systematic Review of Animal and Human Studies. ,15–14(1), p. 315.

,

,

,

,

,

and

( ). . , ( ), p. .)| false P.J., P.S. Hogenkamp, C. de Graaf, S. Higgs, A. Lluch, A.R. Ness, C. Penfold, R. Perry, P. Putz, M.R. Yeomans and D. J Mela (2016). Does Low-energy Sweetener Consumption Affect Energy Intake and Body Weight? A Systematic Review, including Meta-analyses of the Evidence from Human and Animal Studies. ,40(3), pp. 381394.

,

,

,

,

,

,

,

,

,

and

( ). . , ( ), pp. – .)| false W. and R. Burch (1959). . [online] Available at: [Accessed 30 November 2016].

and

( ). . [online] Available at: [Accessed 30 November 2016].)| false M.J., B. Blaauboer, C. Hendriksen and W. Bakker (2012 Regulatory Acceptance and Use of 3R Models: A Multilevel Perspective ,29(3), pp. 287300.

,

,

and

( , ( ), pp. – .)| false E., H. van der Worp, P. Bath, D. Howells and M. Macleod (2010). Publication Bias in Reports of Animal Stroke Studies Leads to Major Overstatement Of Efficacy. , 30, 8(3), p.e1000344.

,

,

,

and

( ). . , 30, ( ), p. .)| false W. (2005). .West Sussex, UK: John Wiley & Sons, Ltd.

( ). .

West Sussex, :

John Wiley & Sons, Ltd.)| false Meer P., M. Kooijman, C. Gispen-de Wied, E. Moors and H. Schellekens (2012) The Ability of Animal Studies To Detect Serious Post Marketing Adverse Events Is Limited. , 64(3), pp. 345349.

,

,

,

and

( ) . , ( ), pp. – . )| false H., P. Eriksson, T. Gordh and A. Fredriksson (2014). Paracetamol (Acetaminophen) Administration During Neonatal Brain Development Affects Cognitive Function and Alters Its Analgesic and Anxiolytic Response in Adult Male Mice. ,138(1), pp. 139147.

,

,

and

( ). . , ( ), pp. – .)| false (2017). . [online] Available at: [Accessed 6 August 2017].

( ). . [online] Available at: [Accessed 6 August 2017].)| false H., N. Barrass, S. Gales, E. Lenz, T. Parry, H. Powell, D. Thurman, M. Hutchison, I. Wilson, L. Bi, J. Qiao, Q. Qin and J. Ren (2015). Metabolism by Conjugation Appears to Confer Resistance to Paracetamol (Acetaminophen) Hepatotoxicity in the Cynomolgus Monkey. ,45(3), pp. 270277.

,

,

,

,

,

,

,

,

,

,

,

and

( ). . , ( ), pp. – .)| false K. Rashid S.T., Strick-Marchand, H., Varela, I., Liu P.Q., Paschon D.E., Miranda E., Ordóñez A., Hannan, N.R., Rouhani F.J. and Darche S., (2011). Targeted Gene Correction of ɑ 1-antitrypsin Deficiency in Induced Pluripotent Stem Cells. ,478(7369), pp. 391394.

,

,

, ,

,

,

,

,

,

and

, ( ). . , ( ), pp. – . )| false Series:  Front Matter Copyright Page Foreword Preface Acknowledgements Contributors Introduction and Methods in Liver Toxicity Evaluations Back Matter All Time Past Year Past 30 Days Abstract Views 0 0 0 Full Text Views 5056 1687 26 PDF Views & Downloads 2671 262 15

Reference Works

Primary source collections

COVID-19 Collection

How to publish with Brill

Open Access Content

Contact & Info

Sales contacts

Publishing contacts

Stay Updated

Newsletters

Social Media Overview

Terms and Conditions  

Privacy Statement  

Cookie Settings  

Accessibility

Legal Notice

Terms and Conditions   |   Privacy Statement   |  Cookie Settings   |   Accessibility   |  Legal Notice   |  Copyright © 2016-2024

Copyright © 2016-2024

Character limit 500 /500

Multisite Skin Biopsies vs Cerebrospinal Fluid for Prion Seeding Activity in the Diagnosis of Prion Diseases

Question   Are misfolded prion protein aggregates in skin biopsies a more sensitive diagnostic biomarker for prion diseases (PRDs) compared with those in the cerebrospinal fluid (CSF)?

Findings   In this diagnostic study involving 415 skin samples and 160 CSF samples from 101 patients with PRDs and 23 patients without PRDs, the sensitivity of single-site skin biopsies was comparable with that of the CSF. However, the combination of 2 or 3 skin biopsies exhibited greater diagnostic sensitivity compared with the CSF alone.

Meaning   Results suggest that analysis of 2 or more skin sites was superior to CSF analysis for diagnosing PRDs and may be valuable for patients with negative CSF real-time quaking-induced conversion assay results or those unable or unwilling to undergo lumbar puncture.

Importance   Recent studies have revealed that autopsy skin samples from cadavers with prion diseases (PRDs) exhibited a positive prion seeding activity similar to cerebrospinal fluid (CSF). It is worthwhile to validate the findings with a large number of biopsy skin samples and compare the clinical value of prion seeding activity between skin biopsies and concurrent CSF specimens.

Objective   To compare the prion seeding activity of skin biopsies and CSF samples and to determine the effectiveness of combination of the skin biopsies from multiple sites and numerous dilutions on the diagnosis for various types of PRDs.

Design, Setting, and Participants   In the exploratory cohort, patients were enrolled from September 15, 2021, to December 15, 2023, and were followed up every 3 months until April 2024. The confirmatory cohort enrolled patients from December 16, 2023, to June 31, 2024. The exploratory cohort was conducted at a single center, the neurology department at Xuanwu Hospital. The confirmatory cohort was a multicenter study involving 4 hospitals in China. Participants included those diagnosed with probable sporadic Creutzfeldt-Jakob disease or genetically confirmed PRDs. Patients with uncertain diagnoses or those lost to follow-up were excluded. All patients with PRDs underwent skin sampling at 3 sites (the near-ear area, upper arm, lower back, and inner thigh), and a portion of them had CSF samples taken simultaneously. In the confirmatory cohort, a single skin biopsy site and CSF samples were simultaneously collected from a portion of patients with PRDs.

Exposures   The skin and CSF prion seeding activity was assessed using the real-time quaking-induced conversion (RT-QUIC) assay, with rHaPrP90-231, a Syrian hamster recombinant prion protein, as the substrate. In the exploratory cohort, skin samples were tested at dilutions of 10 −2 through 10 −4 . In the confirmatory cohort, skin samples were tested at a dilution of 10 −2 . A total of four 15-μL wells of CSF were used in the RT-QUIC assay.

Main Outcomes and Measures   Correlations between RT-QUIC results from the skin and CSF and the final diagnosis of enrolled patients.

Results   In the exploratory cohort, the study included 101 patients (mean [SD] age, 60.9 [10.2] years; 63 female [62.4%]) with PRD and 23 patients (mean [SD] age, 63.4 [9.1] years; 13 female [56.5%]) without PRD. A total of 94 patients had CSF samples taken simultaneously with the skin biopsy samples. In the confirmatory cohort, a single skin biopsy site and CSF sample were taken simultaneously in 43 patients with PRDs. Using an experimental condition of 10 −2 dilution, the RT-QUIC positive rates of skin samples from different sites were comparable with those of the CSF (skin: 18 of 26 [69.2%] to 74 of 93 [79.6%] vs CSF: 71 of 94 [75.5%]). When tested at 3 different dilutions, all skin sample positivity rates increased to over 80.0% (79 of 93 for the near-ear area, 21 of 26 for the upper arm, 77 of 92 for the lower back, and 78 of 92 for the inner thigh). Combining samples from skin sites near the ear, inner thigh, and lower back in pairs yielded positivity rates exceeding 92.1% (93 of 101), significantly higher than CSF alone (71 of 94 [75.5%]; P =.002). When all skin sample sites were combined and tested at 3 dilution concentrations for RT-QUIC, the sensitivity reached 95.0% (96 of 101). In the confirmatory cohort, the RT-QUIC positive rate of a single skin biopsy sample was slightly higher than that of the CSF (34 of 43 [79.1%] vs 31 of 43 [72.1%]; P  = .45).

Conclusions and Relevance   Results of this diagnostic study suggest that the sensitivity of an RT-QUIC analysis of a combination of 2 or more skin sites was superior to that of CSF in diagnosing PRDs.

Read More About

Chen Z , Shi Q , Xiao K, et al. Multisite Skin Biopsies vs Cerebrospinal Fluid for Prion Seeding Activity in the Diagnosis of Prion Diseases. JAMA Neurol. Published online October 14, 2024. doi:10.1001/jamaneurol.2024.3458

Manage citations:

© 2024

Neurology in JAMA : Read the Latest

Browse and subscribe to JAMA Network podcasts!

Others Also Liked

Select your interests.

Customize your JAMA Network experience by selecting one or more topics from the list below.

share this!

October 15, 2024

This article has been reviewed according to Science X's editorial process and policies . Editors have highlighted the following attributes while ensuring the content's credibility:

fact-checked

trusted source

Finnish study finds good physical fitness from childhood protects mental health

by University of Eastern Finland

physical activity child

A recent Finnish study has found that good physical fitness from childhood to adolescence is linked to better mental health in adolescence. These results are significant and timely, as mental health problems are currently a major societal challenge, affecting up to 25–30% of young people. These findings suggest that improving physical fitness from childhood can help prevent mental health problems.

The paper is published in the journal Sports Medicine .

In a study by the Faculty of Sport and Health Sciences at the University of Jyväskylä and the Institute of Biomedicine at the University of Eastern Finland, the physical fitness of 241 adolescents was followed from childhood to adolescence for eight years. The study showed that better cardiorespiratory fitness and improvements in it from childhood to adolescence were associated with fewer stress and depressive symptoms in adolescence.

Additionally, the study found that better motor fitness from childhood to adolescence was associated with better cognitive function and fewer stress and depressive symptoms. However, the association between motor fitness and depressive symptoms was weaker than the one between cardiorespiratory fitness and depressive symptoms. Screen time measured in adolescence partly explained the associations of cardiorespiratory fitness and motor fitness with mental health.

These findings advocate for investment in physical fitness early in life as a potential strategy for mitigating mental health and cognitive issues in adolescence.

"The concern about the declining physical fitness in children and adolescents is real. However, the focus has been on physical health," says Eero Haapala, Senior Lecturer of Sports and Exercise Medicine at the Faculty of Sport and Health Sciences, University of Jyväskylä.

"Our results should encourage policymakers as well as parents and guardians to see the significance of physical fitness more holistically, as poor physical fitness can increase mental health challenges and impair cognitive skills needed for learning."

"The whole of society should support physical fitness development in children and adolescents by increasing physical activity participation at school, during leisure time , and in hobbies," says Haapala.

This study is based on longitudinal data from the ongoing Physical Activity and Nutrition in Children (PANIC) study conducted at the Institute of Biomedicine, University of Eastern Finland, and led by Professor Timo Lakka. The study followed the physical fitness of 241 individuals for eight years, from childhood to adolescence. Mental health assessments were conducted during adolescence. The study was published in Sports Medicine.

The PANIC Study is part of the Metabolic Diseases Research Community at the University of Eastern Finland. The research community is dedicated to investigating major cardiometabolic diseases.

By leveraging genetics, genomics, translational research, and lifestyle interventions, the community aims to provide robust evidence on disease mechanisms and advance early diagnosis, prevention, and personalized treatment. The research community consists of 20 research groups, spanning basic research to patient care.

Explore further

Feedback to editors

findings of research study are extrapolated to

No evidence for belief that nut allergens spread through aircraft ventilation systems, say experts

2 hours ago

findings of research study are extrapolated to

Study suggests around 40% of postmenopausal hormone positive breast cancers are linked to excess body fat

findings of research study are extrapolated to

New research confirms location of pseudoautosomal region boundary between the two sex chromosomes

4 hours ago

findings of research study are extrapolated to

Active navigation and immersive technologies can strengthen memory and treat neurodegenerative diseases, finds study

findings of research study are extrapolated to

One-third of childhood cancer survivors experience significant fear that it could come back, study reveals

5 hours ago

findings of research study are extrapolated to

Immune sensitivity links race and survival after prostate cancer immunotherapy

findings of research study are extrapolated to

Study helps understand pain associated with viral infection

findings of research study are extrapolated to

New findings could help offer future treatments for unexplained infertility

6 hours ago

findings of research study are extrapolated to

Researchers pioneer novel method to enhance effectiveness of MSC therapy for cartilage repair

findings of research study are extrapolated to

Immune signatures may predict adverse events from immunotherapy

Related stories.

findings of research study are extrapolated to

Physical fitness since childhood associated with cerebellar volume in adolescence: Study

Nov 10, 2023

findings of research study are extrapolated to

Sedentary lifestyle puts strain on young hearts, study shows

May 7, 2024

findings of research study are extrapolated to

Low fitness in youth associated with higher risk of cardiometabolic diseases in middle age: Study

Jan 9, 2024

findings of research study are extrapolated to

Low cardiorespiratory fitness in youth associated with decreased work ability throughout adulthood, finds 45-year study

Apr 8, 2024

findings of research study are extrapolated to

Study: Good physical fitness does not protect children from obesity-related low-grade inflammation

Mar 23, 2023

findings of research study are extrapolated to

Lifestyle intervention from childhood to adolescence affects metabolism even years later, finds study

Sep 12, 2024

Recommended for you

findings of research study are extrapolated to

Bursts of exercise boost cognitive function, neuroscientists find

9 hours ago

findings of research study are extrapolated to

In depth analysis explains why preschoolers are less likely to develop severe COVID-19

14 hours ago

findings of research study are extrapolated to

Gene therapy that converts omega-6 to omega-3 fatty acids in the body could combat effects of childhood obesity

Oct 14, 2024

findings of research study are extrapolated to

GLP-1 weight-loss meds won't raise teens' suicide risk, may even lower it

findings of research study are extrapolated to

Study links children's bedtimes to gut health, finds early sleepers have greater microbial diversity in gut flora

Oct 13, 2024

findings of research study are extrapolated to

Study finds 'brain endurance training' boosts cognitive and physical abilities in older adults

Oct 11, 2024

Let us know if there is a problem with our content

Use this form if you have come across a typo, inaccuracy or would like to send an edit request for the content on this page. For general inquiries, please use our contact form . For general feedback, use the public comments section below (please adhere to guidelines ).

Please select the most appropriate category to facilitate processing of your request

Thank you for taking time to provide your feedback to the editors.

Your feedback is important to us. However, we do not guarantee individual replies due to the high volume of messages.

E-mail the story

Your email address is used only to let the recipient know who sent the email. Neither your address nor the recipient's address will be used for any other purpose. The information you enter will appear in your e-mail message and is not retained by Medical Xpress in any form.

Newsletter sign up

Get weekly and/or daily updates delivered to your inbox. You can unsubscribe at any time and we'll never share your details to third parties.

More information Privacy policy

Donate and enjoy an ad-free experience

We keep our content available to everyone. Consider supporting Science X's mission by getting a premium account.

E-mail newsletter

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

Identifying and characterizing extrapolation in multivariate response data

Contributed equally to this work with: Meridith L. Bartley, Ephraim M. Hanks, Erin M. Schliep, Patricia A. Soranno, Tyler Wagner

Roles Conceptualization, Formal analysis, Methodology, Visualization, Writing – original draft

* E-mail: [email protected]

Affiliation Department of Statistics, Pennsylvania State University, University Park, Pennsylvania, United States of America

ORCID logo

Roles Conceptualization, Supervision, Writing – review & editing

Roles Conceptualization, Writing – review & editing

Affiliation Department of Statistics, University of Missouri, Columbia, Missouri, United States of America

Roles Data curation, Funding acquisition, Project administration, Writing – review & editing

Affiliation Department of Fisheries and Wildlife, Michigan State University, East Lansing, Michigan, United States of America

Affiliation U.S. Geological Survey, Pennsylvania Cooperative Fish and Wildlife Research Unit, Pennsylvania State University, University Park, Pennsylvania, United States of America

PLOS

Fig 1

Faced with limitations in data availability, funding, and time constraints, ecologists are often tasked with making predictions beyond the range of their data. In ecological studies, it is not always obvious when and where extrapolation occurs because of the multivariate nature of the data. Previous work on identifying extrapolation has focused on univariate response data, but these methods are not directly applicable to multivariate response data, which are common in ecological investigations. In this paper, we extend previous work that identified extrapolation by applying the predictive variance from the univariate setting to the multivariate case. We propose using the trace or determinant of the predictive variance matrix to obtain a scalar value measure that, when paired with a selected cutoff value, allows for delineation between prediction and extrapolation. We illustrate our approach through an analysis of jointly modeled lake nutrients and indicators of algal biomass and water clarity in over 7000 inland lakes from across the Northeast and Mid-west US. In addition, we outline novel exploratory approaches for identifying regions of covariate space where extrapolation is more likely to occur using classification and regression trees. The use of our Multivariate Predictive Variance (MVPV) measures and multiple cutoff values when exploring the validity of predictions made from multivariate statistical models can help guide ecological inferences.

Citation: Bartley ML, Hanks EM, Schliep EM, Soranno PA, Wagner T (2019) Identifying and characterizing extrapolation in multivariate response data. PLoS ONE 14(12): e0225715. https://doi.org/10.1371/journal.pone.0225715

Editor: Bryan C. Daniels, Arizona State University & Santa Fe Institute, UNITED STATES

Received: May 7, 2019; Accepted: November 10, 2019; Published: December 5, 2019

This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.

Data Availability: All data and R code used in our analysis have been compiled into a GitHub repository, https://github.com/MLBartley/MV_extrapolation . Current releases are available at https://github.com/MLBartley/MV_extrapolation/releases , and a static version of the package has been publicly archived via Zenodo (DOI: 10.5281/zenodo.3523116 ). All data used and created in this analysis are archived via figshare (DOI: 10.6084/m9.figshare.10093460 ).

Funding: Funding was provided by the US NSF Macrosystems Biology Program grants, DEB-1638679; DEB-1638550, DEB-1638539, DEB-1638554 (EH, PS, TW and ES, https://www.nsf.gov/funding/pgm%20summ.jsp?pims%20id=503425 ). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

The use of ecological modeling to translate observable patterns in nature into quantitative predictions is vital for scientific understanding, policy making, and ecosystem management. However, generating valid predictions requires robust information across a well-sampled system which is not always feasible given constraints in gathering and accessing data. Extrapolation is defined as a prediction from a model that is a projection, extension, or expansion of an estimated model (e.g. regression equation, or Bayesian hierarchical model) beyond the range of the data set used to fit that model [ 1 ]. When we use a model fit on available data to predict a value or values at a new location, it is important to consider how dissimilar this new observation is to previously observed values. If some or many covariate values of this new point are dissimilar enough from those used when the model was fitted (i.e. either because they are outside the range of individual covariates or because they are a novel combination of covariates) predictions at this point may be unreliable. Fig 1 , adapted from work by Filstrup et al. [ 2 ], illustrates this risk with a simple linear regression between the log transformed measurements of total phosphorous (TP) and chlorophyll a (Chl a) in U.S. lakes. The data shown in blue were used to fit a linear model with the estimated regression line shown in the same color. While the selected range of data may be reasonably approximated with a linear model, the linear trend does not extend into more extreme values, and thus our model and predictions are no longer appropriate.

thumbnail

A 95% confidence interval of the mean is included around the regression line. Dashed red lines represents the 95% prediction interval. Areas shaded in darker grey indicate regions of extrapolation (using the maximum leverage value ( h ii ) to identify the boundaries).

https://doi.org/10.1371/journal.pone.0225715.g001

While ecologists and other scientists know the risks associated with extrapolating beyond the range of their data, they are often tasked with making predictions beyond the range of the available data in efforts to understand processes at broad scales, or to make predictions about the effects of different policies or management actions in new locations. Forbes and Carlow [ 3 ] discuss the double-edged sword of supporting cost-effective progress while exhibiting caution for potential misleading results that would hinder environmental protections. They outline the need for extrapolation to balance these goals in ecological risk assessment. Other works [ 4 – 6 ] explore strategies for addressing the problem of ecological extrapolation, often in space and time, across applications in management tools and estimation practices. Previous work on identifying extrapolation includes Cook’s early work on detecting outliers within a simple linear regression setting [ 7 ] and recent extensions to GLMs and similar models by Conn et al. [ 8 ]. The work of Conn et al. defines extrapolation as making predictions that occur outside of a generalized independent variable hull (gIVH), defined by the estimated predictive variance of the mean at observed data points. This definition allows for predictions to be either interpolations (inside the hull) or extrapolations (outside the hull).

However, the work of Conn et al. [ 8 ] is restricted to univariate response data, which does not allow for the application of these methods to multivariate response models. This is an important limitation because many ecological and environmental research problems are inherently multivariate in nature. Elith and Leathwick [ 9 ] note the need for additional extrapolation assessments of fit in the context of using species distribution models (SDMs) for forecasting across different spatial and temporal scales. Mesgaran et al. [ 10 ] developed a new tool for identifying extrapolation using the Mahalanobis distance to detect and quantify the degree of dissimilarity for points either outside the univariate range or forming novel combinations of covariates.

In our paper, we present a general framework for quantifying and evaluating extrapolation in multivariate response models that can be applied to a broad class of problems. Our approach may be succinctly summarized as follows:

We draw on extensive tools for measures of leverage and influential points to inform decisions of a cutoff between extrapolation and interpolation. We illustrate our framework through an application of this approach on jointly modeled lake nutrients, productivity, and water clarity variables in over 7000 inland lakes from across the Northeast and Mid-west US.

Predicting lake nutrient and productivity variables

Inland lake ecosystems are threatened by cultural eutrophication, with excess nutrients such as nitrogen (N) and phosphorus (P) resulting in poor water quality, harmful algal blooms, and negative impacts to higher trophic levels [ 11 ]. Inland lakes are also critical components in the global carbon (C) cycle [ 12 ]. Understanding the water quality in lakes allows for informed ecosystem management and better predictions of the ecological impacts of environmental change. Water quality measurements are collected regularly by federal, state, local, and tribal governments, as well as citizen-science groups trained to sample water quality.

The LAGOS-NE database is a multi-scaled geospatial and temporal database for thousands of inland lakes in 17 of the most lake-rich states in the eastern Mid-west and the Northeast of the continental United States [ 13 ]. This database includes a variety of water quality measurements and variables that describe a lake’s ecological context at multiple scales and across multiple dimensions (such as hydrology, geology, land use, and climate).

Wagner and Schliep [ 14 ] jointly modelled lake nutrient, productivity, and clarity variables and found strong evidence these nutrient-productivity variables are dependent. They also found that predictive performance was greatly enhanced by explicitly accounting for the multivariate nature of these data. Filstrup et al. [ 2 ] more closely examined the relationship between Chl a and TP and found nonlinear models fit the data better than a log-linear model. Most notably for this work, the relationship of these variables differ in the extreme values of the observed ranges; while a linear model may work for a moderate range of these data it is imperative that caution is shown before extending results to more extreme values (i.e., to extremely nutrient-poor or nutrient-rich lakes).

In this study, following Wagner and Schliep, we consider four variables: total phosphorous (TP), total nitrogen (TN), Chl a, and Secchi disk depth (Secchi) as joint response variables of interests. Each lake may have observations for all four of these variables, or only a subset. Fig 2 shows response variable availability (fully observed, partially observed, or missing) for each lake in the data set. A partially observed set of response variables for a lake indicates that at least one, but not all, of the water quality measures were sampled. We consider several covariates at the individual lake and watershed scales as explanatory variables including maximum depth (m), mean base flow (%), mean runoff (mm/yr), road density (km/ha), elevation (m), stream density (km/ha), the ratio of watershed area to lake area, and the proportion of forested and agricultural land in each lake’s watershed. One goal among many for developing this joint model is to be able to predict TN concentrations for all lakes across this region, and eventually the entire continental US. Our objective is to identify and characterize when predictions of these multivariate lake variables are extrapolations. To this end, we will review and develop methods for identifying and characterizing extrapolation in multivariate settings.

thumbnail

Left: map of inland lake locations with full, partial, or missing response variables. Missing response variables are lakes where all water quality measures have not been observed, while partial status indicates only some lake response variables are unobserved. Covariates were quantified for all locations. Right: subset of data status (observed or missing) for each response variable. All spatial plots in this paper were created using the Maps package [ 15 ] in R to provide outline of US states.

https://doi.org/10.1371/journal.pone.0225715.g002

Materials and methods

Review of current work, cook’s independent variable hull..

findings of research study are extrapolated to

This definition remains useful without any underlying distributional assumption of the data. For example, empirically obtained quantile cutoff values can serve reasonably well as threshold for declaring outliers. However, for multivariate-normal data, the squared MD can be transformed into probabilities using a chi-squared cumulative probability distribution [ 17 ] such that points that have a very high probability of not belonging to the distribution could be classified as outliers. In either scenario, outliers can be detected using only predictor variables by calculating x 0 ( X ′ X ) −1 x 0 and comparing with max(diag( X ( X ′ X ) −1 X )).

Conn’s generalized IVH.

findings of research study are extrapolated to

Prediction variance.

findings of research study are extrapolated to

Extension to the multivariate case

findings of research study are extrapolated to

Predictions of different response types covary in multivariate models, complicating our definition of a gIVH (see Eq 11 ) which relies on finding a maximum univariate value. Where a univariate model would yield a scalar prediction variance ( Eq 18 ), a multivariate model will have a prediction covariance matrix. We propose capturing the size of a covariance matrix using univariate measures. Note this is similar to A-optimality and D-optimality criteria used in experimental design [ 19 ].

Further, using our novel numeric measure of extrapolation, we aim to take advantage of the multivariate response variable information to explore when we may identify an additional observation’s (i.e. covariates for a new lake location) predictions as extrapolations for all response values, jointly. We also present an approach to identify when we cannot trust a prediction for only a single response variable at either a new lake location, or a currently partially-sampled lake. The latter identification would be useful for a range of applications in ecology. For example, in the inland lakes project, one important goal is to predict TN because this essential nutrient is not well-sampled across the study extent, and yet is important for understanding nutrient dynamics and for informing eutrophication management strategies for inland lakes. In this case, to accommodate TN not being observed (i.e. sampled) as often as some other water quality variables, we can leverage the knowledge gained from samples of other water quality measures taken more often than TN (e.g. Secchi disk depth [ 20 ] is a common measure of water clarity obtained on site, while other water quality measurements require samples to be sent to a lab for analysis). We first outline our approach for identifying extrapolated new observations using a measure of predictive variance for lakes that have been fully or partially sampled and used to fit a model. Then, we describe how this approach can be applied to the prediction of TN in lakes for which it has not been sampled.

Multivariate extrapolation measures.

findings of research study are extrapolated to

The trace (tr) of an n × n square matrix V is defined to be the sum of the elements on the main diagonal (the diagonal from the upper left to the lower right). This does not take into account the correlation between variables and is not a scale-invariant measure. As the response variables for the inland lakes example are log transformed, we chose to explore the use of this measure for obtaining a scalar value extrapolation measure. The determinant (D) takes into account the correlations among pairs of variables and is scale-invariant. In this paper, we explore both approaches by quantifying extrapolation using our multivariate model of the LAGOS-NE lake data set by:

Conditional single variable extrapolation measures.

The chosen numeric measure of MV extrapolation includes information from the entire set of responses. In the inland lake example, this could be used to identify unsampled lakes where prediction of the whole vector of response variables (TN, TP, Chl a, Secchi) are extrapolations. However, even when a joint model is appropriate, there are important scientific questions that can be answered with prediction of a single variable.

findings of research study are extrapolated to

Any of the four response variables may be considered to be variable 1 and so this general partition approach may be used for any variable conditioned on all others. The values of μ − ni and Σ are determined by the availability of data for the three variables we are conditioning on. These water quality measure can be fully, partially, or not observed.

findings of research study are extrapolated to

Cutoffs vs continuous measures

findings of research study are extrapolated to

Identifying locations as extrapolations

findings of research study are extrapolated to

Choosing IVH vs PV

With several methods of identifying extrapolations available we now provide additional guidance on choosing between various options. Cook’s approach of using the maximum leverage value to define the IVH boundary may be useful for either an univariate or a joint model in a linear regression framework. However, as it depends on covariate values alone, it lacks any influence of response data. Conn et al.’s gIVH introduces the use of posterior predictive variance instead of the hat matrix to define the hull boundary in the case of a generalized model.

findings of research study are extrapolated to

Visualization and interpretation

Exploring data and taking a principled approach to identifying potential extrapolation points is often aided by visualization (and interpretation) of data and predictions. With the LAGOS data we examine spatial plots of the lakes and their locations coded by extrapolation vs prediction. Plotting this for multiple cutoff choices (as in Fig 3 ) is useful to explore how this choice can influence which locations are considered extrapolations. This is important from both an ecological and management perspective. For instance, if potential areas are identified as having many extrapolations this might suggest that specific lake ecosystems or landscapes have characteristics influencing processes governing nutrient dynamics in lakes not well captured by previously collected data—and thus may require further investigation.

thumbnail

Four cutoff approaches are compared and presented. Lakes in orange diamonds and red triangles indicate those where predictions were beyond the 99% and 95% cutoff values, respectively, and thus considered extrapolations. The color and shape of extrapolated lake locations are determined by which cutoff value first identifies the prediction at that location as an extrapolation.

https://doi.org/10.1371/journal.pone.0225715.g003

In addition to an exploration of possible extrapolation in physical space (through the plot in Fig 3 ), we also examine possible extrapolation in covariate space. Using either of the binary/numeric Extrapolation Index values, we propose a Classification and Regression Tree (CART) analysis with the extrapolation values as the response. Our classification approach allows for further insight into what covariates may be influential in determining whether a newly observed location is too dissimilar to existing ones. A CART model allows for the identification of regions in covariate space where predictions are suspect and may inform future sampling efforts as the available data has not fully characterized all lakes.

Model fitting

findings of research study are extrapolated to

Fitting our multivariate linear model to the 8,910 lakes resulted in most lakes’ predictions remaining within the extrapolation index cutoff and thus not being identified as extrapolations. We explored the use of both trace and determinant for obtaining a scalar value representation of the multivariate posterior predictive variance in addition to four cutoff criteria. Using MVPV(tr) with these cutoffs (max value, leverage max, 0.99 quantile, and 0.95 quantile) resulted in 0, 1, 9, and 33 multivariate response predictions being identified as extrapolations, respectively. In contrast, using MVPV(D) values combined with the four cutoffs resulted in 0, 0, 8, 37 predictions identified as extrapolations. Unless all response variables are on the same scale we recommend the use of MVPV(D) over MVPV(tr). However, if a scale-invariant measure if not necessary, exploring the use of MVPV(tr) (in addition to MVPV(D) may reveal single-response variables that are of interest to researchers for further exploration using our Conditional MVPV approach. Fig 3 shows the spatial locations of lakes where the collective model predictions for TP, TN, Chl a, and Secchi depth have been identified as extrapolations using MVPV(D) combined with the cutoff measures. As the cutoff values become more conservative in nature the number of extrapolations identified increases. This figure shows the level of cutoff that first identifies a location as an extrapolation, (e.g. red squares are locations first flagged using the 99% cutoff, but they would also be included in the extrapolations found with the 95% cutoff). This increasing number of extrapolations identified highlights the importance of exploring different choices for a cutoff value. When the maximum value or the leverage-informed maximum of the predictive variance measure ( k max and k lev ) are used as cutoffs for determining when a prediction for an unsampled lake location should not be fully trusted, zero lakes are identified as extrapolations. Exploratory data analysis (see S1 Fig ) indicates that for each of the lakes identified as extrapolations, the values are within the distribution of the data, with only a few exceptions. Rather than a few key variables standing out, it appears to be some combination of variables that makes a lake an extrapolation. To further characterize the type of lake more likely to be identified as an extrapolation we used a CART Model with our binary extrapolation index results using the MVPV(D) and the 0.95 quantile cutoff. This approach can help identify regions in the covariate space where extrapolations are more likely to occur ( Fig 4 ). This CART analysis suggests the most important factors associated with extrapolation include shoreline length, elevation, stream density, and lake SDF. For example, a lake with a shoreline greater than 26 kilometers and above a certain elevation (≥ 279 m), is likely to be identified as an extrapolation when using this model to obtain predictions. This type of information is useful for ecologists trying to model lake nutrients because it suggests lakes with these types of characteristics may behave differently than other lakes. In fact, lake perimeter, SDF, and elevation have been shown to be associated with reservoirs relative to natural lakes [ 24 ]. Although it is beyond the scope of our paper to fully explore this notion because our existing database does not differentiate between natural lakes and reservoirs, these results lend support to our approach and conclusions.

thumbnail

Each level of nodes include the thresholds and variables used to sort the data. Node color indicate whether the majority of sorted inland lake locations were identified as predictions (blue) or extrapolations (red). The first row of numbers in a node indicate the number of lakes identified as predictions (right) or extrapolations (left) that have been sorted into this node. The second row of numbers indicate the percentage of lakes that are identified as predictions (left) or extrapolation (right) with the terminal nodes (square nodes) including the percentage of records sorted by the decision tree.

https://doi.org/10.1371/journal.pone.0225715.g004

We also employed the conditional single variable extrapolation through predictive variance approach to leverage all information known about a lake when considering whether a prediction of a single response variable (e.g. TN, as explored here) is an extrapolation ( Fig 5 ). These cutoffs resulted in 0, 2, 73, and 386 lake multivariate response predictions out of 5031 being identified as extrapolations. To characterize the type of lake more likely to be identified as an extrapolation we used a CART model using the 95% cutoff criterion. CART revealed the most important factors associated with extrapolation were latitude, maximum depth, and watershed to lake size ratio. Latitude may be expected as many of the lakes without measures for TN are located in the northern region. An additional visualization and table exploring extrapolation lakes and their covariate values may be found in S1 Tables .

thumbnail

Four cutoff approaches are compared and presented. Lakes in blue circles represent locations where TN predictions have been not been identified as extrapolations for any cutoff choice. Lakes in red squares, orange triangles, and yellow diamonds indicate those where predictions were beyond the cutoff values and thus considered extrapolations. The color and shape of extrapolated lake locations are determined by which cutoff value first identifies the prediction at that location as an extrapolation.

https://doi.org/10.1371/journal.pone.0225715.g005

We have presented different approaches for identifying and characterizing potential extrapolation points within multivariate response data. Ecological research is often faced with the challenge of explaining processes at broad scales with limited data. Financial, temporal, and logistical restrictions often prevent research efforts from fully exploring an ecosystem or ecological setting. Rather, ecologists rely on predictions made on a select amount of available data that may not fully represent the breadth of a system of study. By better understanding when extrapolation is occurring scientists may avoid making unsound inferences.

In our inland lakes example we addressed the issue of large-scale predictions to fill in missing data using a joint linear model presented by Wagner and Schliep [ 18 ]. With our novel approach for identifying and characterizing extrapolation in a multivariate setting we were able to provide numeric measures associated with extrapolation (MVPV, CMVPV, R(C)MVPV) allowing for focus on predictions for all response variables or a single response variable while conditioning on others. Each of these measurements, when paired with a cutoff criterion, identify novel locations that are extrapolations. Our recommendations for visualization and interpretation of these extrapolated lakes is useful for future analyses and predictions which inform policy and management decisions. Insight into identified extrapolations and their characteristics provides additional sampling locations to consider for future work. In this analysis we found certain lakes, such as lakes located at relatively higher elevations in our study area, are more likely to be identified as an extrapolation. The available data may thus not fully represent these types of lakes, resulting in them being poorly predicted, or identified as extrapolations.

The tools outlined in this work provide novel insights into identifying and characterizing extrapolations in multivariate response settings. Further extensions of this work are available but not explored in this paper. In addition to the A- and D-optimality approaches (trace and determinant, respectively) used to obtain scalar value representations of the covariance matrices one may also explore the utility of E-optimality (maximum eigenvalue) as an additional criterion. This approach would focus on examining variance in the first principle component of the predictive variance matrix and, like the trace, this variance is not a scale-invariant measure. Our work takes advantage of posterior predictive inference under a Bayesian setting to obtain an estimate of the variance of the predictive mean response vector for each lake. However, a frequentist approach using simulation-based methods may also provide an estimate of this variance through non-parametric or parametric bootstrapping (a comparison of the two for spatial abundance estimates may be found in Hedley and Buckland [ 25 ]) and the extrapolation coefficients may be obtained through the trace and/or determinant of this variance.

This work results in identification of extrapolated lake locations as well as further understanding of the unique covariate space they occupy. The resulting caution shown when using joint nutrient models to estimate water quality variables at lakes with partially or completely unsampled measures is necessary for larger goals such as estimating the overall combined levels of varying water qualities in all US inland lakes. In addition, under- or overestimating concentrations of key nutrients such as TN and TP can potentially lead to misinformed management strategies which may have deleterious effects on water quality and the lake ecosystem. The identification of lake and landscape characteristics associated with extrapolation locations can further understanding between natural/anthropogenic sources of nutrients in lakes not well represented in the sampled population. In our database, TP is sampled more than TN, which is likely due to the conventional wisdom that inland waters are P limited, where P contributes the most to eutrophication [ 26 ]. However, nitrogen has been shown to be an important nutrient in eutrophication in some lakes and some regions [ 27 ], and may be as important to sample to fully understand lake eutrophication. Our results show it is possible to predict TN if other water quality variables are available, but it would be better if it was sampled more often.

The joint model used in this work can be improved upon in several regards; no spatial component is included, response variables are averages over several years worth of data and thus temporal variation is not considered, and data from different years are given equal weight. The model we use to fit these data may be considered to be a simple one, but the novel approach presented here may be applied to more complicated models. In a sample based approach using a Bayesian framework the MVPV and CMVPV values obtained come from the MCMC samples and are thus independent from model design choices.

Deeper understanding of where extrapolation is occurring will allow researchers to propagate this uncertainty forward. Follow up analyses using model-based predictions need to acknowledge that some predictions are less trustworthy than others. This approach and our analysis here shows that while a model may be able to produce an estimate and a confidence or prediction interval, that does not mean the truth is captured nor does the assumed relationship persist, especially outside the range of observed data. The methods outlined here will serve to guide future scientific inquiries involving joint distribution models.

Supporting information

S1 fig. violin plots of covariate densities and extrapolation points plotted..

https://doi.org/10.1371/journal.pone.0225715.s001

S1 Tables. Tables of covariate values for lakes identified as extrapolations using MVPV(D) and CMVPV for TN.

https://doi.org/10.1371/journal.pone.0225715.s002

Acknowledgments

We thank the LAGOS Continental Limnology Research Team for helpful discussions throughout the process of this manuscript. Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government.

This draft manuscript is distributed solely for purposes of scientific peer review. Its content is deliberative and predecisional, so it must not be disclosed or released by reviewers. Because the manuscript has not yet been approved for publication by the US Geological Survey (USGS), it does not represent any official finding or policy.

Nipocalimab demonstrates sustained disease control in adolescents living with generalized myasthenia gravis in Phase 2/3 study

News provided by.

Oct 15, 2024, 08:00 ET

Share this article

First FcRn blocker to demonstrate sustained disease control over 24 weeks in antibody positive adolescents aged 12 –  17 years, broadening the population in which nipocalimab has been studied 

SAVANNAH, Ga. , Oct. 15, 2024 /PRNewswire/ -- Johnson & Johnson (NYSE: JNJ ) today announced positive results from the Phase 2/3 Vibrance-MG study of nipocalimab in anti-AChR a  positive adolescents (aged 12 – 17 years) living with generalized myasthenia gravis (gMG). Study participants who were treated with nipocalimab plus standard of care (SOC) achieved sustained disease control as measured by the primary endpoint of immunoglobulin G (IgG) reduction from baseline over 24 weeks, and secondary endpoints of improvement in MG-ADL b and QMG c scores. These Phase 2/3 data will be featured in an oral presentation (Abstract #MG100) at the Myasthenia Gravis Foundation of America (MGFA) Scientific Session during the American Association of Neuromuscular & Electrodiagnostic Medicine (AANEM) Annual Meeting, where Johnson & Johnson will present 25 abstracts .  

Experience the full interactive Multichannel News Release here:  https://www.multivu.com/johnson-johnson/9296251-en-johnson-and-johnson-nipocalimab

"Findings from the Vibrance-MG study underscore the potential of this investigational therapy for young individuals aged 12 – 17 living with gMG. Results show a significant reduction in IgG of approximately 70% in adolescents and a clinical benefit that is consistent with the Vivacity-MG3 study in adults," said Jonathan Strober , M.D., Director of Clinical Services for Child Neurology and Director of the Muscular Dystrophy Clinic at UCSF Benioff Children's Hospital. d "It is encouraging to see these positive results as there are currently no approved advanced treatment options for this adolescent population in the United States ."

About 10% of new cases of myasthenia gravis are diagnosed in adolescents (12 – 17 years of age) and the severity of gMG in pediatric patients is heightened with 43% having experienced over five hospitalizations in their lifetime, 46% having at least one intensive care unit stay and 68% having periods of exacerbated disease. 1 ,2,3,4   

Treatment with nipocalimab plus SOC met the study's primary endpoint of reduction in total serum IgG (-69%), and the two secondary endpoints of MG-ADL and QMG, which are measures of disease activity. 5 ,e Four of five patients achieved minimum symptom expression (MG-ADL score 0-1) by the end of their treatment phase. f,g   Nipocalimab was well-tolerated over the six-month period, similar to tolerability seen in adult participants in the Vivacity-MG3 study. 5 There were no serious adverse events and no discontinuations due to an adverse event. 

Presented for the first time, these open-label Phase 2/3 results in adolescents are consistent with findings from the pivotal study of nipocalimab in adult patients with gMG. Nipocalimab when added to SOC is the first FcRn blocker  to demonstrate sustained disease control in a registrational trial as measured by improvement in MG-ADL over placebo plus SOC over a period of six months of consistent dosing (Q2 week) among adults living with gMG.

"The Vibrance-MG data add to the expanding clinical profile of nipocalimab and highlight its potential for adolescents living with gMG who are in need of new treatments," said Sindhu Ramchandren , M.D., Executive Medical Director, Neuroscience, Johnson & Johnson Innovative Medicine. "We are committed to developing innovations for autoantibody-driven neurological diseases, like gMG, with the aim of transforming the lives of people living with these conditions."

Earlier this year, Johnson & Johnson announced the submission of applications to the U.S. Food and Drug Administration ( FDA ) and the European Medicines Agency ( EMA ) seeking approval for nipocalimab for the treatment of gMG. 

Editor's notes:  

a.  Patients with a positive blood test for acetylcholine receptor (anti-AChR) antibodies or muscle-specific tyrosine kinase (anti-MuSK) antibodies are eligible for the study. b.  MG-ADL (Myasthenia Gravis – Activities of Daily Living) provides a rapid clinical assessment of the patient's recall of symptoms impacting activities of daily living, with a total score range of 0 to 24; a higher score indicates greater symptom severity. c.  QMG (Quantitative Myasthenia Gravis) is a 13-item assessment by a clinician that quantifies MG disease severity through muscle weakness. The total QMG score ranges from 0 to 39, where higher scores indicated greater disease severity. d.  Dr. Jonathan Strober is a paid consultant for Johnson & Johnson. He has not been compensated for any media work. e.  Treatment with nipocalimab showed a mean percentage change from baseline to week 24 for total serum IgG of -68.98% (standard error [SE] = 7.561). f.  Adolescents who received nipocalimab plus current SOC had a mean baseline score of 4.29 (SE = 2.430) on the MG-ADL scale and a mean baseline score of 12.50 (SE = 3.708) on the QMG scale. g.  Adolescents who received nipocalimab plus current SOC had a mean change at week 24 of -2.40 (SE = 0.187) on the MG-ADL scale and -3.80 (SE = 2.683) on the QMG scale.

About Generalized Myasthenia Gravis (gMG)

Myasthenia gravis (MG) is an autoantibody disease in which the immune system mistakenly makes antibodies (e.g., anti-acetylcholine receptor [AChR], anti-muscle-specific tyrosine kinase [MuSK] or anti-low density lipoprotein-related protein 4 [LRP4]), which target proteins at the neuromuscular junction and can block or disrupt normal signaling from nerves to muscles, thus impairing or preventing muscle contraction. 6 , 7  The disease impacts an estimated 700,000 people worldwide. 6  Approximately 10 to 15% of new cases of MG are diagnosed in adolescents (12 – 17 years of age). 1 , 2 , 3 Among juvenile MG patients, girls are affected more often than boys with over 65% of pediatric MG cases in the US diagnosed in girls. 8 ,9,10

Initial disease manifestations are usually ocular but in 85% or more cases, the disease generalizes (gMG), which is characterized by fluctuating weakness of the skeletal muscles leading to symptoms like limb weakness, drooping eyelids, double vision and difficulties with chewing, swallowing, speech, and breathing. 6,11,12,13,14  Approximately 100,000 individuals in the U.S. are living with gMG. 15  Vulnerable gMG populations, such as pediatric patients, have more limited therapeutic options. 3  Currently, SOC treatments for adolescents with gMG are extrapolated from adult trials. 3  Other than symptomatic treatments, there are no approved FcRn blockers that may address the root cause of the disease for adolescents with gMG in the United States . 3

About the Phase 2/3 Vibrance-MG Study

The Phase 2/3 Vibrance-MG study ( NCT05265273 ) is an on-going open-label study to determine the effect of nipocalimab in pediatric participants with gMG. 16  Seven participants aged 12 – 17 years with a diagnosis of gMG as reflected by a Myasthenia Gravis Foundation of America (MGFA) Class of II through IV at screening, and an insufficient clinical response to ongoing, stable SOC therapy, have been enrolled in the trial. 5  Participants must have a positive blood test for either anti-AChR or anti-MUSK autoantibodies. The study consists of a screening period of up to four weeks, a 24-week open-label Active Treatment Phase during which participants receive nipocalimab intravenously every two weeks, and a Long-term Extension Phase; a safety follow-up assessment will be conducted at eight weeks after last dose. 16  The primary outcome of the study is the effect of nipocalimab on total serum IgG, safety and tolerability, and pharmacokinetics in pediatric participants with gMG at 24 weeks. Secondary endpoints include change in MG-ADL and QMG scores at 24 weeks. 5 , 16

About Nipocalimab

Nipocalimab is an investigational monoclonal antibody, designed to bind with high affinity to block FcRn and reduce levels of circulating immunoglobulin G (IgG) antibodies potentially without impact on other immune functions. This includes autoantibodies and alloantibodies that underlie multiple conditions across three key segments in the autoantibody space including Rare Autoantibody diseases, Maternal Fetal diseases mediated by maternal alloantibodies and Prevalent Rheumatology. 17 ,18,19,20,21,22,23,24,25 Blockade of IgG binding to FcRn in the placenta is also believed to limit transplacental transfer of maternal alloantibodies to the fetus. 26 ,27

The U.S. Food and Drug Administration (FDA) and European Medicines Agency (EMA) have granted several key designations to nipocalimab including:  

About Johnson & Johnson

At Johnson & Johnson, we believe health is everything. Our strength in healthcare innovation empowers us to build a world where complex diseases are prevented, treated, and cured, where treatments are smarter and less invasive, and solutions are personal. Through our expertise in Innovative Medicine and MedTech, we are uniquely positioned to innovate across the full spectrum of healthcare solutions today to deliver the breakthroughs of tomorrow, and profoundly impact health for humanity.  

Learn more at https://www.jnj.com/  or at www.innovativemedicine.jnj.com

Follow us at @JanssenUS  and @JNJInnovMed .

Janssen Research & Development, LLC and Janssen Biotech, Inc. are both Johnson & Johnson companies. 

Cautions Concerning Forward-Looking Statements  

This press release contains "forward-looking statements" as defined in the Private Securities Litigation Reform Act of 1995 regarding product development and the potential benefits and treatment impact of nipocalimab. The reader is cautioned not to rely on these forward-looking statements. These statements are based on current expectations of future events. If underlying assumptions prove inaccurate or known or unknown risks or uncertainties materialize, actual results could vary materially from the expectations and projections of Janssen Research & Development, LLC, Janssen Biotech, Inc. and/or Johnson & Johnson. Risks and uncertainties include, but are not limited to: challenges and uncertainties inherent in product research and development, including the uncertainty of clinical success and of obtaining regulatory approvals; uncertainty of commercial success; manufacturing difficulties and delays; competition, including technological advances, new products and patents attained by competitors; challenges to patents; product efficacy or safety concerns resulting in product recalls or regulatory action; changes in behavior and spending patterns of purchasers of health care products and services; changes to applicable laws and regulations, including global health care reforms; and trends toward health care cost containment. A further list and descriptions of these risks, uncertainties and other factors can be found in Johnson & Johnson's Annual Report on Form 10-K for the fiscal year ended December 31, 2023 , including in the sections captioned "Cautionary Note Regarding Forward-Looking Statements" and "Item 1A. Risk Factors," and in Johnson & Johnson's subsequent Quarterly Reports on Form 10-Q and other filings with the Securities and Exchange Commission. Copies of these filings are available online at www.sec.gov , www.jnj.com or on request from Johnson & Johnson. None of Janssen Research & Development, LLC, Janssen Biotech, Inc. nor Johnson & Johnson undertakes to update any forward-looking statement as a result of new information or future events or developments.

1  Evoli A, Batocchi AP, Bartoccioni E, Lino MM, Minisci C, Tonali P. Juvenile myasthenia gravis with prepubertal onset. Neuromuscul Disord. 1998 Dec;8(8):561-7. doi: 10.1016/s0960-8966(98)00077-7. 2  Evoli A. Acquired myasthenia gravis in childhood. Curr Opin Neurol. 2010 Oct;23(5):536-40. doi: 10.1097/WCO.0b013e32833c32af. 3  Finnis MF, Jayawant S. Juvenile myasthenia gravis: a paediatric perspective. Autoimmune Dis. 2011;2011:404101. doi: 10.4061/2011/404101. 4  Barraud C, Desguerre I, Barnerias C, Gitiaux C, Boulay C, Chabrol B. Clinical features and evolution of juvenile myasthenia gravis in a French cohort. Muscle Nerve. 2018 Apr;57(4):603-609. doi: 10.1002/mus.25965. 5  Strober J et al. Safety and effectiveness of nipocalimab in adolescent participants in the open label Phase 2/3 Vibrance-MG clinical study. Presentation at American Association of Neuromuscular & Electrodiagnostic Medicine (AANEM) Annual Meeting. October 2024 . 6  Chen J, Tian D-C, Zhang C, et al. Incidence, mortality, and economic burden of myasthenia gravis in China : A nationwide population-based study. The Lancet Regional Health - Western Pacific. https://www.thelancet.com/action/showPdf?pii=S2666-6065%2820%2930063-8 7  Wiendl, H., et al., Guideline for the management of myasthenic syndromes. Therapeutic advances in neurological disorders, 16, 17562864231213240. https://doi.org/10.1177/17562864231213240 . Last Accessed: October 2024 . 8  Haliloglu G, Anlar B, Aysun S, Topcu M, Topaloglu H, Turanli G, Yalnizoglu D. Gender prevalence in childhood multiple sclerosis and myasthenia gravis. J Child Neurol. 2002 May;17(5):390-2. doi: 10.1177/088307380201700516. 9  Parr JR, Andrew MJ, Finnis M, Beeson D, Vincent A, Jayawant S. How common is childhood myasthenia? The UK incidence and prevalence of autoimmune and congenital myasthenia. Arch Dis Child. 2014 Jun;99(6):539-42. doi: 10.1136/archdischild-2013-304788. 10  Mansukhani SA, Bothun ED, Diehl NN, Mohney BG. Incidence and Ocular Features of Pediatric Myasthenias. Am J Ophthalmol. 2019 Apr;200:242-249. doi: 10.1016/j.ajo.2019.01.004. 11  Bever, C.T., Jr, Aquino, A.V., Penn, A.S., Lovelace, R.E. and Rowland, L.P. (1983), Prognosis of ocular myasthenia. Ann Neurol., 14: 516-519.  https://doi.org/10.1002/ana.410140504 12  Kupersmith MJ, Latkany R, Homel P. Development of generalized disease at 2 years in patients with ocular myasthenia gravis. Arch Neurol. 2003 Feb;60(2):243-8. doi: 10.1001/archneur.60.2.243. PMID: 12580710. 13  Myasthenia gravis fact sheet. Retrieved April 2024 from https://www.ninds.nih.gov/sites/default/files/migrate-documents/myasthenia_gravis_e_march_2020_508c.pdf . 14  Myasthenia Gravis: Treatment & Symptoms. (2021, April 7 ). Retrieved April 2024 from https://my.clevelandclinic.org/health/diseases/17252-myasthenia-gravis-mg . 15  DRG EPI (2021) & Optum Claims Analysis Jan 2012- December 2020 . 16  ClinicalTrials.gov. NCT05265273. Available at: https://clinicaltrials.gov/study/NCT05265273 . Last accessed: October 2024 17  ClinicalTrials.gov Identifier: NCT04951622. Available at: https://clinicaltrials.gov/ct2/show/NCT04951622 . Last accessed: October 2024 . 18  ClinicalTrials.gov. NCT03842189. Available at: https://clinicaltrials.gov/ct2/show/NCT03842189 . Last accessed: October 2024 19  ClinicalTrials.gov Identifier: NCT05327114. Available at: https://www.clinicaltrials.gov/study/NCT05327114 . Last accessed: October 2024 20  ClinicalTrials.gov Identifier: NCT04119050. Available at: https://clinicaltrials.gov/study/NCT04119050 . Last accessed: October 2024 . 21  ClinicalTrials.gov Identifier: NCT05379634. Available at: https://clinicaltrials.gov/study/NCT05379634 . Last accessed: October 2024 . 22  ClinicalTrials.gov Identifier: NCT05912517. Available at: https://www.clinicaltrials.gov/study/NCT05912517 . Last accessed: October 2024 23  ClinicalTrials.gov Identifier: NCT06028438. Available at: https://clinicaltrials.gov/study/NCT06028438 . Last accessed: October 2024 . 24  ClinicalTrials.gov Identifier: NCT04968912. Available at: https://clinicaltrials.gov/study/NCT04968912 . Last accessed: October 2024 . 25  ClinicalTrials.gov Identifier: NCT04882878. Available at: https://clinicaltrials.gov/study/NCT04882878 . Last accessed: October 2024 . 26  Lobato G, Soncini CS. Relationship between obstetric history and Rh(D) alloimmunization severity. Arch Gynecol Obstet. 2008 Mar;277(3):245-8. DOI: 10.1007/s00404-007-0446-x. Last accessed: October 2024 . 27  Roy S, Nanovskaya T, Patrikeeva S, et al. M281, an anti-FcRn antibody, inhibits IgG transfer in a human ex vivo placental perfusion model. Am J Obstet Gynecol. 2019;220(5):498 e491-498 e499.


Bridget Kimmel

Mobile: (215) 688-6033

Lauren Johnson

SOURCE Johnson & Johnson

WANT YOUR COMPANY'S NEWS FEATURED ON PRNEWSWIRE.COM?

icon3

Modal title

Also from this source.

TREMFYA® (guselkumab) demonstrates impressive results across biologic-naïve and biologic-refractory patients in Crohn's disease and ulcerative colitis

TREMFYA® (guselkumab) demonstrates impressive results across biologic-naïve and biologic-refractory patients in Crohn's disease and ulcerative colitis

Johnson & Johnson (NYSE: JNJ) today announced TREMFYA® (guselkumab) data in both Crohn's disease (CD) and ulcerative colitis (UC) showing high rates...

ERLEADA® (apalutamide) demonstrates statistically significant and clinically meaningful improvement in overall survival compared to enzalutamide in patients with metastatic castration-sensitive prostate cancer

ERLEADA® (apalutamide) demonstrates statistically significant and clinically meaningful improvement in overall survival compared to enzalutamide in patients with metastatic castration-sensitive prostate cancer

Johnson & Johnson (NYSE: JNJ) today announced the results of a landmark real-world, head-to-head study showing that ERLEADA® (apalutamide) provided a ...

Pharmaceuticals

Pharmaceuticals

Health Care & Hospitals

Health Care & Hospitals

Medical Pharmaceuticals

Medical Pharmaceuticals

Biotechnology

Biotechnology

Selected By U.S. Army For Research Study On Advanced Continuous Fiber Composite Insoles In Military Boots

Study aims to reduce soldier foot fatigue & prevent injuries while enhancing performance: findings expected to influence future military boot design.

Berkeley, CA, October 15, 2024—ARRIS Composites, a leader in high-performance continuous fiber thermoplastic composite manufacturing, is proud to announce its selection by the U.S. Army for a groundbreaking study on the use of advanced carbon fiber plates in military boots.  

Funded by the U.S. Army Natick Soldier Research, Development and Engineering Center (NATICK) and the U.S. Army Combat Capabilities Development Command (DEVCOM) through the University of Massachusetts at Lowell (UML) HEROES program, this collaborative research will be conducted alongside the School of Kinesiology and Nutrition at The University of Southern Mississippi (USM). Findings from the study are expected in 2025, with the potential to reshape future designs of military boots to improve soldier performance and reduce risk of musculoskeletal injuries.

Founded in 2017, ARRIS has produced a number of high-performance composite products in a wide range of industries. ARRIS previously partnered with the US Army & DEVCOM GVSC on lightweight vehicle seats 1 . Most applicable to this military boot study, ARRIS developed a novel running plate for Brooks’ Hyperion Elite 4 running shoe, where ARRIS’ carbon fiber plate reduced weight, improved energy return, and provided better recovery for elite runners. Building on its success in the “super shoe” market, ARRIS is now applying its technology to military footwear to develop cutting-edge carbon fiber insoles for use in military-issued boots. These novel insoles are designed and engineered to enhance daily movement, reduce fatigue, and mitigate musculoskeletal injuries, especially in hot-weather environments.

This ongoing study involves both bench-top and in-vivo testing to validate the benefits of carbon fiber insoles in military-issued boots. Prototypes of continuous fiber-reinforced plates have been integrated into hot-weather combat boots and will be tested to assess their impact on running economy, lower extremity biomechanics, and measures of functional performance.

“This study is an exciting opportunity to explore how the most advanced composites can be used with best-in-class foam to benefit military personnel that spend long hours on their feet,” said Riley Reese, CEO at ARRIS Composites. “Injury reduction by way of novel footwear technologies has long eluded even the most industry-leading brands with little to no science backing some egregious claims. This study with the military will provide a critical framework for demonstrating measurable, real-world benefits.”

“Footwear plays a significant role in the mechanics of movement, particularly for service members who operate in demanding environments,” said Scott Piland, PhD, Professor and Director at the School of Kinesiology and Nutrition at The University of Southern Mississippi. “This research will offer significant insight into how we might boost performance and prevent injuries through the application of this unique technology. We are enthused to be part of this collaboration with ARRIS and UML”

The ultimate objective of this project will be to directly integrate proven performance-enhancing continuous fiber composite plates into next-generation military boots, addressing soldiers’ unique challenges in diverse field environments. The results are expected to influence both military and civilian applications, revealing even more meaningful possibilities for athletes and enthusiasts alike, as well as professionals and workers on their feet all day.

ARRIS ranked as one of Fast Company’s “10 Most Innovative Manufacturers,” labeled as one of 13 sports startups shaking up the industry by Business Insider and has earned the BIG Innovation Award four years straight from the Business Intelligence Group. Learn more at arriscomposites.com .

Related Posts

Check out popular posts to learn more about ARRIS’ work with advanced composites.

General Inquiries [email protected] 510.730.0067

Media Inquiries Elizabeth Griffin-Isabelle, ARRIS [email protected]

Media Kit Download: Click Here for ARRIS Logo, Photos, Video, Etc.

1 CompositesWorld. “Arris, U.S. Army and LIFT Launch Collaborative Project to Lightweight Combat Vehicles.” CompositesWorld , Gardner Business Media, 27 Sept. 2023, www.compositesworld.com/news/arris-us-army-and-lift-launch-collaborative-project-to-lightweight-combat-vehicles.

findings of research study are extrapolated to

Privacy Overview

CookieDurationDescription
cookielawinfo-checkbox-analytics11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional11 monthsThe cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy11 monthsThe cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Search Results

Recently viewed.

Listening...

Nipocalimab demonstrates sustained disease control in adolescents living with generalized myasthenia gravis in Phase 2/3 study

First fcrn blocker to demonstrate sustained disease control over 24 weeks in antibody positive adolescents aged 12 – 17 years, broadening the population in which nipocalimab has been studied.

SAVANNAH, Ga. (October 15, 2024) – Johnson & Johnson today announced positive results from the Phase 2/3 Vibrance-MG study of nipocalimab in anti-AChR a positive adolescents (aged 12 – 17 years) living with generalized myasthenia gravis (gMG). Study participants who were treated with nipocalimab plus standard of care (SOC) achieved sustained disease control as measured by the primary endpoint of immunoglobulin G (IgG) reduction from baseline over 24 weeks, and secondary endpoints of improvement in MG-ADL b and QMG c scores. These Phase 2/3 data will be featured in an oral presentation (Abstract #MG100) at the Myasthenia Gravis Foundation of America (MGFA) Scientific Session during the American Association of Neuromuscular & Electrodiagnostic Medicine (AANEM) Annual Meeting, where Johnson & Johnson will present 25 abstracts .

“Findings from the Vibrance-MG study underscore the potential of this investigational therapy for young individuals aged 12 – 17 living with gMG. Results show a significant reduction in IgG of approximately 70% in adolescents and a clinical benefit that is consistent with the Vivacity-MG3 study in adults,” said Jonathan Strober, M.D., Director of Clinical Services for Child Neurology and Director of the Muscular Dystrophy Clinic at UCSF Benioff Children’s Hospital. d “It is encouraging to see these positive results as there are currently no approved advanced treatment options for this adolescent population in the United States.”

About 10% of new cases of myasthenia gravis are diagnosed in adolescents (12 – 17 years of age) and the severity of gMG in pediatric patients is heightened with 43% having experienced over five hospitalizations in their lifetime, 46% having at least one intensive care unit stay and 68% having periods of exacerbated disease. 1,2,3,4

Treatment with nipocalimab plus SOC met the study’s primary endpoint of reduction in total serum IgG (-69%), and the two secondary endpoints of MG-ADL and QMG, which are measures of disease activity. 5,e Four of five patients achieved minimum symptom expression (MG-ADL score 0-1) by the end of their treatment phase. f,g Nipocalimab was well-tolerated over the six-month period, similar to tolerability seen in adult participants in the Vivacity-MG3 study. 5 There were no serious adverse events and no discontinuations due to an adverse event.

Presented for the first time, these open-label Phase 2/3 results in adolescents are consistent with findings from the pivotal study of nipocalimab in adult patients with gMG. Nipocalimab when added to SOC is the first FcRn blocker to demonstrate sustained disease control in a registrational trial as measured by improvement in MG-ADL over placebo plus SOC over a period of six months of consistent dosing (Q2 week) among adults living with gMG.

“The Vibrance-MG data add to the expanding clinical profile of nipocalimab and highlight its potential for adolescents living with gMG who are in need of new treatments,” said Sindhu Ramchandren, M.D., Executive Medical Director, Neuroscience, Johnson & Johnson Innovative Medicine. “We are committed to developing innovations for autoantibody-driven neurological diseases, like gMG, with the aim of transforming the lives of people living with these conditions.”

Earlier this year, Johnson & Johnson announced the submission of applications to the U.S. Food and Drug Administration ( FDA ) and the European Medicines Agency ( EMA ) seeking approval for nipocalimab for the treatment of gMG.

Editor’s notes:

a. Patients with a positive blood test for acetylcholine receptor (anti-AChR) antibodies or muscle-specific tyrosine kinase (anti-MuSK) antibodies are eligible for the study.

b. MG-ADL (Myasthenia Gravis – Activities of Daily Living) provides a rapid clinical assessment of the patient’s recall of symptoms impacting activities of daily living, with a total score range of 0 to 24; a higher score indicates greater symptom severity.

c. QMG (Quantitative Myasthenia Gravis) is a 13-item assessment by a clinician that quantifies MG disease severity through muscle weakness. The total QMG score ranges from 0 to 39, where higher scores indicated greater disease severity.

d. Dr. Jonathan Strober is a paid consultant for Johnson & Johnson. He has not been compensated for any media work.

e. Treatment with nipocalimab showed a mean percentage change from baseline to week 24 for total serum IgG of -68.98% (standard error [SE] = 7.561).

f. Adolescents who received nipocalimab plus current SOC had a mean baseline score of 4.29 (SE = 2.430) on the MG-ADL scale and a mean baseline score of 12.50 (SE = 3.708) on the QMG scale.

g. Adolescents who received nipocalimab plus current SOC had a mean change at week 24 of -2.40 (SE = 0.187) on the MG-ADL scale and -3.80 (SE = 2.683) on the QMG scale.

About Generalized Myasthenia Gravis (gMG) Myasthenia gravis (MG) is an autoantibody disease in which the immune system mistakenly makes antibodies (e.g., anti-acetylcholine receptor [AChR], anti-muscle-specific tyrosine kinase [MuSK] or anti-low density lipoprotein-related protein 4 [LRP4]), which target proteins at the neuromuscular junction and can block or disrupt normal signaling from nerves to muscles, thus impairing or preventing muscle contraction. 6,7 The disease impacts an estimated 700,000 people worldwide.6 Approximately 10 to 15% of new cases of MG are diagnosed in adolescents (12 – 17 years of age). 1,2,3 Among juvenile MG patients, girls are affected more often than boys with over 65% of pediatric MG cases in the US diagnosed in girls. 8,9,10

Initial disease manifestations are usually ocular but in 85% or more cases, the disease generalizes (gMG), which is characterized by fluctuating weakness of the skeletal muscles leading to symptoms like limb weakness, drooping eyelids, double vision and difficulties with chewing, swallowing, speech, and breathing. 6,11,12,13,14 Approximately 100,000 individuals in the U.S. are living with gMG. 15 Vulnerable gMG populations, such as pediatric patients, have more limited therapeutic options. 3 Currently, SOC treatments for adolescents with gMG are extrapolated from adult trials. 3 Other than symptomatic treatments, there are no approved FcRn blockers that may address the root cause of the disease for adolescents with gMG in the United States. 3

About the Phase 2/3 Vibrance-MG Study The Phase 2/3 Vibrance-MG study ( NCT05265273 ) is an on-going open-label study to determine the effect of nipocalimab in pediatric participants with gMG . 16 Seven participants aged 12 – 17 years with a diagnosis of gMG as reflected by a Myasthenia Gravis Foundation of America (MGFA) Class of II through IV at screening, and an insufficient clinical response to ongoing, stable SOC therapy, have been enrolled in the trial. 5 Participants must have a positive blood test for either anti-AChR or anti-MUSK autoantibodies. The study consists of a screening period of up to four weeks, a 24-week open-label Active Treatment Phase during which participants receive nipocalimab intravenously every two weeks, and a Long-term Extension Phase; a safety follow-up assessment will be conducted at eight weeks after last dose. 16 The primary outcome of the study is the effect of nipocalimab on total serum IgG, safety and tolerability, and pharmacokinetics in pediatric participants with gMG at 24 weeks. Secondary endpoints include change in MG-ADL and QMG scores at 24 weeks. 5,16

About Nipocalimab Nipocalimab is an investigational monoclonal antibody, designed to bind with high affinity to block FcRn and reduce levels of circulating immunoglobulin G (IgG) antibodies potentially without impact on other immune functions. This includes autoantibodies and alloantibodies that underlie multiple conditions across three key segments in the autoantibody space including Rare Autoantibody diseases, Maternal Fetal diseases mediated by maternal alloantibodies and Prevalent Rheumatology. 17,18,19,20,21,22,23,24,25 Blockade of IgG binding to FcRn in the placenta is also believed to limit transplacental transfer of maternal alloantibodies to the fetus. 26,27

The U.S. Food and Drug Administration (FDA) and European Medicines Agency (EMA) have granted several key designations to nipocalimab including:

About Johnson & Johnson At Johnson & Johnson, we believe health is everything. Our strength in healthcare innovation empowers us to build a world where complex diseases are prevented, treated, and cured, where treatments are smarter and less invasive, and solutions are personal. Through our expertise in Innovative Medicine and MedTech, we are uniquely positioned to innovate across the full spectrum of healthcare solutions today to deliver the breakthroughs of tomorrow, and profoundly impact health for humanity.

Learn more at https://www.jnj.com/ or at www.innovativemedicine.jnj.com

Follow us at @JanssenUS and @JNJInnovMed .

Janssen Research & Development, LLC and Janssen Biotech, Inc. are both Johnson & Johnson companies.

Cautions Concerning Forward-Looking Statements This press release contains “forward-looking statements” as defined in the Private Securities Litigation Reform Act of 1995 regarding product development and the potential benefits and treatment impact of nipocalimab. The reader is cautioned not to rely on these forward-looking statements. These statements are based on current expectations of future events. If underlying assumptions prove inaccurate or known or unknown risks or uncertainties materialize, actual results could vary materially from the expectations and projections of Janssen Research & Development, LLC, Janssen Biotech, Inc. and/or Johnson & Johnson. Risks and uncertainties include, but are not limited to: challenges and uncertainties inherent in product research and development, including the uncertainty of clinical success and of obtaining regulatory approvals; uncertainty of commercial success; manufacturing difficulties and delays; competition, including technological advances, new products and patents attained by competitors; challenges to patents; product efficacy or safety concerns resulting in product recalls or regulatory action; changes in behavior and spending patterns of purchasers of health care products and services; changes to applicable laws and regulations, including global health care reforms; and trends toward health care cost containment. A further list and descriptions of these risks, uncertainties and other factors can be found in Johnson & Johnson’s Annual Report on Form 10-K for the fiscal year ended December 31, 2023, including in the sections captioned “Cautionary Note Regarding Forward-Looking Statements” and “Item 1A. Risk Factors,” and in Johnson & Johnson’s subsequent Quarterly Reports on Form 10-Q and other filings with the Securities and Exchange Commission. Copies of these filings are available online at www.sec.gov, www.jnj.com or on request from Johnson & Johnson. None of Janssen Research & Development, LLC, Janssen Biotech, Inc. nor Johnson & Johnson undertakes to update any forward-looking statement as a result of new information or future events or developments.

Footnotes 1 Evoli A, Batocchi AP, Bartoccioni E, Lino MM, Minisci C, Tonali P. Juvenile myasthenia gravis with prepubertal onset. Neuromuscul Disord. 1998 Dec;8(8):561-7. doi: 10.1016/s0960-8966(98)00077-7.

2 Evoli A. Acquired myasthenia gravis in childhood. Curr Opin Neurol. 2010 Oct;23(5):536-40. doi: 10.1097/WCO.0b013e32833c32af.

3 Finnis MF, Jayawant S. Juvenile myasthenia gravis: a paediatric perspective. Autoimmune Dis. 2011;2011:404101. doi: 10.4061/2011/404101.

4 Barraud C, Desguerre I, Barnerias C, Gitiaux C, Boulay C, Chabrol B. Clinical features and evolution of juvenile myasthenia gravis in a French cohort. Muscle Nerve. 2018 Apr;57(4):603-609. doi: 10.1002/mus.25965.

5 Strober J et al. Safety and effectiveness of nipocalimab in adolescent participants in the open label Phase 2/3 Vibrance-MG clinical study. Presentation at American Association of Neuromuscular & Electrodiagnostic Medicine (AANEM) Annual Meeting. October 2024.

6 Chen J, Tian D-C, Zhang C, et al. Incidence, mortality, and economic burden of myasthenia gravis in China: A nationwide population-based study. The Lancet Regional Health - Western Pacific. https://www.thelancet.com/action/showPdf?pii=S2666-6065%2820%2930063-8

7 Wiendl, H., et al., Guideline for the management of myasthenic syndromes. Therapeutic advances in neurological disorders, 16, 17562864231213240. https://doi.org/10.1177/17562864231213240. Last Accessed: October 2024.

8 Haliloglu G, Anlar B, Aysun S, Topcu M, Topaloglu H, Turanli G, Yalnizoglu D. Gender prevalence in childhood multiple sclerosis and myasthenia gravis. J Child Neurol. 2002 May;17(5):390-2. doi: 10.1177/088307380201700516.

9 Parr JR, Andrew MJ, Finnis M, Beeson D, Vincent A, Jayawant S. How common is childhood myasthenia? The UK incidence and prevalence of autoimmune and congenital myasthenia. Arch Dis Child. 2014 Jun;99(6):539-42. doi: 10.1136/archdischild-2013-304788.

10 Mansukhani SA, Bothun ED, Diehl NN, Mohney BG. Incidence and Ocular Features of Pediatric Myasthenias. Am J Ophthalmol. 2019 Apr;200:242-249. doi: 10.1016/j.ajo.2019.01.004.

11 Bever, C.T., Jr, Aquino, A.V., Penn, A.S., Lovelace, R.E. and Rowland, L.P. (1983), Prognosis of ocular myasthenia. Ann Neurol., 14: 516-519. https://doi.org/10.1002/ana.410140504

12 Kupersmith MJ, Latkany R, Homel P. Development of generalized disease at 2 years in patients with ocular myasthenia gravis. Arch Neurol. 2003 Feb;60(2):243-8. doi: 10.1001/archneur.60.2.243. PMID: 12580710.

13 Myasthenia gravis fact sheet. Retrieved April 2024 from https://www.ninds.nih.gov/sites/default/files/migrate-documents/myasthenia_gravis_e_march_2020_508c.pdf.

14 Myasthenia Gravis: Treatment & Symptoms. (2021, April 7). Retrieved April 2024 from https://my.clevelandclinic.org/health/diseases/17252-myasthenia-gravis-mg.

15 DRG EPI (2021) & Optum Claims Analysis Jan 2012-December 2020.

16 ClinicalTrials.gov. NCT05265273. Available at: https://clinicaltrials.gov/study/NCT05265273. Last accessed: October 2024

17 ClinicalTrials.gov Identifier: NCT04951622. Available at: https://clinicaltrials.gov/ct2/show/NCT04951622. Last accessed: October 2024.

18 ClinicalTrials.gov. NCT03842189. Available at: https://clinicaltrials.gov/ct2/show/NCT03842189. Last accessed: October 2024

19 ClinicalTrials.gov Identifier: NCT05327114. Available at: https://www.clinicaltrials.gov/study/NCT05327114. Last accessed: October 2024

20 ClinicalTrials.gov Identifier: NCT04119050. Available at: https://clinicaltrials.gov/study/NCT04119050. Last accessed: October 2024.

21 ClinicalTrials.gov Identifier: NCT05379634. Available at: https://clinicaltrials.gov/study/NCT05379634. Last accessed: October 2024.

22 ClinicalTrials.gov Identifier: NCT05912517. Available at: https://www.clinicaltrials.gov/study/NCT05912517. Last accessed: October 2024

23 ClinicalTrials.gov Identifier: NCT06028438. Available at: https://clinicaltrials.gov/study/NCT06028438. Last accessed: October 2024.

24 ClinicalTrials.gov Identifier: NCT04968912. Available at: https://clinicaltrials.gov/study/NCT04968912. Last accessed: October 2024.

25 ClinicalTrials.gov Identifier: NCT04882878. Available at: https://clinicaltrials.gov/study/NCT04882878. Last accessed: October 2024.

26 Lobato G, Soncini CS. Relationship between obstetric history and Rh(D) alloimmunization severity. Arch Gynecol Obstet. 2008 Mar;277(3):245-8. DOI: 10.1007/s00404-007-0446-x. Last accessed: October 2024.

27 Roy S, Nanovskaya T, Patrikeeva S, et al. M281, an anti-FcRn antibody, inhibits IgG transfer in a human ex vivo placental perfusion model. Am J Obstet Gynecol. 2019;220(5):498 e491-498 e499.

Media contact: Bridget Kimmel Mobile: (215) 688-6033 [email protected]

Investor contact: Lauren Johnson [email protected]

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

The PMC website is updating on October 15, 2024. Learn More or Try it out now .

Logo of bmjmed

Extrapolation beyond the end of trials to estimate long term survival and cost effectiveness

Nicholas r latimer.

1 School of Health and Related Research, University of Sheffield, Sheffield, UK

Amanda I Adler

2 Diabetes Trials Unit, University of Oxford, Oxford, UK

Associated Data

No patient level data were used for this article. The survival curves illustrated in figures 1 and 3 were fitted to data simulated by NRL. Please email the corresponding author to request access to the simulated data.

Key messages

This paper explains the importance of extrapolating beyond the end of trials to estimate the long term benefits associated with new treatments, why this is done, and the limitations of various approaches.

Introduction

Policy makers worldwide use economic evaluation to inform decisions when allocating limited healthcare resources. A critical part of this evaluation involves accurately estimating long term effects of treatments. Yet, evidence is usually from clinical trials of short duration. Rarely do all participants encounter the clinical event of interest by the trial’s end. When people might benefit from a long term treatment, health technology assessment agencies recommend that economic evaluations extrapolate beyond the trial period to estimate lifetime benefits. 1 2 This kind of evaluation is common for people with cancer, when effective treatments delay disease progression and improve survival.

Use of survival modelling: rationale

To make funding decisions, health technology assessment agencies rely on accurate estimates of the benefits and costs of new treatments compared with existing treatments. For treatments that improve survival, accurate estimates of survival benefits are crucial. Policy makers use estimates of mean (average) survival rather than median survival, taking into account the probability of death over a lifetime across all patients with the disease. This mean is represented by the area under survival curves that plot the proportion of patients alive over time by treatment.

In figure 1 , the purple area represents a mean survival benefit associated with an experimental compared with a control treatment, but this benefit is a restricted mean, limited to the trial period. The curves separate early, and remain separated at the end of the trial, so it is reasonable to expect that benefits would continue to accrue beyond the trial’s end. The orange smooth curves represent survival models fitted to the trial data and extrapolated beyond the trial. The area between the orange curves estimates the mean lifetime survival benefit associated with the experimental treatment. This area is much larger than the purple area, and is relevant for economic evaluation.

An external file that holds a picture, illustration, etc.
Object name is bmjmed-2021-000094f01.jpg

Survival modelling to extrapolate beyond the trial—mean survival restricted to the trial period, and extrapolated

Description of survival models

Survival models extrapolate beyond the trial. They typically have a parametric specification, which means that they rely on an assumed distribution of probabilities of, for example, death over time, which is defined by a set of parameters such as shape and scale. The chosen parametric model is fitted to the observed trial survival data, and values estimated for each parameter. The model is then used to generate survival probabilities beyond the trial period to predict what would have happened had the trial continued until everyone died.

In health technology assessments, a set of standard models typically include: exponential, Weibull, Gompertz, log-logistic, log-normal, and generalised gamma models. 3 Each survival model involves different assumptions about the shape of the hazard function—that is, the risk over time of the event of interest,—which is usually death. Figure 2 shows the hazard function shapes assumed when using standard parametric models; over time these can stay the same, increase, decrease, or have one turning point (that is, the hazard increases then decreases, or decreases then increases).

An external file that holds a picture, illustration, etc.
Object name is bmjmed-2021-000094f02.jpg

Survival modelling to extrapolate beyond the trial—hazard shapes associated with standard parametric survival models

Selecting a model

Extrapolating survival curves predicts the unknown. No one can know which models most accurately predict survival—although it might be possible to determine which models produce extrapolations that are plausible. Different models often result in substantially different estimates of survival and cost effectiveness. 4 Figure 3 shows a range of survival models fitted to the same data. While all the parametric models seem to fit the observed trial data well, they predict large differences in longer term and mean survival. The more immature the trial data, the more likely the long term predictions will differ. Model choice affects estimated treatment benefits and, consequently, cost effectiveness.

An external file that holds a picture, illustration, etc.
Object name is bmjmed-2021-000094f03.jpg

Survival modelling to extrapolate beyond the trial—a variety of standard parametric models fitted to the same data

To choose clinically plausible survival models, modellers must assess fit to the trial data, but also, crucially, assess the credibility of the extrapolations. 4 5 This approach involves considering external data sources with longer term data such as other trials, disease registries, and general population mortality rates. Biological plausibility, pharmacological mechanisms, and clinical opinion should also be considered. Although identifying a single best model might not be possible, this approach ensures that policy makers use credible models.

Limitations of standard survival models

Standard parametric survival models have limitations. They might rely on hazard functions with implausible shapes ( figure 2 ), and might neither fit the data well nor provide credible extrapolations. As illustrated in figure 3 , the implications of choosing the wrong survival model are serious, because the choice of model affects survival predictions. Figure 4 illustrates a hypothetical hazard function of death from a cancer. No standard parametric models could capture the shape of this function, although more complex survival models can, such as flexible parametric models, fractional polynomials, piecewise models, or mixture cure models.

An external file that holds a picture, illustration, etc.
Object name is bmjmed-2021-000094f04.jpg

Survival modelling to extrapolate beyond the trial—a hypothesised, realistic hazard function

Flexible parametric models (such as restricted cubic spline models) segment the survival curve into portions, using knots to model hazard functions that have many turning points. 6 However, flexible parametric models will not generate turning points beyond the period of observed trial data unless modellers use external information, which they rarely do, such as longer term hazard rates from registry data. Indeed, while flexible parametric models are likely to fit the data well, beyond the data they reduce to standard Weibull, log-normal, or log-logistic models (therefore assuming that a transformation of the survival function is a linear function of log-time), and might generate implausible extrapolations. In figure 4 , if the trial were short and ended in the period where the hazard function is rising, a flexible parametric model would extrapolate that rising hazard, based on the observed segment of data.

An alternative option is to use fractional polynomials to model a hazard function with a complex shape, placing no restrictions on the hazard and survival functions beyond the period of observed data. However, while these models might fit the observed data well, the lack of restrictions on the extrapolation can lead to implausible predictions. 7 Other options include piecewise models, where separate survival models are fitted to defined portions of the observed survival data using cut-off points. The extrapolation is based on the model fitted to the final observed period. Piecewise models can be sensitive to the choice of cut-off points, and lead to extrapolations based on the last portion of data where numbers of trial participants and numbers of deaths among these participants are often low. 8 Generalised additive models and dynamic survival models have recently been suggested as potentially valuable novel approaches for modelling and extrapolating survival data. 7

Mixture cure models can capture complex hazard functions because they predict survival separately for cured and uncured patients, 9 and estimate a cure fraction—that is, the proportion of patients who would be cured. Predicting survival for cured and uncured patients separately could result in a model that generates credible extrapolations. However, a key issue that is difficult—or perhaps impossible—is to estimate a cure fraction reliably based on short term data. When the cure fraction is estimated inaccurately, cure models can result in poor survival predictions.

Extrapolation in practice

Decision makers, such as those on committees of the National Institute for Health and Care Excellence (NICE), discuss, document, and assess the approaches that pharmaceutical companies use to predict long term survival. Often the approach has a large impact on cost effectiveness estimates ( box 1 ). Typically, NICE reviews appraisals three years after the initial recommendation, and some drugs are placed in the Cancer Drugs Fund, providing an opportunity for checking extrapolations once longer term data are available, often from the key trial. However, while drugs in the Cancer Drugs Fund undergo rigorous reappraisal, other reviews are rarely done comprehensively, leaving extrapolations unchecked.

Impact of survival modelling in technology appraisals by the National Institute for Health and Care Excellence (NICE)

When NICE appraised pembrolizumab for untreated, advanced oesophageal and gastro-oesophageal junction cancer, the appraisal committee identified four approaches to survival modelling that it considered to be credible. 10 These approaches were a log-logistic piecewise model, a log- logistic piecewise model incorporating an assumed waning of the treatment effect over time, a log-logistic model not fitted using a piecewise approach, and a generalised gamma piecewise model. The incremental gains in quality adjusted life years (QALYs) associated with pembrolizumab ranged from 0.50 to 1.07 QALYs per person over a lifetime, with the estimated cost per incremental QALY doubling between the most and least optimistic analysis. 11

When NICE appraised tisagenlecleucel (a chimeric antigen receptor T cell treatment) for relapsed or refractory, diffuse, large B cell, acute lymphoblastic leukaemia, the committee acknowledged that survival was a key uncertainty, considered cure possible, and discussed several mixture cure models. Cure fractions varied by 35 percentage points depending on the model, with cost effectiveness estimates that varied from potentially acceptable to unacceptable. 12 The committee accepted using a mixture cure model based on clinical experts suggesting that some patients could be cured. However, the committee preferred a model that estimated a lower cure fraction than that estimated by the manufacturer’s preferred model, because the manufacturer’s model predicted a cure fraction that was higher than the proportion of patients who remained event-free in the tisangenlecleucel trials. Tisagenlecleucel was recommended for use in the Cancer Drugs Fund to allow the trial to accrue more data on overall survival before making a final decision on its routine use in the NHS. 12

Conclusions

When treatments make people live longer, it is important to extrapolate beyond the end of clinical trials to estimate mean survival gains and cost effectiveness over a period longer than the trial. Several survival models are available, and these result in widely varying estimates. To choose a model, researchers should consider a model’s fit to the observed trial survival data, and the credibility of predictions beyond the trial. More complex models could, but do not necessarily, result in better extrapolations. To inform decision making, survival models must be scrutinised while considering a range of plausible models and their impact on cost effectiveness. Analysts should follow recommended processes, report analyses clearly, justify chosen models by describing why and how the models have been selected, detail how well models fit the observed data, and describe what the models predict about hazards and survival. 4 8 This approach provides decision makers with the reassurance needed to make decisions when allocating healthcare resources.

Twitter: @NRLatimer, @DrAmandaAdler

Contributors: NRL had the idea for this article, discussed it with AIA, and wrote the initial draft. AIA reviewed the draft and made substantial revisions and additions. NRL and AIA both approved the final version. NRL is the guarantor of this work.

Funding: NRL is supported by Yorkshire Cancer Research (award reference No S406NL). AIA is funded by the Biomedical Research Centre-OCDEM.

Competing interests: We have read and understood the BMJ policy on declaration of interests and declare no relevant interests. NRL is a health economist who specialises in survival analysis; he is a member of the National Institute for Health and Care Excellence's (NICE) Decision Support Unit and of NICE Appraisal Committee B, and regularly uses the methods described in this paper in his own research, and reviews analyses submitted to NICE as part of the technology appraisal process. NRL also teaches survival analysis and occasionally provides short training courses for pharmaceutical companies and consultancy companies. AIA is a physician and directs the Diabetes Trials Unit at the University of Oxford. She chaired NICE Technology Appraisal Committee B, and chairs NICE’s committee to address models for evaluating and purchasing of antimicrobials in the setting of antimicrobial resistance.

Patient and public involvement: Patients and the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.

Provenance and peer review: Commissioned; internally peer reviewed.

Data availability statement

IMAGES

  1. Basic course in Biomedical Research

    findings of research study are extrapolated to

  2. Findings of this study are extrapolated for each stream between the

    findings of research study are extrapolated to

  3. #BCBR #NPTEL #answerkey ,cycle 5 assignment 3, Literature Review

    findings of research study are extrapolated to

  4. Implications of the study findings

    findings of research study are extrapolated to

  5. Planting Science

    findings of research study are extrapolated to

  6. How to Extrapolate plots in Origin

    findings of research study are extrapolated to

VIDEO

  1. Interpolation and Extrapolation in Statistics || Statistical Analysis

  2. ACE 745: Research Report (IUP)

  3. Module 6: Radiation Epidemiologic Studies: What makes a study flawed?

  4. Type 1 Diabetes Quick Facts: The Truth You Need To Know (2024)

  5. Research Methodology in English Education /B.Ed. 4th Year/ Syllabus

  6. What is Research

COMMENTS

  1. Extrapolation in Statistical Research: Definition, Examples, Types

    1. Linear Extrapolation. Linear extrapolation is the process of estimating a value that is close to the existing data. To do this, the researcher plots out a linear equation on a graph and uses the sequence of the values to predict immediate future data points. You can draw a tangent line at the last point and extend this line beyond its limits.

  2. Concealing research outcomes: Missing data, negative results and missed

    The results of the research that get published are expected to be replicable by others. This depends on its precision in terms of the methodology including the sample size; the results extrapolated from the study should also be applicable to the larger population so that the observed effects are near identical to true effects.

  3. Identifying and characterizing extrapolation in multivariate response

    Identifying locations as extrapolations. With the (C)MVPV values and cutoff choice in hand, determining which locations (observed/unobserved) are extrapolations is straightforward and results in a binary (yes/no) value. We refer to this delineation as our extrapolation index (e) e k p = {1 if v p> k 0 otherwise.

  4. The pillars of trustworthiness in qualitative research

    Transferability pertains to the degree to which the research findings can be extrapolated to alternative contexts or situations [20], [21]. Qualitative researchers aim to offer comprehensive and intricate depictions of the study's environment, participants, and procedures to enhance the potential for transferability.

  5. What Is Generalizability In Research?

    Defining Generalizability. Generalizability refers to the extent to which a study's findings can be extrapolated to a larger population. It's about making sure that your findings apply to a large number of people, rather than just a small group. Generalizability ensures research findings are credible and reliable.

  6. Extrapolation

    This impacts generalizability since findings that cannot be reliably extrapolated may misrepresent real-world scenarios or lead to ineffective interventions. Therefore, establishing external validity through careful study design and consideration of contextual factors is crucial for valid extrapolation.

  7. Extrapolating Survival from Randomized Trials Using External Data: A

    Study reports from a national health technology assessment program in the United Kingdom were searched, and the findings were combined with "pearl-growing" searches of the academic literature. ... net benefit of the intervention of interest is unlikely to cross the decision threshold in the period of time being extrapolated over. 73. More ...

  8. Extrapolation

    Extrapolation. In mathematics, extrapolation is a type of estimation, beyond the original observation range, of the value of a variable on the basis of its relationship with another variable. It is similar to interpolation, which produces estimates between known observations, but extrapolation is subject to greater uncertainty and a higher risk ...

  9. PDF The 'Big Picture': The Problem of Extrapolation in Basic Research

    3.2 The validation of surrogate models in later stages of basic research and in clinical research 4 'Big Picture' Accounts and the Extrapolations Underpinning Them 4.1 The mosaic nature of mechanistic descriptions in basic science 4.2 Challenges for mechanistic-similarity-based validation protocols 5 Retrospective Testing of Extrapolated Knowledge

  10. Research Guides: Study Design 101: Systematic Review

    Exhaustive review of the current literature and other sources (unpublished studies, ongoing research) Less costly to review prior studies than to create a new study; Less time required than conducting a new study; Results can be generalized and extrapolated into the general population more broadly than individual studies; More reliable and ...

  11. Generalization in quantitative and qualitative research: Myths and

    In quantitative research, generalizability is considered a major criterion for evaluating the quality of a study (Kerlinger and Lee, 2000, Polit and Beck, 2008).Within the classic validity framework of Cook and Campbell (e.g., Shadish et al., 2002), external validity—the degree to which inferences from a study can be generalized—has been a valued standard for decades.

  12. What's (successful) extrapolation?

    While several strategies have been proposed to aid with extrapolation, the existing methodological literature has left our understanding of what extrapolation consists of and what constitutes successful extrapolation underdeveloped. This paper addresses this lack in understanding by offering a novel account of successful extrapolation.

  13. Extrapolate findings

    Horizontal evaluation is an approach that combines self-assessment by local participants and external review by peers. Positive deviance (PD), a behavioural and social change approach, involves learning from those who find unique and successful solutions to problems despite facing the same challenges, constraints and resource deprivation as others.

  14. Extrapolation beyond the end of trials to estimate long term survival

    Policy makers worldwide use economic evaluation to inform decisions when allocating limited healthcare resources. A critical part of this evaluation involves accurately estimating long term effects of treatments. Yet, evidence is usually from clinical trials of short duration. Rarely do all participants encounter the clinical event of interest by the trial's end. When people might benefit ...

  15. Participants in research: Routine extrapolation of randomised ...

    EDITOR—For more than a decade it has been an article of faith in evidence based medicine that randomised controlled trials are "best evidence" and their findings can routinely be extrapolated to clinical situations.1 In his editorial Sackett, the founder of evidence based medicine, seeks retrospectively to reassure clinicians that this practice was justifiable, but the accompanying study ...

  16. Statistics Notes: Generalisation and extrapolation

    The usefulness of research lies primarily in the generalisation of the findings rather than in the information gained about those particular individuals. We study the patients in a trial not to find out anything about them but to predict what might happen to future patients given these treatments.

  17. Extrapolating baseline trend in single-case data: Problems and

    Aim of the review. It has already been stated (Parker et al., 2011) and illustrated (Tarlow, 2017) that baseline trend extrapolation can lead to impossible forecasts for the subsequent intervention-phase data.Accordingly, the research question we chose was the percentage of studies in which extrapolating the baseline trend of the data set (across several different techniques for fitting the ...

  18. Extrapolating from Animals to Humans

    Extrapolating from Animals to Humans. Clinical effectiveness for interventions in humans can only be speculated from animal studies. John P. A. Ioannidis Authors Info & Affiliations. Science Translational Medicine. 12 Sep 2012. Vol 4, Issue 151. p. 151ps15. DOI: 10.1126/scitranslmed.3004631.

  19. Extrapolating empirical long-term survival data: the impact of updated

    Real-world data may also guide model selection by assessing whether extrapolated results are plausible when compared to patient survival outside the context of a clinical trial . Prior research has suggested that model selection should consider the length of follow-up of the data available . In a case study, Bullement et al. assessed the ...

  20. Chapter 15 Extrapolation of Animal Research Data to Humans: An ...

    The findings from the majority of publications reviewed are consistent with other evidence on the problems of translating animal data to humans; for example, the Review of Research Using Non-human Primates (jointly commissioned in 2006 by a number of major UK research councils and chaired by Sir David Weatherall). A subsequent review in 2011 ...

  21. Skin Biopsies vs CSF in the Diagnosis of Prion Diseases

    The relative PRPSC-SA was extrapolated by plotting relative fluorescence unit readouts ... The findings of this diagnostic study suggest that analysis of 2 or more skin sites was superior to CSF analysis in diagnosing PRDs. ... This work was supported by grants from Capital's Funds for Health Improvement and Research (2024-2-2018), National ...

  22. Can understanding mechanisms solve the problem of extrapolating from

    Mechanisms of action and why they do not always help. A mechanism of action is the causal chain or web linking the intervention with the clinical outcome via pathophysiologic mechanisms 9 (see Figure 1, middle).If we know the mechanism of action in the study population, and we know it is shared with the target population, then extrapolation is more likely to be justified.

  23. Finnish study finds good physical fitness from childhood protects

    A recent Finnish study has found that good physical fitness from childhood to adolescence is linked to better mental health in adolescence. These results are significant and timely, as mental ...

  24. Identifying and characterizing extrapolation in multivariate ...

    Faced with limitations in data availability, funding, and time constraints, ecologists are often tasked with making predictions beyond the range of their data. In ecological studies, it is not always obvious when and where extrapolation occurs because of the multivariate nature of the data. Previous work on identifying extrapolation has focused on univariate response data, but these methods ...

  25. Nipocalimab demonstrates sustained disease control in adolescents

    The Phase 2/3 Vibrance-MG study (NCT05265273) is an on-going open-label study to determine the effect of nipocalimab in pediatric participants with gMG. 16 Seven participants aged 12 - 17 years ...

  26. Selected By U.S. Army For Research Study On Advanced Continuous Fiber

    Berkeley, CA, October 15, 2024—ARRIS Composites, a leader in high-performance continuous fiber thermoplastic composite manufacturing, is proud to announce its selection by the U.S. Army for a groundbreaking study on the use of advanced carbon fiber plates in military boots. Funded by the U.S. Army Natick Soldier Research, Development and Engineering Center (NATICK) and the U.S. Army Combat ...

  27. Nipocalimab demonstrates sustained disease control in adolescents

    SAVANNAH, Ga. (October 15, 2024) - Johnson & Johnson today announced positive results from the Phase 2/3 Vibrance-MG study of nipocalimab in anti-AChR a positive adolescents (aged 12 - 17 years) living with generalized myasthenia gravis (gMG). Study participants who were treated with nipocalimab plus standard of care (SOC) achieved sustained disease control as measured by the primary ...

  28. Extrapolation beyond the end of trials to estimate long term survival

    Conclusions. When treatments make people live longer, it is important to extrapolate beyond the end of clinical trials to estimate mean survival gains and cost effectiveness over a period longer than the trial. Several survival models are available, and these result in widely varying estimates.