- What Is Generalizability In Research?
- Data Collection
Generalizability is making sure the conclusions and recommendations from your research apply to more than just the population you studied. Think of it as a way to figure out if your research findings apply to a larger group, not just the small population you studied.
In this guide, we explore research generalizability, factors that influence it, how to assess it, and the challenges that come with it.
So, let’s dive into the world of generalizability in research!
Defining Generalizability
Generalizability refers to the extent to which a study’s findings can be extrapolated to a larger population. It’s about making sure that your findings apply to a large number of people, rather than just a small group.
Generalizability ensures research findings are credible and reliable. If your results are only true for a small group, they might not be valid.
Also, generalizability ensures your work is relevant to as many people as possible. For example, if you were to test a drug only on a small number of patients, you could potentially put patients at risk by prescribing the drug to all patients until you are confident that it is safe for everyone.
Factors Influencing Generalizability
Here are some of the factors that determine if your research can be adapted to a large population or different objects:
1. Sample Selection and Size
The size of the group you study and how you choose those people can affect how well your results can be applied to others. Think of it like asking one person out of a friendship group of 16 if a game is fun, doesn’t accurately represent the opinion of the group.
2. Research Methods and Design
Different methods have different levels of generalizability. For example, if you only observe people in a particular city, your findings may not apply to other locations. But if you use multiple methods, you get a better idea of the big picture.
3. Population Characteristics
Not everyone is the same. People from different countries, different age groups, or different cultures may respond differently. That’s why the characteristics of the people you’re looking at have a significant impact on the generalizability of the results.
4. Context and Environment
Think of your research as a weather forecast. A forecast of sunny weather in one location may not be accurate in another. Context and environment play a role in how well your results translate to other environments or contexts.
Internal vs. External Validity
You can only generalize a study when it has high validity, but there are two types of validity- internal and external. Let’s see the role they play in generalizability:
1. Understanding Internal Validity
Internal validity is a measure of how well a study has ruled out alternative explanations for its findings. For example, if a study investigates the effects of a new drug on blood pressure, internal validity would be high if the study was designed to rule out other factors that could affect blood pressure, such as exercise, diet, and other medications.
2. Understanding External Validity
External validity is the extent to which a study’s findings can be generalized to other populations, settings, and times. It focuses on how well your study’s results apply to the real world.
For example, if a new blood pressure-lowering drug were to be studied in a laboratory with a sample of young healthy adults, the study’s external validity would be limited. This is because the study doesn’t consider people outside the population such as older adults, patients with other medical conditions, and more.
3 . The Relationship Between Internal and External Validity
Internal validity and external validity are often inversely related. This means that studies with high internal validity may have lower external validity, and vice versa.
For example, a study that randomly assigns participants to different treatment groups may have high internal validity, but it may have lower external validity if the participants are not representative of the population of interest.
Strategies for Enhancing Generalizability
Several strategies enable you to enhance the generalizability of their findings, here are some of them:
1 . Random Sampling Techniques
This involves selecting participants from a population in a way that gives everyone an equal chance of being selected. This helps to ensure that the sample is representative of the population.
Let’s say you want to find out how people feel about a new policy. Randomly pick people from the list of people who registered to vote to ensure your sample is representative of the population.
2 . Diverse Sample Selection
Choose samples that are representative of different age groups, genders, races, ethnicities, and economic backgrounds. This helps to ensure that the findings are generalizable to a wider range of people.
3 . Careful Research Design
Meticulously design your studies to minimize the risk of bias and confounding variables. A confounding variable is a factor that makes it hard to tell the real cause of your results.
For example, you are studying the effect of a new drug on cholesterol levels. Even if you take a random sample of participants and randomly select them to receive either a new drug or placebo if you don’t control for the participant’s diet, your results could be misleading. You could be attributing cholesterol balance to drugs when it is due to their diet.
4 . Robust Data Collection Methods
Use robust data collection methods to minimize the risk of errors and biases. This includes using well-validated measures and carefully training data collectors.
For instance, an online survey tool could be used to conduct online polls on how voters change their minds during an election cycle rather than relying on phone interviews, which would make it harder to get repeat voters to participate in the study and review their views over time.
Challenges to Generalizability
1. sample bias .
Sample bias happens when the group you study doesn’t represent everyone you want to talk about. For example, if you’re researching ice cream preferences and only ask your friends, your results might not apply to everyone because your friends are not the only people who take ice cream.
2. Ethical Considerations
Ethical considerations can limit your research’s generalizability because it wouldn’t be right or fair. For example, it’s not ethical to test a new medicine on people without their permission.
3 . Resource Constraints
Having a limited budget for a project also restricts your research’s generalizability. For example, if you want to conduct a large-scale study but don’t have the resources, time, or personnel, you opt for a small-scale study, which could make your findings less likely to apply to a larger population.
4. Limitations of Research Methods
Tools are just as much a part of your research as the research itself. If you an ineffective tool, you might not be able to apply what you’ve learned to other situations.
Assessing Generalizability
Evaluating generalizability allows you to understand the implications of your findings and make realistic recommendations. Here are some of the most effective ways to assess generalizability:
Statistical Measures and Techniques
Several statistical tools and methods allow you to assess the generalizability of your study. Here are the top two:
- Confidence Interval
A confidence interval is a range of values that is likely to contain the true population value. So if a researcher looks at a test and sees that the mean score is 78 with a 95% confidence interval of 70-80, they’re 95% sure that the actual population score is between 70-80.
The p-value indicates the likelihood that the results of the study, or more extreme results, will be obtained if the null hypothesis holds. A null hypothesis is the supposition that there is no association between the variables being analyzed.
A good example is a researcher surveying 1,000 college students to study the relationship between study habits and GPA. The researcher finds that students who study for more hours per week have higher GPAs.
The p-value below 0.05 indicates that there is a statistically significant association between study habits and GPA. This means that the findings of the study are not by coincidence.
Peer Review and Expert Evaluation
Reviewers and experts can look at sample selection, study design, data collection, and analysis methods to spot areas for improvement. They can also look at the survey’s results to see if they’re reliable and if they match up with other studies.
Transparency in Reporting
Clearly and concisely report the survey design, sample selection, data collection methods, data analysis methods, and findings of the survey. This allows other researchers to assess the quality of the survey and to determine whether the results are generalizable.
The Balance Between Generalizability and Specificity
Generalizability refers to the degree to which the findings of a study can be applied to a larger population or context. Specificity, on the other hand, refers to the focus of a study on a particular population or context.
a. When Generalizability Matters Most
Generalizability comes into play when you want to make predictions about the world outside of your sample. For example, you want to look at the impact of a new viewing restrictions policy on the population as a whole.
b. Situations Where Specificity is Preferred
Specificity is important when researchers want to gain a deep understanding of a specific group or phenomenon in detail. For example, if a researcher wants to study the experiences of people with a rare disease.
Finding the Right Balance Between Generalizability and Specificity
The right balance between generalizability and specificity depends on the research question.
Case 1- Specificity over Generalizability
Sometimes, you have to give up some of their generalizability to get more specific results. For example, if you are studying a rare genetic condition, you might not be able to get a sample that’s representative of the population.
Case 2- Generalizability over Specificity
In other cases, you may need to sacrifice some specificity to achieve greater generalizability. For example, when studying the effects of a new drug, you need a sample that includes a wide range of people with different characteristics.
Keep in mind that generalizability and specificity are not mutually exclusive. You can design studies that are both generalizable and specific.
Real-World Examples
Here are a few real-world examples of studies that turned out to be generalizable, as well as some that are not:
1. Case Studies of Research with High Generalizability
We’ve been talking about how important a generalizable study is and how to tell if your research is generalizable. Let’s take a look at some studies that have achieved this:
a. The Framingham Heart Study
This is a long-running study that has been tracking the health of over 15,000 participants since 1948. The study has provided valuable insights into the risk factors for heart disease, stroke, and other chronic diseases
The findings of the Framingham Heart Study are highly generalizable because the study participants were recruited from a representative sample of the general population.
b. The Cochrane Database of Systematic Reviews
This is a collection of systematic reviews that evaluate the evidence for the effectiveness of different healthcare interventions. The Cochrane Database of Systematic Reviews is a highly respected source of information for healthcare professionals and policymakers.
The findings of Cochrane reviews are highly generalizable because they are based on a comprehensive review of all available evidence.
2. Case Studies of Research with Limited Generalizability
Let’s look at some studies that would fail to prove their validity to the general population:
- A study that examines the effects of a new drug on a small sample of participants with a rare medical condition. The findings of this study would not be generalizable to the general population because the study participants were not representative of the general population.
- A study that investigates the relationship between culture and values using a sample of participants from a single country. The findings of this study would not be generalizable to other countries because the study participants were not representative of people from other cultures.
Implications of Generalizability in Different Fields
Research generalizability has significant effects in the real world, here are some ways to leverage it across different fields:
1. Medicine and Healthcare
Generalizability is a key concept of medicine and healthcare. For example, a single study that found a new drug to be effective in treating a specific condition in a limited number of patients might not apply to all patients.
Healthcare professionals also leverage generalizability to create guidelines for clinical practice. For example, a guideline for the treatment of diabetes may not be generalizable to all patients with diabetes if it is based on research studies that only included patients with a particular type of diabetes or a particular level of severity.
2. Social Sciences
Generalizability allows you to make accurate inferences about the behavior and attitudes of large populations. People are influenced by multiple factors, including their culture, personality, and social environment.
For example, a study that finds that a particular educational intervention is effective in improving student achievement in one school may not be generalizable to all schools.
3. Business and Economics
Generalizability also allows companies to conclude how customers and their competitors behave. Factors like economic conditions, consumer tastes, and tech trends can change quickly, so it’s hard to generalize results from one study to the next.
For example, a study that finds that a new marketing campaign is effective in increasing sales of a product in one region may not be generalizable to other regions.
The Future of Generalizability in Research
Let’s take a look at new and future developments geared at improving the generalizability of research:
1. Evolving Research Methods and Technologies
The evolution of research methods and technologies is changing the way that we think about generalizability. In the past, researchers were often limited to studying small samples of people in specific settings. This made it difficult to generalize the findings to the larger population.
Today, you can use various new techniques and technologies to gather data from a larger and more varied sample size. For example, online surveys provide you with a large sample size in a very short period.
2. The Growing Emphasis on Reproducibility
The growing emphasis on reproducibility is also changing the way that we think about generalizability. Reproducibility is the ability to reproduce the results of a study by following the same methods and using a similar sample.
For example, you publish a study that claims that a new drug is effective in treating a certain disease. Two other researchers replicated the study and confirmed the findings. This replication helps to build confidence in the findings of the original study and makes it more likely that the drug will be approved for use.
3. The Ongoing Debate on Generalizability vs. Precision
Generalizability refers to the ability to apply the findings of a study to a wider population. Precision refers to the ability to accurately measure a particular phenomenon.
For some researchers, generalizability matters more than accuracy because it means their findings apply to a larger number of people and have an impact on the real world. For others, accuracy matters more than generalization because it enables you to understand the underlying mechanisms of a phenomenon.
The debate over generalizability versus precision is likely to continue because both concepts are very important. However, it is important to note that the two concepts are not mutually exclusive. It is possible to achieve both generalizability and precision in research by using carefully designed methods and technologies.
Generalizability allows you to apply the findings of a study to a larger population. This is important for making informed decisions about policy and practice, identifying and addressing important social problems, and advancing scientific knowledge.
With more advanced tools such as online surveys, generalizability research is here to stay. Sign up with Formplus to seamlessly collect data from a global audience.
Connect to Formplus, Get Started Now - It's Free!
- Case Studies of Research
- External Validity
- Generalizability
- internal validity
- Specificity
- Moradeke Owa
You may also like:
What is Research Replicability in Surveys
Research replicability ensures that if one researcher does a study, another researcher could do the same study and get pretty similar...
Conversational Analysis in Research: Methods & Techniques
Communication patterns can reveal a great deal about our social interactions and relationships. But identifying and analyzing them can...
What is Retrieval Practice?
Learning something new is like putting a shelf of books in your brain. If you don’t take them out and read them again, you will probably...
Internal Validity in Research: Definition, Threats, Examples
In this article, we will discuss the concept of internal validity, some clear examples, its importance, and how to test it.
Formplus - For Seamless Data Collection
Collect data the right way with a versatile data collection tool. try formplus and transform your work productivity today..
All Subjects
study guides for every class
That actually explain what's on your next test, extrapolation, from class:, causal inference.
Extrapolation is the process of estimating unknown values by extending or projecting from known data points. This technique is crucial in understanding how results observed in a specific sample or experimental setting might apply to a broader population or different contexts, which relates closely to issues of external validity and the generalizability of findings.
congrats on reading the definition of Extrapolation . now let's actually learn it.
5 Must Know Facts For Your Next Test
- Extrapolation can introduce errors if the relationship between variables changes outside the observed range of data, potentially leading to misleading conclusions.
- In machine learning for causal inference, extrapolation is often necessary when applying learned models to new datasets, but caution must be exercised to avoid overestimating the model's applicability.
- External validity is fundamentally linked to extrapolation, as it assesses whether study results are applicable to settings or populations beyond those studied.
- Understanding the limits of extrapolation is critical; for instance, applying results from a controlled environment directly to real-world situations can yield inaccurate predictions.
- The validity of extrapolated conclusions heavily depends on the robustness of the underlying causal assumptions made during analysis.
Review Questions
- Extrapolation can significantly impact the reliability of findings because it involves making predictions about unobserved data based on known values. If the underlying relationships remain stable across contexts, then extrapolated conclusions may hold true. However, if those relationships change or do not apply outside the studied sample, it can lead to erroneous interpretations and flawed decision-making. Thus, careful consideration of the context and assumptions is vital when relying on extrapolated results.
- Extrapolating machine learning models poses several challenges, including overfitting and potential changes in underlying data distributions. When a model is overfit to training data, it may not perform well when applied to new datasets due to its lack of generalization. Strategies such as cross-validation, regularization techniques, and ensuring diverse training datasets can help improve model robustness and accuracy. Additionally, conducting sensitivity analyses can assess how variations in input affect output predictions, helping validate extrapolations.
- External validity is inherently linked to extrapolation as it assesses whether research findings can be applied beyond the specific conditions of a study. If researchers fail to establish strong external validity, their ability to extrapolate results confidently to broader populations or different contexts becomes compromised. This impacts generalizability since findings that cannot be reliably extrapolated may misrepresent real-world scenarios or lead to ineffective interventions. Therefore, establishing external validity through careful study design and consideration of contextual factors is crucial for valid extrapolation.
Related terms
Generalization : The process of applying findings from a study sample to a larger population, which relies on the assumption that the sample accurately represents the population.
A modeling error that occurs when a machine learning model learns the details and noise in the training data to the extent that it negatively impacts its performance on new data.
Transferability : The extent to which findings from one context can be applied to another, often assessed in qualitative research settings.
" Extrapolation " also found in:
Subjects ( 34 ).
- AP Statistics
- Advanced quantitative methods
- Algebra and Trigonometry
- Approximation Theory
- Blockchain and Cryptocurrency
- Business Analytics
- Business Valuation
- College Algebra
- College Introductory Statistics
- Computational Mathematics
- Contemporary Mathematics for Non-Math Majors
- Forecasting
- Honors Pre-Calculus
- Honors Statistics
- Intermediate Financial Accounting 2
- Intro to Business Statistics
- Introduction to Demographic Methods
- Introduction to Econometrics
- Introduction to Film Theory
- Mathematical Biology
- Mathematical Fluid Dynamics
- Numerical Analysis I
- Numerical Analysis for Data Science and Statistics
- Numerical Solution of Differential Equations
- Population and Society
- Preparatory Statistics
- Principles of Finance
- Programming for Mathematical Applications
- Screenwriting II
- Thermodynamics I
- Variational Analysis
© 2024 Fiveable Inc. All rights reserved.
Ap® and sat® are trademarks registered by the college board, which is not affiliated with, and does not endorse this website..
Study Design 101: Systematic Review
- Case Report
- Case Control Study
- Cohort Study
- Randomized Controlled Trial
- Practice Guideline
- Systematic Review
- Meta-Analysis
- Helpful Formulas
- Finding Specific Study Types
A document often written by a panel that provides a comprehensive review of all relevant studies on a particular clinical or health-related topic/question. The systematic review is created after reviewing and combining all the information from both published and unpublished studies (focusing on clinical trials of similar treatments) and then summarizing the findings.
- Exhaustive review of the current literature and other sources (unpublished studies, ongoing research)
- Less costly to review prior studies than to create a new study
- Less time required than conducting a new study
- Results can be generalized and extrapolated into the general population more broadly than individual studies
- More reliable and accurate than individual studies
- Considered an evidence-based resource
Disadvantages
- Very time-consuming
- May not be easy to combine studies
Design pitfalls to look out for
Studies included in systematic reviews may be of varying study designs, but should collectively be studying the same outcome.
Is each study included in the review studying the same variables?
Some reviews may group and analyze studies by variables such as age and gender; factors that were not allocated to participants.
Do the analyses in the systematic review fit the variables being studied in the original studies?
Fictitious Example
Does the regular wearing of ultraviolet-blocking sunscreen prevent melanoma? An exhaustive literature search was conducted, resulting in 54 studies on sunscreen and melanoma. Each study was then evaluated to determine whether the study focused specifically on ultraviolet-blocking sunscreen and melanoma prevention; 30 of the 54 studies were retained. The thirty studies were reviewed and showed a strong positive relationship between daily wearing of sunscreen and a reduced diagnosis of melanoma.
Real-life Examples
Yang, J., Chen, J., Yang, M., Yu, S., Ying, L., Liu, G., ... Liang, F. (2018). Acupuncture for hypertension. The Cochrane Database of Systematic Reviews, 11 (11), CD008821. https://doi.org/10.1002/14651858.CD008821.pub2
This systematic review analyzed twenty-two randomized controlled trials to determine whether acupuncture is a safe and effective way to lower blood pressure in adults with primary hypertension. Due to the low quality of evidence in these studies and lack of blinding, it is not possible to link any short-term decrease in blood pressure to the use of acupuncture. Additional research is needed to determine if there is an effect due to acupuncture that lasts at least seven days.
Parker, H.W. and Vadiveloo, M.K. (2019). Diet quality of vegetarian diets compared with nonvegetarian diets: a systematic review. Nutrition Reviews , https://doi.org/10.1093/nutrit/nuy067
This systematic review was interested in comparing the diet quality of vegetarian and non-vegetarian diets. Twelve studies were included. Vegetarians more closely met recommendations for total fruit, whole grains, seafood and plant protein, and sodium intake. In nine of the twelve studies, vegetarians had higher overall diet quality compared to non-vegetarians. These findings may explain better health outcomes in vegetarians, but additional research is needed to remove any possible confounding variables.
Related Terms
Cochrane Database of Systematic Reviews
A highly-regarded database of systematic reviews prepared by The Cochrane Collaboration , an international group of individuals and institutions who review and analyze the published literature.
Exclusion Criteria
The set of conditions that characterize some individuals which result in being excluded in the study (i.e. other health conditions, taking specific medications, etc.). Since systematic reviews seek to include all relevant studies, exclusion criteria are not generally utilized in this situation.
Inclusion Criteria
The set of conditions that studies must meet to be included in the review (or for individual studies - the set of conditions that participants must meet to be included in the study; often comprises age, gender, disease type and status, etc.).
Now test yourself!
1. Systematic Reviews are similar to Meta-Analyses, except they do not include a statistical analysis quantitatively combining all the studies.
a) True b) False
2. The panels writing Systematic Reviews may include which of the following publication types in their review?
a) Published studies b) Unpublished studies c) Cohort studies d) Randomized Controlled Trials e) All of the above
Evidence Pyramid - Navigation
- Meta- Analysis
- Case Reports
- << Previous: Practice Guideline
- Next: Meta-Analysis >>
- Last Updated: Sep 25, 2023 10:59 AM
- URL: https://guides.himmelfarb.gwu.edu/studydesign101
- Himmelfarb Intranet
- Privacy Notice
- Terms of Use
- GW is committed to digital accessibility. If you experience a barrier that affects your ability to access content on this page, let us know via the Accessibility Feedback Form .
- Himmelfarb Health Sciences Library
- 2300 Eye St., NW, Washington, DC 20037
- Phone: (202) 994-2962
- [email protected]
- https://himmelfarb.gwu.edu
This website may not work correctly because your browser is out of date. Please update your browser .
Extrapolate findings
An evaluation usually involves some level of generalising of the findings to other times, places or groups of people.
For many evaluations, this simply involves generalising from data about the current situation or the recent past to the future.
For example, an evaluation might report that a practice or program has been working well (finding), therefore it is likely to work well in the future (generalisation), and therefore we should continue to do it (recommendation). In this case, it is important to understand whether or not future times are likely to be similar to the time period of the evaluation. If the program had been successful because of support from another organisation, and this support was not going to continue, then it would not be correct to assume that the program would continue to succeed in the future.
For some evaluations, there are other types of generalising needed. Impact evaluations which aim to learn from the evaluation of a pilot to make recommendations about scaling up must be clear about the situations and people to whom results can be generalised.
There are often two levels of generalisation. For example, an evaluation of a new nutrition program in Ghana collected data from a random sample of villages. This allowed statistical generalisation to the larger population of villages in Ghana. In addition, because there was international interest in the nutrition program, many organisations, including governments in other countries, were interested to learn from the evaluation for possible implementation elsewhere.
Analytical generalisation involves making projections about the likely transferability of findings from an evaluation, based on a theoretical analysis of the factors producing outcomes and the effect of context.
Statistical generalisation involves statistically calculating the likely parameters of a population using data from a random sample of that population.
Horizontal evaluation is an approach that combines self-assessment by local participants and external review by peers.
Positive deviance (PD), a behavioural and social change approach, involves learning from those who find unique and successful solutions to problems despite facing the same challenges, constraints and resource deprivation as others.
Realist evaluation aims to identify the underlying generative causal mechanisms that explain how outcomes were caused and how context influences these.
This blog post and its associated replies, written by Jed Friedman for the World Bank, describes a process of using analytic methods to overcome some of the assumptions that must be made when extrapolating results from evaluations to other settings.
- << Synthesise data across evaluations
- Report & Support Use of findings >>
Expand to view all resources related to 'Extrapolate findings'
- Qualitative research & evaluation methods: Integrating theory and practice
- Randomised control trials for the impact evaluation of development initiatives: a statistician's point of view
'Extrapolate findings' is referenced in:
- 52 weeks of BetterEvaluation: Week 34 Generalisations from case studies?
Framework/Guide
- Communication for Development (C4D) : C4D: Generalise findings
- Analytical generalisation
- Statistical generalisation
Back to top
© 2022 BetterEvaluation. All right reserved.
Extrapolation beyond the end of trials to estimate long term survival and cost effectiveness
Author affiliations
Key messages
Extrapolation beyond time periods studied in clinical trials is usually necessary to estimate long term effects of treatments
Many statistical survival models can be used to extrapolate data, but these can have widely varying results, which affects estimated clinical effectiveness and cost effectiveness
The choice of survival model and credibility of the extrapolations should be inspected carefully when making policy decisions that inform the allocation of healthcare resources
This paper explains the importance of extrapolating beyond the end of trials to estimate the long term benefits associated with new treatments, why this is done, and the limitations of various approaches.
- Introduction
Policy makers worldwide use economic evaluation to inform decisions when allocating limited healthcare resources. A critical part of this evaluation involves accurately estimating long term effects of treatments. Yet, evidence is usually from clinical trials of short duration. Rarely do all participants encounter the clinical event of interest by the trial’s end. When people might benefit from a long term treatment, health technology assessment agencies recommend that economic evaluations extrapolate beyond the trial period to estimate lifetime benefits. 1 2 This kind of evaluation is common for people with cancer, when effective treatments delay disease progression and improve survival.
Use of survival modelling: rationale
To make funding decisions, health technology assessment agencies rely on accurate estimates of the benefits and costs of new treatments compared with existing treatments. For treatments that improve survival, accurate estimates of survival benefits are crucial. Policy makers use estimates of mean (average) survival rather than median survival, taking into account the probability of death over a lifetime across all patients with the disease. This mean is represented by the area under survival curves that plot the proportion of patients alive over time by treatment.
In figure 1 , the purple area represents a mean survival benefit associated with an experimental compared with a control treatment, but this benefit is a restricted mean, limited to the trial period. The curves separate early, and remain separated at the end of the trial, so it is reasonable to expect that benefits would continue to accrue beyond the trial’s end. The orange smooth curves represent survival models fitted to the trial data and extrapolated beyond the trial. The area between the orange curves estimates the mean lifetime survival benefit associated with the experimental treatment. This area is much larger than the purple area, and is relevant for economic evaluation.
Survival modelling to extrapolate beyond the trial—mean survival restricted to the trial period, and extrapolated
Description of survival models
Survival models extrapolate beyond the trial. They typically have a parametric specification, which means that they rely on an assumed distribution of probabilities of, for example, death over time, which is defined by a set of parameters such as shape and scale. The chosen parametric model is fitted to the observed trial survival data, and values estimated for each parameter. The model is then used to generate survival probabilities beyond the trial period to predict what would have happened had the trial continued until everyone died.
In health technology assessments, a set of standard models typically include: exponential, Weibull, Gompertz, log-logistic, log-normal, and generalised gamma models. 3 Each survival model involves different assumptions about the shape of the hazard function—that is, the risk over time of the event of interest,—which is usually death. Figure 2 shows the hazard function shapes assumed when using standard parametric models; over time these can stay the same, increase, decrease, or have one turning point (that is, the hazard increases then decreases, or decreases then increases).
Survival modelling to extrapolate beyond the trial—hazard shapes associated with standard parametric survival models
Selecting a model
Extrapolating survival curves predicts the unknown. No one can know which models most accurately predict survival—although it might be possible to determine which models produce extrapolations that are plausible. Different models often result in substantially different estimates of survival and cost effectiveness. 4 Figure 3 shows a range of survival models fitted to the same data. While all the parametric models seem to fit the observed trial data well, they predict large differences in longer term and mean survival. The more immature the trial data, the more likely the long term predictions will differ. Model choice affects estimated treatment benefits and, consequently, cost effectiveness.
Survival modelling to extrapolate beyond the trial—a variety of standard parametric models fitted to the same data
To choose clinically plausible survival models, modellers must assess fit to the trial data, but also, crucially, assess the credibility of the extrapolations. 4 5 This approach involves considering external data sources with longer term data such as other trials, disease registries, and general population mortality rates. Biological plausibility, pharmacological mechanisms, and clinical opinion should also be considered. Although identifying a single best model might not be possible, this approach ensures that policy makers use credible models.
Limitations of standard survival models
Standard parametric survival models have limitations. They might rely on hazard functions with implausible shapes ( figure 2 ), and might neither fit the data well nor provide credible extrapolations. As illustrated in figure 3 , the implications of choosing the wrong survival model are serious, because the choice of model affects survival predictions. Figure 4 illustrates a hypothetical hazard function of death from a cancer. No standard parametric models could capture the shape of this function, although more complex survival models can, such as flexible parametric models, fractional polynomials, piecewise models, or mixture cure models.
Survival modelling to extrapolate beyond the trial—a hypothesised, realistic hazard function
Flexible parametric models (such as restricted cubic spline models) segment the survival curve into portions, using knots to model hazard functions that have many turning points. 6 However, flexible parametric models will not generate turning points beyond the period of observed trial data unless modellers use external information, which they rarely do, such as longer term hazard rates from registry data. Indeed, while flexible parametric models are likely to fit the data well, beyond the data they reduce to standard Weibull, log-normal, or log-logistic models (therefore assuming that a transformation of the survival function is a linear function of log-time), and might generate implausible extrapolations. In figure 4 , if the trial were short and ended in the period where the hazard function is rising, a flexible parametric model would extrapolate that rising hazard, based on the observed segment of data.
An alternative option is to use fractional polynomials to model a hazard function with a complex shape, placing no restrictions on the hazard and survival functions beyond the period of observed data. However, while these models might fit the observed data well, the lack of restrictions on the extrapolation can lead to implausible predictions. 7 Other options include piecewise models, where separate survival models are fitted to defined portions of the observed survival data using cut-off points. The extrapolation is based on the model fitted to the final observed period. Piecewise models can be sensitive to the choice of cut-off points, and lead to extrapolations based on the last portion of data where numbers of trial participants and numbers of deaths among these participants are often low. 8 Generalised additive models and dynamic survival models have recently been suggested as potentially valuable novel approaches for modelling and extrapolating survival data. 7
Mixture cure models can capture complex hazard functions because they predict survival separately for cured and uncured patients, 9 and estimate a cure fraction—that is, the proportion of patients who would be cured. Predicting survival for cured and uncured patients separately could result in a model that generates credible extrapolations. However, a key issue that is difficult—or perhaps impossible—is to estimate a cure fraction reliably based on short term data. When the cure fraction is estimated inaccurately, cure models can result in poor survival predictions.
Extrapolation in practice
Decision makers, such as those on committees of the National Institute for Health and Care Excellence (NICE), discuss, document, and assess the approaches that pharmaceutical companies use to predict long term survival. Often the approach has a large impact on cost effectiveness estimates ( box 1 ). Typically, NICE reviews appraisals three years after the initial recommendation, and some drugs are placed in the Cancer Drugs Fund, providing an opportunity for checking extrapolations once longer term data are available, often from the key trial. However, while drugs in the Cancer Drugs Fund undergo rigorous reappraisal, other reviews are rarely done comprehensively, leaving extrapolations unchecked.
Impact of survival modelling in technology appraisals by the National Institute for Health and Care Excellence (NICE)
When NICE appraised pembrolizumab for untreated, advanced oesophageal and gastro-oesophageal junction cancer, the appraisal committee identified four approaches to survival modelling that it considered to be credible. 10 These approaches were a log-logistic piecewise model, a log- logistic piecewise model incorporating an assumed waning of the treatment effect over time, a log-logistic model not fitted using a piecewise approach, and a generalised gamma piecewise model. The incremental gains in quality adjusted life years (QALYs) associated with pembrolizumab ranged from 0.50 to 1.07 QALYs per person over a lifetime, with the estimated cost per incremental QALY doubling between the most and least optimistic analysis. 11
When NICE appraised tisagenlecleucel (a chimeric antigen receptor T cell treatment) for relapsed or refractory, diffuse, large B cell, acute lymphoblastic leukaemia, the committee acknowledged that survival was a key uncertainty, considered cure possible, and discussed several mixture cure models. Cure fractions varied by 35 percentage points depending on the model, with cost effectiveness estimates that varied from potentially acceptable to unacceptable. 12 The committee accepted using a mixture cure model based on clinical experts suggesting that some patients could be cured. However, the committee preferred a model that estimated a lower cure fraction than that estimated by the manufacturer’s preferred model, because the manufacturer’s model predicted a cure fraction that was higher than the proportion of patients who remained event-free in the tisangenlecleucel trials. Tisagenlecleucel was recommended for use in the Cancer Drugs Fund to allow the trial to accrue more data on overall survival before making a final decision on its routine use in the NHS. 12
- Conclusions
When treatments make people live longer, it is important to extrapolate beyond the end of clinical trials to estimate mean survival gains and cost effectiveness over a period longer than the trial. Several survival models are available, and these result in widely varying estimates. To choose a model, researchers should consider a model’s fit to the observed trial survival data, and the credibility of predictions beyond the trial. More complex models could, but do not necessarily, result in better extrapolations. To inform decision making, survival models must be scrutinised while considering a range of plausible models and their impact on cost effectiveness. Analysts should follow recommended processes, report analyses clearly, justify chosen models by describing why and how the models have been selected, detail how well models fit the observed data, and describe what the models predict about hazards and survival. 4 8 This approach provides decision makers with the reassurance needed to make decisions when allocating healthcare resources.
- Publication history
- Rapid Responses
- - Google Chrome
Intended for healthcare professionals
- My email alerts
- BMA member login
- Username * Password * Forgot your log in details? Need to activate BMA Member Log In Log in via OpenAthens Log in via your institution
Search form
- Advanced search
- Search responses
- Search blogs
- News & Views
- Participants in...
Participants in research: Routine extrapolation of randomised controlled trials is absurd
- Related content
- Peer review
- Bruce G Charlton , reader in evolutionary psychiatry ([email protected])
- School of Biology, University of Newcastle upon Tyne, Newcastle upon Tyne NE1 7RU
EDITOR—For more than a decade it has been an article of faith in evidence based medicine that randomised controlled trials are “best evidence” and their findings can routinely be extrapolated to clinical situations. 1 In his editorial Sackett, the founder of evidence based medicine, seeks retrospectively to reassure clinicians that this practice was justifiable, but the accompanying study by Vist et …
Log in using your username and password
BMA Member Log In
If you have a subscription to The BMJ, log in:
- Need to activate
- Log in via institution
- Log in via OpenAthens
Log in through your institution
Subscribe from £184 *.
Subscribe and get access to all BMJ articles, and much more.
* For online subscription
Access this article for 1 day for: £50 / $60/ €56 ( excludes VAT )
You can download a PDF version for your personal record.
Buy this article
Extrapolating baseline trend in single-case data: Problems and tentative solutions
- Published: 27 November 2018
- Volume 51 , pages 2847–2869, ( 2019 )
Cite this article
- Rumen Manolov ORCID: orcid.org/0000-0002-9387-1926 1 , 2 ,
- Antonio Solanas 1 &
- Vicenta Sierra 2
5592 Accesses
13 Citations
2 Altmetric
Explore all metrics
Single-case data often contain trends. Accordingly, to account for baseline trend, several data-analytical techniques extrapolate it into the subsequent intervention phase. Such extrapolation led to forecasts that were smaller than the minimal possible value in 40% of the studies published in 2015 that we reviewed. To avoid impossible predicted values, we propose extrapolating a damping trend, when necessary. Furthermore, we propose a criterion for determining whether extrapolation is warranted and, if so, how far out it is justified to extrapolate a baseline trend. This criterion is based on the baseline phase length and the goodness of fit of the trend line to the data. These proposals were implemented in a modified version of an analytical technique called Mean phase difference. We used both real and generated data to illustrate how unjustified extrapolations may lead to inappropriate quantifications of effect, whereas our proposals help avoid these issues. The new techniques are implemented in a user-friendly website via the Shiny application, offering both graphical and numerical information. Finally, we point to an alternative not requiring either trend line fitting or extrapolation.
Similar content being viewed by others
How important is the linearity assumption in a sample size calculation for a randomised controlled trial where treatment is anticipated to affect a rate of change?
A Review of Time Scale Fundamentals in the g-Formula and Insidious Selection Bias
Search for efficient complete and planned missing data designs for analysis of change.
Avoid common mistakes on your manuscript.
Several features of single-case experimental design (SCED) data have been mentioned as potential reasons for the difficulty of analyzing such data quantitatively, for the lack of consensus regarding the most appropriate statistical analyses, and for the continued use of visual analysis (Campbell & Herzinger, 2010 ; Kratochwill, Levin, Horner, & Swoboda, 2014 ; Parker, Cryer, & Byrns, 2006 ; Smith, 2012 ). Some of the data features that have received the most attention are serial dependence (Matyas & Greenwood, 1997 ; Shadish, Rindskopf, Hedges, & Sullivan, 2013 ), the common use of counts or other outcome measures that are not continuous or normally distributed (Pustejovsky, 2015 ; Sullivan, Shadish, & Steiner, 2015 ), the shortness of the data series (Arnau & Bono, 1998 ; Huitema, McKean, & McKnight, 1999 ), and the presence of trends (Mercer & Sterling, 2012 ; Parker et al., 2006 ; Solomon, 2014 ). In the present article we focus on trends. The reason for this focus is that trend is a data feature whose presence, if not taken into account, can invalidate conclusions regarding an intervention’s effectiveness (Parker et al., 2006 ). Even when there is an intention to take the trend into account, several challenges arise. First, linear trend has been defined in several ways in the context of SCED data (Manolov, 2018 ). Second, there has been recent emphasis on the need to consider nonlinear trends (Shadish, Rindskopf, & Boyajian, 2016 ; Swan & Pustejovsky, 2018 ; Verboon & Peters, 2018 ). Third, some techniques for controlling trend may provide insufficient control (see Tarlow, 2017 , regarding Tau-U by Parker, Vannest, Davis, & Sauber, 2011 ), leading applied researchers to think that their results represent an intervention effect beyond baseline trend, which may not be justified. Fourth, other techniques may extrapolate baseline trend regardless of the degree to which the trend line is a good representation of the baseline data, and despite the possibility of impossible values being predicted (see Parker et al.’s, 2011 , comments on the regression model by Allison & Gorman, 1993 ). The latter two challenges compromise the interpretation of results.
Aim, focus, and organization of the article
The aim of the present article is to provide further discussion on four issues related to baseline trend extrapolation, based on the comments by Parker et al. ( 2011 ). As part of this discussion, we propose tentative solutions to the issues identified. Moreover, we specifically aim to improve one analytical procedure, which extrapolates baseline trend and compares this extrapolation to the actual intervention-phase data: the mean phase difference (MPD; Manolov & Solanas, 2013 ; see also the modification and extension in Manolov & Rochat, 2015 ).
Most single-case data-analytical techniques focus on linear trend, although there are certain exceptions. One exception is a regression-based analysis (Swaminathan, Rogers, Horner, Sugai, & Smolkowski, 2014 ), for which the possibility of modeling quadratic trend has been discussed explicitly. Another is Tau-U, developed by Parker et al. ( 2011 ), which deals more broadly with monotonic (not necessarily linear) trends. We stick here to linear trends and their extrapolation, a decision that reflects Chatfield’s ( 2000 ) statement that relatively simple forecasting methods are preferred, because they are potentially more easily understood. Moreover, this focus is well aligned with our willingness to improve the MPD, a procedure for fitting a linear trend line to baseline data. Despite this focus, three of the four issues identified by Parker et al. ( 2011 ), and the corresponding solutions we propose, are also applicable to nonlinear trends.
Organization
In the following sections, first we mention procedures that include extrapolating the trend line fitted in the baseline, and distinguish them from procedures that account for baseline trend but do not extrapolate it. Second, we perform a review of published research in order to explore how frequently trend extrapolation leads to out-of-bounds predicted values for the outcome variable. Third, we deal separately with the four main issues of extrapolating a baseline trend, as identified by Parker et al. ( 2011 ), and we offer tentative solutions to these issues. Fourth, on the basis of the proposals from the previous two points, we propose a modification of the MPD. In the same section, we also provide examples, based on previously published data, of the extent to which our modification helps avoid misleading results. Fifth, we include a small proof-of-concept simulation study.
Analytical techniques that entail extrapolating baseline trend
Visual analysis.
When discussing how visual analysis should be carried out, Kratochwill et al. ( 2010 ) stated that “[t] he six visual analysis features are used collectively to compare the observed and projected patterns for each phase with the actual pattern observed after manipulation of the independent variable” (p. 18). Moreover, the conservative dual criteria for carrying out structured visual analysis (Fisher, Kelley, & Lomas, 2003 ) entail extrapolating split-middle trend in addition to extrapolating mean level. This procedure has received considerable attention recently as a means of improving decision accuracy (Stewart, Carr, Brandt, & McHenry, 2007 ; Wolfe & Slocum, 2015 ; Young & Daly, 2016 ).
Regression-based analyses
Among the procedures based on regression analysis, the last treatment day procedure (White, Rusch, Kazdin, & Hartmann, 1989 ) entails fitting ordinary least squares (OLS) trend lines to the baseline and intervention phases separately, and comparison between the two is performed for the last intervention phase measurement occasion. In the Allison and Gorman ( 1993 ) regression model, baseline trend is extrapolated before it is removed from both the A and B phases’ data. Apart from OLS regression, the generalized least squares proposal by Swaminathan et al. ( 2014 ) fits trend lines separately to the A and B phases, but baseline trend is still extrapolated for carrying out the comparisons. The overall effect size described by the authors entails comparing the treatment data as estimated from the treatment-phase trend line to the treatment data as estimated from the baseline-phase trend line.
Apart from the procedures based on the general linear model (assuming normal errors), generalized linear models (Fox, 2016 ) need to be mentioned as well in the present subsection. Such models can deal with count data, which are ubiquitous in single-case research (Pustejovsky, 2018a ), specifying a Poisson model (rather than a normal one) for the conditional distribution of the response variable (Shadish, Kyse, & Rindskopf, 2013 ). Other useful models are based on the binomial distribution, specifying a logistic model (Shadish et al., 2016 ), when the data are proportions that have a natural floor (0) and ceiling (100). Despite dealing with certain issues arising from single-case data, these models are not flawless. Note that a Poisson model may present limitations when the data are more variable than expected (i.e., alternative models have been proposed for overdispersed count data; Fox, 2016 ), whereas a logistic model may present the difficulty of not knowing the floor or ceiling (i.e., the upper asymptote) or of forcing artificial limits. Finally, what is most relevant to the topic of the present text is that none of these generalized linear models necessarily includes an extrapolation of baseline trend. Actually, some of them (Rindskopf & Ferron, 2014 ; Verboon & Peters, 2018 ) consider the baseline data together with the intervention-phase data in order to detect when the greatest change is produced. Other models (Shadish, Kyse, & Rindskopf, 2013 ) include an interaction term between the dummy phase variable and the time variable, making possible the estimation of change in slope.
Nonregression procedures
MPD involves estimating baseline trend and extrapolating it into the intervention phase in order to compare the predictions with the actual intervention-phase data. Another nonregression procedure, Slope and level change (SLC; Solanas, Manolov, & Onghena, 2010 ), involves estimating baseline trend and removing it from the whole series before quantifying the change in slope and the net change in level (hence, SLC). In one of the steps of the SLC, baseline trend is removed from the n A baseline measurements and the n B intervention-phase measurements by subtracting from each value ( y i ) the slope estimate ( b 1 ), multiplied by the measurement occasion ( i ). Formally, \( {\overset{\sim }{y}}_i={y}_i-i\times {b}_1;i=1,2,\dots, \left({n}_A+{n}_B\right) \) . This step does resemble extrapolating baseline trend, but there is no estimation of the intercept of the baseline trend line, and thus a trend line is not fitted to the baseline data and then extrapolated, which would lead to obtaining residuals as in Allison and Gorman’s ( 1993 ) model. Therefore, we consider that it is more accurate to conceptualize this step as removing baseline trend from the intervention-phase trend for the purpose of comparison.
Nonoverap indices
Among nonoverlap indices, the percentage of data points exceeding median trend (Wolery, Busick, Reichow, & Barton, 2010 ) involves fitting a split-middle (i.e., bi-split) trend line and extrapolating it into the subsequent phase. Regarding Tau-U (Parker et al., 2011 ), it only takes into account the number of baseline measurements that improve previous baseline measurements, and this number is subtracted from the number of intervention-phase values that improve the baseline-phase values. Therefore, no intercept or slope is estimated, and no trend line is fitted or extrapolated, either. The way in which trend is controlled for in Tau-U cannot be described as trend extrapolation in a strict sense.
Two other nonoverlap indices also entail baseline trend control. According to the “additional output” calculated at http://ktarlow.com/stats/tau/ , the baseline-corrected Tau (Tarlow, 2017 ) removes baseline trend from the data using the expression \( {\overset{\sim }{y}}_i={y}_i-i\times {b}_{1(TS)};i=1,2,\dots, \left({n}_A+{n}_B\right) \) , where b 1( TS ) is the Theil–Sen estimate of slope. In the percentage of nonoverlapping corrected data (Manolov & Solanas, 2009 ), baseline trend is eliminated from the n values via the same expression as for baseline-corrected Tau, \( {\overset{\sim }{y}}_i={y}_i-i\times {b}_{1(D)};i=1,2,\dots, \left({n}_A+{n}_B\right) \) , but slope is estimated via b 1( D ) (see Appendix B ) instead of via b 1( TS ) . Therefore, as we discussed above for SLC, there is actually no trend extrapolation in the baseline-corrected Tau or percentage-of-nonoverlapping-corrected data.
Procedures not extrapolating trend
The analytical procedures included in the present subsection do not extrapolate baseline trend, but they do take baseline trend into account. We decided to mention these techniques for three reasons. First, we wanted to provide a broader overview of analytical techniques applicable to single-case data. Second, we wanted to make it explicit that not all analytical procedures entail baseline trend extrapolation, and therefore, such extrapolation is not an indispensable step in single-case data analysis. Stated in other words, it is possible to deal with baseline trend without extrapolating it. Third, the procedures mentioned here were those more recently developed or suggested for single-case data analysis, and so they may be less widely known. Moreover, they can be deemed more sophisticated and more strongly grounded on statistical theory than is MPD, which is the focus of the present article.
The between-case standard mean difference, also known as the d statistic (Shadish, Hedges, & Pustejovsky, 2014 ), assumes stable data, but the possibility of detrending has been mentioned (Marso & Shadish, 2015 ) if baseline trend is present. It is not clear that a regression model using time and its interaction with a dummy variable representing phase entails baseline trend extrapolation. Moreover, a different approach was suggested by Pustejovsky, Hedges, and Shadish ( 2014 ) for obtaining a d statistic—namely, in relation to multilevel analysis. In multilevel analysis, also referred to as hierarchical linear models , the trend in each phase can be modeled separately, and the slopes can be compared (Ferron, Bell, Hess, Rendina-Gobioff, & Hibbard, 2009 ). Another statistical option is to use generalized additive models (GAMs; Sullivan et al., 2015 ), in which there is greater flexibility for modeling the exact shape of the trend in each phase, without the need to specify a particular model a priori. GAMs that have been specifically suggested include the use of cubic polynomial curves fitted to different portions of the data and joined at the specific places (called knots ) that divide the data into portions. Just like when using multilevel models, trend lines are fitted separately to each phase, without the need to extrapolate baseline trend.
A review of research published in 2015
Aim of the review.
It has already been stated (Parker et al., 2011 ) and illustrated (Tarlow, 2017 ) that baseline trend extrapolation can lead to impossible forecasts for the subsequent intervention-phase data. Accordingly, the research question we chose was the percentage of studies in which extrapolating the baseline trend of the data set (across several different techniques for fitting the trend line) leads to values that are below the lower bound or above the upper bound of the outcome variable.
Search strategy
We focused on the four journals that have published most SCED research, according to the review by Shadish and Sullivan ( 2011 ). These journals are Journal of Applied Behavior Analysis , Behavior Modification , Research in Autism Spectrum Disorders , and Focus on Autism and Other Developmental Disabilities . Each of these four journals published more than ten SCED studies in 2008, and the 76 studies they published represent 67% of all studies included in the Shadish and Sullivan review. Given that the bibliographic search was performed in September 2016, we focused on the year 2015 and looked for any articles using phase designs (AB designs, variations, or extensions) or alternation designs with a baseline phase and providing a graphical representation of the data, with at least three measurements in the initial baseline condition.
Techniques for finding a best fitting straight line
For the present review, we selected five techniques for finding a best-fitting straight line: OLS, split-middle, tri-split, Theil–Sen, and differencing. The motivation for this choice was that these five techniques are included in single-case data-analytical procedures (Manolov, 2018 ), and therefore, applied researchers can potentially use them. The R code used for checking whether out-of-bounds forecasts are obtained is available at https://osf.io/js3hk/ .
Upper and lower bounds
The data were retrieved using Plot Digitizer for Windows ( https://plotdigitizer.sourceforge.net ). We counted the number and percentage of studies in which values out of logical bounds were obtained after extrapolating the baseline trend, estimated either from an initial baseline phase or from a subsequent withdrawal phase (e.g., in ABAB designs) for at least one of the data sets reported graphically in the article. The “logical bounds” were defined as 0 as a minimum and 1 or 100 as a maximum, when the measurement provided was a proportion or a percentage, respectively. Additional upper bounds included the maximal scores obtainable for an exam (e.g., Cheng, Huang, & Yang, 2015 ; Knight, Wood, Spooner, Browder, & O’Brien, 2015 ), for the number of steps in a task (e.g., S. J. Gardner & Wolfe, 2015 ), for the number of trials in the session (Brandt, Dozier, Juanico, Laudont, & Mick, 2015 ; Cannella-Malone, Sabielny, & Tullis, 2015 ), or for the duration of transition between a stimulus and reaching a location (Siegel & Lien, 2015 ), or the total duration of a session, when quantifying latency (Hine, Ardoin, & Foster, 2015 ). We chose a conservative approach, and did not to speculate Footnote 1 about upper bounds for behaviors that were expressed as either a frequency (e.g., Fiske et al., 2015 ; Ledbetter-Cho et al., 2015 ) or a rate (e.g., Austin & Tiger, 2015 ; Fahmie, Iwata, & Jann, 2015 ; Rispoli et al., 2015 ; Saini, Greer, & Fisher, 2015 ). Footnote 2
Results of the review
The numbers of articles included per journal are as follows. From the Journal of Applied Behavior Analysis , 27 SCED studies were included from the 46 “research articles” published (excluding three alternating-treatment designs without a baseline), and 20 more SCED studies were included from the 30 “reports” published (excluding two alternating-treatments design without a baseline and one changing-criterion design). From Behavior Modification , eight SCED studies were included from the 39 “articles” published (excluding two alternating-treatments design studies without a baseline, two studies with other designs without phases, one study with phases but only two measurements in the baseline phase, meta-analyses of single cases, and data analysis for single-case articles). From Research in Autism Spectrum Disorders , seven SCED studies were included from the 67 “original research articles” published (excluding one SCED study that did not have a minimum of three measurements per phase, as per Kratochwill et al., 2010 ). From Focus on Autism and Other Developmental Disabilities , six SCED studies were included from the 21 “articles” published. The references to all 68 articles reviewed are available in Appendix A at https://osf.io/js3hk/ .
The results of this review are as follows. Extrapolation led to impossibly small values for all five trend estimators in 27 studies (39.71%), in contrast to 34 studies (50.00%) in which that did not happen for any of the trend estimators. Complementarily, extrapolation led to impossibly large values for all five trend estimators in eight studies (11.76%), in contrast to 56 studies (82.35%) in which that did not happen for any of the trend estimators. In terms of when the extrapolation led to an impossible value, a summary is provided in Table 1 . Note that this table refers to the data set in each article, including the earliest out-of-bounds forecast. Thus, it can be seen that for all trend-line-fitting techniques, it was most common to have out-of-bounds forecasts already before the third intervention phase measurement occasion. This is relevant, considering that an immediate effect can be understood to refer to the first three intervention data points (Kratochwill et al., 2010 ).
These results suggest that researchers using techniques to extrapolate baseline trend should be cautious about downward trends that would apparently lead to negative values, if continued. We do not claim that the four journals and the year 2015 are representative of all published SCED research, but the evidence obtained suggests that trend extrapolation may affect the meaningfulness of the quantitative operations performed with the predicted data frequently enough for it to be considered an issue worth investigation.
Main issues when extrapolating baseline trend, and tentative solutions
The main issues when extrapolating baseline trend that were identified by Parker et al. ( 2011 ) include (a) unreliable trend lines being fitted; (b) the assumption that trends will continue unabated; (c) no consideration of the baseline phase length; and (d) the possibility of out-of-bounds forecasts. In this section, we comment on each of these four issues identified by Parker et al. ( 2011 ) separately (although they are related), and we propose tentative solutions, based on the existing literature. However, we begin by discussing in brief how these issues could be avoided rather than simply addressed.
Avoiding the issues
Three decisions can be made in relation to trend extrapolation. First, the researcher may wonder whether there is any clear trend at all. For that purpose, a tool such as a trend stability envelope (Lane & Gast, 2014 ) can be used. According to Lane and Gast, a within-phase trend would be considered stable (or clear) when at least 80% of the data points fell within the envelope defined by the split-middle trend line plus/minus 25% of the baseline median. Similarly, Mendenhall and Sincich ( 2012 ) suggested, although not in the context of single-case data, that a good fit of an OLS trend line would be represented by a coefficient of variation of 10% or smaller. We consider that either of these descriptive approaches is likely to be more reasonable than testing the statistical significance of the baseline trend before deciding whether or not to take it into account, because such a statistical test might lack power for short baselines (Tarlow, 2017 ). Using Kendall’s tau as a measure of the percentage of improving data points (Vannest, Parker, Davis, Soares, & Smith, 2012 ) would not inform one about whether a clear linear trend were present, because it refers more generally to a monotonic trend.
Second, if the data show considerable variability and no clear trend, it is possible to use a quantification that does not rely on (a) linear trend, (b) any specific nonlinear trend, or (c) any average level whatsoever, by using a nonoverlap index. Specifically, the nonoverlap of all pairs (NAP; Parker & Vannest, 2009 ) can be used when the baseline data do not show a natural improvement, whereas Tau-U (Parker et al., 2011 ) can be used when such an improvement is apparent but it is not necessarily linear. Footnote 3 A different approach could be to quantify the difference in level (e.g., using a d statistic) after showing that the assumption of no trend is plausible via a GAM (Sullivan et al., 2015 ). Thus, there would be no trend line fitting and no trend extrapolation.
Third, if the trend looks clear (visually or according to a formal rule) and the researcher decides to take it into account, it is also possible not to extrapolate trend lines. For instance, it is possible to fit separate trend lines to the different phases and compare the slopes and intercepts of these trend lines, as in piecewise regression (Center, Skiba, & Casey, 1985–1986 ).
Although these potential solutions seem reasonable, here we deal with another option: namely, the case in which baseline extrapolation is desired (because it is part of the analytical procedure chosen prior to data collection), but the researcher is willing to improve the way in which such extrapolation is performed.
First issue: Unreliable trend lines fitted
If an unreliable linear trend is fitted (e.g., the relation between the time variable and the measurements would be described by a small R 2 value), then the degree of confidence we have in the representation of the baseline data is reduced. If the fit of the baseline trend line to the data is poor, its extrapolation would also be problematic. It is expected that, if the amount of variability were the same, shorter baselines would result in more uncertain estimates. In that sense, this issue is related to the next one.
Focusing specifically on reliability, we advocate quantifying the amount of fit of the trend line and using this information when deciding on baseline trend extrapolation. Regarding the comparison between actual and fitted values, Hyndman and Koehler ( 2006 ) reviewed the drawbacks of several measures of forecast accuracy, including widely known options such as the minimum square error ( \( \frac{{\left({y}_i-{\widehat{y}}_i\right)}^2}{n} \) , based on a quadratic loss function and inversely related to R 2 ) or the minimum absolute error ( \( \frac{\left|{y}_i-{\widehat{y}}_i\right|}{n} \) , based on a linear loss function). Hyndman and Koehler proposed the mean absolute scaled error (MASE). For a trend line fitted to the n A baseline measurements, MASE can be written as follows:
Hyndman and Koehler ( 2006 , p. 687) stated that MASE is “easily interpretable, because values of MASE greater than one indicate that the forecasts are worse, on average, than in-sample one-step forecasts from the naïve method.” (The naïve method entails predicting a value from the previous one—i.e., the random-walk model that has frequently been used to assess the degree to which more sophisticated methods provide more accurate forecasts that this simple procedure; Chatfield, 2000 .) Thus, values of MASE greater than one could be indicative that a general trend (e.g., a linear one, as in MPD) does not provide a good enough fit to the data from which it was estimated, because it does not improve the fit of the naïve method.
Second issue: Assuming that trend continues unabated
This issue refers to treating baseline trend as if it were always the same for the whole period of extrapolation. By default, all the analytical techniques mentioned in the “Analytical Techniques That Entail Extrapolating Baseline Trend” section extrapolate baseline trend until the end of the intervention phase. Thus, one way of dealing with this issue would be to limit the extrapolation, following Rindskopf and Ferron ( 2014 ), who stated that “for a short period, behavior may show a linear trend, but we cannot project that linear behavior very far into the future” (p. 229). Similarly, when discussing the gradual-effects model, Swan and Pustejovksy ( 2018 ) also cautioned against long extrapolations, although their focus was on the intervention phase and not on the baseline phase.
An initial approach could be to select how far out to extrapolate baseline trend prior to gathering and plotting the data, by selecting a number that would be the same across studies. When discussing an approach for comparing levels when trend lines are fitted separately to each phase, it has been suggested that a comparison can be performed at the fifth intervention-phase measurement occasion (Rindskopf & Ferron; 2014 ; Swaminathan et al., 2014 ). It is possible to extend this recommendation to the present situation and state that the baseline trend should be extrapolated until the fifth intervention-phase measurement occasion. The choice of five measurements is arbitrary, but it is well-aligned with the minimal phase length required in the What Works Clearinghouse Standards (Kratochwill et al., 2010 ). Nonetheless, our review (Table 1 ) suggests that impossible extrapolations are common even before the fifth intervention-phase measurement occasion, and thus a comparison at that point might not avoid comparison with an impossible projection from the baseline. Similarly, when presenting the gradual-effects model, Swan and Pustejovsky ( 2018 ) defined the calculation of the effect size for an a priori set number of intervention-phase measurement occasions. In their study, this number depends on the actually observed intervention-phase lengths. Moreover, Swan and Pustejovsky suggested a sensitivity analysis, comparing the results of several possible a-priori-set numbers. It could be argued that a fixed choice would avoid making data-driven decisions that could favor finding results in line with the expectations of the researchers (Wicherts et al., 2016 ). A second approach would be to choose how far away to extrapolate on the basis of both a design feature (baseline phase length; see the next section) and a data feature (the amount of fit of the trend line to the data, expressed as the MASE). In the following discussion, we present a tentative solution including both these aspects.
Third issue: No consideration of baseline-phase length
Parker et al. ( 2011 ) expressed a concern that baseline trend correction procedures do not take into consideration the length of the baseline phase. The problem is that a short baseline is potentially related to unreliable trend, and it could also entail predicting many values (i.e., a longer intervention phase) from few values, which is not justified.
To take baseline length ( n A ) into account, one approach would be to limit the extrapolation of baseline trend to the first n A treatment-phase measurement occasions. This approach introduces an objective criterion based on a characteristic of the design. A conservative version of this alternative would be to estimate how far out to extrapolate using the following expression: \( {\widehat{n}}_B=\left\lfloor {n}_A\times \left(1- MASE\right)\right\rfloor \) , applying the restriction that \( 0\le {\widehat{n}}_B\le {n}_B \) . Thus, the extrapolation is determined by both the number of baseline measurements ( n A ) and the goodness of fit of the trend line to the data. When MASE > 1, the expression for \( {\widehat{n}}_B \) would give a negative value, precluding extrapolation. For data in which MASE < 1, the better the fit of the trend line to the data, the further out extrapolation could be considered justified. From the expression presented for \( {\widehat{n}}_B \) , it can be seen that if the result of the multiplication is not an integer, the value representing the number of intervention-phase measurement occasions to which to extend the baseline trend ( \( {\widehat{n}}_B \) ) would be truncated. Finally, note the restriction that \( {\widehat{n}}_B \) should be equal to or smaller than n B , because it is possible that the baseline is longer than the intervention phase ( n A > n B ) and that, even after applying the correction factor representing the fit of the trend line \( {\widehat{n}}_B>{n}_B \) . Thus, whenever \( {\widehat{n}}_B>{n}_B \) , it is reset to \( {\widehat{n}}_B={n}_B \) .
Fourth issue: Out-of-bounds forecasts
Extrapolating baseline trend for five, n A , or \( {\widehat{n}}_B \) measurement occasions may make trend extrapolation more reasonable (or, at least, less unreasonable), but none of these options precludes out-of-bounds forecasts. When Parker et al. ( 2011 ) discussed the issue that certain procedures to control for baseline trend could lead to projecting trend beyond rational limits, they proposed the conservative trend correction procedure implemented in Tau-U. This procedure could be useful for statistically controlling baseline trend, although the evidence provided by Tarlow ( 2017 ) suggests that the trend control incorporated in Tau-U is insufficient (i.e., leads to false positive results), especially as compared to other procedures, including MPD. An additional limitation of this trend correction procedure is that it cannot be used when extrapolating baseline trend. Therefore, we consider other options in the following text.
Nonlinear models
One option, suggested by Rindskopf and Ferron ( 2014 ), is to use nonlinear models for representing situations in which a stable and low initial level during the baseline phase experiences a change due to the intervention (e.g., an upward trend) before settling at a stable high level. Rindskopf and Ferron suggested using logistic regression with an additional term for identifying the moment at which the response has gone halfway between the floor and the ceiling. Similarly, Shadish et al. ( 2016 ) and Verboon and Peters ( 2018 ) used a logistic model for representing data with clear floor and ceiling effects. The information that can be obtained by fitting a generalized logistic model is in terms of the floor and ceiling levels, the rate of change, and the moments at which the change from the floor to the ceiling plateau starts and stops (Verboon & Peters, 2018 ). Shadish et al. ( 2016 ) acknowledged that not all analysts are expected to be able to fit intrinsically nonlinear models and that choosing one model over another is always partly arbitrary, suggesting nonparametric smoothing as an alternative.
Focusing on the need to improve MPD, the proposals by Rindskopf and Ferron ( 2014 ) and Verboon and Peters ( 2018 ) are not applicable, since the logistic model they present deals with considering the data of a baseline phase and an intervention phase jointly, whereas in MPD baseline trend is estimated and extrapolated in order to allow for a comparison between projected and observed patterns of the outcome variable (as suggested by Kratochwill et al., 2010 , and Horner, Swaminathan, Sugai, & Smolkowski, 2012 , when performing visual analysis). In contrast, Shadish et al. ( 2016 ) used the logistic model for representing the data within one of the phases in order to explore whether any within-phase change took place, but they were not aiming to use the within-phase model for extrapolating to the subsequent phase.
Although not all systematic changes in the behavior of interest are necessarily linear, there are three drawbacks to applying nonlinear models to single-case data, or even to usually longer time-series data (Chatfield, 2000 ). First, there has not been extensive research with short-time-series data and any of the possible nonlinear models (e.g., logistic, Gompertz, or polynomial) applicable for modeling growth curves in order to ensure that known minimal and maximal values of the measurements are not exceeded. Second, it may be difficult to distinguish between a linear model with disturbance and an inherently nonlinear model. Third, a substantive justification is necessary, based either on theory or on previously fitted nonlinear models, for preferring one nonlinear model instead of another or for preferring a nonlinear model instead of the more parsimonious linear model. However, the latter two challenges are circumvented by GAMs, because they allow one to avoid the need to explicitly posit a specific model for the data (Sullivan et al., 2015 ).
Winsorizing
Faith, Allison, and Gorman ( 1997 ) suggested rescaling manually out-of-bounds predicted scores within limits, a manipulation similar to winsorization. Thus, a trend is extrapolated until the values predicted are no longer possible, and then a flat line is set at the minimum/maximum possible value (e.g., 0 when the aim is to eliminate a behavior, or 100% when the aim is to improve in the completion of a certain task). The “manual” rescaling of out-of-bounds forecasts could be supported by Chatfield’s ( 2000 , pp. 175–179) claim that it is possible to make judgmental adjustments to forecasts and also to use the “eyeball test” for checking whether forecasts are intuitively reasonable, given that background knowledge (albeit background as simple as knowing the bounds of the outcome variable) is part of nonautomatic univariate methods for forecasting in time-series analysis. In summary, just as in the logistic model, winsorizing the trend line depends on the data at hand. As a limitation, Parker et al. ( 2011 ) claimed that such a correction would impose an artificial ceiling on the effect size. However, it could also be argued that computing an effect size on the basis of impossible values is equally (or more) artificial, since it involves only crunching numbers, some of which (e.g., negative frequencies) are meaningless.
Damping trend
A third option arises from time-series forecasting, in which exponential smoothing is one of the methods commonly used (Billah, King, Snyder, & Koehler, 2006 ). Specifically, in double exponential smoothing, which can be seen as a special case of Holt’s ( 2004 ) linear trend procedure, it is possible to include a damping parameter (E. S. Gardner & McKenzie, 1985 ) that indicates how much the slope of the trend is reduced in subsequent time periods. According to the review performed by E. S. Gardner ( 2006 ), the damped additive trend is the model of choice when using exponential smoothing. A damped trend can be interpreted as an attenuation reflecting the gradual reduction of the trend until the behavior eventually settles at an upper or a lower asymptote. This would address Parker et al.’s ( 2011 ) concern that it may not be reasonable to consider that the baseline trend will continue unabated until the end of the intervention phase in the absence of an effect. Moreover, the behavioral progression is more gradual than the one implied when winsorizing. Furthermore, a gradual change is also the basis of recent proposals for modeling longitudinal data using generalized additive models (Bringmann et al., 2017 ).
Aiming for a tentative solution for out-of-bounds forecasts for techniques such as MPD, we consider it reasonable to borrow the idea of damping the trend from the linear trend model by Holt ( 2004 ). In contrast, the application of that model in its entirety to short SCED baselines (Shadish & Sullivan, 2011 ; Smith, 2012 ; Solomon, 2014 ) is limited by the need to estimate several parameters (a smoothing parameter for level, a smoothing parameter for trend, a damping parameter, the initial level, and the initial trend).
We consider that a gradually reduced trend conceptualization seems more substantively defensible than abruptly winsorizing the trend line. In that sense, instead of extrapolating the linear trend until the lower or upper bound is reached and then flattening the trend line, it is possible to estimate the damping coefficient in such a way as to ensure that impossible forecasts are not obtained during the period of extrapolation (i.e., in the \( {\widehat{n}}_B \) or n B measurement occasions after the last baseline data point, according to whether extrapolation is limited, as we propose here, or not). The damping parameter is usually represented by the Greek letter phi ( φ ), so that the trend line extrapolated into the intervention phase would be based on the baseline trend ( b 1 ) as follows: \( {b}_1\times {\varphi}^i;i=1,2,\dots, {\widehat{n}}_B \) , so that the first predicted intervention-phase measurement is \( {\widehat{y}}_1={\widehat{y}}_{n_A}+{b}_1\times \varphi \) , and the subsequent forecasts (for \( i=2,3,\dots, {\widehat{n}}_B \) ) are obtained via \( {\widehat{y}}_i={\widehat{y}}_{i-1}+{b}_1\times {\varphi}^i \) . The previous expressions are presented using \( {\widehat{n}}_B \) , but they can be rewritten using n B in the case that extrapolation is not limited in time. For avoiding extrapolation to impossible values, the damping parameter would be estimated from the data in such a way that the final predicted value \( {\widehat{y}}_{{\widehat{n}}_B} \) would still be within the bounds of the outcome variable. We propose an iterative process checking the values of φ from 0.05 to 1.00 in steps of 0.001, in order to identify the largest φ value k for which there are no out-of-bounds values, whereas for ( k + 0.001) there is one or more such values. The closer φ is to 1, the farther away in the intervention phase is the first out-of-bounds forecast produced. Estimating φ from the data and not setting it to an a-priori-chosen value is in accordance with the usually recommended practice in exponential smoothing (Billah et al., 2006 ).
Justification of the tentative solutions
Our main proposal is to combine the quantitative criterion for how far out to extrapolate baseline trend ( \( {\widehat{n}}_B \) ) with damping, in case the latter is necessary within the \( {\widehat{n}}_B \) limit. The fact that both \( {\widehat{n}}_B \) and the damping parameter φ are estimated from the data rather than being predetermined implies that this proposal is data-driven. We consider that the data-driven quantification of \( {\widehat{n}}_B \) is also not necessarily a drawback, due to three reasons: (a) An objective formula was proposed for estimating how far out it is reasonable to extrapolate the baseline trend, according to the data at hand; that is, the choice is not made subjectively by the researcher in order to favor his/her hypotheses. (b) This formula is based on both a design feature (i.e., the baseline phase length) and a data feature (i.e., the MASE as a measure of the accuracy of the trend line fitted). And (c) no substantive reason may be available a priori regarding when extrapolation becomes unjustified.
We also consider that estimating the damping parameter from the data is not a drawback, either, given that (a) φ is estimated from the data in Holt’s linear trend model for which it was proposed; (b) damping trend can be considered conceptually similar to choosing a function, in a growth curve model, that makes possible incorporating an asymptote (Chatfield, 2000 ), because both methods model decisions made by the researcher on the basis of knowing the characteristics of the data and, in both cases, the moment at which the asymptote is reached depends on the data at hand and not on a predefined criterion; and (c) the use of regression splines (Bringmann et al., 2017 ; Sullivan et al., 2015 ) for modeling a nonlinear relation is also data-driven, despite the fact that a predefined number of knots may be used.
The combined use of \( {\widehat{n}}_B \) plus the estimation of φ can be applied to the OLS baseline trend (as used in the Allison & Gorman, 1993 , model), to the split-middle trend (as used in the conservative dual criterion, Fisher et al., 2003 ; or in the percentage of data points exceeding the median trend, Wolery et al., 2010 ), or to the trend extrapolation that is part of MPD (Manolov & Solanas, 2013 ). In the following section, we focus on MPD.
The present proposal is also well-aligned with Bringmann et al.’s ( 2017 ) recommendation for models that do not require existing theories about the expected nature of the change in the behavior, excessively high computational demands, or long series of measurements. Additionally, as these authors suggested, the methods need to be readily usable by applied researchers, which is achieved by the software implementations we have created.
Limitations of the tentative solutions
As we mentioned previously, it could be argued that the tentative solutions are not necessary if the researcher simply avoids extrapolation. Moreover, we do not argue that the expressions presented for deciding whether and how far to extrapolate are the only possible, or necessarily the optimal, ones; we rather aimed at defining an objective rule on a solid, albeit arbitrary, basis. An additional limitation, as was suggested by a reviewer, is that for a baseline with no variability, MASE would not be defined. In such a case, when the same value is repeated n A times (e.g., when the value is 0 because the individual is unable to perform the action required), we do consider that an unlimited extrapolation would be warranted, because the reference to which the intervention-phase data would be compared would be clear and unambiguous.
Incorporating the tentative solutions in a data-analytical procedure
Modifying the mpd.
The revised version of the MPD includes the following steps:
Estimate the slope of the baseline trend as the average of the differenced data ( b 1( D ) ).
Fit the trend line, choosing Footnote 4 one of the three definitions of the intercept (see Appendix B at https://osf.io/js3hk/ ), according to the value of the MASE.
Extrapolate the baseline trend, if justified (i.e., if MASE < 1), for as many intervention-phase measurement occasions as is justified (i.e., for the first \( {\widehat{n}}_B \) measurement occasions of the intervention phase) and considering the need for damping the trend to avoid out-of-bounds forecasts. The damping parameter φ would be equal to 1 when all \( {\widehat{n}}_B \) forecasts are within bounds, or φ < 1, otherwise.
Compute MPD as the difference between the actually obtained and the forecast first \( {\widehat{n}}_B \) intervention-phase values.
Illustration of the proposal for modifying MPD
In the present section, we chose three of the studies included in the review that we performed (all three data sets are available at https://osf.io/js3hk/ , in the format required by the Shiny application, http://manolov.shinyapps.io/MPDExtrapolation , implementing the modified version of MPD). From the illustrations it is clear that, although the focus of the present text is comparison between a pair of phases, such a comparison can be conceptualized to be part of a more appropriate design structure, such as ABAB or multiple-baseline designs (Kratochwill et al., 2010 ; Tate et al., 2013 ), by replicating the same procedure for each AB comparison. Such a means of analyzing data corresponds to the suggestion by Scruggs and Mastropieri ( 1998 ) to perform comparisons only for data that maintain the AB sequence.
The Ciullo, Falcomata, Pfannenstiel, and Billingsley ( 2015 ) data were chosen because their multiple-baseline design includes short baselines and extrapolation to out-of-bounds forecasts (impossibly low values) for both the first tier Footnote 5 (Fig. 1 ) and the third tier. In Fig. 1 , trend extrapolation was not limited (i.e., the baseline trend was extrapolated for all n B = 7 values), to allow for comparing winsorizing and damping the trend. Limiting the extrapolation to \( {\widehat{n}}_B \) = 2 would have made either winsorizing or damping the trend unnecessary, because no out-of-bound forecasts would have been obtained; MPD would have been equal to 40.26.
Results for mean phase difference (MPD) with the slope estimated through differencing and the intercept computed as in the Theil–Sen estimator. The results in the left panel are based on winsorizing the trend line when the lower bound is reached. The results in the right panel are based on damping the trend. Trend extrapolation is not limited. The data correspond to the first tier (a participant called Salvador) in the Ciullo et al. ( 2015 ) multiple-baseline design study
Limiting the amount of extrapolation seems reasonable, because for both of methods the intervention phase is almost three times as long as the baseline phase; using \( {\widehat{n}}_B \) leads to avoiding impossibly low forecasts for these data and to more conservative estimates of the magnitude of the effect. Damping the trend line was necessary for three of the four tiers, where it also led to more conservative estimates, given that the out-of-bounds forecasts were in a direction opposite from the one desired with the intervention. The numerical results are available in Table 2 .
The data from Allen, Vatland, Bowen, and Burke ( 2015 ) were chosen, because this study represents a different data pattern: Longer baselines are available, which could allow for better estimation of the trend, but the baseline data are apparently very variable. Intervention phases were also longer, which required extrapolations farther out in time. Thus, we wanted to illustrate how limiting extrapolations affects the quantification of an effect.
For Tier 1, out-of-bounds forecasts (impossible high values in the same direction as desired for the intervention) are obtained. However, damping the trend led to avoiding such forecasts and also to greater estimates of the effect. For Tiers 2 and 3 (the latter is represented in Fig. 2 ), limiting the amount of extrapolation had a very strong effect, due to the high MASE values, and only a very short extrapolation was justified for Tiers 2 and 3. The limited extrapolation is also related to greater estimates of the magnitude of the effect for Tiers 2 and 3.
Results for mean phase difference (MPD) with the slope estimated through differencing and the intercept computed as in the Theil–Sen estimator. Trend extrapolation was not limited (left) versus limited (right). Damping the trend was not necessary in either case ( φ = 1). The data correspond to the third tier of the Allen et al. ( 2015 ) multiple-baseline design study
Therefore, using only the first \( {\widehat{n}}_B \) intervention-phase data points for the comparison reflects a reasonable doubt regarding whether the (not sufficiently clear) improving baseline trend would have continued unchanged throughout the whole intervention phase (i.e., for 23 or 16 measurement occasions, for Tiers 2 and 3, respectively). The numerical results are available in Table 3 .
The data from Eilers and Hayes ( 2015 ) were chosen because they include baselines of varying lengths, out-of-bounds forecasts for Tiers 1 and 2, and a nonlinear pattern in Tier 3 (to which a linear trend line is expected to show poor fit). For these data, damping and limiting the extrapolation, when applied separately, both correct overestimation of the effect that would arise from out-of-bounds (high) forecasts in a direction opposite from the one desired in the intervention. Such an overestimation, in the absence of damping, would lead to MPD values implying more than a 100% reduction, which is meaningless (see Fig. 3 ).
Results for mean phase difference (MPD) with the slope estimated through differencing and the intercept computed as in the Theil–Sen estimator. Trend was damped completely (right; φ = 0) versus not damped (left; φ = 1). Trend extrapolation is not limited in this figure. The data correspond to the second tier of the Eilers and Hayes ( 2015 ) multiple-baseline design study
Specifically, damping the trend is necessary in Tiers 1 and 2 to avoid such forecasts. Note that for Tier 3, the fact that a straight line does not represent the baseline data well is reflected by MASE > 1 and \( {\widehat{n}}_B<1 \) , leading to a recommendation not to extrapolate the baseline trend. The numerical results are available in Table 4 .
General comments
In general, the modifications introduced in MPD achieve the aims to (a) avoid extrapolating from a short baseline to a much longer intervention phase (Example 1); (b) avoid assuming that the trend will continue exactly the same for many measurement occasions beyond the baseline phase (Example 2); (c) follow an objective criterion regarding a baseline trend line that is not justified in being extrapolated at all (Example 3); and (d) avoid excessively large quantifications of effect when comparing to impossibly bad (countertherapeutic) forecasts in the absence of an effect (Examples 1 and 3). Furthermore, note that for all the data sets included in this illustration, the smallest MASE values were obtained using the Theil–Sen definition of the intercept.
Small-scale simulation study
To obtain additional evidence regarding the performance of the proposals, an application to generated data was a necessary complement to the application of our proposals to previously published real behavioral data. The simulation presented in this section should be understood as a proof of concept, rather than as a comprehensive source of evidence. We consider that further thought and research should be dedicated to simulating discrete bounded data (e.g., counts, percentages) and to studying the present proposals for deciding how far to extrapolate baseline trend and how to deal with impossible extrapolations.
Data generation
We simulated independent and autocorrelation count data using a Poisson model, following the article by Swan and Pustejovsky ( 2018 ) and adapting the R code available in the supplementary material to their article ( https://osf.io/gaxrv and https://www.tandfonline.com/doi/suppl/10.1080/00273171.2018.1466681 ). The adaptation consisted in adding the general trend for certain conditions (denoted here by β 1 , whereas β 2 denotes the change-in-level parameter, unlike in Swan & Pustejovsky, 2018 , who denoted the change in level by β 1 ) and simulating immediate instead of delayed effects (i.e., we set ω = 0). Given that ω = 0, the simulation model, as described by Swan and Pustejovsky, is as follows. The mathematical expectancy for each measurement occasion is μ t = exp( β 0 + β 1 t + β 2 D ), where t is the time variable defined taking values 1, 2, . . . , n A + n B , and D is a dummy variable for change in level, taking n A values of 0 followed by n B values of 1. The first value, Y 1 , is simulated from a Poisson distribution with a mean set to λ 1 = μ 1 . Subsequent values ( j = 2, 3, . . . , n A + n B ) are simulated taking autocorrelation into account ( φ j = min { φ , μ j / μ j − 1 }), leading to the following mean for the Poisson distribution: λ j = μ j − φ j μ j − 1 . Finally, the values from second to last were simulated as Y j = X j + Z j , where Z j follows a Poisson distribution with mean λ j , and X j follows a binomial distribution with Y j − 1 trials and a probability of φ j .
The specific simulation parameters for defining μ t were e β0 = 50 (representing the baseline frequency), β 1 = 0, − 0.1, − 0.2, β 2 = − 0.4 (representing the intervention effect as an immediate change in level), and autocorrelation φ = 0 or 0.4. Regarding the intervention effect, according to the formula % change = 100 % × [exp( β 2 ) − 1] (Pustejovsky, 2018b ), the effect was a reduction of approximately 33%, or 16.5 points, from the baseline level ( e β0 ), set to 50. The phase lengths ( n A = n B ) were 5, 7, and 10.
The specific simulation parameters β , as well as simulating the intervention effect as a reduction, were chosen in such a way as to produce a floor effect for certain simulation conditions. That is, for some of the conditions, the values of the dependent variable were equal or close to zero before the end of the intervention phase, and thus could not improve any more. For these conditions, extrapolating the baseline trend would lead to impossible negative forecasts. Such a data pattern represents well the findings from our review, according to which in almost 40% of the articles at least one AB comparison led to impossible negative predictions if the baseline trend were continued. Example data sets of the simulation conditions are presented as figures at https://osf.io/js3hk/ . A total of 10,000 iterations were performed for each condition using R code ( https://cran.r-project.org ).
Data analysis
Six different quantifications of the intervention effect were computed. First, an immediate effect was computed, as defined in piecewise regression (Center et al., 1985–1986 ) and by extension in multilevel models (Van den Noortgate & Onghena, 2008 ). This immediate effect represents a comparison, for the first intervention-phase measurement occasion, between the extrapolated baseline trend and the fitted intervention-phase trend. Second, an average effect was computed, as defined in the generalized least squares proposal by Swaminathan et al. ( 2014 ). This average effect ( δ AB ) is based on the expression by Rogosa ( 1980 ), initially proposed for computing an overall effect in the context of the analysis of covariance when the regression slopes were not parallel. The specific expressions are (1) for the baseline data, \( {y}_t^A={\beta}_0^A+{\beta}_0^At+{e}_t \) , where t = 1, 2, . . ., n A ; (2) for the intervention-phase data, \( {y}_t^B={\beta}_0^B+{\beta}_0^Bt+{e}_t \) , where t = n A + 1, n A + 2, . . ., n A + n B ; and (3) \( {\delta}_{AB}=\left({\beta}_0^A-{\beta}_0^B\right)+\left({\beta}_1^A-{\beta}_1^B\right)\frac{2{n}_A+{n}_B+1}{2} \) . Additionally, four versions of the MPD were computed: (a) one estimating the baseline trend line using the Theil–Sen estimator, with no limitation of the extrapolation and no correction for impossible forecasts; (b) MPD incorporating \( {\widehat{n}}_B \) for limiting the extrapolation [MPD Limited]; (c) MPD incorporating \( {\widehat{n}}_B \) and using flattening to correct impossible forecasts [MPD Limited Flat]; and (d) MPD incorporating \( {\widehat{n}}_B \) and using damping to correct impossible forecasts [MPD Limited Damping]. Finally, we obtained two additional pieces of information: the percentage of iterations in which \( {\widehat{n}}_B<1 \) (due to MASE being greater than 1) and the quartiles (plus minimum and maximum) corresponding to \( {\widehat{n}}_B \) for each experimental condition.
The results of the simulation are presented in Tables 5 , 6 , and 7 , for phase lengths of five, seven, and ten measurements, respectively. When there is an intervention effect ( β 2 = − 0.4) but no general trend ( β 1 = 0), all quantifications lead to very similar results, which are also very similar to the expected overall difference of 16.5. The most noteworthy result for these conditions is that, when there is autocorrelation, for phase lengths of seven and ten data points, the naïve method is more frequently a better model for the baseline data than the Theil–Sen trend (e.g., 17.51% for autocorrelated data vs. 6.61% for independent data when n A = n B = 10). This is logical because, according to the naïve method each data point is predicted from the previous one, and positive first-order autocorrelation entails that adjacent values are more similar to each other than would be expected by chance.
When there is a general trend and n A = n B = 5 (Table 5 ), the floor effect means that only the immediate effect remains favorable for the intervention (i.e., lower values for the dependent variable in the intervention phase). In contrast, a comparison between the baseline extrapolation and the treatment data leads to overall quantifications ( δ AB and MPD) suggesting deterioration. This is because of the impossible (negative) predicted values. The other versions of MPD entail quantifications that are less overall (i.e., \( {\widehat{n}}_B<{n}_B\Big) \) , and the MPD version that both limits extrapolation and uses damping to avoid impossible projections is the one that leads to values more similar to the immediate effect.
For conditions with n A = n B = 7 (Table 6 ), the results and the comments are equivalent. The only difference is that for a general trend expressed as β 1 = − 0.2, the baseline “spontaneous” reduction is already large enough to reach the floor values, and thus even the immediate effect is unfavorable for the intervention. The results for n A = n B = 10 (Table 7 ) are similar. For n A = n B = 10, we added another condition in which the general trend was not so pronounced (i.e., β 1 = − 0.1) as to lead to a floor effect already during the baseline. For these conditions, the results are similar to the ones for n A = n B = 5 and β 1 = − 0.2.
In summary, when there is a change in level in the absence of a general trend, the proposals for limiting the extrapolation and avoiding impossible forecasts do not affect the quantification of an overall effect. Additionally, in situations in which impossible forecasts would be obtained, these proposals lead to quantifications that better represent the data pattern. We consider that for data patterns in which the floor is reached soon after introducing the intervention, an immediate effect and subsequent values at the floor level (e.g., as quantified by the percentage zero data; Scotti, Evans, Meyer, & Walker, 1991 ) should be considered sufficient evidence (if they are replicated) for an intervention effect. That is, we consider that such quantifications would be a more appropriate evaluation of the data pattern than an overall quantification, such as δ AB and MPD in absence of the proposals. Thus, we consider the proposals to be useful. Still, the specific quantifications obtained when the proposals are applied to MPD should not be considered perfect, because they will depend on the extent to which the observed data pattern matches the expected data pattern (e.g., whether a spontaneous improvement is expected, whether an immediate effect is expected) and on the type of quantification preferred (e.g., a raw difference as in MPD, a percentage change such as the one that could be obtained from the log response ratio [Pustejovsky, 2018b ], or a difference in standard deviations, such as the BC-SMD [Shadish et al., 2014 ]).
In terms of the \( {\widehat{n}}_B \) values obtained, Tables 5 , 6 , and 7 show that most typically (i.e., the central 50%), extrapolations were considered justified from two to four measurement occasions into the interventions phase. This is well-aligned with the idea of an immediate effect consisting of the first three intervention phase measurement occasions (Kratochwill et al., 2010 ) and is broader than the immediate effect defined in piecewise regressions and multilevel models (focusing only on the first measurement occasion). Such a short extrapolation would avoid the untenable assumption that the baseline trend would continue unabated for too long. Moreover, damping the baseline trend helps identify a more appropriate reference for comparing the actual intervention data points.
General discussion
Extrapolating baseline trend: issues, breadth of these issues, and tentative solutions.
Several single-case analytical techniques entail extrapolating baseline trend—for instance, the Allison and Gorman ( 1993 ) regression model, the nonregression technique called mean phase difference (Manolov & Solanas, 2013 ), and the nonoverlap index called the percentage of data points exceeding the median trend (Wolery et al., 2010 ). An initial aspect to take into account is that these three techniques estimate the intercept and slope of the trend line in three different ways. When a trend line is fitted to the baseline data, the amount of fit of the trend line to the data has to be considered, plus whether it is reasonable to consider that the trend will continue unchanged and whether extrapolating the trend would lead to predicted values that are impossible in real data. The latter issue appeared to be present in SCED data published in 2015, given that in approximately 10% of the studies reviewed, forecasts above the maximal possible value were obtained, and in 40% the forecasts were below the minimal possible value, for all five trend line fitting procedures investigated. The proposals we make here take into account the length of the baseline phase, the amount of fit of the trend line to the data, and the need to avoid meaningless comparisons between actual values and impossible predicted values. Moreover, limiting the extrapolation emphasizes the idea that a linear trend is only a model that serves as an approximation of how the data would behave if the baseline continued for a limited amount of time, rather than assuming that a linear trend is necessarily the correct model for the progression of the measurements in the absence of an intervention.
The examples provided with real data and the simulation results from applying the proposals to the MPD illustrate how the present proposal for correcting out-of-bounds forecasts avoids both excessively low and excessively high effect estimates when the bounds of the measurement units are considered. Moreover, the quantitative criterion for deciding how far out to extrapolate baseline trend serves as an objective rule for not extrapolating a trend line into the intervention phase when the baseline data are not represented well by such a line.
Recommendations for applied researchers
In relation to our proposals, we recommend both limiting the extrapolation and allowing for damping the trend. Limiting the extrapolation leads to a quantification that combines to criteria mentioned in the What Works Clearninghouse Standards (Kratochwill et al., 2010 ): immediate change and comparison of the projected versus observed data pattern, whereas damping a trend avoids completely meaningless comparisons. Moreover, in relation to the MPD, we advocate defining its intercept according to the smallest MASE value. In relation to statistical analysis in general, we do not recommend that applied researchers should necessarily always use analytical techniques to extrapolate a baseline trend (e.g., MPD, generalized least squares analysis by Swaminathan et al., 2014 , or the Allison & Gorman, 1993 , OLS model). Rather, we caution regarding the use of such techniques for certain data sets and propose a modification of MPD that avoids obtaining quantifications of effects that are based on unreasonable comparisons. Additionally, we also caution researchers that when a trend line is fitted to the data, in order to improve transparency, it is important to report the technique used for estimating the intercept and slope of this trend line, given that several such techniques are available (Manolov, 2018 ). Finally, for cases in which the data show substantial variability and are not represented well by a straight line, or even by a curved line, we recommend applying the nonoverlap of all pairs, which makes use of all the data and not only of the first \( {\widehat{n}}_B \) measurements of the intervention-phase data.
Beyond the present focus on trend, some desirable features of analytical techniques have been suggested by Wolery et al. ( 2010 ) and expanded on by Manolov, Gast, Perdices, and Evans ( 2014 ). Readers interested in broader reviews of analytical techniques can also consult Gage and Lewis ( 2013 ) and Manolov and Moeyaert ( 2017 ). In general, we echo the recommendation to use quantitative analysis together with visual analysis (e.g., Campbell & Herzinger, 2010 ; Harrington & Velicer, 2015 ; Houle, 2009 ), and we further reflect on this point in the following section.
Validating the quantifications and enhancing their interpretation: Software developments
Visual analysis is regarded as a tool for verifying the meaningfulness of the quantitative results yielded by statistical techniques (Parker et al., 2006 ). In that sense, representing visually the trend line fitted and extrapolated or the transformed data after baseline trend has been removed is crucial. Accordingly, recent efforts have focused on using visual analysis to help choose the appropriate multilevel model (Baek, Petit-Bois, Van Den Noortgate, Beretvas, & Ferron, 2016 ). To make more transparent what exactly is being done with the data to obtain the quantifications, the output of the modified MPD is both graphical and numerical (see http://manolov.shinyapps.io/MPDExtrapolation , which allows for choosing whether to limit the extrapolation of the baseline trend and whether to use damping or winsorizing in the case of out-of-bounds forecasts). For MPD, in which the quantification is the average difference between the extrapolated baseline trend and the actual intervention phase measurements, the graphical output clearly indicates which are the forecast values (plus whether a trend is maintained or damped) and how far away the baseline trend is extrapolated. Moreover, the color of the arrows from predicted to actual intervention-phase values we have used in the figures of this article indicated, for a comparison, whether (green) or not (red) the difference was in the direction desired. In summary, the graphical representation of comparisons performed in MPD makes easier using visual analysis to validate and help interpret the information obtained.
Limitations in relation to the alternatives for extrapolating linear baseline trend for forecasting
In the present study, we discussed extrapolating linear trends because the MPD, our focal analytical technique, fits a straight line to the baseline data before extrapolating them. Nevertheless, it would be possible to fit a nonlinear (e.g., logistic) model to the baseline data (Shadish et al., 2016 ). Furthermore, there are many other alternative procedures for estimating and extrapolating trend, especially in the context of time-series analysis.
Among univariate time-series procedures for forecasting, Chatfield ( 2000 ) distinguished formal statistical models, that is, mathematical representations of reality (e.g., ARIMA; state space; growth curve models, such as logistic and Gompertz; nonlinear models, including artificial neural networks) and ad hoc methods, that is, formulas for computing forecasts. Among the ad hoc methods the most well-known and frequently used options are exponential smoothing (which can be expressed within the framework of state space models; De Gooijer & Hyndman, 2006 ) and the related Holt linear-trend procedure or the Holt–Winters procedure including a seasonal component. As we mentioned previously, the idea of damping a trend is borrowed from the Holt linear-trend procedure, on the basis of the work of E. S. Gardner and McKenzie ( 1985 ).
Regarding ARIMA, according to the Box–Jenkins approach already introduced in the single-case designs context, the aim is to identify the best parsimonious model by means of three steps: model identification, parameter estimation, and diagnostic checking. An appropriate model would then be used for forecasting. The difficulties of correctly identifying the ARIMA model for single-case data, via the analysis of autocorrelations and partial autocorrelations, have been documented (Velicer & Harrop, 1983 ), leading to fewer plausible models being proposed that would avoid this initial step (Velicer & McDonald, 1984 ). The simulation evidence available (Harrop & Velicer, 1985 ) for these models refers to data series of 40 measurements (i.e., 20 per phase), which is more than might be expected from typical single-case baselines (almost half of the initial baselines contained four or fewer data points) or series lengths (median of 20, according to the review by Shadish & Sullivan, 2011 , with most series containing fewer than 40 measurements). Moreover, to the best of our knowledge, the possibility of obtaining out-of-bounds predicted values has not been discussed, nor have tentative solutions been proposed for this issue.
Holt’s ( 2004 ) linear-trend procedure is another option for forecasting that is available in textbooks (e.g., Mendenhall & Sincich, 2012 ), and therefore is potentially accessible to applied researchers. Holt’s model is an extension of simple exponential smoothing including a linear trend. This procedure can be extended further by including a damping parameter (E. S. Gardner & McKenzie, 1985 ) that indicates how much the slope of the trend is reduced in subsequent time periods. The latter model is called the additive damped trend model , and according to the review by E. S. Gardner ( 2006 ), it is the model of choice when using exponential smoothing. The main issue with the additive damped trend model is that it requires estimating three parameters—one smoothing parameter for the level, one smoothing parameter for the trend, and the damping parameter—and it is also recommended to estimate the initial level and trend via optimization. It is unclear whether reliable estimates can be obtained with the usually short baseline phases in single-case data. We performed a small-scale check using the R code by Hyndman and Athanasopoulos ( 2013 , chap. 7.4). For instance, for the Ciullo et al. ( 2015 ) data with n A ≤ 4 and the multiple-baseline data by Eilers and Hayes ( 2015 ) with n A equal to 3, 5, and 8, the number of measurements was not sufficient to estimate the damping parameter, and thus only a linear trend was extrapolated. The same was the case for the Allen et al. ( 2015 ) data for n A = 5 and 9, whereas for n A = 16, it was possible to use the additive damped trend model. Our check suggested that the minimum baseline length required for applying the additive damped trend model is 10, which is greater than (a) the value found in at least 50% of the data sets reviewed by Shadish and Sullivan ( 2011 ); (b) the modal value of six baseline data points reported in Smith’s ( 2012 ) review; and (c) the average baseline length in the Solomon ( 2014 ) review.
Therefore, the reader should be aware that there are alternatives for estimating and extrapolating trend for forecasting. However, to the best of our knowledge, none of these alternatives is directly applicable to single-case data without any issues, or without the need to explore which model or method is more appropriate, and in which circumstances, questions that do not have clear answers even for the usually longer time-series data (Chatfield, 2000 ).
Future research
One line of future research could be to focus on testing the proposals via a broader simulation, such as one that applied different analytical techniques: for instance, the MPD, before computing δ AB in the context of regression analysis, and the percentage of data points exceeding the median trend. Another line of research could focus on a comparison between the version of MPD incorporating the proposals and the recently developed generalized logistic model of Verboon and Peters ( 2018 ). Such a comparison could entail a field test and a survey among applied researchers on the perceived ease of use and the utility of the information provided.
Author note
The authors thank Patrick Onghena for his feedback on previous versions of this article.
In contrast, in the meta-analysis by Chiu and Roberts ( 2018 ), for outcomes for which there was no true maximum, the largest value actually obtained was treated as a maximum, before converting the values into percentages. If we had followed the same procedure, we would have found a greater frequency of impossibly high forecasts.
The references in this paragraph correspond to the studies included in the review and are available in Appendix A at our Open Science Framework site: https://osf.io/js3hk/ .
Note that Tarlow ( 2017 ) identified several issues with Tau-U and proposed the “baseline-corrected Tau,” which, however, corrects the data using the linear trend as estimated with the Theil–Sen estimator, and thus implicitly assumes that a straight line is a good representation of the baseline data.
It could be argued that having three different ways of defining the intercept available (i.e., in the Shiny application) may prompt applied researchers to choose the definition that favors their hypotheses or expectations. Nevertheless, we advocate using the definition of the intercept that provides a better fit to the data, both visually and quantitatively, as assessed via the MASE.
Following Tate and Perdices ( 2018 ), we use the term “tier” to refer to each AB comparison within a multiple-baseline design. Therefore, “tiers” could refer to different individuals, if the multiple-baseline design entails a staggered replication across participants, or to different behaviors or settings, if there is replication across behaviors or settings. Additionally, the term “tier” enables us to avoid confusion with the term “baseline,” which denotes only the A phase of the AB comparison.
Allen, K. D., Vatland, C., Bowen, S. L., & Burke, R. V. (2015). Parent-produced video self-modeling to improve independence in an adolescent with intellectual developmental disorder and an autism spectrum disorder: A controlled case study. Behavior Modification , 39 , 542–556.
PubMed Google Scholar
Allison, D. B., & Gorman, B. S. (1993). Calculating effect sizes for meta-analysis: The case of the single case. Behaviour Research and Therapy , 31 , 621−631.
Arnau, J., & Bono, R. (1998). Short time series analysis: C statistic vs. Edgington model. Quality & Quantity , 32 , 63–75.
Google Scholar
Austin, J. E., & Tiger, J. H. (2015). Providing alternative reinforcers to facilitate tolerance to delayed reinforcement following functional communication training. Journal of Applied Behavior Analysis , 48 , 663−668.
Baek, E. K., Petit-Bois, M., Van Den Noortgate, W., Beretvas, S. N., & Ferron, J. M. (2016). Using visual analysis to evaluate and refine multilevel models of single-case studies. Journal of Special Education , 50 , 18–26.
Billah, B., King, M. L., Snyder, R. D., & Koehler, A. B. (2006). Exponential smoothing model selection for forecasting. International Journal of Forecasting , 22 , 239–247.
Brandt, J. A. A., Dozier, C. L., Juanico, J. F., Laudont, C. L., & Mick, B. R. (2015). The value of choice as a reinforcer for typically developing children. Journal of Applied Behavior Analysis , 48 , 344−362.
Bringmann, L. F., Hamaker, E. L., Vigo, D. E., Aubert, A., Borsboom, D., & Tuerlinckx, F. (2017). Changing dynamics: Time-varying autoregressive models using generalized additive modeling. Psychological Methods , 22 , 409–425. https://doi.org/10.1037/met0000085
Article PubMed Google Scholar
Campbell, J. M., & Herzinger, C. V. (2010). Statistics and single subject research methodology. In D. L. Gast (Ed.), Single subject research methodology in behavioral sciences (pp. 417–453). London: Routledge.
Cannella-Malone, H. I., Sabielny, L. M., & Tullis, C. A. (2015). Using eye gaze to identify reinforcers for individuals with severe multiple disabilities. Journal of Applied Behavior Analysis , 48 , 680–684. https://doi.org/10.1002/jaba.231
Center, B. A., Skiba, R. J., & Casey, A. (1985–1986). A methodology for the quantitative synthesis of intra-subject design research. Journal of Special Education , 19 , 387–400.
Chatfield, C. (2000). Time-series forecasting. London: Chapman & Hall/CRC.
Cheng, Y., Huang, C. L., & Yang, C. S. (2015). Using a 3D immersive virtual environment system to enhance social understanding and social skills for children with autism spectrum disorders. Focus on Autism and Other Developmental Disabilities , 30 , 222−236.
Chiu, M. M., & Roberts, C. A. (2018). Improved analyses of single cases: Dynamic multilevel analysis. Developmental Neurorehabilitation , 21 , 253–265.
Ciullo, S., Falcomata, T. S., Pfannenstiel, K., & Billingsley, G. (2015). Improving learning with science and social studies text using computer-based concept maps for students with disabilities. Behavior Modification , 39 , 117–135.
De Gooijer, J. G., & Hyndman, R. J. (2006). 25 years of time series forecasting. International Journal of Forecasting , 22 , 443–473.
Eilers, H. J., & Hayes, S. C. (2015). Exposure and response prevention therapy with cognitive defusion exercises to reduce repetitive and restrictive behaviors displayed by children with autism spectrum disorder. Research in Autism Spectrum Disorders , 19 , 18–31.
Fahmie, T. A., Iwata, B. A., & Jann, K. E. (2015). Comparison of edible and leisure reinforcers. Journal of Applied Behavior Analysis , 48 , 331−343.
Faith, M. S., Allison, D. B., & Gorman, D. B. (1997). Meta-analysis of single-case research. In R. D. Franklin, D. B. Allison, & B. S. Gorman (Eds.), Design and analysis of single-case research (pp. 245–277). Mahwah: Erlbaum.
Ferron, J. M., Bell, B. A., Hess, M. R., Rendina-Gobioff, G., & Hibbard, S. T. (2009). Making treatment effect inferences from multiple-baseline data: The utility of multilevel modeling approaches. Behavior Research Methods , 41 , 372–384. https://doi.org/10.3758/BRM.41.2.372
Fisher, W. W., Kelley, M. E., & Lomas, J. E. (2003). Visual aids and structured criteria for improving visual inspection and interpretation of single-case designs. Journal of Applied Behavior Analysis , 36 , 387–406.
PubMed PubMed Central Google Scholar
Fiske, K. E., Isenhower, R. W., Bamond, M. J., Delmolino, L., Sloman, K. N., & LaRue, R. H. (2015). Assessing the value of token reinforcement for individuals with autism. Journal of Applied Behavior Analysis , 48 , 448−453.
Fox, J. (2016). Applied regression analysis and generalized linear models (3rd). London: Sage.
Gage, N. A., & Lewis, T. J. (2013). Analysis of effect for single-case design research. Journal of Applied Sport Psychology , 25 , 46–60.
Gardner, E. S., Jr. (2006). Exponential smoothing: The state of the art—Part II. International Journal of Forecasting , 22 , 637–666.
Gardner, E. S., Jr., & McKenzie, E. (1985). Forecasting trends in time series. Management Science , 31 , 1237–1246.
Gardner, S. J., & Wolfe, P. S. (2015). Teaching students with developmental disabilities daily living skills using point-of-view modeling plus video prompting with error correction. Focus on Autism and Other Developmental Disabilities , 30 , 195−207.
Harrington, M., & Velicer, W. F. (2015). Comparing visual and statistical analysis in single-case studies using published studies. Multivariate Behavioral Research , 50 , 162–183.
Harrop, J. W., & Velicer, W. F. (1985). A comparison of alternative approaches to the analysis of interrupted time-series. Multivariate Behavioral Research , 20 , 27–44.
Hine, J. F., Ardoin, S. P., & Foster, T. E. (2015). Decreasing transition times in elementary school classrooms: Using computer-assisted instruction to automate intervention components. Journal of Applied Behavior Analysis , 48 , 495–510. https://doi.org/10.1002/jaba.233
Holt, C. C. (2004). Forecasting seasonals and trends by exponentially weighted moving averages. International Journal of Forecasting , 20 , 5–10.
Horner, R. H., Swaminathan, H., Sugai, G., & Smolkowski, K. (2012). Considerations for the systematic analysis and use of single-case research. Education and Treatment of Children , 35 , 269–290.
Houle, T. T. (2009). Statistical analyses for single-case experimental designs. In D. H. Barlow, M. K. Nock, & M. Hersen (Eds.), Single case experimental designs: Strategies for studying behavior change (3rd, pp. 271–305). Boston: Pearson.
Huitema, B. E., McKean, J. W., & McKnight, S. (1999). Autocorrelation effects on least-squares intervention analysis of short time series. Educational and Psychological Measurement , 59 , 767–786.
Hyndman, R. J., & Athanasopoulos, G. (2013). Forecasting: Principles and practice. Retrieved March 24, 2018, from https://www.otexts.org/fpp/7/4
Hyndman, R. J., & Koehler, A. B. (2006). Another look at measures of forecast accuracy. International Journal of Forecasting , 22 , 679–688.
Knight, V. F., Wood, C. L., Spooner, F., Browder, D. M., & O’Brien, C. P. (2015). An exploratory study using science eTexts with students with Autism Spectrum Disorder. Focus on Autism and Other Developmental Disabilities , 30 , 86−99.
Kratochwill, T. R., Hitchcock, J. H., Horner, R. H., Levin, J. R., Odom, S. L., Rindskopf, D. M., & Shadish, W. R. (2010). Single case designs technical documentation . In the What Works Clearinghouse: Procedures and standards handbook (Version 1.0). Available at http://ies.ed.gov/ncee/wwc/pdf/reference_resources/wwc_scd.pdf
Kratochwill, T. R., Levin, J. R., Horner, R. H., & Swoboda, C. M. (2014). Visual analysis of single-case intervention research: Conceptual and methodological issues. In T. R. Kratochwill & J. R. Levin (Eds.), Single-case intervention research: Methodological and statistical advances (pp. 91–125). Washington, DC: American Psychological Association.
Lane, J. D., & Gast, D. L. (2014). Visual analysis in single case experimental design studies: Brief review and guidelines. Neuropsychological Rehabilitation , 24 , 445–463.
Ledbetter-Cho, K., Lang, R., Davenport, K., Moore, M., Lee, A., Howell, A., . . . O’Reilly, M. (2015). Effects of script training on the peer-to-peer communication of children with autism spectrum disorder. Journal of Applied Behavior Analysis , 48 , 785−799.
Manolov, R. (2018). Linear trend in single-case visual and quantitative analyses. Behavior Modification , 42 , 684–706.
Manolov, R., Gast, D. L., Perdices, M., & Evans, J. J. (2014). Single-case experimental designs: Reflections on conduct and analysis. Neuropsychological Rehabilitation , 24 , 634−660. https://doi.org/10.1080/09602011.2014.903199
Manolov, R., & Moeyaert, M. (2017). Recommendations for choosing single-case data analytical techniques. Behavior Therapy , 48 , 97−114.
Manolov, R., & Rochat, L. (2015). Further developments in summarising and meta-analysing single-case data: An illustration with neurobehavioural interventions in acquired brain injury. Neuropsychological Rehabilitation , 25 , 637−662.
Manolov, R., & Solanas, A. (2009). Percentage of nonoverlapping corrected data. Behavior Research Methods , 41 , 1262–1271. https://doi.org/10.3758/BRM.41.4.1262
Manolov, R., & Solanas, A. (2013). A comparison of mean phase difference and generalized least squares for analyzing single-case data. Journal of School Psychology , 51 , 201−215.
Marso, D., & Shadish, W. R. (2015). Software for meta-analysis of single-case design: DHPS macro . Retrieved January 22, 2017, from http://faculty.ucmerced.edu/wshadish/software/software-meta-analysis-single-case-design
Matyas, T. A., & Greenwood, K. M. (1997). Serial dependency in single-case time series. In R. D. Franklin, D. B. Allison, & B. S. Gorman (Eds.), Design and analysis of single-case research (pp. 215–243). Mahwah: Erlbaum.
Mendenhall, W., & Sincich, T. (2012). A second course in statistics: Regression analysis (7th). Boston: Prentice Hall.
Mercer, S. H., & Sterling, H. E. (2012). The impact of baseline trend control on visual analysis of single-case data. Journal of School Psychology , 50 , 403–419.
Parker, R. I., Cryer, J., & Byrns, G. (2006). Controlling baseline trend in single-case research. School Psychology Quarterly , 21 , 418−443.
Parker, R. I., & Vannest, K. (2009). An improved effect size for single-case research: Nonoverlap of all pairs. Behavior Therapy , 40 , 357–367. https://doi.org/10.1016/j.beth.2008.10.006
Parker, R. I., Vannest, K. J., Davis, J. L., & Sauber, S. B. (2011). Combining nonoverlap and trend for single-case research: Tau-U. Behavior Therapy , 42 , 284−299. https://doi.org/10.1016/j.beth.2010.08.006
Pustejovsky, J. E. (2015). Measurement-comparable effect sizes for single-case studies of free-operant behavior. Psychological Methods , 20 , 342−359.
Pustejovsky, J. E. (2018a). Procedural sensitivities of effect sizes for single-case designs with directly observed behavioral outcome measures. Psychological Methods . Advance online publication. https://doi.org/10.1037/met0000179
Pustejovsky, J. E. (2018b). Using response ratios for meta-analyzing single-case designs with behavioral outcomes. Journal of School Psychology , 68 , 99–112.
Pustejovsky, J. E., Hedges, L. V., & Shadish, W. R. (2014). Design-comparable effect sizes in multiple baseline designs: A general modeling framework. Journal of Educational and Behavioral Statistics , 39 , 368–393.
Rindskopf, D. M., & Ferron, J. M. (2014). Using multilevel models to analyze single-case design data. In T. R. Kratochwill & J. R. Levin (Eds.), Single-case intervention research: Methodological and statistical advances (pp. 221−246). Washington, DC: American Psychological Association.
Rispoli, M., Ninci, J., Burke, M. D., Zaini, S., Hatton, H., & Sanchez, L. (2015). Evaluating the accuracy of results for teacher implemented trial-based functional analyses. Behavior Modification , 39 , 627−653.
Rogosa, D. (1980). Comparing nonparallel regression lines. Psychological Bulletin , 88 , 307–321. https://doi.org/10.1037/0033-2909.88.2.307
Article Google Scholar
Saini, V., Greer, B. D., & Fisher, W. W. (2015). Clarifying inconclusive functional analysis results: Assessment and treatment of automatically reinforced aggression. Journal of Applied Behavior Analysis , 48 , 315–330. https://doi.org/10.1002/jaba.203
Article PubMed PubMed Central Google Scholar
Scotti, J. R., Evans, I. M., Meyer, L. H., & Walker, P. (1991). A meta-analysis of intervention research with problem behavior: Treatment validity and standards of practice. American Journal on Mental Retardation , 96 , 233–256.
Scruggs, T. E., & Mastropieri, M. A. (1998). Summarizing single-subject research: Issues and applications. Behavior Modification , 22 , 221–242.
Shadish, W. R., Hedges, L. V., & Pustejovsky, J. E. (2014). Analysis and meta-analysis of single-case designs with a standardized mean difference statistic: A primer and applications. Journal of School Psychology , 52 , 123–147.
Shadish, W. R., Kyse, E. N., & Rindskopf, D. M. (2013). Analyzing data from single-case designs using multilevel models: New applications and some agenda items for future research. Psychological Methods , 18 , 385–405. https://doi.org/10.1037/a0032964
Shadish, W. R., Rindskopf, D. M., & Boyajian, J. G. (2016). Single-case experimental design yielded an effect estimate corresponding to a randomized controlled trial. Journal of Clinical Epidemiology , 76 , 82–88.
Shadish, W. R., Rindskopf, D. M., Hedges, L. V., & Sullivan, K. J. (2013). Bayesian estimates of autocorrelations in single-case designs. Behavior Research Methods , 45 , 813–821.
Shadish, W. R., & Sullivan, K. J. (2011). Characteristics of single-case designs used to assess intervention effects in 2008. Behavior Research Methods , 43 , 971−980. https://doi.org/10.3758/s13428-011-0111-y
Siegel, E. B., & Lien, S. E. (2015). Using photographs of contrasting contextual complexity to support classroom transitions for children with Autism Spectrum Disorders. Focus on Autism and Other Developmental Disabilities , 30 , 100−114.
Smith, J. D. (2012). Single-case experimental designs: A systematic review of published research and current standards. Psychological Methods , 17 , 510–550. https://doi.org/10.1037/a0029312
Solanas, A., Manolov, R., & Onghena, P. (2010). Estimating slope and level change in N = 1 designs. Behavior Modification , 34 , 195−218.
Solomon, B. G. (2014). Violations of assumptions in school-based single-case data: Implications for the selection and interpretation of effect sizes. Behavior Modification , 38 , 477−496.
Stewart, K. K., Carr, J. E., Brandt, C. W., & McHenry, M. M. (2007). An evaluation of the conservative dual-criterion method for teaching university students to visually inspect AB-design graphs. Journal of Applied Behavior Analysis , 40 , 713−718.
Sullivan, K. J., Shadish, W. R., & Steiner, P. M. (2015). An introduction to modeling longitudinal data with generalized additive models: Applications to single-case designs. Psychological Methods , 20 , 26−42. https://doi.org/10.1037/met0000020
Swaminathan, H., Rogers, H. J., Horner, R., Sugai, G., & Smolkowski, K. (2014). Regression models for the analysis of single case designs. Neuropsychological Rehabilitation , 24 , 554−571.
Swan, D. M., & Pustejovsky, J. E. (2018). A gradual effects model for single-case designs. Multivariate Behavioral Research , 53 , 574–593. https://doi.org/10.1080/00273171.2018.1466681
Tarlow, K. (2017). An improved rank correlation effect size statistic for single-case designs: Baseline corrected Tau. Behavior Modification , 41 , 427–467.
Tate, R. L., & Perdices, M. (2018). Single-case experimental designs for clinical research and neurorehabilitation settings: Planning, conduct, analysis and reporting. London: Routledge.
Tate, R. L., Perdices, M., Rosenkoetter, U., Wakima, D., Godbee, K., Togher, L., & McDonald, S. (2013). Revision of a method quality rating scale for single-case experimental designs and n -of-1 trials: The 15-item Risk of Bias in N -of-1 Trials (RoBiNT) Scale. Neuropsychological Rehabilitation , 23 , 619–638. https://doi.org/10.1080/09602011.2013.824383
Van den Noortgate, W., & Onghena, P. (2008). A multilevel meta-analysis of single-subject experimental design studies. Evidence-Based Communication Assessment and Intervention , 2 , 142–151.
Vannest, K. J., Parker, R. I., Davis, J. L., Soares, D. A., & Smith, S. L. (2012). The Theil–Sen slope for high-stakes decisions from progress monitoring. Behavioral Disorders , 37 , 271–280.
Velicer, W. F., & Harrop, J. (1983). The reliability and accuracy of time series model identification. Evaluation Review , 7 , 551–560.
Velicer, W. F., & McDonald, R. P. (1984). Time series analysis without model identification. Multivariate Behavioral Research , 19 , 33–47.
Verboon, P., & Peters, G. J. (2018). Applying the generalized logistic model in single case designs: Modeling treatment-induced shifts. Behavior Modification . Advance online publication. https://doi.org/10.1177/0145445518791255
White, D. M., Rusch, F. R., Kazdin, A. E., & Hartmann, D. P. (1989). Applications of meta-analysis in individual subject research. Behavioral Assessment , 11 , 281–296.
Wicherts, J. M., Veldkamp, C. L., Augusteijn, H. E., Bakker, M., van Aert, R. C., & Van Assen, M. A. (2016). Degrees of freedom in planning, running, analyzing, and reporting psychological studies: A checklist to avoid p -hacking. Frontiers in Psychology , 7 , 1832. https://doi.org/10.3389/fpsyg.2016.01832
Wolery, M., Busick, M., Reichow, B., & Barton, E. E. (2010). Comparison of overlap methods for quantitatively synthesizing single-subject data. Journal of Special Education , 44 , 18–29.
Wolfe, K., & Slocum, T. A. (2015). A comparison of two approaches to training visual analysis of AB graphs. Journal of Applied Behavior Analysis , 48 , 472–477. https://doi.org/10.1002/jaba.212
Young, N. D., & Daly, E. J., III. (2016). An evaluation of prompting and reinforcement for training visual analysis skills. Journal of Behavioral Education , 25 , 95–119.
Download references
Author information
Authors and affiliations.
Department of Social Psychology and Quantitative Psychology, Faculty of Psychology, University of Barcelona, Barcelona, Spain
Rumen Manolov & Antonio Solanas
Department of Operations, Innovation and Data Sciences, ESADE Business School, Ramon Llull University, Barcelona, Spain
Rumen Manolov & Vicenta Sierra
You can also search for this author in PubMed Google Scholar
Corresponding author
Correspondence to Rumen Manolov .
Additional information
Publisher’s note.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References to the studies included in the present review of single-case research published in 2015 in four journals: Journal of Applied Behavior Analysis , Behavior Modification , Research in Autism Spectrum Disorders , and Focus on Autism and Other Developmental Disabilities .
Allen, K. D., Vatland, C., Bowen, S. L., & Burke, R. V. (2015). An evaluation of parent-produced video self-modeling to improve independence in an adolescent with intellectual developmental disorder and an autism spectrum disorder: A controlled case study. Behavior Modification , 39 , 542–556.
Austin, J. E., & Tiger, J. H. (2015). Providing alternative reinforcers to facilitate tolerance to delayed reinforcement following functional communication training. Journal of Applied Behavior Analysis , 48 , 663–668.
Austin, J. L., Groves, E. A., Reynish, L. C., & Francis, L. L. (2015). Validating trial-based functional analyses in mainstream primary school classrooms. Journal of Applied Behavior Analysis , 48 , 274–288.
Boudreau, B. A., Vladescu, J. C., Kodak, T. M., Argott, P. J., & Kisamore, A. N. (2015). A comparison of differential reinforcement procedures with children with autism. Journal of Applied Behavior Analysis , 48 , 918–923.
Brandt, J. A. A., Dozier, C. L., Juanico, J. F., Laudont, C. L., & Mick, B. R. (2015). The value of choice as a reinforcer for typically developing children. Journal of Applied Behavior Analysis , 48 , 344–362.
Cannella-Malone, H. I., Sabielny, L. M., & Tullis, C. A. (2015). Using eye gaze to identify reinforcers for individuals with severe multiple disabilities. Journal of Applied Behavior Analysis , 48 , 680–684.
Carroll, R. A., Joachim, B. T., St Peter, C. C., & Robinson, N. (2015). A comparison of error-correction procedures on skill acquisition during discrete-trial instruction. Journal of Applied Behavior Analysis , 48 , 257–273.
Cheng, Y., Huang, C. L., & Yang, C. S. (2015). Using a 3D immersive virtual environment system to enhance social understanding and social skills for children with autism spectrum disorders. Focus on Autism and Other Developmental Disabilities , 30 , 222–236.
Ciccone, F. J., Graff, R. B., & Ahearn, W. H. (2015). Increasing the efficiency of paired-stimulus preference assessments by identifying categories of preference. Journal of Applied Behavior Analysis , 48 , 221–226.
Ciullo, S., Falcomata, T. S., Pfannenstiel, K., & Billingsley, G. (2014). Improving learning with science and social studies text using computer-based concept maps for students with disabilities. Behavior Modification , 39 , 117–135.
Daar, J. H., Negrelli, S., & Dixon, M. R. (2015). Derived emergence of WH question–answers in children with autism. Research in Autism Spectrum Disorders , 19 , 59–71.
DeQuinzio, J. A., & Taylor, B. A. (2015). Teaching children with autism to discriminate the reinforced and nonreinforced responses of others: Implications for observational learning. Journal of Applied Behavior Analysis , 48 , 38–51.
Derosa, N. M., Fisher, W. W., & Steege, M. W. (2015). An evaluation of time in establishing operation on the effectiveness of functional communication training. Journal of Applied Behavior Analysis , 48 , 115–130.
Ditzian, K., Wilder, D. A., King, A., & Tanz, J. (2015). An evaluation of the performance diagnostic checklist–human services to assess an employee performance problem in a center-based autism treatment facility. Journal of Applied Behavior Analysis , 48 , 199–203.
Donaldson, J. M., Wiskow, K. M., & Soto, P. L. (2015). Immediate and distal effects of the good behavior game. Journal of Applied Behavior Analysis , 48 , 685–689.
Downs, H. E., Miltenberger, R., Biedronski, J., & Witherspoon, L. (2015). The effects of video self-evaluation on skill acquisition with yoga postures. Journal of Applied Behavior Analysis , 48 , 930–935.
Dupuis, D. L., Lerman, D. C., Tsami, L., & Shireman, M. L. (2015). Reduction of aggression evoked by sounds using noncontingent reinforcement and time-out. Journal of Applied Behavior Analysis , 48 , 669–674.
Engstrom, E., Mudford, O. C., & Brand, D. (2015). Replication and extension of a check-in procedure to increase activity engagement among people with severe dementia. Journal of Applied Behavior Analysis , 48 , 460–465.
Fahmie, T. A., Iwata, B. A., & Jann, K. E. (2015). Comparison of edible and leisure reinforcers. Journal of Applied Behavior Analysis , 48 , 331–343.
Fichtner, C. S., & Tiger, J. H. (2015). Teaching discriminated social approaches to individuals with Angelman syndrome. Journal of Applied Behavior Analysis , 48 , 734–748.
Fisher, W. W., Greer, B. D., Fuhrman, A. M., & Querim, A. C. (2015). Using multiple schedules during functional communication training to promote rapid transfer of treatment effects. Journal of Applied Behavior Analysis , 48 , 713–733.
Fiske, K. E., Isenhower, R. W., Bamond, M. J., Delmolino, L., Sloman, K. N., & LaRue, R. H. (2015). Assessing the value of token reinforcement for individuals with autism. Journal of Applied Behavior Analysis , 48 , 448–453.
Fox, A. E., & Belding, D. L. (2015). Reducing pawing in horses using positive reinforcement. Journal of Applied Behavior Analysis , 48 , 936–940.
Frewing, T. M., Rapp, J. T., & Pastrana, S. J. (2015). Using conditional percentages during free-operant stimulus preference assessments to predict the effects of preferred items on stereotypy preliminary findings. Behavior Modification , 39 , 740–765.
Fu, S. B., Penrod, B., Fernand, J. K., Whelan, C. M., Griffith, K., & Medved, S. (2015). The effects of modeling contingencies in the treatment of food selectivity in children with autism. Behavior Modification , 39 , 771–784.
Gardner, S. J., & Wolfe, P. S. (2014). Teaching students with developmental disabilities daily living skills using point-of-view modeling plus video prompting with error correction. Focus on Autism and Other Developmental Disabilities , 30 , 195–207.
Gilroy, S. P., Lorah, E. R., Dodge, J., & Fiorello, C. (2015). Establishing deictic repertoires in autism. Research in Autism Spectrum Disorders , 19 , 82–92.
Groskreutz, M. P., Peters, A., Groskreutz, N. C., & Higbee, T. S. (2015). Increasing play-based commenting in children with autism spectrum disorder using a novel script-frame procedure. Journal of Applied Behavior Analysis , 48 , 442–447.
Haq, S. S., & Kodak, T. (2015). Evaluating the effects of massed and distributed practice on acquisition and maintenance of tacts and textual behavior with typically developing children. Journal of Applied Behavior Analysis , 48 , 85–95.
Hayes, L. B., & Van Camp, C. M. (2015). Increasing physical activity of children during school recess. Journal of Applied Behavior Analysis , 48 , 690–695.
Hine, J. F., Ardoin, S. P., & Foster, T. E. (2015). Decreasing transition times in elementary school classrooms: Using computer-assisted instruction to automate intervention components. Journal of Applied Behavior Analysis , 48 , 495–510.
Kelley, M. E., Liddon, C. J., Ribeiro, A., Greif, A. E., & Podlesnik, C. A. (2015). Basic and translational evaluation of renewal of operant responding. Journal of Applied Behavior Analysis , 48 , 390–401.
Kodak, T., Clements, A., Paden, A. R., LeBlanc, B., Mintz, J., & Toussaint, K. A. (2015). Examination of the relation between an assessment of skills and performance on auditory–visual conditional discriminations for children with autism spectrum disorder. Journal of Applied Behavior Analysis , 48 , 52–70.
Knight, V. F., Wood, C. L., Spooner, F., Browder, D. M., & O’Brien, C. P. (2015). An exploratory study using science eTexts with students with Autism Spectrum Disorder. Focus on Autism and Other Developmental Disabilities , 30 , 86–99.
Kuhl, S., Rudrud, E. H., Witts, B. N., & Schulze, K. A. (2015). Classroom-based interdependent group contingencies increase children’s physical activity. Journal of Applied Behavior Analysis , 48 , 602–612.
Lambert, A. M., Tingstrom, D. H., Sterling, H. E., Dufrene, B. A., & Lynne, S. (2015). Effects of tootling on classwide disruptive and appropriate behavior of upper-elementary students. Behavior Modification , 39 , 413–430.
Lambert, J. M., Bloom, S. E., Samaha, A. L., Dayton, E., & Rodewald, A. M. (2015). Serial alternative response training as intervention for target response resurgence. Journal of Applied Behavior Analysis , 48 , 765–780.
Ledbetter-Cho, K., Lang, R., Davenport, K., Moore, M., Lee, A., Howell, A., . . . O’Reilly, M. (2015). Effects of script training on the peer-to-peer communication of children with autism spectrum disorder. Journal of Applied Behavior Analysis , 48 , 785–799.
Lee, G. P., Miguel, C. F., Darcey, E. K., & Jennings, A. M. (2015). A further evaluation of the effects of listener training on derived categorization and speaker behavior in children with autism. Research in Autism Spectrum Disorders , 19 , 72–81.
Lerman, D. C., Hawkins, L., Hillman, C., Shireman, M., & Nissen, M. A. (2015). Adults with autism spectrum disorder as behavior technicians for young children with autism: Outcomes of a behavioral skills training program. Journal of Applied Behavior Analysis , 48 , 233–256.
Mechling, L. C., Ayres, K. M., Foster, A. L., & Bryant, K. J. (2014). Evaluation of generalized performance across materials when using video technology by students with autism spectrum disorder and moderate intellectual disability. Focus on Autism and Other Developmental Disabilities , 30 , 208–221.
Miller, S. A., Rodriguez, N. M., & Rourke, A. J. (2015). Do mirrors facilitate acquisition of motor imitation in children diagnosed with autism? Journal of Applied Behavior Analysis , 48 , 194–198.
Mitteer, D. R., Romani, P. W., Greer, B. D., & Fisher, W. W. (2015). Assessment and treatment of pica and destruction of holiday decorations. Journal of Applied Behavior Analysis , 48 , 912–917.
Neely, L., Rispoli, M., Gerow, S., & Ninci, J. (2014). Effects of antecedent exercise on academic engagement and stereotypy during instruction. Behavior Modification , 39 , 98–116.
O’Handley, R. D., Radley, K. C., & Whipple, H. M. (2015). The relative effects of social stories and video modeling toward increasing eye contact of adolescents with autism spectrum disorder. Research in Autism Spectrum Disorders , 11 , 101–111.
Paden, A. R., & Kodak, T. (2015). The effects of reinforcement magnitude on skill acquisition for children with autism. Journal of Applied Behavior Analysis , 48 , 924–929.
Pence, S. T., & St Peter, C. C. (2015). Evaluation of treatment integrity errors on mand acquisition. Journal of Applied Behavior Analysis , 48 , 575–589. https://doi.org/10.1002/jaba.238
Peters, L. C., & Thompson, R. H. (2015). Teaching children with autism to respond to conversation partners’ interest. Journal of Applied Behavior Analysis , 48 , 544–562.
Peterson, K. M., Volkert, V. M., & Zeleny, J. R. (2015). Increasing self-drinking for children with feeding disorders. Journal of Applied Behavior Analysis , 48 , 436–441.
Protopopova, A., & Wynne, C. D. (2015). Improving in-kennel presentation of shelter dogs through response-dependent and response-independent treat delivery. Journal of Applied Behavior Analysis , 48 , 590–601.
Putnam, B. C., & Tiger, J. H. (2015). Teaching braille letters, numerals, punctuation, and contractions to sighted individuals. Journal of Applied Behavior Analysis , 48 , 466–471.
Quinn, M. J., Miltenberger, R. G., & Fogel, V. A. (2015). Using TAGteach to improve the proficiency of dance movements. Journal of Applied Behavior Analysis , 48 , 11–24.
Rispoli, M., Ninci, J., Burke, M. D., Zaini, S., Hatton, H., & Sanchez, L. (2015). Evaluating the accuracy of results for teacher implemented trial-based functional analyses. Behavior Modification , 39 , 627–653.
Rosales, R., Gongola, L., & Homlitas, C. (2015). An evaluation of video modeling with embedded instructions to teach implementation of stimulus preference assessments. Journal of Applied Behavior Analysis , 48 , 209–214.
Saini, V., Greer, B. D., & Fisher, W. W. (2015). Clarifying inconclusive functional analysis results: Assessment and treatment of automatically reinforced aggression. Journal of Applied Behavior Analysis , 48 , 315–330.
Saini, V., Gregory, M. K., Uran, K. J., & Fantetti, M. A. (2015). Parametric analysis of response interruption and redirection as treatment for stereotypy. Journal of Applied Behavior Analysis , 48 , 96–106.
Scalzo, R., Henry, K., Davis, T. N., Amos, K., Zoch, T., Turchan, S., & Wagner, T. (2015). Evaluation of interventions to reduce multiply controlled vocal stereotypy. Behavior Modification , 39 , 496–509.
Siegel, E. B., & Lien, S. E. (2014). Using photographs of contrasting contextual complexity to support classroom transitions for children with Autism Spectrum Disorders. Focus on Autism and Other Developmental Disabilities , 30 , 100–114.
Slocum, S. K., & Vollmer, T. R. (2015). A comparison of positive and negative reinforcement for compliance to treat problem behavior maintained by escape. Journal of Applied Behavior Analysis , 48 , 563–574.
Smith, K. A., Shepley, S. B., Alexander, J. L., Davis, A., & Ayres, K. M. (2015). Self-instruction using mobile technology to learn functional skills. Research in Autism Spectrum Disorders , 11 , 93–100.
Sniezyk, C. J., & Zane, T. L. (2014). Investigating the effects of sensory integration therapy in decreasing stereotypy. Focus on Autism and Other Developmental Disabilities , 30 , 13–22.
Speelman, R. C., Whiting, S. W., & Dixon, M. R. (2015). Using behavioral skills training and video rehearsal to teach blackjack skills. Journal of Applied Behavior Analysis , 48 , 632–642.
Still, K., May, R. J., Rehfeldt, R. A., Whelan, R., & Dymond, S. (2015). Facilitating derived requesting skills with a touchscreen tablet computer for children with autism spectrum disorder. Research in Autism Spectrum Disorders , 19 , 44–58.
Vargo, K. K., & Ringdahl, J. E. (2015). An evaluation of resistance to change with unconditioned and conditioned reinforcers. Journal of Applied Behavior Analysis , 48 , 643–662.
Vedora, J., & Grandelski, K. (2015). A comparison of methods for teaching receptive language to toddlers with autism. Journal of Applied Behavior Analysis , 48 , 188–193.
Wilder, D. A., Majdalany, L., Sturkie, L., & Smeltz, L. (2015). Further evaluation of the high-probability instructional sequence with and without programmed reinforcement. Journal of Applied Behavior Analysis , 48 , 511–522.
Wunderlich, K. L., & Vollmer, T. R. (2015). Data analysis of response interruption and redirection as a treatment for vocal stereotypy. Journal of Applied Behavior Analysis , 48 , 749–764.
Appendix B: Versions of the mean phase difference
In the initial proposal (Manolov & Solanas, 2013 ), MPD.2013 entails the following steps:
Estimate baseline trend as the average of the differenced baseline phase data:
Extrapolate baseline trend, adding the trend estimate ( b 1( D ) ) to the last baseline phase data point ( \( {y}_{n_A} \) ) to predict the first intervention phase data point ( \( {\widehat{y}}_{n_A+1} \) ). Formally, \( {\widehat{y}}_{n_A+1}={y}_{n_A}+{b}_{1(D)} \) . This entails that the intercept of the baseline trend line is \( {b}_{0(MPD.2013)}={y}_{n_A}-{n}_A\times {b}_{1(D)} \) .
Continue extrapolating adding the trend estimate to the previously obtained forecast. Formally, \( {\widehat{y}}_{n_A+j}={\widehat{y}}_{n_A+j-1}+{b}_{1(D)};j=2,3,\dots, {n}_B \) .
Obtain MPD as the difference between the actually obtained treatment data (y j ) and the treatment measurements as predicted from baseline trend ( \( {\widehat{y}}_j \) ): \( {MPD}_{2013}=\frac{\sum_{j=1}^{n_B}\left({y}_j-{\widehat{y}}_j\right)}{n_B} \) .
In its modified version (Manolov & Rochat, 2015 ), MPD.2015 entails the following steps:
Estimate baseline trend as the average of the differenced baseline phase data: the same b 1( D ) previously defined.
Establish the pivotal point in the baseline at the crossing of Md ( x ) = Md (1, 2, …, n A ) on the abscissa and \( Md(y)= Md\left({y}_1,{y}_2,\dots, {y}_{n_A}\right) \) on the ordinate.
Establish a fitted value at an existing baseline measurement occasion around Md ( y ). Formally, \( {\widehat{y}}_{\left\lfloor Md(x)\right\rfloor }= Md(y)-\left( Md(x)-\left\lfloor Md(x)\right\rfloor \right)\times {b}_1 \) .
Fit the baseline trend to the whole baseline, subtracting and adding the estimated baseline slope from the fitted value obtained in the previous step, according to the measurement occasion.
Therefore, the intercept of the baseline trend line is defined as:
Extrapolate the baseline trend into the treatment phase, starting from the last fitted baseline value: \( {\widehat{y}}_{n_A+1}={\widehat{y}}_{n_A}+{b}_{1(D)} \) .
Continue extrapolating adding the trend estimate to the previously obtained forecast: \( {\widehat{y}}_{n_A+j}={\widehat{y}}_{n_A+j-1}+{b}_{1(D)};j=2,3,\dots, {n}_B \) .
Obtain MPD as the difference between the actually obtained treatment data and the treatment measurements as predicted from baseline trend: \( {MPD}_{2015}=\frac{\sum_{j=1}^{n_B}\left({y}_j-{\widehat{y}}_j\right)}{n_B} \) .
We propose a third way of defining the intercept, namely, in the same way as estimated in the Theil–Sen estimator, that is, as the median difference between actual data points and the trend multiplied by the measurement occasion: b 0( TS ) = Md ( y i − b 1( D ) × i ); i = 1, 2, …, n A . Note that the slope is still estimated as in the original proposal (Manolov & Solanas, 2013 ).
Rights and permissions
Reprints and permissions
About this article
Manolov, R., Solanas, A. & Sierra, V. Extrapolating baseline trend in single-case data: Problems and tentative solutions. Behav Res 51 , 2847–2869 (2019). https://doi.org/10.3758/s13428-018-1165-x
Download citation
Published : 27 November 2018
Issue Date : December 2019
DOI : https://doi.org/10.3758/s13428-018-1165-x
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
- Single-case designs
- Extrapolation
- Forecasting
- Find a journal
- Publish with us
- Track your research
- Open access
- Published: 29 May 2023
Extrapolating empirical long-term survival data: the impact of updated follow-up data and parametric extrapolation methods on survival estimates in multiple myeloma
- LJ Bakker 1 , 2 ,
- FW Thielen 1 , 2 ,
- WK Redekop 1 , 2 ,
- CA Uyl-de Groot 1 , 2 &
- HM Blommestein 1 , 2
BMC Medical Research Methodology volume 23 , Article number: 132 ( 2023 ) Cite this article
2543 Accesses
Metrics details
In economic evaluations, survival is often extrapolated to smooth out the Kaplan-Meier estimate and because the available data (e.g., from randomized controlled trials) are often right censored. Validation of the accuracy of extrapolated results can depend on the length of follow-up and the assumptions made about the survival hazard. Here, we analyze the accuracy of different extrapolation techniques while varying the data cut-off to estimate long-term survival in newly diagnosed multiple myeloma (MM) patients.
Empirical data were available from a randomized controlled trial and a registry for MM patients treated with melphalan + prednisone, thalidomide, and bortezomib- based regimens. Standard parametric and spline models were fitted while artificially reducing follow-up by introducing database locks. The maximum follow-up for these locks varied from 3 to 13 years. Extrapolated (conditional) restricted mean survival time (RMST) was compared to the Kaplan-Meier RMST and models were selected according to statistical tests, and visual fit.
For all treatments, the RMST error decreased when follow-up and the absolute number of events increased, and censoring decreased. The decline in RMST error was highest when maximum follow-up exceeded six years. However, even when censoring is low there can still be considerable deviations in the extrapolated RMST conditional on survival until extrapolation when compared to the KM-estimate.
Conclusions
We demonstrate that both standard parametric and spline models could be worthy candidates when extrapolating survival for the populations examined. Nevertheless, researchers and decision makers should be wary of uncertainty in results even when censoring has decreased, and the number of events has increased.
Peer Review reports
Introduction
The data available for assessing efficacy of novel healthcare technologies in oncology often comes from randomized controlled trials (RCTs). However, RCTs do not provide all necessary information for assessing the cost-effectiveness of these technologies. RCTs often have limited follow-up times and thus increased censoring at market approval while a lifetime horizon is usually recommended in best-practice guidelines for economic evaluations [ 1 , 2 ]. This lifetime horizon ensures that all differences (i.e., short- and long-term) of the technologies compared are accounted for. Since a lifetime horizon almost always exceeds the follow-up duration of RCTs or other data sources used in economic evaluations (e.g., registries), empirical survival data are typically right censored [ 3 ]. For the novel treatment assessed, this can result in considerable uncertainty regarding the parametric survival function. For the comparator, this depends on whether the treatment administered in the trial is representative of what happens in current care and whether alternative sources of data are available to inform long-term survival.
There is substantial variation in the percentage of patients that is right censored depending on the type of disease [ 4 ]. For hematological malignancies for instance, the average percentage censored was 84% in initial publications and 54% in updated publications, whereas for other malignancies this varied from 28 to 73% in the initial publication and 13-47% in updated results [ 4 ]. With an increase in novel immunotherapies resulting in prolonged survival for multiple myeloma patients, such as daratumumab and lenalidomide [ 5 , 6 ], this issue has become even more prominent in recent years.
To address the issue of right-censoring, parametric survival functions and other methods for extrapolation are used to estimate long-term survival, making assumptions about the underlying hazard function for the extrapolated period based on the data observed [ 7 ]. Many types of models can be used to extrapolate survival from empirical evidence. Standard parametric models are generally included (e.g., Weibull, lognormal), but it is recommended to also consider more flexible models (e.g., spline, parametric mixture models) that allow for multiple turning points in the hazard function [ 8 ]. More flexible parametric spline models for instance were found in a previous study by Gray et al. to predict 10-year survival quite accurately for large cohorts of registry patients for which there was little uncertainty in the data [ 9 ].
Assessing the suitability of models and selecting the best-fitting model for extrapolation can be done through inspection of log cumulative hazard plots, inspection of visual fit, and statistical tests (e.g., Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC)) [ 10 ]. Real-world data may also guide model selection by assessing whether extrapolated results are plausible when compared to patient survival outside the context of a clinical trial [ 11 ]. Prior research has suggested that model selection should consider the length of follow-up of the data available [ 12 ]. In a case study, Bullement et al. assessed the accuracy of extrapolations for four different data-cuts of the JAVELIN Merkel 200 trial which studied the treatment effect of avelumab for patients with Merkel cell carcinoma. The authors found that extrapolations using longer follow-up (e.g., 36 months) favored more flexible spline-based models [ 12 ].
Despite this guidance, selecting a good fitting model and analyzing the uncertainty surrounding model choice remains a challenging endeavor and several publications already assessed the accuracy of extrapolations (e.g., [ 4 , 7 , 8 , 9 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 ]). These studies vary in the type of disease treated (e.g., melanoma, lung cancer), type of treatment evaluated (e.g., immunotherapy, surgery), the type of models compared, the duration of empirical follow-up time, the availability of individual patient data (IPD) and recreated data from published RCTs, their sample size, as well as the inclusion of external data sources. The overall accuracy of extrapolations has been found to be correlated with the percentage censored [ 4 ]. Everest et al. conducted a systematic review to find published RCTs with initial and updated results. For the 32 eligible RCTs, the accuracy of extrapolations based on the initial publication was then assessed after reconstructing individual patient data and fitting standard parametric models. The authors found that the difference between the extrapolated survival and the empirical survival increased as the percentage of patients censored increased [ 4 ].
In this study, we aim to compare extrapolation methods to assess the relationship between data maturity and survival projection accuracy, in the presence of several data sources. Both standard parametric models and spline models were fitted to RCT and patient registry data from patients with multiple myeloma while varying the maximum data cut-off (DCO) times. These extrapolations were not informed by alternative sources of information assuming that solely the dataset at hand with its particular DCO would be the best source available for extrapolation. The resulting extrapolations were compared to long-term empirical survival to determine the best candidate models. The results of our study may assist researchers in assessing whether the IPD is sufficiently mature for cost-effectiveness analysis and guide their decision-making concerning the sensitivity analyses that should be conducted.
Patient population & data
All details on the data sources, treatment arms, inclusion dates, and the data cuts can be found in Table 1 . IPD from an RCT performed by the Dutch Haemato-oncology Foundation for Adults in the Netherlands (HOVON), and data from the Dutch National Cancer Registry (NKR) were used to assess the accuracy of extrapolations compared to long-term empirical survival. The HOVON49 study compared melphalan + prednisone (HOVON - MP) with melphalan + prednisone + thalidomide (HOVON - Thal) in newly diagnosed multiple myeloma patients > 65 years of age [ 19 ]. Patients were included between September 2002 and July 2007 and long-term follow-up was available until December 2015.
Data from the NKR registry, including the Dutch Population based HAematological Registry for Observational Studies (PHAROS), were also used [ 20 , 21 ]. From the PHAROS database, newly diagnosed patients that received first-line treatment with MP (PHAROS - MP), thalidomide (PHAROS - Thal) and bortezomib (PHAROS - Bort) based regimens were included. Patients receiving melphalan + prednisone + bortezomib (NKR+ - MPV) present in the NKR + data were also included as a separate cohort. The mean age of the PHAROS-Bort cohort was slightly lower (Table 1 ) because at the time bortezomib was not the recommended first-line treatment for all multiple myeloma patients. In 2006–2011, bortezomib based regimens were mainly prescribed in younger patients followed by patients with kidney failure [ 22 ]. All dates for inclusion and exclusion can be found in Table 1 and Figure S1 . For all patients in the PHAROS database and the NKR + database, follow-up was included up until January 2022.
Overall survival was extrapolated using data sets which varied in the maximum follow-up time after the start of patient inclusion. For the MP arm from the HOVON49 study for instance, four datasets were created. In the first, patients were included from September 2002 until September 2005. Thus, the maximum follow-up of patients was three years and only patients that were enrolled before September 2005 were included. For the second HOVON - MP dataset, all enrolled patients were included (since enrollment ended in July 2007) but the final follow-up date was six years after starting enrollment (i.e., September 2008) and so forth.
The DCOs (i.e., < 3, 6, 8, 10, and 13 years) were chosen based on previously reported results from Everest et al. and Bullement et al. [ 4 , 12 ] and according to the maximum potential follow-up in the dataset. For instance, if inclusion started in 2002 and the DCO was 2005 the longest that a patient could have been followed was 3 years (Table 1 ). The minimum amount of potential follow-up time for all patients included in the data is also reported in Table 1 . For instance, when the maximum follow-up was < 6 years for the HOVON-MP arm the minimum potential follow-up for all patients included was 1 year and when the maximum follow-up was < 8 years the minimum potential follow-up was 3 years.
Fitted models
The models used to extrapolate results included all commonly used standard parametric models recommended by the Technical Support Documents 14 and 21 from the National Institute for Health and Care Excellence Decision Support Unit (i.e., exponential, Weibull, Gompertz, Gamma, log-logistic, lognormal, Generalized Gamma) and spline models. Spline-based models are flexible models where the survival function is transformed by a link function using natural cubic splines [ 23 ]. Natural cubic splines impose monotonicity for the tails where the number at risk is low, whereas at earlier points monotonicity is guaranteed due to the data density if the sample size is reasonable [ 23 ]. The transformed survival function is then smoothed reducing the risk of sudden deviations especially in the tail. Knots are placed at extreme values of the survival times and internally [ 23 ]. Here, the number of knots was varied from one to three and a hazard, odds, and normal scale were used.
Model selection & accuracy of predictions
In the results, we presented the models that had the lowest AIC, lowest BIC, and the best visual fit based on survival, and hazard plots. To select those models with the best visual fit, two authors (LB, HB) reviewed all curves independently. For the best visual fit, curves were selected based on four criteria: their fit to the Kaplan-Meier survival curve, the feasibility of the extrapolated survival, their fit to the smoothed hazard and the feasibility of the extrapolated hazard. If, based on these four criteria, multiple models were still eligible for ‘best’ fit, the model with the smallest number of parameters that needed to be estimated was selected. For instance, an exponential distribution for which one parameter needs to be estimated would be preferred over a generalized gamma distribution (three parameters). After individual selection, any remaining discrepancies were resolved by discussion to reach consensus. A third author (FT) participated in these discussions to resolve any ties in model selection. In preparation of the discussion the third author randomly assessed one-third of all curves according to the criteria noted above.
The accuracy of predictions was estimated using the restricted mean survival time (RMST). The RMST is equal to the mean survival restricted to a maximum time t instead of lifetime. It can be calculated by estimating the area under the curve (AUC) up until time t using integration [ 24 ]. All models were fitted and RMST estimated using the flexsurv package (version 2.1) in R [ 25 ]. First, a lifetime RMST was estimated for all cohorts. Here, the AUC was estimated for the extrapolated survival curves with the time horizon set to 35 years. Hereafter, the extrapolated survival was compared to the empirical survival with the horizon for RMST depending on the length of follow-up in empirical survival (Table 1 ). The RMST error was defined as the difference between the RMST from extrapolated curves and the RMST for the KM-estimate. In the second set of analyses, RMST was limited to the extrapolated proportion of the survival curve. Here, RMST was estimated conditional on surviving up until the point from where on extrapolation was required. Thus, for the data set with a maximum of three years follow-up, RMST was estimated conditional on having survived 3 years. Variations in RMST error were also plotted according to the percentage censored, absolute number of events, and the type of model (i.e., standard parametric or spline). For the spline models, knots were automatically placed at the centiles following recommendations by Royston & Parmar, when using the flexsurv package [ 23 ]. R version 4.0.3 was used for all analyses together with the packages flexsurv, muhaz, survRM2, lme4.
Ethical approval
Approval for use of the PHAROS and NKR + data was granted through the supervisory committee of the Dutch Integral Cancer Registry. Approval for secondary use of the data from the HOVON49 study was provided by HOVON.
Overall, 1853 patients were included, who received a variety of treatment regimens in a regular clinical care setting (PHAROS & NKR+) or in an RCT (HOVON) (Table S1 ). For all patient cohorts the percentage censored was initially high but quickly decreased over time with longer follow-up (Table S1 ). Kaplan-Meier estimates and the number at risk for the respective time points were plotted grouped according to the treatment received (i.e., MP, thalidomide, bortezomib-based), the data source (i.e., HOVON, PHAROS, NKR+) and the maximum follow-up (i.e., 3, 6, 8, 10, and 13 years) (Fig. 1 , S2 - S7 ).
Long term overall survival of patients treated with bortezomib-based regimens for the PHAROS (registry) data and the NKR+ (registry) data, patients treated with MP-based regimens for the HOVON (RCT) data and the PHAROS (registry) data, and patients treated with thalidomide-based regimens for the HOVON (RCT) data and the PHAROS (registry) data
Lifetime RMST
The extrapolated lifetime RMST varied considerably according to the data source and the types of models fitted (Figure S8 ). Overall, the variation in the extrapolated lifetime RMST was high for models estimated with limited follow-up. For example, for HOVON-Thal with a maximum follow-up of 3 years, the RMST varied from 5 years to 22.5 years. The variation for HOVON – MP, PHAROS – MP and, NKR+ - MPV was considerably smaller compared to all other arms (Figures S8 , S9 ) varying from 2.5 years to less than 10 years. The survival estimates declined considerably as the percentage censored decreased (Fig. 2 ) but also as the absolute number of events increased for almost all models (Figure S10 ).
Lifetime RMST according to the percentage censored and the type of model
Observed and estimated RMST from the RCT
In Table S2 we present a comparison between the observed long-term survival (i.e., 11 years) and estimated RMST for four different data cuts using data from the HOVON RCT. RMST estimates were restricted to the maximum follow-up. The mean survival estimates were considerably smaller compared to the 35-year time horizon, but the uncertainty was also large when follow-up was short. The standard parametric models were often selected based on AIC, BIC, and visual fit whereas no clear preference for either standard parametric models or spline models could be seen for the model with the lowest RMST error. Curves often overlapped and the differences between curves were often negligible making selections based on model fit difficult. We also observed that the RMST error based on the selection using BIC was almost always lower than the model selected based on AIC and visual fit. However, this was usually the exponential distribution which tended to under- or overestimate the hazard in the earlier months and vice versa in later months.
The RMST error was higher for the short-term follow-up (< 3 years) for which the censoring percentages were also higher (HOVON - MP: 73%, HOVON - Thal: 77%) relative to the number of events (HOVON - MP:29, HOVON - Thal: 25) (Table S1 ). However, as the length of follow-up increased, the error reduced with the absolute largest difference in RMST from < 3 years of follow-up to < 6 years of follow-up which coincided with a large reduction in censoring (HOVON - MP: 73–38%, HOVON - Thal: 77–48%). Confidence intervals of the models selected almost always overlapped.
Observed and estimated RMST from registries
For the registries, the maximum follow-up was slightly longer and therefore the RMST was estimated for 14 years (Table S3 ). Here the model selected with the best visual fit seemed to change less often when follow-up increased, and standard parametric models were almost always selected based on AIC, BIC, and best visual fit. For the NKR + data, the absolute RMST error was much smaller due to shorter length of the time horizon for which RMST was estimated (i.e., 8 years).
Overall, standard parametric models regularly had the smallest absolute RMST error (i.e., in 67%) but as censoring decreased, the lowest absolute RMST error was more often a spline model (i.e., PHAROS - MP, NKR+ - MPV). The error in the extrapolations based on the datasets with short follow-up (< 3 years) was large, irrespective of the sample size of the dataset used and the percentage censored. The error decreased as the follow-up increased and thus censoring decreased.
The RMST error for all models decreased when follow-up increased (Figures S11 - S14 ). The RMST errors for all treatments (regardless of the sample size, censoring, events, and the time horizon of the RMST) were low when 8 years of follow-up or more was available (S11-S14). Decreased censoring and more events coincided with smaller RMST errors (Fig. 3 , S15 - S17 ).
RMST error according to the percentage censored and the type of model. RMST is estimated for a time horizon of 8 years and a maximum follow-up of 3 and 6 years
RMST error conditional on survival
For the RMST error conditional on having survived until extrapolation, the decline was less pronounced as censoring reduced and the number of events increased (Fig. 4 , S18 ). Moreover, the spread in error was much wider for standard parametric models compared to spline models (Fig. 4 , S18 , S19 ). In Fig. 4 , the spread in the conditional RMST error between different models becomes smaller when censoring is less compared to the data cuts with higher percentages censored. However, even for the lowest percentages of censoring (e.g., 30–40%) there were some considerable deviations in the extrapolated RMST from the KM-estimate. This was also observed when the number of events was higher (e.g., > 100 events) (Figure S18 ).
The RMST error conditional on surviving until extrapolation plotted according to the percentage censored and the type of model. RMST is estimated for a time horizon of 8 years and a maximum follow-up of 3 and 6 years
In this study we analyze the accuracy of extrapolations for a non-solid tumor while varying the percentage censored using trial and registry data representative in sample size of those generally available to health economic researchers. We compared RMST estimated from extrapolated survival with the long-term Kaplan-Meier estimate in patients with multiple myeloma for a variety of treatments, data sources and maximum follow-up times. When reimbursement dossiers are drafted, the length of follow-up of patients included in the pivotal trial is often limited. Here, insight into the consequences of the uncertainty of extrapolations and the different models fitted, is essential since they are used to inform (conditional) reimbursement decisions of policy makers. This is an even bigger issue for clinical trials of novel immunotherapies such as daratumumab, where the percentage censored for overall survival is high [ 5 , 26 ].
These results align with Everest et al., meaning that the RMST error increases when the percentage censored increases. For trials of hematologic diseases, the average percentage of censoring was 84% for initial publications and 54% for the final publication [ 4 ]. Although it can be seen in Fig. 3 and S15 that the RMST error has extensively declined for a percentage censored of 54% or lower, there can still be considerable uncertainty in these extrapolations. This was more pronounced when the error in RMST was estimated conditional on having survived until extrapolation (Fig. 4 ). Decision makers should critically review whether decisions on reimbursement can be made when the extrapolated survival is based on high percentages censored. In the economic evaluations that support these decisions, those models should be fitted which are relevant considering the disease at hand and based on clinical expertise. Here, the sensitivity analyses adopted by health economic researchers demonstrate the potential impact on cost-effectiveness of uncertainty for instance coinciding with high percentages censored but also when percentage censored is low.
In this study, we found no conclusive evidence that standard parametric models are better than spline models or vice versa. The highest absolute RMST error was regularly seen with a standard parametric model. This suggests that uncertainty analyses for health economic evaluations including all standard parametric models, could adequately address the extent (i.e., upper, and lower limits) of the uncertainty in the incremental cost effectiveness ratio. The relationship between the percentage censored and RMST error further stipulates the need to identify those methods that lead to the lowest RMST error, even when the percentage censored is high. Further research should assess whether spline models perform better or worse, with large percentages censored and a small absolute number of events.
Limitations
This study focused on the RMST error as an outcome measure, which enables a comparison between the extrapolated and observed survival. There are however some drawbacks of this outcome measure. First, underestimation and overestimation over time can compensate and ultimately result in a relatively small RMST error. This aligns with results from a prior study in which large cohorts of registry data were used to extrapolate 10-year survival [ 9 ]. Gray et al. observed that the exponential distribution both under- and overestimated the hazard, resulting in a low RMST [ 9 ]. Second, for obtaining the RMST, a maximum time is required. While we could implement a life-time horizon for estimating RMST, we were bounded by the observation time for calculating the error in RMST which differed for the different data sources.
Another limitation of the outcome used is the fact that the Kaplan-Meier estimate itself is an estimate of the true survival function for a given cohort of patients. Although inherent to this kind of research, the error in RMST could be influenced by the fact that the number at risk decreases as time progresses. This is for instance reflected in the conditional survival estimated for HOVON-MP with > 10 years of follow-up where none of the few patients in the sample pass away between 10 and 11 years of follow-up.
Overall, the cohort size in our study was relatively small (i.e., smallest cohort of Gray et al. being N = 5407 [ 9 ]) which increases the uncertainty in extrapolated survival. This can also (partially) explain, why our findings differ from those by Gray et al. who found spline models to perform well even for short follow-up times. While larger cohorts are preferred and might be available for some treatments, our sample sizes are representative of clinical trials in hematology generally used as input for economic evaluations [ 5 , 27 , 28 ]. This makes our research applicable to current practices where health economic modelling is often performed using data from RCTs with a similar sample size. Another limitation was the heterogeneity in the PHAROS-bort cohort. The considerable uncertainty in the extrapolations for this cohort might be (partially) explained by the small sample size but perhaps also by the heterogeneity in the cohort. Due to the small sample size further stratification according to age was not feasible but would be recommended when such variation is present when performing an economic evaluation.
We employed commonly used parametric and spline models and did not consider more recent and complex models such as cure, parametric mixture, and landmark models [ 8 , 15 ]. In Technical Support Document 21 from the National Institute for Health and Care Excellence Decision Support Unit, Rutherford et al. provide recommendations for their appropriate use and, although we did not include them in this analysis, they could be a relevant addition for instance when modelling survival for potentially curative treatments (e.g., CAR-T) [ 8 ]. Another topic for which an increasing amount of research is available concerns the inclusion of external data (e.g., registry data, national statistics). Including such external data to correct for excessively predicted survival in the extrapolations has been recommended when extrapolating survival from RCTs [ 7 , 11 ]. Although this can sometimes reduce the overestimation of survival, this was beyond the scope of this study.
The generalizability of our findings to other areas of disease, particularly other hematological malignancies for which little evidence concerning the accuracy of extrapolations is available, will strongly depend on the similarities between the populations studied. The six datasets used in this study differ in the types of patients included, treatments administered, and hence in their hazard function. Similarly, the generalizability of these findings to other hematological malignancies will strongly depend on these features.
In this study, we compare extrapolated survival of multiple myeloma patients to prolonged empirical survival for a wide variety of DCOs using data from an RCT and registries. Uncertainty in extrapolations can have a large impact on use of healthcare services when the error in long-term survival is large and when it leads to incorrect conclusions for decision makers.
We found that the RMST error can become quite small for both standard parametric and spline models but also that RMST error increases for all models as censoring increases. The error in RMST for the extrapolated period only also reduced as the percentage censored decreased and the number of events decreased. However, this reduction was much less pronounced.
Health economic researchers should consider a variety of models in their (uncertainty) analyses when extrapolating survival in economic evaluations. Here, although the RMST error is high when the percentage censored is high, careful consideration of uncertainty analyses also seems warranted when longer follow-up is available.
Data Availability
The data that support the findings of this study are available in the Dutch Cancer Registry (IKNL), and the Dutch Haemato-oncology Foundation for Adults in the Netherlands (HOVON). Data are available upon reasonable request through the corresponding author (LB) under condition that permission for access is granted by the Dutch National Cancer Registry (IKNL), and the Dutch Haemato-oncology Foundation for Adults in the Netherlands (HOVON).
Abbreviations
Akaike Information Criterion
Area under the curve
Bayesian Information Criterion
Data cut off
Individual patient data
Dutch Haemato-Oncology Foundation for Adults in the Netherlands
Lower Confidence Interval
Melphalan + Prednisone
Dutch National Cancer Registry
Dutch Population based HAematological Registry for Observational Studies
Randomized Controlled Trial
Restricted Mean Survival Time
Upper Confidence Interval
Sharma D, Aggarwal AK, Downey LE, Prinja S. National healthcare economic evaluation guidelines: a cross-country comparison. PharmacoEconomics-Open. 2021 Sep;5(3):349–64.
Dutch Pharmacoeconomic Guidelines [Internet] Diemen: National Health Care Institute the Netherlands. Available from: https://www.zorginstituutnederland.nl/publicaties/publicatie/2016/02/29/richtlijn-voor-het-uitvoeren-van-economische-evaluaties-in-de-gezondheidszorg .
Latimer NR. Survival analysis for economic evaluations alongside clinical trials—extrapolation with patient-level data: inconsistencies, limitations, and a practical guide. Med Decis Making. 2013 Aug;33(6):743–54.
Everest L, Blommaert S, Chu RW, Chan KK, Parmar A. Parametric Survival Extrapolation of early Survival data in economic analyses: a comparison of projected Versus observed updated survival. Value in Health. 2021 Nov 24.
Mateos MV, Cavo M, Blade J, Dimopoulos MA, Suzuki K, Jakubowiak A, Knop S, Doyen C, Lucio P, Nagy Z, Pour L. Overall survival with daratumumab, bortezomib, melphalan, and prednisone in newly diagnosed multiple myeloma (ALCYONE): a randomised, open-label, phase 3 trial. The Lancet. 2020 Jan 11;395(10218):132 – 41.6.
Jackson GH, Davies FE, Pawlyn C, Cairns DA, Striha A, Collett C, Hockaday A, Jones JR, Kishore B, Garg M, Williams CD. Lenalidomide maintenance versus observation for patients with newly diagnosed multiple myeloma (Myeloma XI): a multicentre, open-label, randomised, phase 3 trial. The Lancet Oncology. 2019 Jan 1;20(1):57–73.
Jackson C, Stevens J, Ren S, Latimer N, Bojke L, Manca A, Sharples L. Extrapolating survival from randomized trials using external data: a review of methods. Med Decis Making. 2017 May;37(4):377–90.
Rutherford MJ, Lambert PC, Sweeting MJ, Pennington R, Crowther MJ, Abrams KR, Latimer NR. NICE DSU Technical Support Document 21. Flexible Methods for Survival Analysis. Department of Health Sciences, University of Leicester, Leicester, UK. 2020 Jan 23:1–97.
Gray J, Sullivan T, Latimer NR, Salter A, Sorich MJ, Ward RL, Karnon J. Extrapolation of survival curves using standard parametric models and flexible parametric spline models: comparisons in large registry cohorts with advanced cancer. Med Decis Making. 2021 Feb;41(2):179–93.
Latimer N. NICE DSU technical support document 14: survival analysis for economic evaluations alongside clinical trials-extrapolation with patient-level data. Rep Decis Support Unit. 2011 Jun.
Vickers A. An evaluation of survival curve extrapolation techniques using long-term observational cancer data. Med Decis Making. 2019 Nov;39(8):926–38.
Bullement A, Willis A, Amin A, Schlichting M, Hatswell AJ, Bharmal M. Evaluation of survival extrapolation in immuno-oncology using multiple pre-planned data cuts: learnings to aid in model selection. BMC Med Res Methodol. 2020 Dec;20(1):1–4.
Davies C, Briggs A, Lorgelly P, Garellick G, Malchau H. The “hazards” of extrapolating survival curves. Med Decis Making. 2013 Apr;33(3):369–80.
Kearns B, Stevenson MD, Triantafyllopoulos K, Manca A. Comparing current and emerging practice models for the extrapolation of survival data: a simulation study and case-study. BMC Med Res Methodol. 2021 Dec;21(1):1–1.
Bullement A, Latimer NR, Gorrod HB. Survival extrapolation in cancer immunotherapy: a validation-based case study. Value in Health. 2019 Mar 1;22(3):276 – 83.
Ouwens MJ, Mukhopadhyay P, Zhang Y, Huang M, Latimer N, Briggs A. Estimating lifetime benefits associated with immuno-oncology therapies: challenges and approaches for overall survival extrapolations. PharmacoEconomics. 2019 Sep;37(9):1129–38.
Gibson E, Koblbauer I, Begum N, Dranitsaris G, Liew D, McEwan P, Monfared AA, Yuan Y, Juarez-Garcia A, Tyas D, Lees M. Modelling the survival outcomes of immuno-oncology drugs in economic evaluations: a systematic approach to data analysis and extrapolation. PharmacoEconomics. 2017 Dec;35(12):1257–70.
Lanitis T, Proskorovsky I, Ambavane A, Hunger M, Zheng Y, Bharmal M, Phatak H. Survival analysis in patients with metastatic merkel cell carcinoma treated with Avelumab. Advances in therapy. 2019 Sep;36(9):2327–41.
Wijermans P, Schaafsma M, Termorshuizen F, Ammerlaan R, Wittebol S, Sinnige H, Zweegman S, van Marwijk Kooy M, Van Der Griend R, Lokhorst H, Sonneveld P. Phase III study of the value of thalidomide added to melphalan plus prednisone in elderly patients with newly diagnosed multiple myeloma: the HOVON 49 Study. Journal of Clinical Oncology. 2010 Jul 1;28(19):3160-6.
Blommestein HM, Franken MG, Uyl-de Groot CA. A practical guide for using registry data to inform decisions about the cost effectiveness of new cancer drugs: lessons learned from the PHAROS registry. PharmacoEconomics. 2015 Jun;33(6):551–60.
Verelst SGR, Blommestein HM, De Groot S, Gonzalez-McQuire S, DeCosta L, de Raad JB, Uyl-de Groot CA, Sonneveld P. Long-term outcomes in patients with multiple myeloma: a retrospective analysis of the Dutch Population-based HAematological Registry for Observational Studies (PHAROS). Hemasphere 2018 May 4;2(4):e45. doi: https://doi.org/10.1097/HS9.0000000000000045 . PMID: 31723779; PMCID: PMC6746001.
Blommestein H, Uyl-de Groot C, Visser O, Oerlemans S, Verelst S, van den Broek E, Issa D, Aarts M, Louwman M, Sonneveld P, Postuma W, Coebergh JW, van de Poll L, Huijgens P. Impact of new systemic treatments of patients with hematological malignancies in the Netherlands: population-based cohort studies of process and outcome as a basis for assessments of cost-effectiveness. Report, PHAROS, Netherlands; 2014.
Royston P, Parmar MK. Flexible parametric proportional-hazards and proportional‐odds models for censored survival data, with application to prognostic modelling and estimation of treatment effects. Statistics in medicine. 2002 Aug 15;21(15):2175–97.
Royston P, Parmar MK. The use of restricted mean survival time to estimate the treatment effect in randomized clinical trials when the proportional hazards assumption is in doubt. Statistics in medicine. 2011 Aug 30;30(19):2409-21.
Jackson CH. Flexsurv: a platform for parametric survival modeling in R. Journal of statistical software. 2016 May 12;70.
Palumbo A, Chanan-Khan A, Weisel K, Nooka AK, Masszi T, Beksac M, Spicka I, Hungria V, Munder M, Mateos MV, Mark TM. Daratumumab, bortezomib, and dexamethasone for multiple myeloma. New Engl J Med 2016 Aug 25;375(8):754–66.
Facon T, Kumar SK, Plesner T, Orlowski RZ, Moreau P, Bahlis N, Basu S, Nahi H, Hulin C, Quach H, Goldschmidt H. Daratumumab, lenalidomide, and dexamethasone versus lenalidomide and dexamethasone alone in newly diagnosed multiple myeloma (MAIA): overall survival results from a randomised, open-label, phase 3 trial. The Lancet Oncology. 2021 Nov 1;22(11):1582-96.
Zweegman S, van der Holt B, Mellqvist UH, Salomo M, Bos GM, Levin MD, Visser-Wisselaar H, Hansson M, van der Velden AW, Deenik W, Gruber A. Melphalan, prednisone, and lenalidomide versus melphalan, prednisone, and thalidomide in untreated multiple myeloma. Blood, The Journal of the American Society of Hematology. 2016 Mar 3;127(9):1109-16.
Download references
Acknowledgements
This study used data from the Dutch Cancer Registry (IKNL), PHAROS. and the Dutch Haemato-oncology Foundation for Adults in the Netherlands (HOVON). The authors are grateful to the various registration teams of the Netherlands Cancer Registry and HOVON for the data collection and delivery.
The authors received no financial support for this research.
Author information
Authors and affiliations.
Erasmus School of Health Policy and Management, Erasmus University Rotterdam, P.O. Box 1738, Rotterdam, 3000 DR, The Netherlands
LJ Bakker, FW Thielen, WK Redekop, CA Uyl-de Groot & HM Blommestein
Erasmus Centre for Health Economics Rotterdam, Erasmus University, Rotterdam, The Netherlands
You can also search for this author in PubMed Google Scholar
Contributions
Concept and design: LB, HB, FT, CUG, WR. Acquisition of data: LB, HB. Analysis and interpretation of data: HB, LB, WR, FT, CUG. Drafting of the manuscript: HB, LB, WR, FT, CUG. Critical revision of the paper for important intellectual content: HB, LB, WR, FT, CUG. Statistical analysis: LB. Supervision: HB, CUG.
Corresponding author
Correspondence to LJ Bakker .
Ethics declarations
Ethical approval and consent to participate.
Neither obtaining informed consent from patients nor approval by a medical ethics committee is obligatory for this type of observational studies containing no directly identifiable data (art. 9.2 sub j General Data Protection Regulation, art. 24 Dutch GDPR Implementation Act jo). Administrative permission for use of the anonymized data from the Netherlands Cancer Registry (NCR) was granted through the supervisory committee of the NCR. Administrative permission for use of the anonymized data from the Dutch Haemato-oncology Foundation for Adults in the Netherlands (HOVON) was granted by the HOVON executive board and the HOVON multiple myeloma working group. All data provided to the researchers by the NCR & HOVON was in anonymized format. This study was conducted according to the principles of the Declaration of Helsinki.
Consent for publication
Not applicable.
Competing interests
FT reports previous consultation for AstraZeneca, Optimax Access, Dark Peak Analytics, and grants from Celgene outside the submitted work. Previous and ongoing research was or is partly funded by the CADTH (Canadian Agency for Drugs and Technologies in Health), the Dutch Ministry of Health, Welfare and Sport, and the European Haematology Association. HB reports previous research grants from BMS (Celgene BV), advisory board fee from Pfizer, outside the submitted work paid to the institute; Previous and ongoing research was or is partly funded by the CADTH (Canadian Agency for Drugs and Technologies in Health), the Dutch Healthcare Institute, and Medical Delta. LB reports previous and ongoing research grants from the European H2020 Research Programme and the Convergence Program outside the submitted work. WR reports previous and ongoing research grants from the European H2020 Research Programme and the Convergence Program outside the submitted work. CUG reports unrestricted grants from Boehringer Ingelheim, Astellas, Sanofi, Janssen-Cilag, Bayer, Sanofi, Amgen, Merck, Gilead, Novartis, and Astra Zeneca, Roche, and grants from European Research Programmes, CADTH (Canadian Agency for Drugs and Technologies in Health), the Dutch Healthcare Institute, the European Haematology Association, and Dutch Ministry of Health. All grants were outside the submitted work.
Additional information
Publisher’s note.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Supplementary Material 1
Rights and permissions.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Reprints and permissions
About this article
Cite this article.
Bakker, L., Thielen, F., Redekop, W. et al. Extrapolating empirical long-term survival data: the impact of updated follow-up data and parametric extrapolation methods on survival estimates in multiple myeloma. BMC Med Res Methodol 23 , 132 (2023). https://doi.org/10.1186/s12874-023-01952-2
Download citation
Received : 09 January 2023
Accepted : 16 May 2023
Published : 29 May 2023
DOI : https://doi.org/10.1186/s12874-023-01952-2
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
- Parametric Extrapolation
- Multiple myeloma
- Kaplan-Meier
BMC Medical Research Methodology
ISSN: 1471-2288
- General enquiries: [email protected]
Brill | Nijhoff
Brill | Wageningen Academic
Brill Germany / Austria
Böhlau
Brill | Fink
Brill | mentis
Brill | Schöningh
Vandenhoeck & Ruprecht
V&R unipress
Open Access
Open Access for Authors
Transformative Agreements
Open Access and Research Funding
Open Access for Librarians
Open Access for Academic Societies
Discover Brill’s Open Access Content
Organization
Stay updated
Corporate Social Responsiblity
Investor Relations
Policies, rights & permissions
Review a Brill Book
Author Portal
How to publish with Brill: Files & Guides
Fonts, Scripts and Unicode
Publication Ethics & COPE Compliance
Data Sharing Policy
Brill MyBook
Ordering from Brill
Author Newsletter
Piracy Reporting Form
Sales Managers and Sales Contacts
Ordering From Brill
Titles No Longer Published by Brill
Catalogs, Flyers and Price Lists
E-Book Collections Title Lists and MARC Records
How to Manage your Online Holdings
LibLynx Access Management
Discovery Services
KBART Files
MARC Records
Online User and Order Help
Rights and Permissions
Latest Key Figures
Latest Financial Press Releases and Reports
Annual General Meeting of Shareholders
Share Information
Specialty Products
Press and Reviews
Chapter 15 Extrapolation of Animal Research Data to Humans: An Analysis of the Evidence
- Download PDF
The ethical arguments against animal experimentation remain ever-strong. In addition, the scientific case against the use of animals in research grows more compelling, with exponential progress in the development of alternative methods and new research technologies. The Dutch authorities recently announced an ambitious, but welcome, proposal to phase out “the use of laboratory animals in regulatory safety testing of chemicals, food ingredients, pesticides and (veterinary) medicines” by 2025, as well as “the use of laboratory animals for the release of biological products, such as vaccines” ( Netherlands National Committee for the protection of animals used for scientific purposes, NCad, 2016 , p. 3). National government departments (e.g., the United Kingdom, UK , Home Office) have stated that alternatives to animals are now considered necessary for scientific as much as ethical reasons, also conceding that pressure exists within the research community to use animals in order to get published . Furthermore, only 20% of animal tests across the European Union ( EU ) each year are conducted to meet regulatory requirements, with the vast majority carried out as basic research (including basic medical research) or breeding of genetically modified ( GM ) animals at academic institutions (European Commission, 2013b).
Despite the strength of both scientific and moral arguments, animal research continues to increase worldwide, especially given the rising trend in use of GM animals. A Catch 22 situation also exists, with regulators largely refusing to break with tradition and continuing to accept only animal data, even when robust human-based data exists. Additionally, when new animal-free, human-relevant methods are developed, regulators often insist that research still be performed on animals; this is considered to be one of the major barriers to achieving change and, in turn, results in an industry reluctant to invest in non-animal research, if its results are unlikely to be accepted ( Schiffelers et al., 2012 ).
Whilst public engagement, via campaigns to highlight animal suffering, remains vital, a renewed focus on scientific, political, and financial interests is needed. This focus is needed to emphasize the fundamental message that animal research simply does not deliver what is needed, in order to influence those who regulate, finance, or approve animal experiments and have a meaningful impact on their ongoing reduction but primarily, their replacement . Scientific evidence is needed, on an ongoing basis, of the inadequacy of animal experiments in predicting human outcomes, combined with a focus on the modern, non-animal techniques that have the potential to replace them, to drive an ongoing recognition of the need for genuine, significant investment in human-relevant research. Additionally, not all animal tests need replacing, many can simply end; so providing appropriate evidence of these types of tests is also essential.
In striving to achieve a paradigm shift to end animal experimentation, for scientific as much as ethical reasons, an evidence-based approach is required. There remains a vital need for a combination of drivers in innovative, animal-free scientific research, training, and education, as well as continued lobbying and campaigning to key stakeholders (i.e., scientists, regulators, and political audiences).
Animal experimentation falls into two broad categories: basic research (including basic medical research) and a relatively smaller category, toxicity (or safety) testing of new substances, which includes chemicals for use in personal care, household products, industrial substances, foodstuffs, or pharmaceuticals (the latter are also tested for efficacy). There is overlap, to some extent, in these categories, with some animal procedures categorized as “fundamental toxicology”, for example. A two-fold strategy is suggested to end the use of animals in all experimental research. The first should focus on how a large number of procedures performed, both in basic research and product-safety testing, can simply end today; in other words, they do not need non-animal replacements. The second should focus on procedures that are considered to require replacement. This could be through intelligent and strategic combinations of existing non-animal tests (integrated testing strategies) and/or further development of new human-relevant models. Examples of these and their success in replacing animals to date are discussed later in this chapter.
A popular argument in support of continuing animal research is that they have been used for decades in the research and development of new medicines. The fact that millions of animals have been used over years, often in the same repeated experiments, is not in dispute. However, their continued use does not prove necessity . It is also relevant to note that from early on in a scientific career, one is discouraged from saying that experiments “didn’t work” but instead, to conclude how further research or new approaches must be tried next, in light of unsuccessful or unexpected results. The use of animals has been grandfathered through, due to convention, anecdotal evidence or belief, rather than robust scientific validity. “ We must use a living system” … but it is the wrong living system and no matter how many animals are used, they will never provide an appropriate model for humans. This needs to change, particularly when considering the growing industry of breeding and supplying millions of GM animals worldwide each year, in repeated attempts to mimic the human condition.
The vast majority of animals are used either for basic research or breeding of GM strains. This is clear when reviewing recent official statistics for the three highest animal-using countries in the EU ; the UK , Germany, and France. For example, more than 3.9 million procedures on animals (mice, rats, rabbits, guinea pigs, dogs, horses, cats, non-human primates, pigs, sheep, cattle, birds, xenopus, and fish, among other species) were carried out in the UK in 2016. Of these, 729,390 were genetically modified, including more than 149,000 animals deliberately bred to suffer a harmful phenotype (a deliberately induced condition, such as cancer, failed immune system, or organ failure to try to simulate disease in humans). There were also increases in the number of experiments across several species, and a significant number of experiments for ingredients in household products (1700 procedures), to meet industrial chemicals legislation requirements, despite a policy on testing for such purposes ( Home Office, 2017 ). In fact, of the total 3.9 million procedures conducted in the UK in 2016, only 13% were carried out for regulatory purposes. Germany bred 1.2 million GM animals in 2015 (with similar numbers of harmful phenotype animals to the UK ), representing 42% of the 2.8 million animals used annually ( Federal Ministry of Food and Agriculture, 2016 ). Figures reported for France in 2014 show that 1.8 million animals were used, however the proportion of GM animals was not reported ( Ministry of Higher Education & Research, 2016 ).
Several thousand diseases affect humans. Of these, only 500 currently have fda -approved treatments available ( National Center for Advancing Translational Sciences, 2017 ). In every discipline of disease research, animals are used on an ongoing basis, yet it is continually reported that mechanisms of human conditions investigated in such animals are still not understood. This is because basic research in animals is a demand-driven and self-perpetuating system, with much research being proposed and licensed on the basis of being repetitively performed on animals (often termed as “well established” or “well documented” models). Such research is neither legally required, nor does it have to be relevant or applicable to human disease to be licensed. Another key barrier to replacing animals, even when scientifically valid alternatives are available, is awareness and acceptance of their use, both by researchers and regulators ( Ramirez et al., 2015 ).
The first part of this chapter provides an analysis of extrapolation of animal studies to humans, by sampling systematic reviews carried out to assess evidence of clinical translation and incorporating a review of literature on animal toxicity studies for some well-known, established case study drugs (e.g., paracetamol, aspirin, penicillin) and animal versus human findings. The second part addresses drivers for change and the development of animal-free (or rather, human-relevant ) research methods, as well as some examples of procedures that do not need replacing as they can simply stop, when considering that they can logically be avoided or rejected on the basis of a correctly performed (and legally required) harm-benefit assessment. The chapter aims to provide an overview of the above topics and suggestions for the way forward as part of a new paradigm for a global, animal-research free future.
1 Part 1: Analysis of Abstracts from Systematic Reviews of Animal Studies
To carry out an analysis of systematic reviews on animal experiments, a review of a sample of available literature was performed. The intention of this analysis was to provide a generally qualitative review of the literature. To do so, two separate sources were used. First, a search in PubMed ( National Centre for Biotechnology Information, 2016 ) was made using the keyword search of “systematic review animal studies.” This resulted in a total of 163,585 publications. PubMed allows search by Article Type and selecting this as “systematic review” further filtered results to 8,291 listings, also sorted by relevance. Second, the Google Scholar database, using the same search terms, “systematic review animal studies,” for consistency, yielded 2,530,000 results (Google Scholar, 2016). Dates of publications ranged from 1999–present. Generally, PubMed provided more recent listings compared to Google Scholar, which resulted in older publications; but this was useful to provide a greater scope for review over the past two decades as well as avoid duplication.
To account for time constraints, while still providing a reasonable sample size, the first 50 abstract listings within each source were reviewed, giving a sample total of 100 (see Table 15.1 ). If a publication appeared within both sources, this was also accounted for, although duplicates were relatively few. Where publications were found to be not relevant, further listings were reviewed to compensate for this and to maintain a total of 100.
IMAGES
VIDEO
COMMENTS
1. Linear Extrapolation. Linear extrapolation is the process of estimating a value that is close to the existing data. To do this, the researcher plots out a linear equation on a graph and uses the sequence of the values to predict immediate future data points. You can draw a tangent line at the last point and extend this line beyond its limits.
The results of the research that get published are expected to be replicable by others. This depends on its precision in terms of the methodology including the sample size; the results extrapolated from the study should also be applicable to the larger population so that the observed effects are near identical to true effects.
Identifying locations as extrapolations. With the (C)MVPV values and cutoff choice in hand, determining which locations (observed/unobserved) are extrapolations is straightforward and results in a binary (yes/no) value. We refer to this delineation as our extrapolation index (e) e k p = {1 if v p> k 0 otherwise.
Transferability pertains to the degree to which the research findings can be extrapolated to alternative contexts or situations [20], [21]. Qualitative researchers aim to offer comprehensive and intricate depictions of the study's environment, participants, and procedures to enhance the potential for transferability.
Defining Generalizability. Generalizability refers to the extent to which a study's findings can be extrapolated to a larger population. It's about making sure that your findings apply to a large number of people, rather than just a small group. Generalizability ensures research findings are credible and reliable.
This impacts generalizability since findings that cannot be reliably extrapolated may misrepresent real-world scenarios or lead to ineffective interventions. Therefore, establishing external validity through careful study design and consideration of contextual factors is crucial for valid extrapolation.
Study reports from a national health technology assessment program in the United Kingdom were searched, and the findings were combined with "pearl-growing" searches of the academic literature. ... net benefit of the intervention of interest is unlikely to cross the decision threshold in the period of time being extrapolated over. 73. More ...
Extrapolation. In mathematics, extrapolation is a type of estimation, beyond the original observation range, of the value of a variable on the basis of its relationship with another variable. It is similar to interpolation, which produces estimates between known observations, but extrapolation is subject to greater uncertainty and a higher risk ...
3.2 The validation of surrogate models in later stages of basic research and in clinical research 4 'Big Picture' Accounts and the Extrapolations Underpinning Them 4.1 The mosaic nature of mechanistic descriptions in basic science 4.2 Challenges for mechanistic-similarity-based validation protocols 5 Retrospective Testing of Extrapolated Knowledge
Exhaustive review of the current literature and other sources (unpublished studies, ongoing research) Less costly to review prior studies than to create a new study; Less time required than conducting a new study; Results can be generalized and extrapolated into the general population more broadly than individual studies; More reliable and ...
In quantitative research, generalizability is considered a major criterion for evaluating the quality of a study (Kerlinger and Lee, 2000, Polit and Beck, 2008).Within the classic validity framework of Cook and Campbell (e.g., Shadish et al., 2002), external validity—the degree to which inferences from a study can be generalized—has been a valued standard for decades.
While several strategies have been proposed to aid with extrapolation, the existing methodological literature has left our understanding of what extrapolation consists of and what constitutes successful extrapolation underdeveloped. This paper addresses this lack in understanding by offering a novel account of successful extrapolation.
Horizontal evaluation is an approach that combines self-assessment by local participants and external review by peers. Positive deviance (PD), a behavioural and social change approach, involves learning from those who find unique and successful solutions to problems despite facing the same challenges, constraints and resource deprivation as others.
Policy makers worldwide use economic evaluation to inform decisions when allocating limited healthcare resources. A critical part of this evaluation involves accurately estimating long term effects of treatments. Yet, evidence is usually from clinical trials of short duration. Rarely do all participants encounter the clinical event of interest by the trial's end. When people might benefit ...
EDITOR—For more than a decade it has been an article of faith in evidence based medicine that randomised controlled trials are "best evidence" and their findings can routinely be extrapolated to clinical situations.1 In his editorial Sackett, the founder of evidence based medicine, seeks retrospectively to reassure clinicians that this practice was justifiable, but the accompanying study ...
The usefulness of research lies primarily in the generalisation of the findings rather than in the information gained about those particular individuals. We study the patients in a trial not to find out anything about them but to predict what might happen to future patients given these treatments.
Aim of the review. It has already been stated (Parker et al., 2011) and illustrated (Tarlow, 2017) that baseline trend extrapolation can lead to impossible forecasts for the subsequent intervention-phase data.Accordingly, the research question we chose was the percentage of studies in which extrapolating the baseline trend of the data set (across several different techniques for fitting the ...
Extrapolating from Animals to Humans. Clinical effectiveness for interventions in humans can only be speculated from animal studies. John P. A. Ioannidis Authors Info & Affiliations. Science Translational Medicine. 12 Sep 2012. Vol 4, Issue 151. p. 151ps15. DOI: 10.1126/scitranslmed.3004631.
Real-world data may also guide model selection by assessing whether extrapolated results are plausible when compared to patient survival outside the context of a clinical trial . Prior research has suggested that model selection should consider the length of follow-up of the data available . In a case study, Bullement et al. assessed the ...
The findings from the majority of publications reviewed are consistent with other evidence on the problems of translating animal data to humans; for example, the Review of Research Using Non-human Primates (jointly commissioned in 2006 by a number of major UK research councils and chaired by Sir David Weatherall). A subsequent review in 2011 ...
The relative PRPSC-SA was extrapolated by plotting relative fluorescence unit readouts ... The findings of this diagnostic study suggest that analysis of 2 or more skin sites was superior to CSF analysis in diagnosing PRDs. ... This work was supported by grants from Capital's Funds for Health Improvement and Research (2024-2-2018), National ...
Mechanisms of action and why they do not always help. A mechanism of action is the causal chain or web linking the intervention with the clinical outcome via pathophysiologic mechanisms 9 (see Figure 1, middle).If we know the mechanism of action in the study population, and we know it is shared with the target population, then extrapolation is more likely to be justified.
A recent Finnish study has found that good physical fitness from childhood to adolescence is linked to better mental health in adolescence. These results are significant and timely, as mental ...
Faced with limitations in data availability, funding, and time constraints, ecologists are often tasked with making predictions beyond the range of their data. In ecological studies, it is not always obvious when and where extrapolation occurs because of the multivariate nature of the data. Previous work on identifying extrapolation has focused on univariate response data, but these methods ...
The Phase 2/3 Vibrance-MG study (NCT05265273) is an on-going open-label study to determine the effect of nipocalimab in pediatric participants with gMG. 16 Seven participants aged 12 - 17 years ...
Berkeley, CA, October 15, 2024—ARRIS Composites, a leader in high-performance continuous fiber thermoplastic composite manufacturing, is proud to announce its selection by the U.S. Army for a groundbreaking study on the use of advanced carbon fiber plates in military boots. Funded by the U.S. Army Natick Soldier Research, Development and Engineering Center (NATICK) and the U.S. Army Combat ...
SAVANNAH, Ga. (October 15, 2024) - Johnson & Johnson today announced positive results from the Phase 2/3 Vibrance-MG study of nipocalimab in anti-AChR a positive adolescents (aged 12 - 17 years) living with generalized myasthenia gravis (gMG). Study participants who were treated with nipocalimab plus standard of care (SOC) achieved sustained disease control as measured by the primary ...
Conclusions. When treatments make people live longer, it is important to extrapolate beyond the end of clinical trials to estimate mean survival gains and cost effectiveness over a period longer than the trial. Several survival models are available, and these result in widely varying estimates.