• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

What is Power in Statistics?

By Jim Frost 1 Comment

Power in statistics is the probability that a hypothesis test can detect an effect in a sample when it exists in the population . It is the sensitivity of a hypothesis test. When an effect exists in the population, how likely is the test to detect it in your sample?

Lightbulb plugging in its own plug to get statistical power.

High statistical power occurs when a hypothesis test is likely to find an effect that exists in the population. A low power test is unlikely to detect that effect.

For example, if statistical power is 80%, a hypothesis test has an 80% chance of detecting an effect that actually exists. Now imagine you’re performing a study that has only 10%. That’s not good because the test is far more likely to miss the effect.

In this post, learn about statistical power, why it matters, how to increase it, and calculate it for a study.

Why Power in Statistics Matters

In all hypothesis tests, the researchers are testing an effect of some sort. It can be the effectiveness of a new medication, the strength of a new product, etc. There is a relationship or difference between groups that the researchers hope to identify. Learn more about Effects in Statistics .

Unfortunately, a hypothesis test can fail to detect an effect even when it does exist. This problem happens more frequently when the test has low statistical power.

Consequently, power is a crucial concept to understand before starting a study. Imagine the scenario where an effect exists in the population, but the test fails to detect it in the sample. Not only did the researchers waste their time and money on the project, but they’ve also failed to identify an effect that exists. Consequently, they’re missing out on the benefits the effect would have provided!

Clearly, researchers want an experimental design that produces high statistical power! Unfortunately, if the design is lacking, a study can be doomed to fail from the start.

Power matters in statistics because you don’t want to spend time and money on a project only to miss an effect that exists! It is vital to estimate the power of a statistical test before beginning a study to help ensure it has a reasonable chance of detecting an effect if one exists.

Statistical Power and Hypothesis Testing Errors

To better understand power in statistics, you first need to know why and how hypothesis tests can make incorrect decisions.

Related post : Overview of Hypothesis Testing

Why do hypothesis tests make errors?

Hypothesis tests use samples to draw conclusions about entire populations. Researchers use these tests because it’s rarely possible to measure a whole population. So, they’re stuck with samples.

Unfortunately, samples don’t always accurately reflect the population. Statisticians define sampling error as the difference between a sample and the target population. Occasionally, this error can be large enough to cause hypothesis tests to draw the wrong conclusions. Consequently, statistical power becomes a crucial issue because increasing it reduces the chance of errors. Learn more about Sampling Error: Definition, Sources & Minimizing .

How do they make errors?

Samples sometimes show effects that don’t exist in the population, or they don’t display effects that do exist. Hypothesis tests try to manage these errors, but they’re not perfect. Statisticians have devised clever names for these two types of errors— Type I and Type II errors!

  • Type I : The hypothesis test rejects a true null hypothesis (false positive).
  • Type II : Test fails to reject a false null (false negative).

Power in statistics relates only to type II errors, the false negatives. The effect exists in the population, but the test doesn’t detect it in the sample. Hence, we won’t deal with Type I errors for the rest of this post. If you want to know more about both errors, read my post, Types of Errors in Hypothesis Testing .

The Type II error rate (known as beta or β) is the probability of a false negative for a hypothesis test. Furthermore, the inverse of Type II errors is the probability of correctly detecting an effect (i.e., a true positive), which is the definition of statistical power. In mathematical terms, 1 – β = the statistical power.

For example, if the Type II error rate is 0.2, then statistical power is 1 – 0.2 = 0.8. It logically follows that a lower Type II error rate equates to higher power.

Analysts are typically more interested in estimating power than beta.

How to Increase Statistical Power

Now that you know why power in statistics is essential, how do you ensure that your hypothesis test has high power?

Let’s start by understanding the factors that affect power in statistics. The following conditions increase a hypothesis test’s ability to detect an effect:

  • Larger sample sizes.
  • Larger effect sizes.
  • Lower variability in the population.
  • Higher significance level (alpha) (e.g., 5% → 10%).

Of these factors, researchers typically have the most control over the sample size. Consequently, that’s your go-to method for increasing statistical power.

Effect sizes and variability are often inherent to the subject area you’re studying. Researchers have less control over them than the sample size. However, there might be some steps you can take to increase the effect size (e.g., larger treatments) or reduce the variability (e.g., tightly controlled lab conditions).

Do not choose a significance level to increase statistical power. Instead, set it based on your risk tolerance for a false positive. Usually, you’ll want to leave it at 5% unless you have a compelling reason to change it. To learn more, read my post about Understanding Significance Levels .

Power Analysis

Studies typically want at least 80% power, but sometimes they need even more. How do you plan for a study to have that much capability from the start? Perform a power analysis before collecting data!

A statistical power analysis helps determine how large your sample must be to detect an effect. This process requires entering the following information into your statistical software:

  • Effect size estimate
  • Population variability estimate
  • Statistical power target
  • Significance level

Notice that the effect size and population variability values are estimates. Typically, you’ll produce these estimates through literature reviews and subject-area knowledge. The quality of your power analysis depends on having reasonable estimates!

After entering the required information, your statistical software displays the sample size necessary to achieve your target value for statistical power. I recommend using G*Power for this type of analysis. It’s free!

I’ve written an article about this process in more detail, complete with examples. How to Calculate Sample Size Needed for Power .

For readers who are up for a bit more complex topic, failing to detect an effect is not the only problem with low power studies. When such a study happens to have a significant result, it will report an exaggerated effect size! For more information, read Low Power Tests Exaggerate Effect Sizes .

Share this:

definition of hypothesis power

Reader Interactions

' src=

May 6, 2022 at 5:08 am

I am a physician,and I am a die-hard fan of Jim’s series on statistics which help me a lot to establish an intuitive understanding of various statistical concepts and procedures.Therefore,thanks in earnest for your fabulous books.I wonder whether you could consider writing another one on modern beyesian statistical analysis of data since now the frequentis approaches are under scathing attacks by many statisticians especially the NHST and many of them strongly advocate using beyesian methods as alternative instead.However,as a applied users of statistics with very little and weak mathematics background,I found it extremely hard to grasp these beyesian methods,so I am eager to read such one written by Jim.Thx for your consideration.

Comments and Questions Cancel reply

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Statistical Power and Why It Matters | A Simple Introduction

Published on February 16, 2021 by Pritha Bhandari . Revised on June 22, 2023.

Statistical power , or sensitivity, is the likelihood of a significance test detecting an effect when there actually is one.

A true effect is a real, non-zero relationship between variables in a population. An effect is usually indicated by a real difference between groups or a correlation between variables.

High power in a study indicates a large chance of a test detecting a true effect. Low power means that your test only has a small chance of detecting a true effect or that the results are likely to be distorted by r andom and systematic error .

Power is mainly influenced by sample size, effect size, and significance level. A power analysis can be used to determine the necessary sample size for a study.

Table of contents

Why does power matter in statistics, what is a power analysis, other factors that affect power, how do you increase power, other interesting articles, frequently asked questions about statistical power.

Having enough statistical power is necessary to draw accurate conclusions about a population using sample data.

In hypothesis testing , you start with null and alternative hypotheses : a null hypothesis of no effect and an alternative hypothesis of a true effect (your actual research prediction).

The goal is to collect enough data from a sample to statistically test whether you can reasonably reject the null hypothesis in favor of the alternative hypothesis.

  • Null hypothesis: Spending 10 minutes daily outdoors in a natural environment has no effect on stress in recent college graduates.
  • Alternative hypothesis: Spending 10 minutes daily outdoors in a natural environment will reduce symptoms of stress in recent college graduates.

There’s always a risk of making Type I or Type II errors  when interpreting study results:

  • Type I error : rejecting the null hypothesis of no effect when it is actually true.
  • Type II error : not rejecting the null hypothesis of no effect when it is actually false.
  • Type I error : you conclude that spending 10 minutes in nature daily reduces stress when it actually doesn’t.
  • Type II error : you conclude that spending 10 minutes in nature daily doesn’t affect stress when it actually does.

Power is the probability of avoiding a Type II error. The higher the statistical power of a test, the lower the risk of making a Type II error.

Power is usually set at 80%. This means that if there are true effects to be found in 100 different studies with 80% power, only 80 out of 100 statistical tests will actually detect them.

If you don’t ensure sufficient power, your study may not be able to detect a true effect at all. This means that resources like time and money are wasted, and it may even be unethical to collect data from participants (especially in clinical trials).

On the flip side, too much power means your tests are highly sensitive to true effects, including very small ones. This may lead to finding statistically significant results with very little usefulness in the real world.

To balance these pros and cons of low versus high statistical power, you should use a power analysis to set an appropriate level.

Prevent plagiarism. Run a free check.

A power analysis is a calculation that aids you in determining a minimum sample size for your study.

A power analysis is made up of four main components. If you know or have estimates for any three of these, you can calculate the fourth component.

  • Statistical power: the likelihood that a test will detect an effect of a certain size if there is one, usually set at 80% or higher.
  • Sample size: the minimum number of observations needed to observe an effect of a certain size with a given power level.
  • Significance level (alpha) : the maximum risk of rejecting a true null hypothesis that you are willing to take, usually set at 5%.
  • Expected effect size: a standardized way of expressing the magnitude of the expected result of your study, usually based on similar studies or a pilot study.

Before starting a study, you can use a power analysis to calculate the minimum sample size for a desired power level and significance level and an expected effect size.

Traditionally, the significance level is set to 5% and the desired power level to 80%. That means you only need to figure out an expected effect size to calculate a sample size from a power analysis.

To calculate sample size or perform a power analysis, use online tools or statistical software like G*Power .

Sample size

Sample size is positively related to power. A small sample (less than 30 units) may only have low power while a large sample has high power.

Increasing the sample size enhances power, but only up to a point. When you have a large enough sample, every observation that’s added to the sample only marginally increases power. This means that collecting more data will increase the time, costs and efforts of your study without yielding much more benefit.

Your research design is also related to power and sample size:

  • In a within-subjects design , each participant is tested in all treatments of a study, so individual differences will not unevenly affect the outcomes of different treatments.
  • In a between-subjects design , each participant only takes part in a single treatment, so with different participants in each treatment, there is a chance that individual differences can affect the results.

A within-subjects design is more powerful, so fewer participants are needed. More participants are needed in a between-subjects design to establish relationships between variables .

Significance level

The significance level of a study is the Type I error probability, and it’s usually set at 5%. This means your findings have to have a less than 5% chance of occurring under the null hypothesis to be considered statistically significant.

Significance level is correlated with power: increasing the significance level (e.g., from 5% to 10%) increases power. When you decrease the significance level, your significance test becomes more conservative and less sensitive to detecting true effects.

Researchers have to balance the risks of committing Type I and II errors by considering the amount of risk they’re willing to take in making a false positive versus a false negative conclusion.

Effect size

Effect size is the magnitude of a difference between groups or a relationship between variables. It indicates the practical significance of a finding.

While high-powered studies can help you detect medium and large effects in studies, low-powered studies may only catch large ones.

To determine an expected effect size, you perform a systematic literature review to find similar studies. You narrow down the list of relevant studies to only those that manipulate time spent in nature and use stress as a main measure.

There’s always some sampling error involved when using data from samples to make inferences about populations. This means there’s always a discrepancy between the observed effect size and the true effect size. Effect sizes in a study can vary due to random factors, measurement error, or natural variation in the sample.

Low-powered studies will mostly detect true effects only when they are large in a study. That means that, in a low-powered study, any observed effect is more likely to be boosted by unrelated factors.

If low-powered studies are the norm in a particular field, such as neuroscience , the observed effect sizes will consistently exaggerate or overestimate true effects.

Aside from the four major components, other factors need to be taken into account when determining power.

Variability

The variability of the population characteristics affects the power of your test. High population  variance reduces power.

In other words, using a population that takes on a large range of values for a variable will lower the sensitivity of your test, while using a population where the variable is relatively narrowly distributed will heighten the sensitivity of the test.

Using a fairly specific population with defined demographic characteristics can lower the spread of the variable of interest and improve power.

Measurement error

Measurement error is the difference between the true value and the observed or recorded value of something. Measurements can only be as precise as the instruments and researchers that measure them, so some error is almost always present.

The higher the measurement error in a study, the lower the statistical power of a test. Measurement error can be random or systematic:

  • Random errors are unpredictable and unevenly alter measurements due to chance factors (e.g., mood changes can influence survey responses, or having a bad day may lead to researchers misrecording observations).
  • Systematic errors affect data in predictable ways from one measurement to the next (e.g., an incorrectly calibrated device will consistently record inaccurate data, or problematic survey questions may lead to biased responses).

Since many research aspects directly or indirectly influence power, there are various ways to improve power. While some of these can usually be implemented, others are costly or involve a tradeoff with other important considerations.

Increase the effect size. To increase the expected effect in an experiment, you could manipulate your independent variable more widely (e.g., spending 1 hour instead of 10 minutes in nature) to increase the effect on the dependent variable (stress level). This may not always be possible because there are limits to how much the outcomes in an experiment may vary.

Increase sample size. Based on sample size calculations, you may have room to increase your sample size while still meaningfully improving power. But there is a point at which increasing your sample size may not yield high enough benefits.

Increase the significance level. While this makes a test more sensitive to detecting true effects, it also increases the risk of making a Type I error.

Reduce measurement error. Increasing the precision and accuracy of your measurement devices and procedures reduces variability, improving reliability and power.  Using multiple measures or methods, known as triangulation , can also help reduce systematic research bias .

Use a one-tailed test instead of a two-tailed test. When using a t test or z tests, a one-tailed test has higher power. However, a one-tailed test should only be used when there’s a strong reason to expect an effect in a specific direction (e.g., one mean score will be higher than the other), because it won’t be able to detect an effect in the other direction. In contrast, a two-tailed test is able to detect an effect in either direction.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

definition of hypothesis power

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Descriptive statistics
  • Measures of central tendency
  • Correlation coefficient
  • Null hypothesis

Methodology

  • Cluster sampling
  • Stratified sampling
  • Types of interviews
  • Cohort study
  • Thematic analysis

Research bias

  • Implicit bias
  • Cognitive bias
  • Survivorship bias
  • Availability heuristic
  • Nonresponse bias
  • Regression to the mean

In statistics, power refers to the likelihood of a hypothesis test detecting a true effect if there is one. A statistically powerful test is more likely to reject a false negative (a Type II error).

If you don’t ensure enough power in your study, you may not be able to detect a statistically significant result even when it has practical significance. Your study might not have the ability to answer your research question.

Statistical significance is a term used by researchers to state that it is unlikely their observations could have occurred under the null hypothesis of a statistical test . Significance is usually denoted by a p -value , or probability value.

Statistical significance is arbitrary – it depends on the threshold, or alpha value, chosen by the researcher. The most common threshold is p < 0.05, which means that the data is likely to occur less than 5% of the time under the null hypothesis .

When the p -value falls below the chosen alpha value, then we say the result of the test is statistically significant.

A power analysis is a calculation that helps you determine a minimum sample size for your study. It’s made up of four main components. If you know or have estimates for any three of these, you can calculate the fourth component.

  • Statistical power : the likelihood that a test will detect an effect of a certain size if there is one, usually set at 80% or higher.
  • Sample size : the minimum number of observations needed to observe an effect of a certain size with a given power level.
  • Expected effect size : a standardized way of expressing the magnitude of the expected result of your study, usually based on similar studies or a pilot study.

There are various ways to improve power:

  • Increase the potential effect size by manipulating your independent variable more strongly,
  • Increase sample size,
  • Increase the significance level (alpha),
  • Reduce measurement error by increasing the precision and accuracy of your measurement devices and procedures,
  • Use a one-tailed test instead of a two-tailed test for t tests and z tests.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bhandari, P. (2023, June 22). Statistical Power and Why It Matters | A Simple Introduction. Scribbr. Retrieved June 18, 2024, from https://www.scribbr.com/statistics/statistical-power/

Is this article helpful?

Pritha Bhandari

Pritha Bhandari

Other students also liked, what is effect size and why does it matter (examples), an easy introduction to statistical significance (with examples), type i & type ii errors | differences, examples, visualizations, what is your plagiarism score.

Statistical Power: What it is, How to Calculate it

In order to follow this article, you may want to read these articles first: What is a Hypothesis Test? What are Type I and Type II Errors?

What is Power?

The statistical power of a study (sometimes called sensitivity) is how likely the study is to distinguish an actual effect from one of chance. It’s the likelihood that the test is correctly rejecting the null hypothesis (i.e. “proving” your hypothesis ). For example, a study that has an 80% power means that the study has an 80% chance of the test having significant results.

  • A high statistical power means that the test results are likely valid. As the power increases, the probability of making a Type II error decreases.
  • A low statistical power means that the test results are questionable.

Statistical power helps you to determine if your sample size is large enough. It is possible to perform a hypothesis test without calculating the statistical power. If your sample size is too small, your results may be inconclusive when they may have been conclusive if you had a large enough sample .

Statistical Power and Beta

statistical power

Beta( β) is the probability that you won’t reject the null hypothesis when it is false. The statistical power is the complement of this probability: 1- Β

How to Calculate Statistical Power

Statistical Power is quite complex to calculate by hand. This article on MoreSteam explains it well.

Software is normally used to calculate the power.

  • Calculate power in SAS .
  • Calculate power in PASS.

Power Analysis

Power analysis is a method for finding statistical power: the probability of finding an effect, assuming that the effect is actually there. To put it another way, power is the probability of rejecting a null hypothesis when it’s false. Note that power is different from a Type II error, which happens when you fail to reject a false null hypothesis. So you could say that power is your probability of not making a type II error.

A Simple Example of Power Analysis

Let’s say you were conducting a drug trial and that the drug works. You run a series of trials with the effective drug and a placebo. If you had a power of .9, that means 90% of the time you would get a statistically significant result. In 10% of the cases, your results would not be statistically significant. The power in this case tells you the probability of finding a difference between the two means, which is 90%. But 10% of the time, you wouldn’t find a difference.

Reasons to run a Power Analysis

You can run a power analysis for many reasons, including:

  • To find the number of trials needed to get an effect of a certain size. This is probably the most common use for power analysis–it tells you how many trials you need to do to avoid incorrectly rejecting the null hypothesis.
  • To find the power, given an effect size and the number of trials available. This is often useful when you have a limited budget, for say, 100 trials, and you want to know if that number of trials is enough to detect an effect.
  • To validate your research. Conducting power analysis is simply put–good science.

Calculating power is complex and is usually always performed with a computer. You can find a list of links to online power calculators here .

Check out our YouTube channel for hundreds of elementary statistics and Probability videos!

Beyer, W. H. CRC Standard Mathematical Tables, 31st ed. Boca Raton, FL: CRC Press, pp. 536 and 571, 2002. Agresti A. (1990) Categorical Data Analysis. John Wiley and Sons, New York. Dodge, Y. (2008). The Concise Encyclopedia of Statistics . Springer. Salkind, N. (2016). Statistics for People Who (Think They) Hate Statistics: Using Microsoft Excel 4th Edition.

Teach yourself statistics

Power of a Hypothesis Test

The probability of not committing a Type II error is called the power of a hypothesis test.

Effect Size

To compute the power of the test, one offers an alternative view about the "true" value of the population parameter, assuming that the null hypothesis is false. The effect size is the difference between the true value and the value specified in the null hypothesis.

Effect size = True value - Hypothesized value

For example, suppose the null hypothesis states that a population mean is equal to 100. A researcher might ask: What is the probability of rejecting the null hypothesis if the true population mean is equal to 90? In this example, the effect size would be 90 - 100, which equals -10.

Factors That Affect Power

The power of a hypothesis test is affected by three factors.

  • Sample size ( n ). Other things being equal, the greater the sample size, the greater the power of the test.
  • Significance level (α). The lower the significance level, the lower the power of the test. If you reduce the significance level (e.g., from 0.05 to 0.01), the region of acceptance gets bigger. As a result, you are less likely to reject the null hypothesis. This means you are less likely to reject the null hypothesis when it is false, so you are more likely to make a Type II error. In short, the power of the test is reduced when you reduce the significance level; and vice versa.
  • The "true" value of the parameter being tested. The greater the difference between the "true" value of a parameter and the value specified in the null hypothesis, the greater the power of the test. That is, the greater the effect size, the greater the power of the test.

Test Your Understanding

Other things being equal, which of the following actions will reduce the power of a hypothesis test?

I. Increasing sample size. II. Changing the significance level from 0.01 to 0.05. III. Increasing beta, the probability of a Type II error.

(A) I only (B) II only (C) III only (D) All of the above (E) None of the above

The correct answer is (C). Increasing sample size makes the hypothesis test more sensitive - more likely to reject the null hypothesis when it is, in fact, false. Changing the significance level from 0.01 to 0.05 makes the region of acceptance smaller, which makes the hypothesis test more likely to reject the null hypothesis, thus increasing the power of the test. Since, by definition, power is equal to one minus beta, the power of a test will get smaller as beta gets bigger.

Suppose a researcher conducts an experiment to test a hypothesis. If she doubles her sample size, which of the following will increase?

I. The power of the hypothesis test. II. The effect size of the hypothesis test. III. The probability of making a Type II error.

The correct answer is (A). Increasing sample size makes the hypothesis test more sensitive - more likely to reject the null hypothesis when it is, in fact, false. Thus, it increases the power of the test. The effect size is not affected by sample size. And the probability of making a Type II error gets smaller, not bigger, as sample size increases.

Power Analysis

  • Living reference work entry
  • First Online: 26 May 2022
  • Cite this living reference work entry

definition of hypothesis power

  • Manuel C. Voelkle 2 &
  • Edgar Erdfelder 3  

57 Accesses

Probability of a true positive decision ; Sensitivity

The power of a statistical hypothesis test is the probability of rejecting the null hypothesis given that the null hypothesis is in fact false.

Description

There are four possible outcomes of a statistical hypothesis test: (1) the null hypothesis is maintained given that it is in fact true (a true negative decision); (2) the null hypothesis is rejected even though it is true (a false positive decision or type I error ); (3) the null hypothesis is maintained even though it is false (a false negative decision or type II error ); and (4) the null hypothesis is rejected given that it is in fact false (a true positive decision). The probabilities of type I and type II errors are often denoted by the Greek letters α and β, respectively. Accordingly, the power (i.e., the probability of a true positive decision, also referred to as the sensitivity of a test) is (1-β), whereas (1-α) denotes the probability of a true...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

American Psychological Association. (2001). Publication manual of the American Psychological Association (5th ed.). Washington, DC: Author.

Google Scholar  

Berger, M. P. F., & Wong, W. K. (2009). An introduction to optimal design for social and biomedical research . Chichester: Wiley.

Book   Google Scholar  

Brandmaier, A. M., von Oertzen, T., Ghisletta, P., Hertzog, C., & Lindenberger, U. (2015). LIFESPAN: A tool for the computer-aided design of longitudinal studies. Frontiers in Psychology, 6 , 272. https://doi.org/10.3389/fpsyg.2015.00272 .

Article   Google Scholar  

Champely, S. (2020). pwr: Basic functions for power analysis. (Version 1.3-0) . Retrieved from https://CRAN.R-project.org/package=pwr

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. The Journal of Abnormal and Social Psychology, 65 (3), 145–153. https://doi.org/10.1037/h0045186 .

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale: Erlbaum.

Cohen, J. (1992). A power primer. Psychological Bulletin, 112 (1), 155–159. https://doi.org/10.1037/0033-2909.112.1.155 .

Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). Hillsdale: Erlbaum.

Erdfelder, E., Faul, F., Buchner, A., & Cüpper, L. (2010). Effektgröße und Teststärke. In H. Holling & B. Schmitz (Eds.), Handbuch der Psychologischen Methoden und Evaluation (pp. 358–369). Göttingen: Hogrefe.

Faul, F., Erdfelder, E., Lang, A.-G., & Buchner, A. (2007). G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods, 39 (2), 175–191. https://doi.org/10.3758/BF03193146 .

Faul, F., Erdfelder, E., Buchner, A., & Lang, A.-G. (2009). Statistical power analyses using G*Power 3.1: Tests for correlation and regression analyses. Behavior Research Methods, 41 (4), 1149–1160. https://doi.org/10.3758/BRM.41.4.1149 .

Hoenig, J. M., & Heisey, D. M. (2001). The abuse of power. The American Statistician, 55 (1), 19–24. https://doi.org/10.1198/000313001300339897 .

Kruschke, J. K., & Liddell, T. M. (2018). The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychonomic Bulletin & Review, 25 (1), 178–206. https://doi.org/10.3758/s13423-016-1221-4 .

Liu, X., & Wang, L. (2019). Sample size planning for detecting mediation effects: A power analysis procedure considering uncertainty in effect size estimates. Multivariate Behavioral Research, 54 (6), 822–839. https://doi.org/10.1080/00273171.2019.1593814 .

Maxwell, S. E. (2004). The persistence of underpowered studies in psychological research: Causes, consequences, and remedies. Psychological Methods, 9 (2), 147–163. https://doi.org/10.1037/1082-989X.9.2.147 .

Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample size planning for statistical power and accuracy in parameter estimation. Annual Review of Psychology, 59 , 537–563. https://doi.org/10.1146/annurev.psych.59.103006.093735 .

Maxwell, S. E., Delaney, H. D., & Kelley, K. (2018). Designing experiments and analyzing data: a model comparison perspective (3rd ed.). New York: Routledge.

Moshagen, M., & Erdfelder, E. (2016). A new strategy for testing structural equation models. Structural Equation Modeling: A Multidisciplinary Journal, 23 , 54–60. https://doi.org/10.1080/10705511.2014.950896 .

Muthén, L. K., & Muthén, B. O. (2002). How to use a Monte Carlo study to decide on sample size and determine power. Structural Equation Modeling: A Multidisciplinary Journal, 9 (4), 599–620. https://doi.org/10.1207/S15328007SEM0904_8 .

Onwuegbuzie, A. J., & Leech, N. L. (2004). Post hoc power: A concept whose time has come. Understanding Statistics, 3 (4), 201–230. https://doi.org/10.1207/s15328031us0304_1 .

Paxton, P., Curran, P. J., Bollen, K. A., Kirby, J., & Chen, F. (2001). Monte Carlo experiments: Design and implementation. Structural Equation Modeling: A Multidisciplinary Journal, 8 , 287–312. https://doi.org/10.1207/S15328007SEM0802_7 .

R Core Team. (2021). R: A language and environment for statistical computing . Vienna: R Foundation for Statistical Computing. Retrieved from http://www.R-project.org/ .

Schnuerch, M., & Erdfelder, E. (2020). Controlling decision errors with minimal costs: The sequential probability ratio t test. Psychological Methods, 25 (2), 206–226. https://doi.org/10.1037/met0000234 .

Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies? Psychological Bulletin, 105 (2), 309–316. https://doi.org/10.1037/0033-2909.105.2.309 .

von Oertzen, T. (2010). Power equivalence in structural equation modelling. British Journal of Mathematical and Statistical Psychology, 63 , 257–272. https://doi.org/10.1348/000711009X441021 .

Wald, A. (1947). Sequential analysis . New York: Wiley.

Download references

Author information

Authors and affiliations.

Humboldt-Universität zu Berlin, Berlin, Germany

Manuel C. Voelkle

Lehrstuhl für Kognitive Psychologie und Differentielle Psychologie, Universität Mannheim, Mannheim, Germany

Edgar Erdfelder

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Manuel C. Voelkle .

Editor information

Editors and affiliations.

Dipartimento di Scienze Statistiche, Sapienza Università di Roma, Roma, Italy

Filomena Maggino

Section Editor information

Social Statistics, Italian National Institute of Statistics – Istat, Rome, Italy

Leonardo Salvatore Alaimo

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this entry

Cite this entry.

Voelkle, M.C., Erdfelder, E. (2021). Power Analysis. In: Maggino, F. (eds) Encyclopedia of Quality of Life and Well-Being Research. Springer, Cham. https://doi.org/10.1007/978-3-319-69909-7_2230-2

Download citation

DOI : https://doi.org/10.1007/978-3-319-69909-7_2230-2

Received : 26 September 2019

Accepted : 10 August 2021

Published : 26 May 2022

Publisher Name : Springer, Cham

Print ISBN : 978-3-319-69909-7

Online ISBN : 978-3-319-69909-7

eBook Packages : Springer Reference Social Sciences Reference Module Humanities and Social Sciences Reference Module Business, Economics and Social Sciences

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Clin Orthop Relat Res
  • v.469(2); 2011 Feb

Logo of corr

In Brief: Statistics in Brief: Statistical Power: What Is It and When Should It Be Used?

Frederick j. dorey.

Department of Pediatrics, Children’s Hospital Los Angeles, 4650 Sunset Blvd, Mailstop 54, Los Angeles, CA 90027 USA

Although any report formally testing a hypothesis should include an associated p value and confidence interval, another statistical concept that is in some ways more important is the power of a study. Unlike the p value and confidence interval, the issue of power should be considered before even embarking on a clinical study.

What is statistical power, when should it be used, and what information is needed for calculating power?

Like the p value, the power is a conditional probability. In a hypothesis test, the alternative hypothesis is the statement that the null hypothesis is false. If the alternative hypothesis is actually true, the power is the probability that one will correctly reject the null hypothesis. The most meaningful application of statistical power is to decide before initiation of a clinical study whether it is worth doing, given the needed effort, cost, and in the case of clinical experiments, patient involvement. A hypothesis test with little power will likely yield large p values and large confidence intervals. Thus when the power of a proposed study is low, even when there are real differences between treatments under investigation, the most likely result of the study will be that there is not enough evidence to reject the H 0 and meaningful clinical differences will remain in question. In that situation a reasonable question to ask would be, was the study worth the needed time and effort to get so little additional information.

The usual question asked involving statistical power is: what sample size will result in a reasonable power (however defined) for the primary hypothesis being investigated. In many cases however, a more realistic question would be: what will the statistical power be for the important hypothesis tests, given the most likely sample size that can be obtained during the duration of the proposed study?

For any given statistical procedure and significance level, there are three statistical concepts closely related to each other. These are the sample size, effect size, and power. If you know any two of them, the third can be determined. To determine the effect size the investigator first must estimate the magnitude of the minimum clinically important difference (MCID) that the experiment is designed to detect. This value then is divided by an estimate of the variability of the data as interpretation of numbers only makes sense relative to the variability of the estimated parameters. Although investigators usually can provide a reasonable estimate of the MCID for a study, they frequently have little idea about the variability of their data. In many cases the standard deviation of the control group will provide a good estimate of that variability. As intuitively it should be easier to determine if two groups differ by a large rather than a small clinically meaningful difference, it follows that a larger effect size usually will result in more power. Also, a larger sample size results in more precision of the parameters being estimated thus resulting in more power as the estimates are more likely to be closer to the true values in the target population. (A more-detailed article by Biau et al. [ 1 ] discusses the relationships between power and sample size along with examples.)

For power calculations to be meaningful, it first is necessary to decide on the proper effect size. The effect size must be decided first because, for any proposed sample size, an effect size always can be chosen that will result in any desired power. In short, the goals of the experiment alone should determine the effect size. Once a study has been completed and analyzed, the confidence interval reveals how much, or little, has been learned and the power will not contribute any meaningful additional information. In a detailed discussion of post hoc power calculations in general, Hoenig and Heisey [ 2 ] showed that if a hypothesis test has been performed with a resulting p value greater than the 5% significance level, then the power for detecting the observed difference will only be approximately 50% or less. However, it can be verified easily with examples that hypothesis tests resulting in very small p values (such as 0.015) could still have a post hoc power even less than 70%; in such a case it is difficult to see how a post hoc power calculation will contribute any more information than what already is known.

There is a very nice relationship between the concepts of hypothesis testing and diagnostic testing. Let the null hypothesis represent the absence of a given disease, the alternative hypothesis represent the presence of the disease, and the rejection of the null hypothesis represent having a positive diagnostic test. With these assumptions, the power is simply equivalent to the sensitivity of the test (the probability the test is positive when the disease is present). In addition, the significance level is equivalent to one minus the specificity of the test, or in other words, the error you are willing to risk of falsely rejecting the null hypothesis simply corresponds to the probability of getting a positive test among patients without the disease.

Myths and Misconceptions

As discussed above the notion of power after the data have been collected does not provide very much additional information about the hypothesis test results. This is illustrated by considering the experiment of flipping a coin 10 times to see if the coin is fair, that is, the probability of heads is 0.5. Suppose you flip the coin 10 times and you get 10 heads. This experiment with only 10 flips has very little power for testing if the coin is fair. However the p value for obtaining 10 heads in 10 flips with a fair coin (the null hypothesis) is very small, so the null hypothesis certainly will be rejected. Thus, even though the experiment has little power, it does not change the fact that an experiment has been conducted and provided convincing evidence that the coin is biased in favor of heads. I do not recommend that you bet on tails.

Another myth is that the power always has to be at least 80% or greater. That might be a reasonable expectation for a clinical study potentially involving great inconvenience or risk to patients. However in a laboratory study or a retrospective correlation study, there is usually no necessity for the power to be that high.

Conclusions

The concept of statistical power should be used before initiating a study to help determine whether it is reasonable and ethical to proceed with a study. Calculation of statistical power also sometimes is useful post hoc when statistically insignificant but potentially clinically important trends are noted, say in the study of two treatments for cancer. Such post hoc tests can inform the reader or future researchers how many patients might be needed to show statistical differences. The power and effect size needed for a study to be reasonable also will depend on the medical question being asked and the information already available in the literature.

Each author certifies that he or she has no commercial associations (eg, consultancies, stock ownership, equity interest, patent/licensing arrangements, etc) that might pose a conflict of interest in connection with the submitted article.

Example ) through the relationship /n

and because the sample variance S estimates

then, /n .

The square root of this variance, is the standard deviation of the sample mean, also called (SEM), /n)

: =6, =2.5), we accepted a type 1 error of 0.05 and calculated two sample means, , that formed boundaries for the rejection regions.
at 0.20, a one in five chance of making a type 1 error. (Admittedly, 0.20 may be a larger probability than we'd accept comfortably in practice.)
enlarges the rejection region. Inserting this value for into yields critical values of 5.308 and 6.692 for the sample means that bound the rejection regions.
, we are more likely to draw a sample mean that leads us to reject the null hypothesis. Of course, we are also more likely to reject the null mistakenly.
: =6, =2.5), and accepting a type 1 error probability ( ) of 0.05, specifies rejection regions bounded by sample means of .

If we draw a sample of 10 observations to test a slightly different null hypothesis, one in which we estimate the population variance to be 16, ...

yields values of 3.139 and 8.861 for sample means that bound the rejection regions.
=6 before we can reject the null hypothesis.
: =6; =2.5) by setting at 0.05 and drawing a sample of 10 observations.

Drawing a larger sample, say one with 50 observations, estimates more precisely the mean in the population from which we've drawn the sample. The larger the sample, the smaller the sample mean's .

When we estimate the population mean more precisely, a sample mean need not be as distant from the hypothesized mean to cause us to reject the null hypothesis.

yields critical values of 5.551 and 6.449 for the sample means that bound the rejection regions.
of 6 as the values for xbar (4.87 and 7.13) that we calculated for a sample of 10.
is true is true

Power in Tests of Significance

Teaching students the concept of power in tests of significance can be daunting. Happily, the AP Statistics curriculum requires students to understand only the concept of power and what affects it; they are not expected to compute the power of a test of significance against a particular alternate hypothesis.

What Does Power Mean?

The easiest definition for students to understand is: power is the probability of correctly rejecting the null hypothesis. We’re typically only interested in the power of a test when the null is in fact false. This definition also makes it more clear that power is a conditional probability: the null hypothesis makes a statement about parameter values, but the power of the test is conditional upon what the values of those parameters really are.

The following tree diagram may help students appreciate the fact that α, β, and power are all conditional probabilities.

Figure 1: Reality to Decision

Power may be expressed in several different ways, and it might be worthwhile sharing more than one of them with your students, as one definition may “click” with a student where another does not. Here are a few different ways to describe what power is:

  • Power is the probability of rejecting the null hypothesis when in fact it is false.
  • Power is the probability of making a correct decision (to reject the null hypothesis) when the null hypothesis is false.
  • Power is the probability that a test of significance will pick up on an effect that is present.
  • Power is the probability that a test of significance will detect a deviation from the null hypothesis, should such a deviation exist.
  • Power is the probability of avoiding a Type II error.

To help students better grasp the concept, I continually restate what power means with different language each time. For example, if we are doing a test of significance at level α = 0.1, I might say, “That’s a pretty big alpha level. This test is ready to reject the null at the drop of a hat. Is this a very powerful test?” (Yes, it is. Or at least, it’s more powerful than it would be with a smaller alpha value.) Another example: If a student says that the consequences of a Type II error are very severe, then I may follow up with “So you really want to avoid Type II errors, huh? What does that say about what we require of our test of significance?” (We want a very powerful test.)

What Affects Power?

There are four things that primarily affect the power of a test of significance. They are:

  • The significance level α of the test. If all other things are held constant, then as α increases, so does the power of the test. This is because a larger α means a larger rejection region for the test and thus a greater probability of rejecting the null hypothesis. That translates to a more powerful test. The price of this increased power is that as α goes up, so does the probability of a Type I error should the null hypothesis in fact be true.
  • The sample size n . As n increases, so does the power of the significance test. This is because a larger sample size narrows the distribution of the test statistic. The hypothesized distribution of the test statistic and the true distribution of the test statistic (should the null hypothesis in fact be false) become more distinct from one another as they become narrower, so it becomes easier to tell whether the observed statistic comes from one distribution or the other. The price paid for this increase in power is the higher cost in time and resources required for collecting more data. There is usually a sort of “point of diminishing returns” up to which it is worth the cost of the data to gain more power, but beyond which the extra power is not worth the price.
  • The inherent variability in the measured response variable. As the variability increases, the power of the test of significance decreases. One way to think of this is that a test of significance is like trying to detect the presence of a “signal,” such as the effect of a treatment, and the inherent variability in the response variable is “noise” that will drown out the signal if it is too great. Researchers can’t completely control the variability in the response variable, but they can sometimes reduce it through especially careful data collection and conscientiously uniform handling of experimental units or subjects. The design of a study may also reduce unexplained variability, and one primary reason for choosing such a design is that it allows for increased power without necessarily having exorbitantly costly sample sizes. For example, a matched-pairs design usually reduces unexplained variability by “subtracting out” some of the variability that individual subjects bring to a study. Researchers may do a preliminary study before conducting a full-blown study intended for publication. There are several reasons for this, but one of the more important ones is so researchers can assess the inherent variability within the populations they are studying. An estimate of that variability allows them to determine the sample size they will require for a future test having a desired power. A test lacking statistical power could easily result in a costly study that produces no significant findings.
  • The difference between the hypothesized value of a parameter and its true value. This is sometimes called the “magnitude of the effect” in the case when the parameter of interest is the difference between parameter values (say, means) for two treatment groups. The larger the effect, the more powerful the test is. This is because when the effect is large, the true distribution of the test statistic is far from its hypothesized distribution, so the two distributions are distinct, and it’s easy to tell which one an observation came from. The intuitive idea is simply that it’s easier to detect a large effect than a small one. This principle has two consequences that students should understand, and that are essentially two sides of the same coin. On the one hand, it’s important to understand that a subtle but important effect (say, a modest increase in the life-saving ability of a hypertension treatment) may be demonstrable but could require a powerful test with a large sample size to produce statistical significance. On the other hand, a small, unimportant effect may be demonstrated with a high degree of statistical significance if the sample size is large enough. Because of this, too much power can almost be a bad thing, at least so long as many people continue to misunderstand the meaning of statistical significance. For your students to appreciate this aspect of power, they must understand that statistical significance is a measure of the strength of evidence of the presence of an effect. It is not a measure of the magnitude of the effect. For that, statisticians would construct a confidence interval.

Two Classroom Activities

The two activities described below are similar in nature. The first one relates power to the “magnitude of the effect,” by which I mean here the discrepancy between the (null) hypothesized value of a parameter and its actual value. 2 The second one relates power to sample size. Both are described for classes of about 20 students, but you can modify them as needed for smaller or larger classes or for classes in which you have fewer resources available. Both of these activities involve tests of significance on a single population proportion, but the principles are true for nearly all tests of significance.

Activity 1: Relating Power to the Magnitude of the Effect

In advance of the class, you should prepare 21 bags of poker chips or some other token that comes in more than one color. Each of the bags should have a different number of blue chips in it, ranging from 0 out of 200 to 200 out of 200, by 10s. These bags represent populations with different proportions; label them by the proportion of blue chips in the bag: 0 percent, 5 percent, 10 percent,... , 95 percent, 100 percent. Distribute one bag to each student. Then instruct them to shake their bags well and draw 20 chips at random. Have them count the number of blue chips out of the 20 that they observe in their sample and then perform a test of significance whose null hypothesis is that the bag contains 50 percent blue chips and whose alternate hypothesis is that it does not. They should use a significance level of α = 0.10. It’s fine if they use technology to do the computations in the test.

They are to record whether they rejected the null hypothesis or not, then replace the tokens, shake the bag, and repeat the simulation a total of 25 times. When they are done, they should compute what proportion of their simulations resulted in a rejection of the null hypothesis.

Meanwhile, draw on the board a pair of axes. Label the horizontal axis “Actual Population Proportion” and the vertical axis “Fraction of Tests That Rejected.”

When they and you are done, students should come to the board and draw a point on the graph corresponding to the proportion of blue tokens in their bag and the proportion of their simulations that resulted in a rejection. The resulting graph is an approximation of a “power curve,” for power is precisely the probability of rejecting the null hypothesis.

Figure 2 is an example of what the plot might look like. The lesson from this activity is that the power is affected by the magnitude of the difference between the hypothesized parameter value and its true value. Bigger discrepancies are easier to detect than smaller ones.

Figure 2: Power Curve

Activity 2: relating power to sample size.

For this activity, prepare 11 paper bags, each containing 780 blue chips (65 percent) and 420 nonblue chips (35 percent). 3 This activity requires 8,580 blue chips and 4,620 nonblue chips.

Pair up the students. Assign each student pair a sample size from 20 to 120.

The activity proceeds as did the last one. Students are to take 25 samples corresponding to their sample size, recording what proportion of those samples lead to a rejection of the null hypothesis p = 0.5 compared to a two-sided alternative, at a significance level of 0.10. While they’re sampling, you make axes on the board labeled “Sample Size” and “Fraction of Tests That Rejected.” The students put points on the board as they complete their simulations. The resulting graph is a “power curve” relating power to sample size. Below is an example of what the plot might look like. It should show clearly that when p = 0.65 , the null hypothesis of p = 0.50 is rejected with a higher probability when the sample size is larger.

(If you do both of these activities with students, it might be worth pointing out to them that the point on the first graph corresponding to the population proportion p = 0.65 was estimating the same power as the point on the second graph corresponding to the sample size n = 20.)

The AP Statistics curriculum is designed primarily to help students understand statistical concepts and become critical consumers of information. Being able to perform statistical computations is of, at most, secondary importance and for some topics, such as power, is not expected of students at all. Students should know what power means and what affects the power of a test of significance. The activities described above can help students understand power better. If you teach a 50-minute class, you should spend one or at most two class days teaching power to your students. Don’t get bogged down with calculations. They’re important for statisticians, but they’re best left for a later course.

  • In the context of an experiment in which one of two groups is a control group and the other receives a treatment, then “magnitude of the effect” is an apt phrase, as it quite literally expresses how big an impact the treatment has on the response variable. But here I use the term more generally for other contexts as well.
  • I know that’s a lot of chips. The reason this activity requires so many chips is that it is a good idea to adhere to the so-called “10 percent rule of thumb,” which says that the standard error formula for proportions is approximately correct so long as your sample is less than 10 percent of the population. The largest sample size in this activity is 120, which requires 1,200 chips for that student’s bag. With smaller sample sizes you could get away with fewer chips and still adhere to the 10 percent rule, but it’s important in this activity for students to understand that they are all essentially sampling from the same population. If they perceive that some bags contain many fewer chips than others, you may end up in a discussion you don’t want to have, about the fact that only the proportion is what’s important, not the population size. It’s probably easier to just bite the bullet and prepare bags with a lot of chips in them.

Authored by

Floyd Bullard North Carolina School of Science and Mathematics Durham, North Carolina

If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

To log in and use all the features of Khan Academy, please enable JavaScript in your browser.

AP®︎/College Statistics

Course: ap®︎/college statistics   >   unit 10.

  • Introduction to Type I and Type II errors
  • Examples identifying Type I and Type II errors
  • Type I vs Type II error

Introduction to power in significance tests

  • Examples thinking about power in significance tests
  • Error probabilities and power
  • Consequences of errors and significance

definition of hypothesis power

Want to join the conversation?

  • Upvote Button navigates to signup page
  • Downvote Button navigates to signup page
  • Flag Button navigates to signup page

Video transcript

  • Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

Institute for Digital Research and Education

Introduction to Power Analysis

This seminar treats power and the various factors that affect power on both a conceptual and a mechanical level. While we will not cover the formulas needed to actually run a power analysis, later on we will discuss some of the software packages that can be used to conduct power analyses.

OK, let’s start off with a basic definition of what a power is.  Power is the probability of detecting an effect, given that the effect is really there.  In other words, it is the probability of rejecting the null hypothesis when it is in fact false.  For example, let’s say that we have a simple study with drug A and a placebo group, and that the drug truly is effective; the power is the probability of finding a difference between the two groups.  So, imagine that we had a power of .8 and that this simple study was conducted many times.  Having power of .8 means that 80% of the time, we would get a statistically significant difference between the drug A and placebo groups.  This also means that 20% of the times that we run this experiment, we will not obtain a statistically significant effect between the two groups, even though there really is an effect in reality.

There are several of reasons why one might do a power analysis.  Perhaps the most common use is to determine the necessary number of subjects needed to detect an effect of a given size.  Note that trying to find the absolute, bare minimum number of subjects needed in the study is often not a good idea.  Additionally, power analysis can be used to determine power, given an effect size and the number of subjects available.  You might do this when you know, for example, that only 75 subjects are available (or that you only have the budget for 75 subjects), and you want to know if you will have enough power to justify actually doing the study.  In most cases, there is really no point to conducting a study that is seriously underpowered.  Besides the issue of the number of necessary subjects, there are other good reasons for doing a power analysis.  For example, a power analysis is often required as part of a grant proposal.  And finally, doing a power analysis is often just part of doing good research.  A power analysis is a good way of making sure that you have thought through every aspect of the study and the statistical analysis before you start collecting data.

Despite these advantages of power analyses, there are some limitations.  One limitation is that power analyses do not typically generalize very well.  If you change the methodology used to collect the data or change the statistical procedure used to analyze the data, you will most likely have to redo the power analysis.  In some cases, a power analysis might suggest a number of subjects that is inadequate for the statistical procedure.  For example, a power analysis might suggest that you need 30 subjects for your logistic regression, but logistic regression, like all maximum likelihood procedures, require much larger sample sizes.  Perhaps the most important limitation is that a standard power analysis gives you a “best case scenario” estimate of the necessary number of subjects needed to detect the effect.  In most cases, this “best case scenario” is based on assumptions and educated guesses.  If any of these assumptions or guesses are incorrect, you may have less power than you need to detect the effect.  Finally, because power analyses are based on assumptions and educated guesses, you often get a range of the number of subjects needed, not a precise number.  For example, if you do not know what the standard deviation of your outcome measure will be, you guess at this value, run the power analysis and get X number of subjects.  Then you guess a slightly larger value, rerun the power analysis and get a slightly larger number of necessary subjects.  You repeat this process over the plausible range of values of the standard deviation, which gives you a range of the number of subjects that you will need.

After all of this discussion of power analyses and the necessary number of subjects, we need to stress that power is not the only consideration when determining the necessary sample size.  For example, different researchers might have different reasons for conducting a regression analysis.  One might want to see if the regression coefficient is different from zero, while the other wants to get a very precise estimate of the regression coefficient with a very small confidence interval around it.  This second purpose requires a larger sample size than does merely seeing if the regression coefficient is different from zero.  Another consideration when determining the necessary sample size is the assumptions of the statistical procedure that is going to be used.  The number of statistical tests that you intend to conduct will also influence your necessary sample size:  the more tests that you want to run, the more subjects that you will need.  You will also want to consider the representativeness of the sample, which, of course, influences the generalizability of the results.  Unless you have a really sophisticated sampling plan, the greater the desired generalizability, the larger the necessary sample size.  Finally, please note that most of what is in this presentation does not readily apply to people who are developing a sampling plan for a survey or psychometric analyses.

Definitions

Before we move on, let’s make sure we are all using the same definitions.  We have already defined power as the probability of detecting a “true” effect, when the effect exists.  Most recommendations for power fall between .8 and .9.  We have also been using the term “effect size”, and while intuitively it is an easy concept, there are lots of definitions and lots of formulas for calculating effect sizes.  For example, the current APA manual has a list of more than 15 effect sizes, and there are more than a few books mostly dedicated to the calculation of effect sizes in various situations.  For now, let’s stick with one of the simplest definitions, which is that an effect size is the difference of two group means divided by the pooled standard deviation.  Going back to our previous example, suppose the mean of the outcome variable for the drug A group was 10 and it was 5 for the placebo group.  If the pooled standard deviation was 2.5, we would have and effect size which is equal to (10-5)/2.5 = 2 (which is a large effect size).

We also need to think about “statistically significance” versus “clinically relevant”.  This issue comes up often when considering effect sizes. For example, for a given number of subjects, you might only need a small effect size to have a power of .9.  But that effect size might correspond to a difference between the drug and placebo groups that isn’t clinically meaningful, say reducing blood pressure by two points.  So even though you would have enough power, it still might not be worth doing the study, because the results would not be useful for clinicians.

There are a few other definitions that we will need later in this seminar.  A Type I error occurs when the null hypothesis is true (in other words, there really is no effect), but you reject the null hypothesis.  A Type II error occurs when the alternative hypothesis is correct, but you fail to reject the null hypothesis (in other words, there really is an effect, but you failed to detect it).  Alpha inflation refers to the increase in the nominal alpha level when the number of statistical tests conducted on a given data set is increased.

When discussing statistical power, we have four inter-related concepts: power, effect size, sample size and alpha.  These four things are related such that each is a function of the other three.  In other words, if three of these values are fixed, the fourth is completely determined (Cohen, 1988, page 14).  We mention this because, by increasing one, you can decrease (or increase) another.  For example, if you can increase your effect size, you will need fewer subjects, given the same power and alpha level.  Specifically, increasing the effect size, the sample size and/or alpha will increase your power.

While we are thinking about these related concepts and the effect of increasing things, let’s take a quick look at a standard power graph.  (This graph was made in SPSS Sample Power, and for this example, we’ve used .61 and 4 for our two proportion positive values.)

We like these kinds of graphs because they make clear the diminishing returns you get for adding more and more subjects.  For example, let’s say that we have only 10 subjects per group.  We can see that we have a power of about .15, which is really, really low.  We add 50 subjects per group, now we have a power of about .6, an increase of .45.  However, if we started with 100 subjects per group (power of about .8) and added 50 per group, we would have a power of .95, an increase of only .15.  So each additional subject gives you less additional power.  This curve also illustrates the “cost” of increasing your desired power from .8 to .9.

Knowing your research project

As we mentioned before, one of the big benefits of doing a power analysis is making sure that you have thought through every detail of your research project.

Now most researchers have thought through most, if not all, of the substantive issues involved in their research.  While this is absolutely necessary, it often is not sufficient.  Researchers also need to carefully consider all aspects of the experimental design, the variables involved, and the statistical analysis technique that will be used.  As you will see in the next sections of this presentation, a power analysis is the union of substantive knowledge (i.e., knowledge about the subject matter), experimental or quasi-experimental design issues, and statistical analysis.  Almost every aspect of the experimental design can affect power.  For example, the type of control group that is used or the number of time points that are collected will affect how much power you have.  So knowing about these issues and carefully considering your options is important.  There are plenty of excellent books that cover these issues in detail, including Shadish, Cook and Campbell (2002); Cook and Campbell (1979); Campbell and Stanley (1963); Brickman (2000a, 2000b); Campbell and Russo (2001); Webb, Campbell, Schwartz and Sechrest (2000); and Anderson (2001).

Also, you want to know as much as possible about the statistical technique that you are going to use.  If you learn that you need to use a binary logistic regression because your outcome variable is 0/1, don’t stop there; rather, get a sample data set (there are plenty of sample data sets on our web site) and try it out.  You may discover that the statistical package that you use doesn’t do the type of analysis that need to do.  For example, if you are an SPSS user and you need to do a weighted multilevel logistic regression, you will quickly discover that SPSS doesn’t do that (as of version 25), and you will have to find (and probably learn) another statistical package that will do that analysis.  Maybe you want to learn another statistical package, or maybe that is beyond what you want to do for this project.  If you are writing a grant proposal, maybe you will want to include funds for purchasing the new software.  You will also want to learn what the assumptions are and what the “quirks” are with this particular type of analysis.  Remember that the number of necessary subjects given to you by a power analysis assumes that all of the assumptions of the analysis have been met, so knowing what those assumptions are is important deciding if they are likely to be met or not.

The point of this section is to make clear that knowing your research project involves many things, and you may find that you need to do some research about experimental design or statistical techniques before you do your power analysis.

We want to emphasize that this is time and effort well spent.  We also want to remind you that for almost all researchers, this is a normal part of doing good research.  UCLA researchers are welcome and encouraged to come by walk-in consulting at this stage of the research process to discuss issues and ideas, check out books and try out software.

What you need to know to do a power analysis

In the previous section, we discussed in general terms what you need to know to do a power analysis.  In this section we will discuss some of the actual quantities that you need to know to do a power analysis for some simple statistics.  Although we understand very few researchers test their main hypothesis with a t-test or a chi-square test, our point here is only to give you a flavor of the types of things that you will need to know (or guess at) in order to be ready for a power analysis.

– For an independent samples t-test, you will need to know the population means of the two groups (or the difference between the means), and the population standard deviations of the two groups.  So, using our example of drug A and placebo, we would need to know the difference in the means of the two groups, as well as the standard deviation for each group (because the group means and standard deviations are the best estimate that we have of those population values).  Clearly, if we knew all of this, we wouldn’t need to conduct the study.  In reality, researchers make educated guesses at these values.  We always recommend that you use several different values, such as decreasing the difference in the means and increasing the standard deviations, so that you get a range of values for the number of necessary subjects.

In SPSS Sample Power, we would have a screen that looks like the one below, and we would fill in the necessary values.  As we can see, we would need a total of 70 subjects (35 per group) to have a power of .91 if we had a mean of 5 and a standard deviation of 2.5 in the drug A group, and a mean of 3 and a standard deviation of 2.5 in the placebo group.  If we decreased the difference in the means and increased the standard deviations such that for the drug A group, we had a mean of 4.5 and a standard deviation of 3, and for the placebo group a mean of 3.5 and a standard deviation of 3, we would need 190 subjects per group, or a total of 380 subjects, to have a power of .90.  In other words, seemingly small differences in means and standard deviations can have a huge effect on the number of subjects required.

Image t-test

– For a correlation, you need to know/guess at the correlation in the population.  This is a good time to remember back to an early stats class where they emphasized that correlation is a large N procedure (Chen and Popovich, 2002).  If you guess that the population correlation is .6, a power analysis would suggest (with an alpha of .05 and for a power of .8) that you would need only 16 subjects.  There are several points to be made here.  First, common sense suggests that N = 16 is pretty low.  Second, a population correlation of .6 is pretty high, especially in the social sciences.  Third, the power analysis assumes that all of the assumptions of the correlation have been met.  For example, we are assuming that there is no restriction of range issue, which is common with Likert scales; the sample data for both variables are normally distributed; the relationship between the two variables is linear; and there are no serious outliers.  Also, whereas you might be able to say that the sample correlation does not equal zero, you likely will not have a very precise estimate of the population correlation coefficient.

Image corr

– For a chi-square test, you will need to know the proportion positive for both populations (i.e., rows and columns).  Let’s assume that we will have a 2 x 2 chi-square, and let’s think of both variables as 0/1.  Let’s say that we wanted to know if there was a relationship between drug group (drug A/placebo) and improved health.  In SPSS Sample Power, you would see a screen like this.

Image chi-square

In order to get the .60 and the .30, we would need to know (or guess at) the number of people whose health improved in both the drug A and placebo groups.

We would also need to know (or guess at) either the number of people whose health did not improve in those two groups, or the total number of people in each group.

Improved health (positive) Not improved health Row total
Drug A (positive) 33   (33/55 = .6) 22 55
Placebo 17   (17/55 = .3) 38 55
Column total 50 60 Grand Total = 110

– For an ordinary least squares regression, you would need to know things like the R 2 for the full and reduced model.  For a simple logistic regression analysis with only one continuous predictor variable, you would need to know the probability of a positive outcome (i.e., the probability that the outcome equals 1) at the mean of the predictor variable and the probability of a positive outcome at one standard deviation above the mean of the predictor variable.  Especially for the various types of logistic models (e.g., binary, ordinal and multinomial), you will need to think very carefully about your sample size, and information from a power analysis will only be part of your considerations.  For example, according to Long (1997, pages 53-54), 100 is a minimum sample size for logistic regression, and you want *at least* 10 observations per predictor.  This does not mean that if you have only one predictor you need only 10 observations.

Also, if you have categorical predictors, you may need to have more observations to avoid computational difficulties caused by empty cells or cells with few observations.  More observations are needed when the outcome variable is very lopsided; in other words, when there are very few 1s and lots of 0s, or vice versa.  These cautions emphasize the need to know your data set well, so that you know if your outcome variable is lopsided or if you are likely to have a problem with empty cells.

The point of this section is to give you a sense of the level of detail about your variables that you need to be able to estimate in order to do a power analysis. Also, when doing power analyses for regression models, power programs will start to ask for values that most researchers are not accustomed to providing.  Guessing at the mean and standard deviation of your response variable is one thing, but increments to R 2 is a metric in which few researchers are used to thinking.  In our next section we will discuss how you can guestimate these numbers.

Obtaining the necessary numbers to do a power analysis

There are at least three ways to guestimate the values that are needed to do a power analysis: a literature review, a pilot study and using Cohen’s recommendations.  We will review the pros and cons of each of these methods.  For this discussion, we will focus on finding the effect size, as that is often the most difficult number to obtain and often has the strongest impact on power.

Literature review: Sometimes you can find one or more published studies that are similar enough to yours that you can get a idea of the effect size.  If you can find several such studies, you might be able to use meta-analysis techniques to get a robust estimate of the effect size.  However, oftentimes there are no studies similar enough to your study to get a good estimate of the effect size.  Even if you can find such an study, the necessary effect sizes or other values are often not clearly stated in the article and need to be calculated (if they can) based on the information provided.

Pilot studies:  There are lots of good reasons to do a pilot study prior to conducting the actual study.  From a power analysis prospective, a pilot study can give you a rough estimate of the effect size, as well as a rough estimate of the variability in your measures.  You can also get some idea about where missing data might occur, and as we will discuss later, how you handle missing data can greatly affect your power.  Other benefits of a pilot study include allowing you to identify coding problems, setting up the data base, and inputting the data for a practice analysis.  This will allow you to determine if the data are input in the correct shape, etc.

Of course, there are some limitations to the information that you can get from a pilot study.  (Many of these limitations apply to small samples in general.)  First of all, when estimating effect sizes based on nonsignificant results, the effect size estimate will necessarily have an increased error; in other words, the standard error of the effect size estimate will be larger than when the result is significant. The effect size estimate that you obtain may be unduly influenced by some peculiarity of the small sample.  Also, you often cannot get a good idea of the degree of missingness and attrition that will be seen in the real study.  Despite these limitations, we strongly encourage researchers to conduct a pilot study.  The opportunity to identify and correct “bugs” before collecting the real data is often invaluable.  Also, because of the number of values that need to be guestimated in a power analysis, the precision of any one of these values is not that important.  If you can estimate the effect size to within 10% or 20% of the true value, that is probably sufficient for you to conduct a meaningful power analysis, and such fluctuations can be taken into account during the power analysis.

Cohen’s recommendations:  Jacob Cohen has many well-known publications regarding issues of power and power analyses, including some recommendations about effect sizes that you can use when doing your power analysis.  Many researchers (including Cohen) consider the use of such recommendations as a last resort, when a thorough literature review has failed to reveal any useful numbers and a pilot study is either not possible or not feasible.  From Cohen (1988, pages 24-27):

– Small effect:  1% of the variance; d = 0.25 (too small to detect other than statistically; lower limit of what is clinically relevant)

– Medium effect:  6% of the variance; d = 0.5 (apparent with careful observation)

– Large effect: at least 15% of the variance; d = 0.8 (apparent with a superficial glance; unlikely to be the focus of research because it is too obvious)

Lipsey and Wilson (1993) did a meta analysis of 302 meta analyses of over 10,000 studies and found that the average effect size was .5, adding support to Cohen’s recommendation that, as a last resort, guess that the effect size is .5 (cited in Bausell and Li, 2002).  Sedlmeier and Gigerenzer (1989) found that the average effect size for articles in The Journal of Abnormal Psychology was a medium effect.  According to Keppel and Wickens (2004), when you really have no idea what the effect size is, go with the smallest effect size of practical value.  In other words, you need to know how small of a difference is meaningful to you.  Keep in mind that research suggests that most researchers are overly optimistic about the effect sizes in their research, and that most research studies are under powered (Keppel and Wickens, 2004; Tversky and Kahneman, 1971).  This is part of the reason why we stress that a power analysis gives you a lower limit to the number of necessary subjects.

Factors that affect power

From the preceding discussion, you might be starting to think that the number of subjects and the effect size are the most important factors, or even the only factors, that affect power.  Although effect size is often the largest contributor to power, saying it is the only important issue is far from the truth.  There are at least a dozen other factors that can influence the power of a study, and many of these factors should be considered not only from the perspective of doing a power analysis, but also as part of doing good research.  The first couple of factors that we will discuss are more “mechanical” ways of increasing power (e.g., alpha level, sample size and effect size). After that, the discussion will turn to more methodological issues that affect power.

1.  Alpha level:  One obvious way to increase your power is to increase your alpha (from .05 to say, .1).  Whereas this might be an advisable strategy when doing a pilot study, increasing your alpha usually is not a viable option.  We should point out here that many researchers are starting to prefer to use .01 as an alpha level instead of .05 as a crude attempt to assure results are clinically relevant; this alpha reduction reduces power.

1a.  One- versus two-tailed tests:  In some cases, you can test your hypothesis with a one-tailed test.  For example, if your hypothesis was that drug A is better than the placebo, then you could use a one-tailed test.  However, you would fail to detect a difference, even if it was a large difference, if the placebo was better than drug A.  The advantage of one-tailed tests is that they put all of your power “on one side” to test your hypothesis.  The disadvantage is that you cannot detect differences that are in the opposite direction of your hypothesis.  Moreover, many grant and journal reviewers frown on the use of one-tailed tests, believing it is a way to feign significance (Stratton and Neil, 2004).

2.  Sample size:  A second obvious way to increase power is simply collect data on more subjects.  In some situations, though, the subjects are difficult to get or extremely costly to run.  For example, you may have access to only 20 autistic children or only have enough funding to interview 30 cancer survivors.  If possible, you might try increasing the number of subjects in groups that do not have these restrictions, for example, if you are comparing to a group of normal controls.  While it is true that, in general, it is often desirable to have roughly the same number of subjects in each group, this is not absolutely necessary.  However, you get diminishing returns for additional subjects in the control group:  adding an extra 100 subjects to the control group might not be much more helpful than adding 10 extra subjects to the control group.

3.  Effect size:  Another obvious way to increase your power is to increase the effect size.  Of course, this is often easier said than done. A common way of increasing the effect size is to increase the experimental manipulation.  Going back to our example of drug A and placebo, increasing the experimental manipulation might mean increasing the dose of the drug. While this might be a realistic option more often than increasing your alpha level, there are still plenty of times when you cannot do this.  Perhaps the human subjects committee will not allow it, it does not make sense clinically, or it doesn’t allow you to generalize your results the way you want to.  Many of the other issues discussed below indirectly increase effect size by providing a stronger research design or a more powerful statistical analysis.

4.  Experimental task:  Well, maybe you can not increase the experimental manipulation, but perhaps you can change the experimental task, if there is one.  If a variety of tasks have been used in your research area, consider which of these tasks provides the most power (compared to other important issues, such as relevancy, participant discomfort, and the like).  However, if various tasks have not been reviewed in your field, designing a more sensitive task might be beyond the scope of your research project.

5.  Response variable:  How you measure your response variable(s) is just as important as what task you have the subject perform.  When thinking about power, you want to use a measure that is as high in sensitivity and low in measurement error as is possible.  Researchers in the social sciences often have a variety of measures from which they can choose, while researchers in other fields may not.  For example, there are numerous established measures of anxiety, IQ, attitudes, etc.  Even if there are not established measures, you still have some choice.  Do you want to use a Likert scale, and if so, how many points should it have?  Modifications to procedures can also help reduce measurement error.  For example, you want to make sure that each subject knows exactly what he or she is supposed to be rating.  Oral instructions need to be clear, and items on questionnaires need to be unambiguous to all respondents.  When possible, use direct instead of indirect measures.  For example, asking people what tax bracket they are in is a more direct way of determining their annual income than asking them about the square footage of their house.  Again, this point may be more applicable to those in the social sciences than those in other areas of research.  We should also note that minimizing the measurement error in your predictor variables will also help increase your power.

Just as an aside, most texts on experimental design strongly suggest collecting more than one measure of the response in which you are interested. While this is very good methodologically and provides marked benefits for certain analyses and missing data, it does complicate the power analysis.

6.  Experimental design:  Another thing to consider is that some types of experimental designs are more powerful than others.  For example, repeated measures designs are virtually always more powerful than designs in which you only get measurements at one time.  If you are already using a repeated measures design, increasing the number of time points a response variable is collected to at least four or five will also provide increased power over fewer data collections.  There is a point of diminishing return when a researcher collects too many time points, though this depends on many factors such as the response variable, statistical design, age of participants, etc.

7.  Groups:  Another point to consider is the number and types of groups that you are using.  Reducing the number of experimental conditions will reduce the number of subjects that is needed, or you can keep the same number of subjects and just have more per group.  When thinking about which groups to exclude from the design, you might want to leave out those in the middle and keep the groups with the more extreme manipulations.  Going back to our drug A example, let’s say that we were originally thinking about having a total of four groups: the first group will be our placebo group, the second group would get a small dose of drug A, the third group a medium dose, and the fourth group a large dose.  Clearly, much more power is needed to detect an effect between the medium and large dose groups than to detect an effect between the large dose group and the placebo group.  If we found that we were unable to increase the power enough such that we were likely to find an effect between small and medium dose groups or between the medium and the large dose groups, then it would probably make more sense to run the study without these groups.  In some cases, you may even be able to change your comparison group to something more extreme.  For example, we once had a client who was designing a study to compare people with clinical levels of anxiety to a group that had subclinical levels of anxiety.  However, while doing the power analysis and realizing how many subjects she would need to detect the effect, she found that she needed far fewer subjects if she compared the group with the clinical levels of anxiety to a group of “normal” people (a number of subjects she could reasonably obtain).

8.  Statistical procedure:  Changing the type of statistical analysis may also help increase power, especially when some of the assumptions of the test are violated.  For example, as Maxwell and Delaney (2004) noted, “Even when ANOVA is robust, it may not provide the most powerful test available when its assumptions have been violated.”  In particular, violations of assumptions regarding independence, normality and heterogeneity can reduce power. In such cases, nonparametric alternatives may be more powerful.

9.  Statistical model:  You can also modify the statistical model.  For example, interactions often require more power than main effects.  Hence, you might find that you have reasonable power for a main effects model, but not enough power when the model includes interactions.  Many (perhaps most?) power analysis programs do not have an option to include interaction terms when describing the proposed analysis, so you need to keep this in mind when using these programs to help you determine how many subjects will be needed.  When thinking about the statistical model, you might want to consider using covariates or blocking variables.  Ideally, both covariates and blocking variables reduce the variability in the response variable.  However, it can be challenging to find such variables.  Moreover, your statistical model should use as many of the response variable time points as possible when examining longitudinal data.  Using a change-score analysis when one has collected five time points makes little sense and ignores the added power from these additional time points.  The more the statistical model “knows” about how a person changes over time, the more variance that can be pulled out of the error term and ascribed to an effect.

9a. Correlation between time points:  Understanding the expected correlation between a response variable measured at one time in your study with the same response variable measured at another time can provide important and power-saving information.  As noted previously, when the statistical model has a certain amount of information regarding the manner by which people change over time, it can enhance the effect size estimate.  This is largely dependent on the correlation of the response measure over time.  For example, in a before-after data collection scenario, response variables with a .00 correlation from before the treatment to after the treatment would provide no extra benefit to the statistical model, as we can’t better understand a subject’s score by knowing how he or she changes over time.  Rarely, however, do variables have a .00 correlation on the same outcomes measured at different times.  It is important to know that outcome variables with larger correlations over time provide enhanced power when used in a complimentary statistical model.

10.  Modify response variable:  Besides modifying your statistical model, you might also try modifying your response variable.  Possible benefits of this strategy include reducing extreme scores and/or meeting the assumptions of the statistical procedure.  For example, some response variables might need to be log transformed.  However, you need to be careful here.  Transforming variables often makes the results more difficult to interpret, because now you are working in, say, a logarithm metric instead of the metric in which the variable was originally measured. Moreover, if you use a transformation that adjusts the model too much, you can loose more power than is necessary.  Categorizing continuous response variables (sometimes used as a way of handling extreme scores) can also be problematic, because logistic or ordinal logistic regression often requires many more subjects than does OLS regression.  It makes sense that categorizing a response variable will lead to a loss of power, as information is being “thrown away.”

11.  Purpose of the study:  Different researchers have different reasons for conducting research.  Some are trying to determine if a coefficient (such as a regression coefficient) is different from zero.  Others are trying to get a precise estimate of a coefficient.  Still others are replicating research that has already been done.  The purpose of the research can affect the necessary sample size.  Going back to our drug A and placebo study, let’s suppose our purpose is to test the difference in means to see if it equals zero.   In this case, we need a relatively small sample size.  If our purpose is to get a precise estimate of the means (i.e., minimizing the standard errors), then we will need a larger sample size.  If our purpose is to replicate previous research, then again we will need a relatively large sample size.  Tversky and Kahneman (1971) pointed out that we often need more subjects in a replication study than were in the original study.  They also noted that researchers are often too optimistic about how much power they really have.  They claim that researchers too readily assign “causal” reasons to explain differences between studies, instead of sampling error. They also mentioned that researchers tend to underestimate the impact of sampling and think that results will replicate more often than is the case.

12.  Missing data:  A final point that we would like to make here regards missing data.  Almost all researchers have issues with missing data.  When designing your study and selecting your measures, you want to do everything possible to minimize missing data.  Handling missing data via imputation methods can be very tricky and very time-consuming.  If the data set is small, the situation can be even more difficult.  In general, missing data reduces power; poor imputation methods can greatly reduce power.  If you have to impute, you want to have as few missing data points on as few variables as possible.  When designing the study, you might want to collect data specifically for use in an imputation model (which usually involves a different set of variables than the model used to test your hypothesis).  It is also important to note that the default technique for handling missing data by virtually every statistical program is to remove the entire case from an analysis (i.e., listwise deletion).  This process is undertaken even if the analysis involves 20 variables and a subject is missing only one datum of the 20.  Listwise deletion is one of the biggest contributors to loss of power, both because of the omnipresence of missing data and because of the omnipresence of this default setting in statistical programs (Graham et al., 2003).

This ends the section on the various factors that can influence power.  We know that was a lot, and we understand that much of this can be frustrating because there is very little that is “black and white”.  We hope that this section made clear the close relationship between the experimental design, the statistical analysis and power.

Cautions about small sample sizes and sampling variation

We want to take a moment here to mention some issues that frequently arise when using small samples.  (We aren’t going to put a lower limit on what we mean be “small sample size.”)  While there are situations in which a researcher can either only get or afford a small number of subjects, in most cases, the researcher has some choice in how many subjects to include.  Considerations of time and effort argue for running as few subjects as possible, but there are some difficulties associated with small sample sizes, and these may outweigh any gains from the saving of time, effort or both.  One obvious problem with small sample sizes is that they have low power.  This means that you need to have a large effect size to detect anything.  You will also have fewer options with respect to appropriate statistical procedures, as many common procedures, such as correlations, logistic regression and multilevel modeling, are not appropriate with small sample sizes.  It may also be more difficult to evaluate the assumptions of the statistical procedure that is used (especially assumptions like normality).  In most cases, the statistical model must be smaller when the data set is small. Interaction terms, which often test interesting hypotheses, are frequently the first casualties.  Generalizability of the results may also be comprised, and it can be difficult to argue that a small sample is representative of a large and varied population. Missing data are also more problematic; there are a reduced number of imputations methods available to you, and these are not considered to be desirable imputation methods (such as mean imputation).  Finally, with a small sample size, alpha inflation issues can be more difficult to address, and you are more likely to run as many tests as you have subjects.

While the issue of sampling variability is relevant to all research, it is especially relevant to studies with small sample sizes.  To quote Murphy and Myors (2004, page 59), “The lack of attention to power analysis (and the deplorable habit of placing too much weight on the results of small sample studies) are well documented in the literature, and there is no good excuse to ignore power in designing studies.”  In an early article entitled The Law of Small Numbers , Tversky and Kahneman (1971) stated that many researchers act like the Law of Large Numbers applies to small numbers.  People often believe that small samples are more representative of the population than they really are.

The last two points to be made here is that there is usually no point to conducting an underpowered study, and that underpowered studies can cause chaos in the literature because studies that are similar methodologically may report conflicting results.

We will briefly discuss some of the programs that you can use to assist you with your power analysis.  Most programs are fairly easy to use, but you still need to know effect sizes, means, standard deviations, etc.

Among the programs specifically designed for power analysis, we use SPSS Sample Power, PASS and GPower.  These programs have a friendly point-and-click interface and will do power analyses for things like correlations, OLS regression and logistic regression.  We have also started using Optimal Design for repeated measures, longitudinal and multilevel designs. We should note that Sample Power is a stand-alone program that is sold by SPSS; it is not part of SPSS Base or an add-on module.  PASS can be purchased directly from NCSS at http://www.ncss.com/index.htm . GPower (please see GPower for details) and Optimal Design (please see http://sitemaker.umich.edu/group-based/home for details) are free.

Several general use stat packages also have procedures for calculating power.  SAS has proc power , which has a lot of features and is pretty nice.  Stata has the sampsi command, as well as many user-written commands, including fpower , powerreg and aipe (written by our IDRE statistical consultants).  Statistica has an add-on module for power analysis.  There are also many programs online that are free.

For more advanced/complicated analyses, Mplus is a good choice.  It will allow you to do Monte Carlo simulations, and there are some examples at http://www.statmodel.com/power.shtml and http://www.statmodel.com/ugexcerpts.shtml .

Most of the programs that we have mentioned do roughly the same things, so when selecting a power analysis program, the real issue is your comfort; all of the programs require you to provide the same kind of information.

Multiplicity

This issue of multiplicity arises when a researcher has more than one outcome of interest in a given study.  While it is often good methodological practice to have more than one measure of the response variable of interest, additional response variables mean more statistical tests need to be conducted on the data set, and this leads to question of experimentwise alpha control. Returning to our example of drug A and placebo, if we have only one response variable, then only one t test is needed to test our hypothesis.  However, if we have three measures of our response variable, we would want to do three t tests, hoping that each would show results in the same direction.  The question is how to control the Type I error (AKA false alarm) rate.  Most researchers are familiar with Bonferroni correction, which calls for dividing the prespecified alpha level (usually .05) by the number of tests to be conducted.  In our example, we would have .05/3 = .0167.  Hence, .0167 would be our new critical alpha level, and statistics with a p-value greater than .0167 would be classified as not statistically significant.  It is well-known that the Bonferroni correction is very conservative; there are other ways of adjusting the alpha level.

Afterthoughts:  A post-hoc power analysis

In general, just say “No!” to post-hoc analyses.  There are many reasons, both mechanical and theoretical, why most researchers should not do post-hoc power analyses.  Excellent summaries can be found in Hoenig and Heisey (2001) The Abuse of Power:  The Pervasive Fallacy of Power Calculations for Data Analysis and Levine and Ensom (2001) Post Hoc Power Analysis:  An Idea Whose Time Has Passed? .  As Hoenig and Heisey show, power is mathematically directly related to the p-value; hence, calculating power once you know the p-value associated with a statistic adds no new information.  Furthermore, as Levine and Ensom clearly explain, the logic underlying post-hoc power analysis is fundamentally flawed.

However, there are some things that you should look at after your study is completed.  Have a look at the means and standard deviations of your variables and see how close they are (or are not) from the values that you used in the power analysis.  Many researchers do a series of related studies, and this information can aid in making decisions in future research.  For example, if you find that your outcome variable had a standard deviation of 7, and in your power analysis you were guessing it would have a standard deviation of 2, you may want to consider using a different measure that has less variance in your next study.

The point here is that in addition to answering your research question(s), your current research project can also assist with your next power analysis.

Conclusions

Conducting research is kind of like buying a car.  While buying a car isn’t the biggest purchase that you will make in your life, few of us enter into the process lightly.  Rather, we consider a variety of things, such as need and cost, before making a purchase.  You would do your research before you went and bought a car, because once you drove the car off the dealer’s lot, there is nothing you can do about it if you realize this isn’t the car that you need.  Choosing the type of analysis is like choosing which kind of car to buy.  The number of subjects is like your budget, and the model is like your expenses.  You would never go buy a car without first having some idea about what the payments will be.  This is like doing a power analysis to determine approximately how many subjects will be needed.  Imagine signing the papers for your new Maserati only to find that the payments will be twice your monthly take-home pay.  This is like wanting to do a multilevel model with a binary outcome, 10 predictors and lots of cross-level interactions and realizing that you can’t do this with only 50 subjects.  You don’t have enough “currency” to run that kind of model.  You need to find a model that is “more in your price range.”  If you had $530 a month budgeted for your new car, you probably wouldn’t want exactly $530 in monthly payments. Rather you would want some “wiggle-room” in case something cost a little more than anticipated or you were running a little short on money that month. Likewise, if your power analysis says you need about 300 subjects, you wouldn’t want to collect data on exactly 300 subjects.  You would want to collect data on 300 subjects plus a few, just to give yourself some “wiggle-room” just in case.

Don’t be afraid of what you don’t know.  Get in there and try it BEFORE you collect your data.  Correcting things is easy at this stage; after you collect your data, all you can do is damage control.  If you are in a hurry to get a project done, perhaps the worst thing that you can do is start collecting data now and worry about the rest later.  The project will take much longer if you do this than if you do what we are suggesting and do the power analysis and other planning steps.  If you have everything all planned out, things will go much smoother and you will have fewer and/or less intense panic attacks.  Of course, some thing unexpected will always happen, but it is unlikely to be as big of a problem.  UCLA researchers are always welcome and strongly encouraged to come into our walk-in consulting and discuss their research before they begin the project.

Power analysis = planning.  You will want to plan not only for the test of your main hypothesis, but also for follow-up tests and tests of secondary hypotheses.  You will want to make sure that “confirmation” checks will run as planned (for example, checking to see that interrater reliability was acceptable).  If you intend to use imputation methods to address missing data issues, you will need to become familiar with the issues surrounding the particular procedure as well as including any additional variables in your data collection procedures.  Part of your planning should also include a list of the statistical tests that you intend to run and consideration of any procedure to address alpha inflation issues that might be necessary.

The number output by any power analysis program is often just a starting point of thought more than a final answer to the question of how many subjects will be needed.  As we have seen, you also need to consider the purpose of the study (coefficient different from 0, precise point estimate, replication), the type of statistical test that will be used (t-test versus maximum likelihood technique), the total number of statistical tests that will be performed on the data set, genearlizability from the sample to the population, and probably several other things as well.

The take-home message from this seminar is “do your research before you do your research.”

Anderson, N. H.  (2001).  Empirical Direction in Design and Analysis.  Mahwah, New Jersey:  Lawrence Erlbaum Associates.

Bausell, R. B. and Li, Y.  (2002).  Power Analysis for Experimental Research:  A Practical Guide for the Biological, Medical and Social Sciences.  Cambridge University Press, New York, New York.

Bickman, L., Editor.  (2000).  Research Design:  Donald Campbell’s Legacy, Volume 2.  Thousand Oaks, CA:  Sage Publications.

Bickman, L., Editor.  (2000).  Validity and Social Experimentation. Thousand Oaks, CA:  Sage Publications.

Campbell, D. T. and Russo, M. J.  (2001).  Social Measurement. Thousand Oaks, CA:  Sage Publications.

Campbell, D. T. and Stanley, J. C.  (1963).  Experimental and Quasi-experimental Designs for Research.  Reprinted from Handbook of Research on Teaching .  Palo Alto, CA:  Houghton Mifflin Co.

Chen, P. and Popovich, P. M.  (2002).  Correlation: Parametric and Nonparametric Measures.  Thousand Oaks, CA:  Sage Publications.

Cohen, J. (1988).  Statistical Power Analysis for the Behavioral Sciences, Second Edition.  Hillsdale, New Jersey:  Lawrence Erlbaum Associates.

Cook, T. D. and Campbell, D. T.  Quasi-experimentation:  Design and Analysis Issues for Field Settings.  (1979).  Palo Alto, CA: Houghton Mifflin Co.

Graham, J. W., Cumsille, P. E., and Elek-Fisk, E. (2003). Methods for handling missing data. In J. A. Schinka and W. F. Velicer (Eds.), Handbook of psychology (Vol. 2, pp. 87-114). New York: Wiley.

Green, S. B.  (1991).  How many subjects does it take to do a regression analysis?  Multivariate Behavioral Research, 26(3) , 499-510.

Hoenig, J. M. and Heisey, D. M.  (2001).  The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis.  The American Statistician, 55(1) , 19-24.

Kelley, K and Maxwell, S. E.  (2003).  Sample size for multiple regression:  Obtaining regression coefficients that are accurate, not simply significant.  Psychological Methods, 8(3) , 305-321.

Keppel, G. and Wickens, T. D. (2004).  Design and Analysis:  A Researcher’s Handbook, Fourth Edition.  Pearson Prentice Hall:  Upper Saddle River, New Jersey.

Kline, R. B. Beyond Significance  (2004).  Beyond Significance Testing:  Reforming Data Analysis Methods in Behavioral Research. American Psychological Association:  Washington, D.C.

Levine, M., and Ensom M. H. H.  (2001).  Post Hoc Power Analysis: An Idea Whose Time Has Passed?  Pharmacotherapy, 21(4) , 405-409.

Lipsey, M. W. and Wilson, D. B.  (1993).  The Efficacy of Psychological, Educational, and Behavioral Treatment:  Confirmation from Meta-analysis.  American Psychologist, 48(12) , 1181-1209.

Long, J. S. (1997).  Regression Models for Categorical and Limited Dependent Variables.  Thousand Oaks, CA:  Sage Publications.

Maxwell, S. E.  (2000).  Sample size and multiple regression analysis.  Psychological Methods, 5(4) , 434-458.

Maxwell, S. E. and Delany, H. D.  (2004).  Designing Experiments and Analyzing Data:  A Model Comparison Perspective, Second Edition. Lawrence Erlbaum Associates, Mahwah, New Jersey.

Murphy, K. R. and Myors, B.  (2004).  Statistical Power Analysis: A Simple and General Model for Traditional and Modern Hypothesis Tests. Mahwah, New Jersey:  Lawrence Erlbaum Associates.

Publication Manual of the American Psychological Association, Fifth Edition. (2001).  Washington, D.C.:  American Psychological Association.

Sedlmeier, P. and Gigerenzer, G.  (1989).  Do Studies of Statistical Power Have an Effect on the Power of Studies?  Psychological Bulletin, 105(2) , 309-316.

Shadish, W. R., Cook, T. D. and Campbell, D. T.  (2002). Experimental and Quasi-experimental Designs for Generalized Causal Inference. Boston:  Houghton Mifflin Co.

Stratton, I. M. and Neil, A.  (2004).  How to ensure your paper is rejected by the statistical reviewer.  Diabetic Medicine , 22, 371-373.

Tversky, A. and Kahneman, D.  (1971).  Belief in the Law of Small Numbers.  Psychological Bulletin, 76(23) , 105-110.

Webb, E., Campbell, D. T., Schwartz, R. D., and Sechrest, L.  (2000). Unobtrusive Measures, Revised Edition.  Thousand Oaks, CA:  Sage Publications.

Your Name (required)

Your Email (must be a valid email for us to receive the report!)

Comment/Error Report (required)

How to cite this page

  • © 2021 UC REGENTS

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons

Margin Size

  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

9.1: Introduction to Hypothesis Testing

  • Last updated
  • Save as PDF
  • Page ID 10211

  • Kyle Siegrist
  • University of Alabama in Huntsville via Random Services

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

Basic Theory

Preliminaries.

As usual, our starting point is a random experiment with an underlying sample space and a probability measure \(\P\). In the basic statistical model, we have an observable random variable \(\bs{X}\) taking values in a set \(S\). In general, \(\bs{X}\) can have quite a complicated structure. For example, if the experiment is to sample \(n\) objects from a population and record various measurements of interest, then \[ \bs{X} = (X_1, X_2, \ldots, X_n) \] where \(X_i\) is the vector of measurements for the \(i\)th object. The most important special case occurs when \((X_1, X_2, \ldots, X_n)\) are independent and identically distributed. In this case, we have a random sample of size \(n\) from the common distribution.

The purpose of this section is to define and discuss the basic concepts of statistical hypothesis testing . Collectively, these concepts are sometimes referred to as the Neyman-Pearson framework, in honor of Jerzy Neyman and Egon Pearson, who first formalized them.

A statistical hypothesis is a statement about the distribution of \(\bs{X}\). Equivalently, a statistical hypothesis specifies a set of possible distributions of \(\bs{X}\): the set of distributions for which the statement is true. A hypothesis that specifies a single distribution for \(\bs{X}\) is called simple ; a hypothesis that specifies more than one distribution for \(\bs{X}\) is called composite .

In hypothesis testing , the goal is to see if there is sufficient statistical evidence to reject a presumed null hypothesis in favor of a conjectured alternative hypothesis . The null hypothesis is usually denoted \(H_0\) while the alternative hypothesis is usually denoted \(H_1\).

An hypothesis test is a statistical decision ; the conclusion will either be to reject the null hypothesis in favor of the alternative, or to fail to reject the null hypothesis. The decision that we make must, of course, be based on the observed value \(\bs{x}\) of the data vector \(\bs{X}\). Thus, we will find an appropriate subset \(R\) of the sample space \(S\) and reject \(H_0\) if and only if \(\bs{x} \in R\). The set \(R\) is known as the rejection region or the critical region . Note the asymmetry between the null and alternative hypotheses. This asymmetry is due to the fact that we assume the null hypothesis, in a sense, and then see if there is sufficient evidence in \(\bs{x}\) to overturn this assumption in favor of the alternative.

An hypothesis test is a statistical analogy to proof by contradiction, in a sense. Suppose for a moment that \(H_1\) is a statement in a mathematical theory and that \(H_0\) is its negation. One way that we can prove \(H_1\) is to assume \(H_0\) and work our way logically to a contradiction. In an hypothesis test, we don't prove anything of course, but there are similarities. We assume \(H_0\) and then see if the data \(\bs{x}\) are sufficiently at odds with that assumption that we feel justified in rejecting \(H_0\) in favor of \(H_1\).

Often, the critical region is defined in terms of a statistic \(w(\bs{X})\), known as a test statistic , where \(w\) is a function from \(S\) into another set \(T\). We find an appropriate rejection region \(R_T \subseteq T\) and reject \(H_0\) when the observed value \(w(\bs{x}) \in R_T\). Thus, the rejection region in \(S\) is then \(R = w^{-1}(R_T) = \left\{\bs{x} \in S: w(\bs{x}) \in R_T\right\}\). As usual, the use of a statistic often allows significant data reduction when the dimension of the test statistic is much smaller than the dimension of the data vector.

The ultimate decision may be correct or may be in error. There are two types of errors, depending on which of the hypotheses is actually true.

Types of errors:

  • A type 1 error is rejecting the null hypothesis \(H_0\) when \(H_0\) is true.
  • A type 2 error is failing to reject the null hypothesis \(H_0\) when the alternative hypothesis \(H_1\) is true.

Similarly, there are two ways to make a correct decision: we could reject \(H_0\) when \(H_1\) is true or we could fail to reject \(H_0\) when \(H_0\) is true. The possibilities are summarized in the following table:

Hypothesis Test
State | Decision Fail to reject \(H_0\) Reject \(H_0\)
\(H_0\) True Correct Type 1 error
\(H_1\) True Type 2 error Correct

Of course, when we observe \(\bs{X} = \bs{x}\) and make our decision, either we will have made the correct decision or we will have committed an error, and usually we will never know which of these events has occurred. Prior to gathering the data, however, we can consider the probabilities of the various errors.

If \(H_0\) is true (that is, the distribution of \(\bs{X}\) is specified by \(H_0\)), then \(\P(\bs{X} \in R)\) is the probability of a type 1 error for this distribution. If \(H_0\) is composite, then \(H_0\) specifies a variety of different distributions for \(\bs{X}\) and thus there is a set of type 1 error probabilities.

The maximum probability of a type 1 error, over the set of distributions specified by \( H_0 \), is the significance level of the test or the size of the critical region.

The significance level is often denoted by \(\alpha\). Usually, the rejection region is constructed so that the significance level is a prescribed, small value (typically 0.1, 0.05, 0.01).

If \(H_1\) is true (that is, the distribution of \(\bs{X}\) is specified by \(H_1\)), then \(\P(\bs{X} \notin R)\) is the probability of a type 2 error for this distribution. Again, if \(H_1\) is composite then \(H_1\) specifies a variety of different distributions for \(\bs{X}\), and thus there will be a set of type 2 error probabilities. Generally, there is a tradeoff between the type 1 and type 2 error probabilities. If we reduce the probability of a type 1 error, by making the rejection region \(R\) smaller, we necessarily increase the probability of a type 2 error because the complementary region \(S \setminus R\) is larger.

The extreme cases can give us some insight. First consider the decision rule in which we never reject \(H_0\), regardless of the evidence \(\bs{x}\). This corresponds to the rejection region \(R = \emptyset\). A type 1 error is impossible, so the significance level is 0. On the other hand, the probability of a type 2 error is 1 for any distribution defined by \(H_1\). At the other extreme, consider the decision rule in which we always rejects \(H_0\) regardless of the evidence \(\bs{x}\). This corresponds to the rejection region \(R = S\). A type 2 error is impossible, but now the probability of a type 1 error is 1 for any distribution defined by \(H_0\). In between these two worthless tests are meaningful tests that take the evidence \(\bs{x}\) into account.

If \(H_1\) is true, so that the distribution of \(\bs{X}\) is specified by \(H_1\), then \(\P(\bs{X} \in R)\), the probability of rejecting \(H_0\) is the power of the test for that distribution.

Thus the power of the test for a distribution specified by \( H_1 \) is the probability of making the correct decision.

Suppose that we have two tests, corresponding to rejection regions \(R_1\) and \(R_2\), respectively, each having significance level \(\alpha\). The test with region \(R_1\) is uniformly more powerful than the test with region \(R_2\) if \[ \P(\bs{X} \in R_1) \ge \P(\bs{X} \in R_2) \text{ for every distribution of } \bs{X} \text{ specified by } H_1 \]

Naturally, in this case, we would prefer the first test. Often, however, two tests will not be uniformly ordered; one test will be more powerful for some distributions specified by \(H_1\) while the other test will be more powerful for other distributions specified by \(H_1\).

If a test has significance level \(\alpha\) and is uniformly more powerful than any other test with significance level \(\alpha\), then the test is said to be a uniformly most powerful test at level \(\alpha\).

Clearly a uniformly most powerful test is the best we can do.

\(P\)-value

In most cases, we have a general procedure that allows us to construct a test (that is, a rejection region \(R_\alpha\)) for any given significance level \(\alpha \in (0, 1)\). Typically, \(R_\alpha\) decreases (in the subset sense) as \(\alpha\) decreases.

The \(P\)-value of the observed value \(\bs{x}\) of \(\bs{X}\), denoted \(P(\bs{x})\), is defined to be the smallest \(\alpha\) for which \(\bs{x} \in R_\alpha\); that is, the smallest significance level for which \(H_0\) is rejected, given \(\bs{X} = \bs{x}\).

Knowing \(P(\bs{x})\) allows us to test \(H_0\) at any significance level for the given data \(\bs{x}\): If \(P(\bs{x}) \le \alpha\) then we would reject \(H_0\) at significance level \(\alpha\); if \(P(\bs{x}) \gt \alpha\) then we fail to reject \(H_0\) at significance level \(\alpha\). Note that \(P(\bs{X})\) is a statistic . Informally, \(P(\bs{x})\) can often be thought of as the probability of an outcome as or more extreme than the observed value \(\bs{x}\), where extreme is interpreted relative to the null hypothesis \(H_0\).

Analogy with Justice Systems

There is a helpful analogy between statistical hypothesis testing and the criminal justice system in the US and various other countries. Consider a person charged with a crime. The presumed null hypothesis is that the person is innocent of the crime; the conjectured alternative hypothesis is that the person is guilty of the crime. The test of the hypotheses is a trial with evidence presented by both sides playing the role of the data. After considering the evidence, the jury delivers the decision as either not guilty or guilty . Note that innocent is not a possible verdict of the jury, because it is not the point of the trial to prove the person innocent. Rather, the point of the trial is to see whether there is sufficient evidence to overturn the null hypothesis that the person is innocent in favor of the alternative hypothesis of that the person is guilty. A type 1 error is convicting a person who is innocent; a type 2 error is acquitting a person who is guilty. Generally, a type 1 error is considered the more serious of the two possible errors, so in an attempt to hold the chance of a type 1 error to a very low level, the standard for conviction in serious criminal cases is beyond a reasonable doubt .

Tests of an Unknown Parameter

Hypothesis testing is a very general concept, but an important special class occurs when the distribution of the data variable \(\bs{X}\) depends on a parameter \(\theta\) taking values in a parameter space \(\Theta\). The parameter may be vector-valued, so that \(\bs{\theta} = (\theta_1, \theta_2, \ldots, \theta_n)\) and \(\Theta \subseteq \R^k\) for some \(k \in \N_+\). The hypotheses generally take the form \[ H_0: \theta \in \Theta_0 \text{ versus } H_1: \theta \notin \Theta_0 \] where \(\Theta_0\) is a prescribed subset of the parameter space \(\Theta\). In this setting, the probabilities of making an error or a correct decision depend on the true value of \(\theta\). If \(R\) is the rejection region, then the power function \( Q \) is given by \[ Q(\theta) = \P_\theta(\bs{X} \in R), \quad \theta \in \Theta \] The power function gives a lot of information about the test.

The power function satisfies the following properties:

  • \(Q(\theta)\) is the probability of a type 1 error when \(\theta \in \Theta_0\).
  • \(\max\left\{Q(\theta): \theta \in \Theta_0\right\}\) is the significance level of the test.
  • \(1 - Q(\theta)\) is the probability of a type 2 error when \(\theta \notin \Theta_0\).
  • \(Q(\theta)\) is the power of the test when \(\theta \notin \Theta_0\).

If we have two tests, we can compare them by means of their power functions.

Suppose that we have two tests, corresponding to rejection regions \(R_1\) and \(R_2\), respectively, each having significance level \(\alpha\). The test with rejection region \(R_1\) is uniformly more powerful than the test with rejection region \(R_2\) if \( Q_1(\theta) \ge Q_2(\theta)\) for all \( \theta \notin \Theta_0 \).

Most hypothesis tests of an unknown real parameter \(\theta\) fall into three special cases:

Suppose that \( \theta \) is a real parameter and \( \theta_0 \in \Theta \) a specified value. The tests below are respectively the two-sided test , the left-tailed test , and the right-tailed test .

  • \(H_0: \theta = \theta_0\) versus \(H_1: \theta \ne \theta_0\)
  • \(H_0: \theta \ge \theta_0\) versus \(H_1: \theta \lt \theta_0\)
  • \(H_0: \theta \le \theta_0\) versus \(H_1: \theta \gt \theta_0\)

Thus the tests are named after the conjectured alternative. Of course, there may be other unknown parameters besides \(\theta\) (known as nuisance parameters ).

Equivalence Between Hypothesis Test and Confidence Sets

There is an equivalence between hypothesis tests and confidence sets for a parameter \(\theta\).

Suppose that \(C(\bs{x})\) is a \(1 - \alpha\) level confidence set for \(\theta\). The following test has significance level \(\alpha\) for the hypothesis \( H_0: \theta = \theta_0 \) versus \( H_1: \theta \ne \theta_0 \): Reject \(H_0\) if and only if \(\theta_0 \notin C(\bs{x})\)

By definition, \(\P[\theta \in C(\bs{X})] = 1 - \alpha\). Hence if \(H_0\) is true so that \(\theta = \theta_0\), then the probability of a type 1 error is \(P[\theta \notin C(\bs{X})] = \alpha\).

Equivalently, we fail to reject \(H_0\) at significance level \(\alpha\) if and only if \(\theta_0\) is in the corresponding \(1 - \alpha\) level confidence set. In particular, this equivalence applies to interval estimates of a real parameter \(\theta\) and the common tests for \(\theta\) given above .

In each case below, the confidence interval has confidence level \(1 - \alpha\) and the test has significance level \(\alpha\).

  • Suppose that \(\left[L(\bs{X}, U(\bs{X})\right]\) is a two-sided confidence interval for \(\theta\). Reject \(H_0: \theta = \theta_0\) versus \(H_1: \theta \ne \theta_0\) if and only if \(\theta_0 \lt L(\bs{X})\) or \(\theta_0 \gt U(\bs{X})\).
  • Suppose that \(L(\bs{X})\) is a confidence lower bound for \(\theta\). Reject \(H_0: \theta \le \theta_0\) versus \(H_1: \theta \gt \theta_0\) if and only if \(\theta_0 \lt L(\bs{X})\).
  • Suppose that \(U(\bs{X})\) is a confidence upper bound for \(\theta\). Reject \(H_0: \theta \ge \theta_0\) versus \(H_1: \theta \lt \theta_0\) if and only if \(\theta_0 \gt U(\bs{X})\).

Pivot Variables and Test Statistics

Recall that confidence sets of an unknown parameter \(\theta\) are often constructed through a pivot variable , that is, a random variable \(W(\bs{X}, \theta)\) that depends on the data vector \(\bs{X}\) and the parameter \(\theta\), but whose distribution does not depend on \(\theta\) and is known. In this case, a natural test statistic for the basic tests given above is \(W(\bs{X}, \theta_0)\).

Power function

by Marco Taboga , PhD

In statistics, the power function is a function that links the true value of a parameter to the probability of rejecting a null hypothesis about the value of that parameter.

Table of contents

Terminology

Power and size, graph of the power function, how to derive the power function, dependence on sample size, more details, keep reading the glossary.

Here is a more formal definition.

[eq1]

The size of a test is the probability of rejecting the null hypothesis when it is true.

[eq6]

We plot below the graph of a typical power function.

Graph of the power function of a z-test for the mean of a normal distribution.

the size of the test is equal to 5%;

the sample is made of 100 independent draws from the distribution.

Note that the minimum of the graph corresponds to the null and it is equal to the size of the test.

[eq8]

For examples of how to derive the power function, see the lectures:

Hypothesis testing about the mean (z-test and t-test);

Hypothesis testing about the variance (Chi-square test).

Usually, the power of a test is an increasing function of sample size : the more observations we have, the more powerful the test.

You can find a more exhaustive explanation of the concept of power function in the lecture entitled Hypothesis testing .

Some related concepts are found in the following glossary entries:

alternative hypothesis ;

Type I error ;

Type II error .

Previous entry: Posterior probability

Next entry: Precision matrix

How to cite

Please cite as:

Taboga, Marco (2021). "Power function", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/glossary/power-function.

Most of the learning materials found on this website are now available in a traditional textbook format.

  • Set estimation
  • Normal distribution
  • Independent events
  • Bernoulli distribution
  • Central Limit Theorem
  • Combinations
  • Student t distribution
  • Almost sure convergence
  • Mathematical tools
  • Fundamentals of probability
  • Probability distributions
  • Asymptotic theory
  • Fundamentals of statistics
  • About Statlect
  • Cookies, privacy and terms of use
  • Continuous mapping theorem
  • Null hypothesis
  • Posterior probability
  • Critical value
  • To enhance your privacy,
  • we removed the social buttons,
  • but don't forget to share .

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

What is the correct definition of a Power Function?

In Casella Berger's Statistical Inference, they define a power function of a hypothesis test with rejection region $R$ to be the function of $\theta$ define by $\beta(\theta) = P_\theta(X\in R)$ for some data $X$. Suppose that $H_0: \theta\in \Theta_0$ and $H_1: \theta \in \Theta_0^c$.

Furthermore, they state that:

$$ P_\theta(X\in R) = \begin{cases} \text{probability of a Type 1 error} &\mbox{if } \theta\in \Theta_0\\ \text{one minus the probability of a Type 2 error} & \mbox{if } \theta\in \Theta_0^c\end{cases} $$

However, my understand is always that the power function is the probability of rejecting the null, given that the null is false. This doesn't match the above. What is wrong here? Thanks!

  • hypothesis-testing
  • statistical-significance
  • mathematical-statistics

user321627's user avatar

2 Answers 2

Consider if you have a simple null, like $\mu=\mu_0$ against a two-sided alternative. Then your power function has a "hole" at $\mu_0$.

The usual definition of power function fills in the hole, making the power function defined for all possible values of $\theta$.

Sure, at that point it's not power, but calling it a "rejection rate function" just because you defined the function at one point where it isn't measuring power is a little clumsy.

Glen_b's user avatar

Power is the probability that the observation is in the rejection region when some value in the parameter space of the alternative is correct (falsely rejecting the null hypothesis). But when the two distributions are identical, the rejection region for the null hypothesis also corresponds to the non-rejection region for the alternative, so $\alpha =1-\beta$ . Think of the case of two univariate normal distributions with variance 1 and mean 0 under the null hypothesis and a one-sided alternative mean >0. Then as the alternative mean gets closer to zero, the power drops all the way down to $\alpha$ . A drawing showing the critical region with the standard normal and the normal shift to the right of a mean $\mu>0$ should make this clear.

utobi's user avatar

  • $\begingroup$ "Power is the probability [of] [...] (falsely rejecting the null hypothesis)" - sorry if I'm misinterpreting, but isn't power correctly rejecting the null hypothesis? $\endgroup$ –  HeyJude Commented Dec 14, 2023 at 22:11

Your Answer

Sign up or log in, post as a guest.

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .

Not the answer you're looking for? Browse other questions tagged hypothesis-testing statistical-significance mathematical-statistics or ask your own question .

  • Featured on Meta
  • Upcoming sign-up experiments related to tags

Hot Network Questions

  • How do languages where multiple files make up a module handle combining them into one translation/compilation unit?
  • View doesn't recognise a change to an underlying table when an existing column is dropped and replaced with one with the same name but as computed
  • Find 10 float64s that give the least accurate sum
  • How to temporarily disable a primary IP without losing other IPs on the same interface
  • How to replace sequences in a list based on specific patterns?
  • Should I practise a piece at a metronome tempo that is faster than required?
  • Is it possible that the editor is still looking for other reveiwers while one reviewer has submitted the reviewer report?
  • For safety, must one have the vehicle's engine turned off before attaching A/C manifold gauge sets to top off the A/C system?
  • How is Leetcode able to compile a C++ program without me writing a 'main()' function?
  • Improper `at' size due to newtx
  • What is the meaning of this black/white (likely non-traffic) sign seen on German highways?
  • Project Euler 127 - abc-hits
  • What can I add to my too-wet tuna+potato patties to make them less mushy?
  • Rule of Thumb meaning in statistics
  • Is this correct solution to arranging consecutive flowers?
  • (THEORY) Do Tree models output probabilities?
  • How should I report a Man-in-the-Middle attack in my workplace?
  • Why does 2N2222 allow battery current flow when separate 5V circuit unpowered, but 2N3904 doesn't?
  • Looking for a story that possibly started "MYOB"
  • How should I end a campaign only the passive players are enjoying?
  • What will happen if we keep bringing two protons closer and closer to each other, starting from a large distance?
  • Would an industrial level society be able to visually identify orbital debris from a destroyed mega structure?
  • How to Find Efficient Algorithms for Mathematical Functions?
  • Could an Alien decipher human languages using only comms traffic?

definition of hypothesis power

  • More from M-W
  • To save this word, you'll need to log in. Log In

Definition of hypothesis

Did you know.

The Difference Between Hypothesis and Theory

A hypothesis is an assumption, an idea that is proposed for the sake of argument so that it can be tested to see if it might be true.

In the scientific method, the hypothesis is constructed before any applicable research has been done, apart from a basic background review. You ask a question, read up on what has been studied before, and then form a hypothesis.

A hypothesis is usually tentative; it's an assumption or suggestion made strictly for the objective of being tested.

A theory , in contrast, is a principle that has been formed as an attempt to explain things that have already been substantiated by data. It is used in the names of a number of principles accepted in the scientific community, such as the Big Bang Theory . Because of the rigors of experimentation and control, it is understood to be more likely to be true than a hypothesis is.

In non-scientific use, however, hypothesis and theory are often used interchangeably to mean simply an idea, speculation, or hunch, with theory being the more common choice.

Since this casual use does away with the distinctions upheld by the scientific community, hypothesis and theory are prone to being wrongly interpreted even when they are encountered in scientific contexts—or at least, contexts that allude to scientific study without making the critical distinction that scientists employ when weighing hypotheses and theories.

The most common occurrence is when theory is interpreted—and sometimes even gleefully seized upon—to mean something having less truth value than other scientific principles. (The word law applies to principles so firmly established that they are almost never questioned, such as the law of gravity.)

This mistake is one of projection: since we use theory in general to mean something lightly speculated, then it's implied that scientists must be talking about the same level of uncertainty when they use theory to refer to their well-tested and reasoned principles.

The distinction has come to the forefront particularly on occasions when the content of science curricula in schools has been challenged—notably, when a school board in Georgia put stickers on textbooks stating that evolution was "a theory, not a fact, regarding the origin of living things." As Kenneth R. Miller, a cell biologist at Brown University, has said , a theory "doesn’t mean a hunch or a guess. A theory is a system of explanations that ties together a whole bunch of facts. It not only explains those facts, but predicts what you ought to find from other observations and experiments.”

While theories are never completely infallible, they form the basis of scientific reasoning because, as Miller said "to the best of our ability, we’ve tested them, and they’ve held up."

  • proposition
  • supposition

hypothesis , theory , law mean a formula derived by inference from scientific data that explains a principle operating in nature.

hypothesis implies insufficient evidence to provide more than a tentative explanation.

theory implies a greater range of evidence and greater likelihood of truth.

law implies a statement of order and relation in nature that has been found to be invariable under the same conditions.

Examples of hypothesis in a Sentence

These examples are programmatically compiled from various online sources to illustrate current usage of the word 'hypothesis.' Any opinions expressed in the examples do not represent those of Merriam-Webster or its editors. Send us feedback about these examples.

Word History

Greek, from hypotithenai to put under, suppose, from hypo- + tithenai to put — more at do

1641, in the meaning defined at sense 1a

Phrases Containing hypothesis

  • counter - hypothesis
  • nebular hypothesis
  • null hypothesis
  • planetesimal hypothesis
  • Whorfian hypothesis

Articles Related to hypothesis

hypothesis

This is the Difference Between a...

This is the Difference Between a Hypothesis and a Theory

In scientific reasoning, they're two completely different things

Dictionary Entries Near hypothesis

hypothermia

hypothesize

Cite this Entry

“Hypothesis.” Merriam-Webster.com Dictionary , Merriam-Webster, https://www.merriam-webster.com/dictionary/hypothesis. Accessed 21 Jun. 2024.

Kids Definition

Kids definition of hypothesis, medical definition, medical definition of hypothesis, more from merriam-webster on hypothesis.

Nglish: Translation of hypothesis for Spanish Speakers

Britannica English: Translation of hypothesis for Arabic Speakers

Britannica.com: Encyclopedia article about hypothesis

Subscribe to America's largest dictionary and get thousands more definitions and advanced search—ad free!

Play Quordle: Guess all four words in a limited number of tries.  Each of your guesses must be a real 5-letter word.

Can you solve 4 words at once?

Word of the day.

See Definitions and Examples »

Get Word of the Day daily email!

Popular in Grammar & Usage

Plural and possessive names: a guide, more commonly misspelled words, your vs. you're: how to use them correctly, every letter is silent, sometimes: a-z list of examples, more commonly mispronounced words, popular in wordplay, 8 words for lesser-known musical instruments, birds say the darndest things, 10 words from taylor swift songs (merriam's version), 10 scrabble words without any vowels, 12 more bird names that sound like insults (and sometimes are), games & quizzes.

Play Blossom: Solve today's spelling word game by finding as many words as you can using just 7 letters. Longer words score more points.

Harvard Scientists Say There May Be an Unknown, Technologically Advanced Civilization Hiding on Earth

A provocative hypothesis..

Getty / Futurism

What if — stick with us here — an unknown technological civilization is hiding right here on Earth, sheltering in bases deep underground and possibly even emerging with UFOs or disguised as everyday humans?

In a new paper that's bound to raise eyebrows in the scientific community, a team of researchers from Harvard and Montana Technological University speculates that sightings of "Unidentified Anomalous Phemonemona" (UAP) —  bureaucracy-speak for UFOs, basically — "may reflect activities of intelligent beings concealed in stealth here on Earth (e.g., underground), and/or its near environs (e.g., the Moon), and/or even 'walking among us' (e.g., passing as humans)."

Yes, that's a direct quote from the paper. Needless to say, the researchers admit, this idea of hidden "crypoterrestrials" is a highly exotic hypothesis that's "likely to be regarded skeptically by most scientists." Nonetheless, they argue, the theory "deserves genuine consideration in a spirit of epistemic humility and openness."

The interest in unexplained sightings of UFOs by military personnel has grown considerably over the past decade or so. This attention grew to a peak last summer, when former Air Force intelligence officer and whistleblower David Grusch testified in front of Congress , claiming that the US had already recovered alien spacecraft as part of a decades-long UFO retrieval program.

Even NASA has opened its doors for researchers to explore mysterious, high-speed objects that have been spotted by military pilots over the years.

But several Pentagon reports later, we have yet to find any evidence of extraterrestrial life.

That hasn't dissuaded these Harvard researchers, though. In the paper, they suggest a range of possibilities, each more outlandish than the next.

First is that a "remnant form" of an ancient, highly advanced human civilization is still hanging around, observing us. Second is that an intelligent species evolved independently of humans in the distant past, possibly from "intelligent dinosaurs," and is now hiding their presence from us. Third is that these hidden occupants of Earth traveled here from another planet or time period. And fourth — please keep a straight face, everybody — is that these unknown inhabitants of Earth are "less technological than magical," which the researchers liken to "earthbound angels."

UFO sightings of "craft and other phenomena (e.g., 'orbs') appearing to enter/exit potential underground access points, like volcanoes," they write, could be evidence that these cryptoterrestrials may not be drawn to these spots, but actually reside in underground or underwater bases.

The paper quotes former House Representative Mike Gallagher, who suggested last year that one explanation for the UFO sightings might be "an ancient civilization that’s just been hiding here, for all this time, and is suddenly showing itself right now," following Grusch's testimony.

The researchers didn't stop there, even suggesting that these cryptoterrestrials may take on different, non-human primate or even reptile forms.

Beyond residing deep underground, they even speculate that this mysterious species could even be concealing themselves on the Moon or have mastered the art of blending in as human beings, a folk theory that has inspired countless works of science fiction.

Another explanation, as put forward by controversial Harvard astrophysicist Avi Loeb, suggests that other ancient civilizations may have lived on "planets like Mars or Earth" but a "billion years apart and hence were not aware of each other."

Of course, these are all "far-fetched" hypotheses, as the scientists admit, and deserve to be regarded with plenty of skepticism.

"We entertain them here because some aspects of UAP are strange enough that they seem to call for unconventional explanations," the paper reads.

"It may be exceedingly improbable, but hopefully this paper has shown it should nevertheless be kept on the table as we seek to understand the ongoing empirical mystery of UAP," the researchers conclude.

More on UFOs: New Law Would Force Government to Declassify Every UFO Document

Share This Article

definition of hypothesis power

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

S.5 power analysis, why is power analysis important section  .

Consider a research experiment where the p -value computed from the data was 0.12. As a result, one would fail to reject the null hypothesis because this p -value is larger than \(\alpha\) = 0.05. However, there still exist two possible cases for which we failed to reject the null hypothesis:

  • the null hypothesis is a reasonable conclusion,
  • the sample size is not large enough to either accept or reject the null hypothesis, i.e., additional samples might provide additional evidence.

Power analysis is the procedure that researchers can use to determine if the test contains enough power to make a reasonable conclusion. From another perspective power analysis can also be used to calculate the number of samples required to achieve a specified level of power.

Example S.5.1

Let's take a look at an example that illustrates how to compute the power of the test.

Let X denote the height of randomly selected Penn State students. Assume that X is normally distributed with unknown mean \(\mu\) and a standard deviation of 9. Take a random sample of n = 25 students, so that, after setting the probability of committing a Type I error at \(\alpha = 0.05\), we can test the null hypothesis \(H_0: \mu = 170\) against the alternative hypothesis that \(H_A: \mu > 170\).

What is the power of the hypothesis test if the true population mean were \(\mu = 175\)?

\[\begin{align}z&=\frac{\bar{x}-\mu}{\sigma / \sqrt{n}} \\ \bar{x}&= \mu + z \left(\frac{\sigma}{\sqrt{n}}\right) \\ \bar{x}&=170+1.645\left(\frac{9}{\sqrt{25}}\right) \\ &=172.961\\ \end{align}\]

So we should reject the null hypothesis when the observed sample mean is 172.961 or greater:

\[\begin{align}\text{Power}&=P(\bar{x} \ge 172.961 \text{ when } \mu =175)\\ &=P\left(z \ge \frac{172.961-175}{9/\sqrt{25}} \right)\\ &=P(z \ge -1.133)\\ &= 0.8713\\ \end{align}\]

and illustrated below:

Two overlapping normal distributions with means of 170 and 175. The power of 0.871 is show on the right curve.

In summary, we have determined that we have an 87.13% chance of rejecting the null hypothesis \(H_0: \mu = 170\) in favor of the alternative hypothesis \(H_A: \mu > 170\) if the true unknown population mean is, in reality, \(\mu = 175\).

Calculating Sample Size Section  

If the sample size is fixed, then decreasing Type I error \(\alpha\) will increase Type II error \(\beta\). If one wants both to decrease, then one has to increase the sample size.

To calculate the smallest sample size needed for specified \(\alpha\), \(\beta\), \(\mu_a\), then (\(\mu_a\) is the likely value of \(\mu\) at which you want to evaluate the power.

Let's investigate by returning to our previous example.

Example S.5.2

Let X denote the height of randomly selected Penn State students. Assume that X is normally distributed with unknown mean \(\mu\) and standard deviation 9. We are interested in testing at \(\alpha = 0.05\) level , the null hypothesis \(H_0: \mu = 170\) against the alternative hypothesis that \(H_A: \mu > 170\).

Find the sample size n that is necessary to achieve 0.90 power at the alternative μ = 175.

\[\begin{align}n&= \dfrac{\sigma^2(Z_{\alpha}+Z_{\beta})^2}{(\mu_0−\mu_a)^2}\\ &=\dfrac{9^2 (1.645 + 1.28)^2}{(170-175)^2}\\ &=27.72\\ n&=28\\ \end{align}\]

In summary, you should see how power analysis is very important so that we are able to make the correct decision when the data indicate that one cannot reject the null hypothesis. You should also see how power analysis can also be used to calculate the minimum sample size required to detect a difference that meets the needs of your research.

IMAGES

  1. Power of a Hypothesis Test

    definition of hypothesis power

  2. Calculating the Power of a Hypothesis Test: Examples

    definition of hypothesis power

  3. Power of a hypothesis test

    definition of hypothesis power

  4. 13 Different Types of Hypothesis (2024)

    definition of hypothesis power

  5. PPT

    definition of hypothesis power

  6. Power of a Hypothesis Test Tutorial

    definition of hypothesis power

VIDEO

  1. 1. Scientific Method, Definition, Laws, Theory for NTSE || JEE Main || JEE Advanced || NEET Exam

  2. Concept of Hypothesis

  3. Statistical Inference

  4. Hypothesis: meaning Definition #hypothesis #statistics #statisticsforeconomics #statisticalanalysis

  5. Hypothesis testing

  6. Hypothesis|Meaning|Definition|Characteristics|Source|Types|Sociology|Research Methodology|Notes

COMMENTS

  1. 25.1

    The power of a hypothesis test is the probability of making the correct decision if the alternative hypothesis is true. That is, the power of a hypothesis test is the probability of rejecting the null hypothesis H 0 when the alternative hypothesis H A is the hypothesis that is true. Let's return to our engineer's problem to see if we can ...

  2. Power of a test

    Power of a test. In statistics, the power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis ( ) when a specific alternative hypothesis ( ) is true. It is commonly denoted by , and represents the chances of a true positive detection conditional on the actual existence of an effect to detect.

  3. What is Power in Statistics?

    High statistical power occurs when a hypothesis test is likely to find an effect that exists in the population. A low power test is unlikely to detect that effect. For example, if statistical power is 80%, a hypothesis test has an 80% chance of detecting an effect that actually exists. Now imagine you're performing a study that has only 10%.

  4. Statistical Power and Why It Matters

    Statistical power, or sensitivity, is the likelihood of a significance test detecting an effect when there actually is one. A true effect is a real, non-zero relationship between variables in a population. An effect is usually indicated by a real difference between groups or a correlation between variables.

  5. Statistical Power: What it is, How to Calculate it

    What is Power? The statistical power of a study (sometimes called sensitivity) is how likely the study is to distinguish an actual effect from one of chance. It's the likelihood that the test is correctly rejecting the null hypothesis (i.e. "proving" your hypothesis ). For example, a study that has an 80% power means that the study has an ...

  6. Power of Hypothesis Test

    Effect Size. To compute the power of the test, one offers an alternative view about the "true" value of the population parameter, assuming that the null hypothesis is false. The effect size is the difference between the true value and the value specified in the null hypothesis. Effect size = True value - Hypothesized value.

  7. Lesson 25: Power of a Statistical Test

    In this lesson, we'll learn what it means to have a powerful hypothesis test, as well as how we can determine the sample size n necessary to ensure that the hypothesis test we are conducting has high power. 25.1 - Definition of Power. 25.2 - Power Functions. 25.3 - Calculating Sample Size. ‹ 24.4 - Two or More Parameters. Up. 25.1 ...

  8. 6.5

    The probability of rejecting the null hypothesis, given that the null hypothesis is false, is known as power. In other words, power is the probability of correctly rejecting \(H_0\). ... The sample size is not large enough to reject the null hypothesis (i.e., statistical power is too low).

  9. Power Analysis

    The power of a statistical hypothesis test is the probability of rejecting the null hypothesis given that the null hypothesis is in fact false. Description. ... By definition, power corresponds to the area under the H1 distribution to the right of the critical value. The critical value is determined by the α-level.

  10. In Brief: Statistics in Brief: Statistical Power: What Is It and When

    If the alternative hypothesis is actually true, the power is the probability that one will correctly reject the null hypothesis. The most meaningful application of statistical power is to decide before initiation of a clinical study whether it is worth doing, given the needed effort, cost, and in the case of clinical experiments, patient ...

  11. What Is Power?

    Power is the probability of making a correct decision (to reject the null hypothesis) when the null hypothesis is false. Power is the probability that a test of significance will pick up on an effect that is present. Power is the probability that a test of significance will detect a deviation from the null hypothesis, should such a deviation exist.

  12. Hypothesis testing and power

    Hypothesis testing and statistical power. All power and sample size calculations depend on the nature of the null hypothesis and on the assumptions associated with the statistical test of the null hypothesis. This discussion illustrates the core concepts by exploring the t-test on a single sample of independent observations.

  13. Power in Tests of Significance

    Power is the probability of making a correct decision (to reject the null hypothesis) when the null hypothesis is false. Power is the probability that a test of significance will pick up on an effect that is present. Power is the probability that a test of significance will detect a deviation from the null hypothesis, should such a deviation exist.

  14. 11.8: Effect Size, Sample Size and Power

    The answer, shown in Figure 11.5, is that almost the entirety of the sampling distribution has now moved into the critical region. Therefore, if θ=0.7 the probability of us correctly rejecting the null hypothesis (i.e., the power of the test) is much larger than if θ=0.55. In short, while θ=.55 and θ=.70 are both part of the alternative ...

  15. 25.2

    In the above, example, the power of the hypothesis test depends on the value of the mean \(\mu\). As the actual mean \(\mu\) moves further away from the value of the mean \(\mu=100\) under the null hypothesis, the power of the hypothesis test increases. It's that first point that leads us to what is called the power function of the hypothesis ...

  16. Introduction to power in significance tests

    So just to cut to the chase, power is a probability. You can view it as the probability that you are doing the right thing when the null hypothesis is not true, and the right thing is you should reject the null hypothesis if it's not true. So it's a probability of rejecting, rejecting your null hypothesis given that the null hypothesis is false.

  17. Introduction to Power Analysis

    OK, let's start off with a basic definition of what a power is. Power is the probability of detecting an effect, given that the effect is really there. In other words, it is the probability of rejecting the null hypothesis when it is in fact false.

  18. 9.1: Introduction to Hypothesis Testing

    In hypothesis testing, the goal is to see if there is sufficient statistical evidence to reject a presumed null hypothesis in favor of a conjectured alternative hypothesis.The null hypothesis is usually denoted \(H_0\) while the alternative hypothesis is usually denoted \(H_1\). An hypothesis test is a statistical decision; the conclusion will either be to reject the null hypothesis in favor ...

  19. A Gentle Introduction to Statistical Power and Power Analysis in Python

    The statistical power of a hypothesis test is the probability of detecting an effect, if there is a true effect present to detect. Power can be calculated and reported for a completed experiment to comment on the confidence one might have in the conclusions drawn from the results of the study. It can also be used as a tool to estimate

  20. Power function

    Here is a more formal definition. Definition In a test of hypothesis about a parameter , let the null hypothesis be The power function is a function that gives, for any , the probability of rejecting the null hypothesis when the true parameter is equal to . Note that the power function depends on the null hypothesis: if we change , also the ...

  21. 5.4.3

    Hypothesis testing is a very powerful statistical tool. Next, we will move onto situations where we compare more than one population parameter. Book traversal links for 5.4.3 - The Relationship Between Power, \(\beta\), and \(\alpha\)

  22. hypothesis testing

    4. Power is the probability that the observation is in the rejection region when some value in the parameter space of the alternative is correct (falsely rejecting the null hypothesis). But when the two distributions are identical, the rejection region for the null hypothesis also corresponds to the non-rejection region for the alternative, so ...

  23. Hypothesis Definition & Meaning

    hypothesis: [noun] an assumption or concession made for the sake of argument. an interpretation of a practical situation or condition taken as the ground for action.

  24. Harvard Scientists Say There May Be an Unknown ...

    Needless to say, the researchers admit, this idea of hidden "crypoterrestrials" is a highly exotic hypothesis that's "likely to be regarded skeptically by most scientists." Nonetheless, they argue ...

  25. S.5 Power Analysis

    the null hypothesis is a reasonable conclusion, the sample size is not large enough to either accept or reject the null hypothesis, i.e., additional samples might provide additional evidence. Power analysis is the procedure that researchers can use to determine if the test contains enough power to make a reasonable conclusion.

  26. What Is Artificial Intelligence? Definition, Uses, and Types

    What is artificial intelligence? Artificial intelligence (AI) is the theory and development of computer systems capable of performing tasks that historically required human intelligence, such as recognizing speech, making decisions, and identifying patterns. AI is an umbrella term that encompasses a wide variety of technologies, including machine learning, deep learning, and natural language ...