Statistics By Jim

Making statistics intuitive

Hypothesis Testing: Uses, Steps & Example

By Jim Frost

What is Hypothesis Testing?

Hypothesis testing in statistics uses sample data to infer the properties of a whole population . These tests determine whether a random sample provides sufficient evidence to conclude an effect or relationship exists in the population. Researchers use them to help separate genuine population-level effects from false effects that random chance can create in samples. These methods are also known as significance testing.

Data analysts at work.

For example, researchers are testing a new medication to see if it lowers blood pressure. They compare a group taking the drug to a control group taking a placebo. If their hypothesis test results are statistically significant, the medication’s effect of lowering blood pressure likely exists in the broader population, not just the sample studied.

Using Hypothesis Tests

A hypothesis test evaluates two mutually exclusive statements about a population to determine which statement the sample data best supports. These two statements are called the null hypothesis and the alternative hypothesis . The following are typical examples:

  • Null Hypothesis : The effect does not exist in the population.
  • Alternative Hypothesis : The effect does exist in the population.

Hypothesis testing accounts for the inherent uncertainty of using a sample to draw conclusions about a population, which reduces the chances of false discoveries. These procedures determine whether the sample data are sufficiently inconsistent with the null hypothesis that you can reject it. If you can reject the null, your data favor the alternative statement that an effect exists in the population.

Statistical significance in hypothesis testing indicates that an effect you see in sample data also likely exists in the population after accounting for random sampling error , variability, and sample size. Your results are statistically significant when the p-value is less than your significance level or, equivalently, when your confidence interval excludes the null hypothesis value.

Conversely, non-significant results indicate that despite an apparent sample effect, you can’t be sure it exists in the population. It could be chance variation in the sample and not a genuine effect.

Learn more about Failing to Reject the Null .

5 Steps of Significance Testing

Hypothesis testing involves five key steps, each critical to validating a research hypothesis using statistical methods:

  • Formulate the Hypotheses : Write your research hypotheses as a null hypothesis (H 0 ) and an alternative hypothesis (H A ).
  • Data Collection : Gather data specifically aimed at testing the hypothesis.
  • Conduct A Test : Use a suitable statistical test to analyze your data.
  • Make a Decision : Based on the statistical test results, decide whether to reject the null hypothesis or fail to reject it.
  • Report the Results : Summarize and present the outcomes in your report’s results and discussion sections.

While the specifics of these steps can vary depending on the research context and the data type, the fundamental process of hypothesis testing remains consistent across different studies.

Let’s work through these steps in an example!

Hypothesis Testing Example

Researchers want to determine if a new educational program improves student performance on standardized tests. They randomly assign 30 students to a control group , which follows the standard curriculum, and another 30 students to a treatment group, which participates in the new educational program. After a semester, they compare the test scores of both groups.

Download the CSV data file to perform the hypothesis testing yourself: Hypothesis_Testing .

The researchers write their hypotheses. These statements apply to the population, so they use the mu (μ) symbol for the population mean parameter .

  • Null Hypothesis (H 0 ) : The population means of the test scores for the two groups are equal (μ 1 = μ 2 ).
  • Alternative Hypothesis (H A ) : The population means of the test scores for the two groups are unequal (μ 1 ≠ μ 2 ).

Choosing the correct hypothesis test depends on attributes such as data type and number of groups. Because they’re using continuous data and comparing two means, the researchers use a 2-sample t-test .

Here are the results.

Hypothesis testing results for the example.

The treatment group’s mean is 58.70, compared to the control group’s mean of 48.12. The mean difference is 10.67 points. Use the test’s p-value and significance level to determine whether this difference is likely a product of random fluctuation in the sample or a genuine population effect.

Because the p-value (0.000) is less than the standard significance level of 0.05, the results are statistically significant, and we can reject the null hypothesis. The sample data provides sufficient evidence to conclude that the new program’s effect exists in the population.


Hypothesis testing improves your effectiveness in making data-driven decisions. However, it is not 100% accurate because random samples occasionally produce fluky results. Hypothesis tests have two types of errors, both relating to drawing incorrect conclusions.

  • Type I error: The test rejects a true null hypothesis—a false positive.
  • Type II error: The test fails to reject a false null hypothesis—a false negative.

Learn more about Type I and Type II Errors .

Our exploration of hypothesis testing using a practical example of an educational program reveals its powerful ability to guide decisions based on statistical evidence. Whether you’re a student, researcher, or professional, understanding and applying these procedures can open new doors to discovering insights and making informed decisions. Let this tool empower your analytical endeavors as you navigate through the vast seas of data.

Learn more about the Hypothesis Tests for Various Data Types .

it deals with hypothesis testing based on factual data

June 10, 2024 at 10:51 am

Thank you, Jim, for another helpful article; timely too since I have started reading your new book on hypothesis testing and, now that we are at the end of the school year, my district is asking me to perform a number of evaluations on instructional programs. This is where my question/concern comes in. You mention that hypothesis testing is all about testing samples. However, I use all the students in my district when I make these comparisons. Since I am using the entire “population” in my evaluations (I don’t select a sample of third grade students, for example, but I use all 700 third graders), am I somehow misusing the tests? Or can I rest assured that my district’s student population is only a sample of the universal population of students?

' src=

June 10, 2024 at 1:50 pm

I hope you are finding the book helpful!

Yes, the purpose of hypothesis testing is to infer the properties of a population while accounting for random sampling error.

In your case, it comes down to how you want to use the results. Who do you want the results to apply to?

If you’re summarizing the sample, looking for trends and patterns, or evaluating those students and don’t plan to apply those results to other students, you don’t need hypothesis testing because there is no sampling error. They are the population and you can just use descriptive statistics. In this case, you’d only need to focus on the practical significance of the effect sizes.

On the other hand, if you want to apply the results from this group to other students, you’ll need hypothesis testing. However, there is the complicating issue of what population your sample of students represent. I’m sure your district has its own unique characteristics, demographics, etc. Your district’s students probably don’t adequately represent a universal population. At the very least, you’d need to recognize any special attributes of your district and how they could bias the results when trying to apply them outside the district. Or they might apply to similar districts in your region.

However, I’d imagine your 3rd graders probably adequately represent future classes of 3rd graders in your district. You need to be alert to changing demographics. At least in the short run I’d imagine they’d be representative of future classes.

Think about how these results will be used. Do they just apply to the students you measured? Then you don’t need hypothesis tests. However, if the results are being used to infer things about other students outside of the sample, you’ll need hypothesis testing along with considering how well your students represent the other students and how they differ.

I hope that helps!

June 10, 2024 at 3:21 pm

Thank you so much, Jim, for the suggestions in terms of what I need to think about and consider! You are always so clear in your explanations!!!!

June 10, 2024 at 3:22 pm

You’re very welcome! Best of luck with your evaluations!

Hypothesis Testing – A Deep Dive into Hypothesis Testing, The Backbone of Statistical Inference

  • September 21, 2023

Explore the intricacies of hypothesis testing, a cornerstone of statistical analysis. Dive into methods, interpretations, and applications for making data-driven decisions.

it deals with hypothesis testing based on factual data

In this Blog post we will learn:

  • What is Hypothesis Testing?
  • Steps in Hypothesis Testing 2.1. Set up Hypotheses: Null and Alternative 2.2. Choose a Significance Level (α) 2.3. Calculate a test statistic and P-Value 2.4. Make a Decision
  • Example : Testing a new drug.
  • Example in python

1. What is Hypothesis Testing?

In simple terms, hypothesis testing is a method used to make decisions or inferences about population parameters based on sample data. Imagine being handed a dice and asked if it’s biased. By rolling it a few times and analyzing the outcomes, you’d be engaging in the essence of hypothesis testing.

Think of hypothesis testing as the scientific method of the statistics world. Suppose you hear claims like “This new drug works wonders!” or “Our new website design boosts sales.” How do you know if these statements hold water? Enter hypothesis testing.

2. Steps in Hypothesis Testing

  • Set up Hypotheses : Begin with a null hypothesis (H0) and an alternative hypothesis (Ha).
  • Choose a Significance Level (α) : Typically 0.05, this is the probability of rejecting the null hypothesis when it’s actually true. Think of it as the chance of accusing an innocent person.
  • Calculate Test statistic and P-Value : Gather evidence (data) and calculate a test statistic.
  • p-value : This is the probability of observing the data, given that the null hypothesis is true. A small p-value (typically ≤ 0.05) suggests the data is inconsistent with the null hypothesis.
  • Decision Rule : If the p-value is less than or equal to α, you reject the null hypothesis in favor of the alternative.

2.1. Set up Hypotheses: Null and Alternative

Before diving into testing, we must formulate hypotheses. The null hypothesis (H0) represents the default assumption, while the alternative hypothesis (H1) challenges it.

For instance, in drug testing, H0 : “The new drug is no better than the existing one,” H1 : “The new drug is superior .”

2.2. Choose a Significance Level (α)

When You collect and analyze data to test H0 and H1 hypotheses. Based on your analysis, you decide whether to reject the null hypothesis in favor of the alternative, or fail to reject / Accept the null hypothesis.

The significance level, often denoted by $α$, represents the probability of rejecting the null hypothesis when it is actually true.

In other words, it’s the risk you’re willing to take of making a Type I error (false positive).

Type I Error (False Positive) :

  • Symbolized by the Greek letter alpha (α).
  • Occurs when you incorrectly reject a true null hypothesis . In other words, you conclude that there is an effect or difference when, in reality, there isn’t.
  • The probability of making a Type I error is denoted by the significance level of a test. Commonly, tests are conducted at the 0.05 significance level , which means there’s a 5% chance of making a Type I error .
  • Commonly used significance levels are 0.01, 0.05, and 0.10, but the choice depends on the context of the study and the level of risk one is willing to accept.

Example : If a drug is not effective (truth), but a clinical trial incorrectly concludes that it is effective (based on the sample data), then a Type I error has occurred.

Type II Error (False Negative) :

  • Symbolized by the Greek letter beta (β).
  • Occurs when you accept a false null hypothesis . This means you conclude there is no effect or difference when, in reality, there is.
  • The probability of making a Type II error is denoted by β. The power of a test (1 – β) represents the probability of correctly rejecting a false null hypothesis.

Example : If a drug is effective (truth), but a clinical trial incorrectly concludes that it is not effective (based on the sample data), then a Type II error has occurred.

Balancing the Errors :

it deals with hypothesis testing based on factual data

In practice, there’s a trade-off between Type I and Type II errors. Reducing the risk of one typically increases the risk of the other. For example, if you want to decrease the probability of a Type I error (by setting a lower significance level), you might increase the probability of a Type II error unless you compensate by collecting more data or making other adjustments.

It’s essential to understand the consequences of both types of errors in any given context. In some situations, a Type I error might be more severe, while in others, a Type II error might be of greater concern. This understanding guides researchers in designing their experiments and choosing appropriate significance levels.

2.3. Calculate a test statistic and P-Value

Test statistic : A test statistic is a single number that helps us understand how far our sample data is from what we’d expect under a null hypothesis (a basic assumption we’re trying to test against). Generally, the larger the test statistic, the more evidence we have against our null hypothesis. It helps us decide whether the differences we observe in our data are due to random chance or if there’s an actual effect.

P-value : The P-value tells us how likely we would get our observed results (or something more extreme) if the null hypothesis were true. It’s a value between 0 and 1. – A smaller P-value (typically below 0.05) means that the observation is rare under the null hypothesis, so we might reject the null hypothesis. – A larger P-value suggests that what we observed could easily happen by random chance, so we might not reject the null hypothesis.

2.4. Make a Decision

Relationship between $α$ and P-Value

When conducting a hypothesis test:

We then calculate the p-value from our sample data and the test statistic.

Finally, we compare the p-value to our chosen $α$:

  • If $p−value≤α$: We reject the null hypothesis in favor of the alternative hypothesis. The result is said to be statistically significant.
  • If $p−value>α$: We fail to reject the null hypothesis. There isn’t enough statistical evidence to support the alternative hypothesis.

3. Example : Testing a new drug.

Imagine we are investigating whether a new drug is effective at treating headaches faster than drug B.

Setting Up the Experiment : You gather 100 people who suffer from headaches. Half of them (50 people) are given the new drug (let’s call this the ‘Drug Group’), and the other half are given a sugar pill, which doesn’t contain any medication.

  • Set up Hypotheses : Before starting, you make a prediction:
  • Null Hypothesis (H0): The new drug has no effect. Any difference in healing time between the two groups is just due to random chance.
  • Alternative Hypothesis (H1): The new drug does have an effect. The difference in healing time between the two groups is significant and not just by chance.

Calculate Test statistic and P-Value : After the experiment, you analyze the data. The “test statistic” is a number that helps you understand the difference between the two groups in terms of standard units.

For instance, let’s say:

  • The average healing time in the Drug Group is 2 hours.
  • The average healing time in the Placebo Group is 3 hours.

The test statistic helps you understand how significant this 1-hour difference is. If the groups are large and the spread of healing times in each group is small, then this difference might be significant. But if there’s a huge variation in healing times, the 1-hour difference might not be so special.

Imagine the P-value as answering this question: “If the new drug had NO real effect, what’s the probability that I’d see a difference as extreme (or more extreme) as the one I found, just by random chance?”

For instance:

  • P-value of 0.01 means there’s a 1% chance that the observed difference (or a more extreme difference) would occur if the drug had no effect. That’s pretty rare, so we might consider the drug effective.
  • P-value of 0.5 means there’s a 50% chance you’d see this difference just by chance. That’s pretty high, so we might not be convinced the drug is doing much.
  • If the P-value is less than ($α$) 0.05: the results are “statistically significant,” and they might reject the null hypothesis , believing the new drug has an effect.
  • If the P-value is greater than ($α$) 0.05: the results are not statistically significant, and they don’t reject the null hypothesis , remaining unsure if the drug has a genuine effect.

4. Example in python

For simplicity, let’s say we’re using a t-test (common for comparing means). Let’s dive into Python:

Making a Decision : “The results are statistically significant! p-value < 0.05 , The drug seems to have an effect!” If not, we’d say, “Looks like the drug isn’t as miraculous as we thought.”

5. Conclusion

Hypothesis testing is an indispensable tool in data science, allowing us to make data-driven decisions with confidence. By understanding its principles, conducting tests properly, and considering real-world applications, you can harness the power of hypothesis testing to unlock valuable insights from your data.

More Articles

Hypothesis Testing in Data Science: It's Usage and Types

Hypothesis Testing in Data Science: It's Usage and Types


Exclusive 40% OFF

Training Outcomes Within Your Budget!

We ensure quality, budget-alignment, and timely delivery by our expert instructors.

Share this Resource

  • Advanced Data Science Certification
  • Data Science and Blockchain Training
  • Big Data Analysis
  • Python Data Science Course
  • Advanced Data Analytics Course


Table of Contents  

1) What is Hypothesis Testing in Data Science? 

2) Importance of Hypothesis Testing in Data Science 

3) Types of Hypothesis Testing 

4) Basic steps in Hypothesis Testing 

5) Real-world use cases of Hypothesis Testing 

6) Conclusion 

What is Hypothesis Testing in Data Science?  

Hypothesis Testing in Data Science is a statistical method used to assess the validity of assumptions or claims about a population based on sample data. It involves formulating two Hypotheses, the null Hypothesis (H0) and the alternative Hypothesis (Ha or H1), and then using statistical tests to find out if there is enough evidence to support the alternative Hypothesis.  

Hypothetical Testing is a critical tool for making data-driven decisions, evaluating the significance of observed effects or differences, and drawing meaningful conclusions from data, allowing Data Scientists to uncover patterns, relationships, and insights that inform various domains, from medicine to business and beyond. 

Unlock the power of data with our comprehensive Data Science & Analytics Training . Sign up now!  

Importance of Hypothesis Testing in Data Science  

The significance of Hypothesis Testing in Data Science cannot be overstated. It serves as the cornerstone of data-driven decision-making. By systematically testing Hypotheses, Data Scientists can: 

Importance of Hypothesis Testing in Data Science

Objective decision-making 

Hypothesis Testing provides a structured and impartial method for making decisions based on data. In a world where biases can skew perceptions, Data Scientists rely on this method to ensure that their conclusions are grounded in empirical evidence, making their decisions more objective and trustworthy. 

Statistical rigour 

Data Scientists deal with large amounts of data, and Hypothesis Testing helps them make sense of it. It quantifies the significance of observed patterns, differences, or relationships. This statistical rigour is essential in distinguishing between mere coincidences and meaningful findings, reducing the likelihood of making decisions based on random chance. 

Resource allocation 

Resources, whether they are financial, human, or time-related, are often limited. Hypothesis Testing enables efficient resource allocation by guiding Data Scientists towards strategies or interventions that are statistically significant. This ensures that efforts are directed where they are most likely to yield valuable results. 

Risk management 

In domains like healthcare and finance, where lives and livelihoods are at stake, Hypothesis Testing is a critical tool for risk assessment. For instance, in drug development, Hypothesis Testing is used to determine the safety and efficiency of new treatments, helping mitigate potential risks to patients. 

Innovation and progress 

Hypothesis Testing fosters innovation by providing a systematic framework to evaluate new ideas, products, or strategies. It encourages a cycle of experimentation, feedback, and improvement, driving continuous progress and innovation. 

Strategic decision-making 

Organisations base their strategies on data-driven insights. Hypothesis Testing enables them to make informed decisions about market trends, customer behaviour, and product development. These decisions are grounded in empirical evidence, increasing the likelihood of success. 

Scientific integrity 

In scientific research, Hypothesis Testing is integral to maintaining the integrity of research findings. It ensures that conclusions are drawn from rigorous statistical analysis rather than conjecture. This is essential for advancing knowledge and building upon existing research. 

Regulatory compliance 

Many industries, such as pharmaceuticals and aviation, operate under strict regulatory frameworks. Hypothesis Testing is essential for demonstrating compliance with safety and quality standards. It provides the statistical evidence required to meet regulatory requirements. 

Supercharge your data skills with our Big Data and Analytics Training – register now!  

Types of Hypothesis Testing  

Hypothesis Testing can be seen in several different types. In total, we have five types of Hypothesis Testing. They are described below as follows: 

Types of Hypothesis Testing

Alternative Hypothesis

The Alternative Hypothesis, denoted as Ha or H1, is the assertion or claim that researchers aim to support with their data analysis. It represents the opposite of the null Hypothesis (H0) and suggests that there is a significant effect, relationship, or difference in the population. In simpler terms, it's the statement that researchers hope to find evidence for during their analysis. For example, if you are testing a new drug's efficacy, the alternative Hypothesis might state that the drug has a measurable positive effect on patients' health. 

Null Hypothesis 

The Null Hypothesis, denoted as H0, is the default assumption in Hypothesis Testing. It posits that there is no significant effect, relationship, or difference in the population being studied. In other words, it represents the status quo or the absence of an effect. Researchers typically set out to challenge or disprove the Null Hypothesis by collecting and analysing data. Using the drug efficacy example again, the Null Hypothesis might state that the new drug has no effect on patients' health. 

Non-directional Hypothesis 

A Non-directional Hypothesis, also known as a two-tailed Hypothesis, is used when researchers are interested in whether there is any significant difference, effect, or relationship in either direction (positive or negative). This type of Hypothesis allows for the possibility of finding effects in both directions. For instance, in a study comparing the performance of two groups, a Non-directional Hypothesis would suggest that there is a significant difference between the groups, without specifying which group performs better. 

Directional Hypothesis 

A Directional Hypothesis, also called a one-tailed Hypothesis, is employed when researchers have a specific expectation about the direction of the effect, relationship, or difference they are investigating. In this case, the Hypothesis predicts an outcome in a particular direction—either positive or negative. For example, if you expect that a new teaching method will improve student test scores, a directional Hypothesis would state that the new method leads to higher test scores. 

Statistical Hypothesis 

A Statistical Hypothesis is a Hypothesis formulated in a way that it can be tested using statistical methods. It involves specific numerical values or parameters that can be measured or compared. Statistical Hypotheses are crucial for quantitative research and often involve means, proportions, variances, correlations, or other measurable quantities. These Hypotheses provide a precise framework for conducting statistical tests and drawing conclusions based on data analysis. 

Want to unlock the power of Big Data Analysis? Join our Big Data Analysis Course today!  

Basic steps in Hypothesis Testing  

Hypothesis Testing is a systematic approach used in statistics to make informed decisions based on data. It is a critical tool in Data Science, research, and many other fields where data analysis is employed. The following are the basic steps involved in Hypothesis Testing: 

Basic steps in Hypothesis Testing

1) Formulate Hypotheses 

The first step in Hypothesis Testing is to clearly define your research question and translate it into two mutually exclusive Hypotheses: 

a) Null Hypothesis (H0): This is the default assumption, often representing the status quo or the absence of an effect. It states that there is no significant difference, relationship, or effect in the population. 

b) Alternative Hypothesis (Ha or H1): This is the statement that contradicts the null Hypothesis. It suggests that there is a significant difference, relationship, or effect in the population. 

The formulation of these Hypotheses is crucial, as they serve as the foundation for your entire Hypothesis Testing process. 

2) Collect data 

With your Hypotheses in place, the next step is to gather relevant data through surveys, experiments, observations, or any other suitable method. The data collected should be representative of the population you are studying. The quality and quantity of data are essential factors in the success of your Hypothesis Testing. 

3) Choose a significance level (α) 

Before conducting the statistical test, you need to decide on the level of significance, denoted as α. The significance level represents the threshold for statistical significance and determines how confident you want to be in your results. A common choice is α = 0.05, which implies a 5% chance of making a Type I error (rejecting the null Hypothesis when it's true). You can choose a different α value based on the specific requirements of your analysis. 

4) Perform the test 

Based on the nature of your data and the Hypotheses you've formulated, select the appropriate statistical test. There are various tests available, including t-tests, chi-squared tests, ANOVA, regression analysis, and more. The chosen test should align with the type of data (e.g., continuous or categorical) and the research question (e.g., comparing means or testing for independence). 

Execute the selected statistical test on your data to obtain test statistics and p-values. The test statistics quantify the difference or effect you are investigating, while the p-value represents the probability of obtaining the observed results if the null Hypothesis were true. 

5) Analyse the results 

Once you have the test statistics and p-value, it's time to interpret the results. The primary focus is on the p-value: 

a) If the p-value is less than or equal to your chosen significance level (α), typically 0.05, you have evidence to reject the null Hypothesis. This shows that there is a significant difference, relationship, or effect in the population. 

b) If the p-value is more than α, you fail to reject the null Hypothesis, showing that there is insufficient evidence to support the alternative Hypothesis. 

6) Draw conclusions 

Based on the analysis of the p-value and the comparison to the significance level, you can draw conclusions about your research question: 

a) In case you reject the null Hypothesis, you can accept the alternative Hypothesis and make inferences based on the evidence provided by your data. 

b) In case you fail to reject the null Hypothesis, you do not accept the alternative Hypothesis, and you acknowledge that there is no significant evidence to support your claim. 

It's important to communicate your findings clearly, including the implications and limitations of your analysis. 

Real-world use cases of Hypothesis Testing  

The following are some of the real-world use cases of Hypothesis Testing. 

a) Medical research: Hypothesis Testing is crucial in determining the efficacy of new medications or treatments. For instance, in a clinical trial, researchers use Hypothesis Testing to assess whether a new drug is significantly more effective than a placebo in treating a particular condition. 

b) Marketing and advertising: Businesses employ Hypothesis Testing to evaluate the impact of marketing campaigns. A company may test whether a new advertising strategy leads to a significant increase in sales compared to the previous approach. 

c) Manufacturing and quality control: Manufacturing industries use Hypothesis Testing to ensure product quality. For example, in the automotive industry, Hypothesis Testing can be applied to test whether a new manufacturing process results in a significant reduction in defects. 

d) Education: In the field of education, Hypothesis Testing can be used to assess the effectiveness of teaching methods. Researchers may test whether a new teaching approach leads to statistically significant improvements in student performance. 

e) Finance and investment: Investment strategies are often evaluated using Hypothesis Testing. Investors may test whether a new investment strategy outperforms a benchmark index over a specified period.  

Big Data Analytics


To sum it up, Hypothesis Testing in Data Science is a powerful tool that enables Data Scientists to make evidence-based decisions and draw meaningful conclusions from data. Understanding the types, methods, and steps involved in Hypothesis Testing is essential for any Data Scientist. By rigorously applying Hypothesis Testing techniques, you can gain valuable insights and drive informed decision-making in various domains. 

Want to take your Data Science skills to the next level? Join our Big Data Analytics & Data Science Integration Course now!  

Hypothesis tests #

Formal hypothesis testing is perhaps the most prominent and widely-employed form of statistical analysis. It is sometimes seen as the most rigorous and definitive part of a statistical analysis, but it is also the source of many statistical controversies. The currently-prevalent approach to hypothesis testing dates to developments that took place between 1925 and 1940, especially the work of Ronald Fisher , Jerzy Neyman , and Egon Pearson .

In recent years, many prominent statisticians have argued that less emphasis should be placed on the formal hypothesis testing approaches developed in the early twentieth century, with a correspondingly greater emphasis on other forms of uncertainty analysis. Our goal here is to give an overview of some of the well-established and widely-used approaches for hypothesis testing. We will also provide some perspectives on how these tools can be effectively used, and discuss their limitations. We will also discuss some new approaches to hypothesis testing that may eventually come to be as prominent as these classical approaches.

A falsifiable hypothesis is a statement, or hypothesis, that can be contradicted with evidence. In empirical (data-driven) research, this evidence will always be obtained through the data. In statistical hypothesis testing, the hypothesis that we formally test is called the null hypothesis . The alternative hypothesis is a second hypothesis that is our proposed explanation for what happens if the null hypothesis is wrong.

Test statistics #

The key element of a statistical hypothesis test is the test statistic , which (like any statistic) is a function of the data. A test statistic takes our entire dataset, and reduces it to one number. This one number ideally should contain all the information in the data that is relevant for assessing the two hypotheses of interest, and exclude any aspects of the data that are irrelevant for assessing the two hypotheses. The test statistic measures evidence against the null hypothesis. Most test statistics are constructed so that a value of zero represents the lowest possible level of evidence against the null hypothesis. Test statistic values that deviate from zero represent greater levels of evidence against the null hypothesis. The larger the magnitude of the test statistic, the stronger the evidence against the null hypothesis.

A major theme of statistical research is to devise effective ways to construct test statistics. Many useful ways to do this have been devised, and there is no single approach that is always the best. In this introductory course, we will focus on tests that starting with an estimate of a quantity that is relevant for assessing the hypotheses, then proceed by standardizing this estimate by dividing it by its standard error. This approach is sometimes referred to as “Wald testing”, after Abraham Wald .

Testing the equality of two proportions #

As a basic example, let’s consider risk perception related to COVID-19. As you will see below, hypothesis testing can appear at first to be a fairly elaborate exercise. Using this example, we describe each aspect of this exercise in detail below.

The data and research question #

The data shown below are simulated but are designed to reflect actual surveys conducted in the United States in March of 2020. Partipants were asked whether they perceive that they have a substantial risk of dying if they are infected with the novel coronavirus. The number of people stating each response, stratified on age, are shown below (only two age groups are shown):

High risk Not high risk
Age < 30 25 202
Age 60-69 30 124

Each subject’s response is binary – they either perceive themselves to be high risk, or not to be at high risk. When working with this type of data, we are usually interested in the proportion of people who provide each response within each stratum (age group). These are conditional proportions, conditioning on the age group. The numerical values of the conditional proportions are given below:

High risk Not high risk
Age < 30 0.110 0.890
Age 60-69 0.195 0.805

There are four conditional proportions in the table above – the proportion of younger people who perceive themselves to be at higher risk, 0.110=25/(25+202); the proportion of younger people who do not perceive themselves to be at high risk, 0.890=202/(25+202); the proportion of older people who perceive themselves to be at high risk 0.195=30/(30+124); and the proportion of older people who do not perceive themselves to be at high risk, 0.805=124/(30+124).

The trend in the data is that younger people perceive themselves to be at lower risk of dying than older people, by a difference of 0.195-0.110=0.085 (in terms of proportions). But is this trend only present in this sample, or is it generalizable to a broader population (say the entire US population)? That is the goal of conducting a statistical hypothesis test in this setting.

The population structure #

Corresponding to our data above is the unobserved population structure, which we can denote as follows

High risk Not high risk
Age < 30 \(p\) \(1-p\)
Age 60-69 \(q\) \(1-q\)

The symbols \(p\) and \(q\) in the table above are population parameters . These are quantitites that we do not know, and wish to assess using the data. In this case, our null hypothesis can be expressed as the statement \(p = q\) . We can estimate \(p\) using the sample proportion \(\hat{p} = 0.110\) , and similarly estimate \(q\) using \(\hat{q} = 0.195\) . However these estimates do not immediately provide us with a way of expressing the evidence relating to the hypothesis that \(p=q\) . This is provided by the test statistic.

A test statistic #

As noted above, a test statistic is a reduction of the data to one number that captures all of the relevant information for assessing the hypotheses. A natural first choice for a test statistic here would be the difference in sample proportions between the two age groups, which is 0.195 - 0.110 = 0.085. There is a difference of 0.085 between the perceived risks of death in the younger and older age groups.

The difference in rates (0.085) does not on its own make a good test statistic, although it is a good start toward obtaining one. The reason for this is that the evidence underlying this difference in rates depends also on the absolute rates (0.110 and 0.195), and on the sample sizes (227 and 154). If we only know that the difference in rates is 0.085, this is not sufficient to evaluate the hypothesis in a statistical manner. A given difference in rates is much stronger evidence if it is obtained from a larger sample. If we have a difference of 0.085 with a very large sample, say one million people, then we should be almost certain that the true rates differ (i.e. the data are highly incompatiable with the hypothesis that \(p=q\) ). If we have the same difference in rates of 0.085, but with a small sample, say 50 people per age group, then there would be almost no evidence for a true difference in the rates (i.e. the data are compatiable with the hypothesis \(p=q\) ).

To address this issue, we need to consider the uncertainty in the estimated rate difference, which is 0.085. Recall that the estimated rate difference is obtained from the sample and therefore is almost certain to deviate somewhat from the true rate difference in the population (which is unknown). Recall from our study of standard errors that the standard error for an estimated proportion is \(\sqrt{p(1-p)/n}\) , where \(p\) is the outcome probability (here the outcome is that a person perceives a high risk of dying), and \(n\) is the sample size.

In the present analysis, we are comparing two proportions, so we have two standard errors. The estimated standard error for the younger people is \(\sqrt{0.11\cdot 0.89/227} \approx 0.021\) . The estimated standard error for the older people is \(\sqrt{0.195\cdot 0.805/154} \approx 0.032\) . Note that both standard errors are estimated, rather than exact, because we are plugging in estimates of the rates (0.11 and 0.195). Also note that the standard error for the rate among older people is greater than that for younger people. This is because the sample size for older people is smaller, and also because the estimated rate for older people is closer to 1/2.

In our previous discussion of standard errors, we saw how standard errors for independent quantities \(A\) and \(B\) can be used to obtain the standard error for the difference \(A-B\) . Applying that result here, we see that the standard error for the estimated difference in rates 0.195-0.11=0.085 is \(\sqrt{0.021^2 + 0.032^2} \approx 0.038\) .

The final step in constructing our test statistic is to construct a Z-score from the estimated difference in rates. As with all Z-scores, we proceed by taking the estimated difference in rates, and then divide it by its standard error. Thus, we get a test statistic value of \(0.085 / 0.038 \approx 2.24\) .

A test statistic value of 2.24 is not very close to zero, so there is some evidence against the null hypothesis. But the strength of this evidence remains unclear. Thus, we must consider how to calibrate this evidence in a way that makes it more interpretable.

Calibrating the evidence in the test statistic #

By the central limit theorem (CLT), a Z-score approximately follows a normal distribution. When the null hypothesis holds, the Z-score approximately follows the standard normal distribution (recall that a standard normal distribution is a normal distribution with expected value equal to 0 and variance equal to 1). If the null hypothesis does not hold, then the test statistic continues to approximately follow a normal distribution, but it is not the standard normal distribution.

A test statistic of zero represents the least possible evidence against the null hypothesis. Here, we will obtain a test statistic of zero when the two proportions being compared are identical, i.e. exactly the same proportions of younger and older people perceive a substantial risk of dying from a disease. Even if the test statistic is exactly zero, this does not guarantee that the null hypothesis is true. However it is the least amount of evidence that the data can present against the null hypothesis.

In a hypothesis testing setting using normally-distrbuted Z-scores, as is the case here (due to the CLT), the standard normal distribution is the reference distribution for our test statistic. If the Z-score falls in the center of the reference distribution, there is no evidence against the null hypothesis. If the Z-score falls into either tail of the reference distribution, then there is evidence against the null distribution, and the further into the tails of the reference distribution the Z-score falls, the greater the evidence.

The most conventional way to quantify the evidence in our test statistic is through a probability called the p-value . The p-value has a somewhat complex definition that many people find difficult to grasp. It is the probability of observing as much or more evidence against the null hypothesis as we actually observe, calculated when the null hypothesis is assumed to be true. We will discuss some ways to think about this more intuitively below.

For our purposes, “evidence against the null hypothesis” is reflected in how far into the tails of the reference distribution the Z-score (test statistic) falls. We observed a test statistic of 2.24 in our COVID risk perception analysis. Recall that due to the “empirical rule”, 95% of the time, a draw from a standard normal distribution falls between -2 and 2. Thus, the p-value must be less than 0.05, since 2.24 falls outside this interval. The p-value can be calculated using a computer, in this case it happens to be approximately 0.025.

As stated above, the p-value tells us how likely it would be for us to obtain as much evidence against the the null hypothesis as we observed in our actual data analysis, if we were certain that the null hypothesis were true. When the null hypothesis holds, any evidence against the null hypothesis is spurious. Thus, we will want to see stronger evidence against the null from our actual analysis than we would see if we know that the null hypothesis were true. A smaller p-value therefore reflects more evidence against the null hypothesis than a larger p-value.

By convention, p-values of 0.05 or smaller are considered to represent sufficiently strong evidence against the null hypothesis to make a finding “statistically significant”. This threshold of 0.05 was chosen arbitrarily 100 years ago, and there is no objective reason for it. In recent years, people have argued that either a lesser or a greater p-value threshold should be used. But largely due to convention, the practice of deeming p-values smaller than 0.05 to be statistically significant continues.

Summary of this example #

Here is a restatement of the above discussion, using slightly different language. In our analysis of COVID risk perceptions, we found a difference in proportions of 0.085 between younger and older subjects, with younger people perceiving a lower risk of dying. This is a difference based on the sample of data that we observed, but what we really want to know is whether there is a difference in COVID risk perception in the population (say, all US adults).

Suppose that in fact there is no difference in risk perception between younger and older people. For instance, suppose that in the population, 15% of people believe that they have a substantial risk of dying should they become infected with the novel coronavirus, regardless of their age. Even though the rates are equal in this imaginary population (both being 15%), the rates in our sample would typically not be equal. Around 3% of the time (0.024=2.4% to be exact), if the rates are actually equal in the population, we would see a test statistic that is 2.4 or larger. Since 3% represents a fairly rare event, we can conclude that our observed data are not compatible with the null hypothesis. We can also say that there is statistically significant evidence against the null hypothesis, and that we have “rejected” the null hypothesis at the 3% level.

In this data analysis, as in any data analysis, we cannot confirm definitively that the alternative hypothesis is true. But based on our data and the analysis performed above, we can claim that there is substantial evidence against the null hypothesis, using standard criteria for what is considered to be “substantial evidence”.

Comparison of means #

A very common setting where hypothesis testing is used arises when we wish to compare the means of a quantitative measurement obtained for two populations. Imagine, for example, that we have two ways of manufacturing a battery, and we wish to assess which approach yields batteries that are longer-lasting in actual use. To do this, suppose we obtain data that tells us the number of charge cycles that were completed in 200 batteries of type A, and in 300 batteries of type B. For the test developed below to be meaningful, the data must be independent and identically distributed samples.

The raw data for this study consists of 500 numbers, but it turns out that the most relevant information from the data is contained in the sample means and sample standard deviations computed within each battery type. Note that this is a huge reduction in complexity, since we started with 500 measurements and are able to summarize this down to just four numbers.

Suppose the summary statistics are as follows, where \(\bar{x}\) , \(\hat{\sigma}_x\) , and \(n\) denote the sample mean, sample standard deviation, and sample size, respectively.

Type \(\bar{x}\) \(\hat{\sigma}_x\) \(n\)
420 70 200
403 90 300

The simplest measure comparing the two manufacturing approaches is the difference 420 - 403 = 17. That is, batteries of type A tend to have 17 more charge cycles compared to batteries of type B. This difference is present in our sample, but is it also true that the entire population of type A batteries has more charge cycles than the entire population of type B batteries? That is the goal of conducting a hypothesis test.

The next step in the present analysis is to divide the mean difference, which is 17, by its standard error. As we have seen, the standard error of the mean, or SEM, is \(\sigma/n\) , where \(\sigma\) is the standard deviation and \(n\) is the sample size. Since \(\sigma\) is almost never known, we plug in its estimate \(\hat{\sigma}\) . For the type A batteries, the estimated SEM is thus \(70/\sqrt{200} \approx 4.95\) , and for the type B batteries the estimated SEM is \(90/\sqrt{300} \approx 5.2\) .

Since we are comparing two estimated means that are obtained from independent samples, we can pool the standard deviations to obtain an overall standard deviation of \(\sqrt{4.95^2 + 5.2^2} \approx 7.18\) . We can now obtain our test statistic \(17/7.18 \approx 2.37\) .

The test statistic can be calibrated against a standard normal reference distribution. The probability of observing a standard normal value that is greater in magnitude than 2.37 is 0.018 (this can be obtained from a computer). This is the p-value, and since it is smaller than the conventional threshold of 0.05, we can claim that there is a statistically significant difference between the average number of charge cycles for the two types of batteries, with the A batteries having more charge cycles on average.

The analysis illustrated here is called a two independent samples Z-test , or just a two sample Z-test . It may be the most commonly employed of all statistical tests. It is also common to see the very similar two sample t-test , which is different only in that it uses the Student t distribution rather than the normal (Gaussian) distribution to calculate the p-values. In fact, there are quite a few minor variations on this testing framework, including “one sided” and “two sided” tests, and tests based on different ways of pooling the variance. Due to the CLT, if the sample size is modestly large (which is the case here), the results of all of these tests will be almost identical. For simplicity, we only cover the Z-test in this course.

Assessment of a correlation #

The tests for comparing proportions and means presented above are quite similar in many ways. To provide one more example of a hypothesis test that is somewhat different, we consider a test for a correlation coefficient.

Recall that the sample correlation coefficient \(\hat{r}\) is used to assess the relationship, or association, between two quantities X and Y that are measured on the same units. For example, we may ask whether two biomarkers, serum creatinine and D-dimer, are correlated with each other. These biomarkers are both commonly used in medical settings and are obtained using blood tests. D-dimer is used to assess whether a person has blood clots, and serum creatinine is used to measure kidney performance.

Suppose we are interested in whether there is a correlation in the population between D-dimer and serum creatinine. The population correlation coefficient between these two quantitites can be denoted \(r\) . Our null hypothesis is \(r=0\) . Suppose that we observe a sample correlation coefficient of \(\hat{r}=0.15\) , using an independent and identically distributed sample of pairs \((x, y)\) , where \(x\) is a D-dimer measurement and \(y\) is a serum creatinine measurement. Are these data consistent with the null hypothesis?

As above, we proceed by constructing a test statistic by taking the estimated statistic and dividing it by its standard error. The approximate standard error for \(\hat{r}\) is \(1/\sqrt{n}\) , where \(n\) is the sample size. The test statistic is therefore \(\sqrt{n}\cdot \hat{r} \approx 1.48\) .

We now calibrate this test statistic by comparing it to a standard normal reference distribution. Recall from the empirical rule that 5% of the time, a standard normal value falls outside the interval (-2, 2). Therefore, if the test statistic is smaller than 2 in magnitude, as is the case here, its p-value is greater than 0.05. Thus, in this case we know that the p-value will exceed 0.05 without calculating it, and therefore there is no basis for claiming that D-dimer and serum creatinine levels are correlated in this population.

Sampling properties of p-values #

A p-value is the most common way of calibrating evidence. Smaller p-values indicate stronger evidence against a null hypothesis. By convention, if the p-value is smaller than some threshold, usually 0.05, we reject the null hypothesis and declare a finding to be “statistically significant”. How can we understand more deeply what this means? One major concern should be obtaining a small p-value when the null hypothesis is true. If the null hypothesis is true, then it is incorrect to reject it. If we reject the null hypothesis, we are making a false claim. This can never be prevented with complete certainty, but we would like to have a very clear understanding of how likely it is to reject the null hypothesis when the null hypothesis is in fact true.

P-values have a special property that when the null distribution is true, the probability of observing a p-value smaller than 0.05 is 0.05 (5%). In fact, the probability of observing a p-value smaller than \(t\) is equal to \(t\) , for any threshold \(t\) . For example, the probability of observing a p-value smaller than 0.1, when the null hypothesis is true, is 10%.

This fact gives a more concrete understanding of how strong the evidence is for a particular p-value. If we always reject the null hypothesis when the p-value is 0.1 or smaller, then over the long run we will reject the null hypothesis 10% of the time when the null hypothesis is true. If we always reject the null hypothesis when the p-value is 0.05 or smaller, then over the long run we will reject the null hypothesis 5% of the time when the null hypothesis is true.

The approach to hypothesis testing discussed above largely follows the framework developed by RA Fisher around 1925. Note that although we mentioned the alternative hypothesis above, we never actually used it. A more elaborate approach to hypothesis testing was developed somewhat later by Egon Pearson and Jerzy Neyman. The “Neyman-Pearson” approach to hypothesis testing is even more formal than Fisher’s approach, and is most suited to highly planned research efforts in which the study is carefully designed, then executed. While ideally all research projects should be carried out this way, in reality we often conduct research using data that are already available, rather than using data that are specifically collected to address the research question.

Neyman-Pearson hypothesis testing involves specifying an alternative hypothesis that we anticipate encountering. Usually this alternative hypothesis represents a realistic guess about what we might find once the data are collected. In each of the three examples above, imagine that the data are not yet collected, and we are asked to specify an alternative hypothesis. We may arrive at the following:

In comparing risk perceptions for COVID, we may anticipate that older people will perceive a 30% risk of dying, and younger people will anticipate a 5% risk of dying.

In comparing the number of charge cycles for two types of batteries, we may anticipate that batter type A will have on average 500 charge cycles, and battery type B will have on average 400 charge cycles.

In assessing the correlation between D-dimer and serum creatinine levels, we may anticipate a correlation of 0.3.

Note that none of the numbers stated here are data-driven – they are specified before any data are collected, so they do not match the results from the data, which were collected only later. These alternative hypotheses are all essentially speculations, based perhaps on related data or theoretical considerations.

There are several benefits of specifying an explicit alternative hypothesis, as done here, even though it is not strictly necessary and can be avoided entirely by adopting Fisher’s approach to hypothesis testing. One benefit of specifying an alternative hypothesis is that we can use it to assess the power of our planned study, which can in turn inform the design of the study, in particular the sample size. The power is the probability of rejecting the null hypothesis when the alternative hypothesis is true. That is, it is the probability of discovering something real. The power should be contrasted with the level of a hypothesis test, which is the probability of rejecting the null hypothesis when the null hypothesis is true. That is, the level is the probability of “discovering” something that is not real.

To calculate the power, recall that for many of the test statistics that we are considering here, the test statistic has the form \(\hat{\theta}/{\rm SE}(\hat{\theta})\) , where \(\hat{\theta}\) is an estimate. For example, \(\hat{\theta}\) ) may be the correlation coefficient between D-dimer and serum creatinine levels. As stated above, the power is the probability of rejecting the null hypothesis when the alternative hypothesis is true. Suppose we decide to reject the null hypothesis when the test statistic is greater than 2, which is approximately equivalent to rejecting the null hypothesis when the p-value is less than 0.05. The following calculation tells us how to obtain the power in this setting:

Under the alternative hypothesis, \(\sqrt{n}(\hat{r} - r)\) approximately follows a standard normal distribution. Therefore, if \(r\) and \(n\) are given, we can easily use the computer to obtain the probability of observing a value greater than \(2 - \sqrt{n}r\) . This gives us the power of the test. For example, if we anticipate \(r=0.3\) and plan to collect data for \(n=100\) observations, the power is 0.84. This is generally considered to be good power – if the true value of \(r\) is in fact 0.3, we would reject the null hypothesis 84% of the time.

A study usually has poor power because it has too small of a sample size. Poorly powered studies can be very misleading, but since large sample sizes are expensive to collect, a lot of research is conducted using sample sizes that yield moderate or even low power. If a study has low power, it is unlikely to reject the null hypothesis even when the alternative hypothesis is true, but it remains possible to reject the null hypothesis when the null hypothesis is true (usually this probability is 5%). Therefore the most likely outcome of a poorly powered study may be an incorrectly rejected null hypothesis.


  1. Hypothesis Testing: Uses, Steps & Example - Statistics By Jim

    Hypothesis tests are vital statistical tools that evaluate the validity of new theories by comparing them to empirical data. They provide a structured approach to decision-making, emphasizing data-driven insights over personal biases or subjective opinions.

  2. Hypothesis Testing – A Deep Dive ... - Machine Learning Plus

    By understanding its principles, conducting tests properly, and considering real-world applications, you can harness the power of hypothesis testing to unlock valuable insights from your data. Explore the intricacies of hypothesis testing, a cornerstone of statistical analysis.

  3. Mastering Hypothesis Testing: A Comprehensive Guide for ...

    1. Introduction to Hypothesis Testing. - Definition and significance in research and data analysis. - Brief historical background. 2. Fundamentals of Hypothesis Testing. - Null and...

  4. Hypothesis Testing in Data Science: A Comprehensive Guide

    Hypothesis Testing in Data Science is a statistical method used to assess the validity of assumptions or claims about a population based on sample data. It involves formulating two Hypotheses, the null Hypothesis (H0) and the alternative Hypothesis (Ha or H1), and then using statistical tests to find out if there is enough evidence to support ...

  5. Understanding Hypothesis Testing - Towards Data Science

    Hypothesis testing is a statistical method to determine whether a hypothesis that you have holds true or not. The hypothesis can be with respect to two variables within a dataset, an association between two groups or a situation.

  6. Everything You Need To Know about Hypothesis Testing — Part I

    A Hypothesis Test evaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data.

  7. A Complete Guide to Hypothesis Testing - Towards Data Science

    Hypothesis testing is a method of statistical inference that considers the null hypothesis H ₀ vs. the alternative hypothesis H a, where we are typically looking to assess evidence against H ₀. Such a test is used to compare data sets against one another, or compare a data set against some external standard.

  8. Hypothesis Tests | Introduction to Data Science

    A falsifiable hypothesis is a statement, or hypothesis, that can be contradicted with evidence. In empirical (data-driven) research, this evidence will always be obtained through the data. In statistical hypothesis testing, the hypothesis that we formally test is called the null hypothesis.

  9. What is Hypothesis Testing? — DATA SCIENCE

    Hypothesis testing refers to the formal procedures used by statisticians to accept or reject statistical hypotheses. Statistical Hypotheses. Factual Hypotheses. The most ideal approach to decide if a factual theory is genuine is to look at the whole populace.

  10. A Comprehensive Guide to Hypothesis Testing in Data Analysis

    Hypothesis testing is a fundamental statistical technique used in data analysis to make informed decisions and draw meaningful conclusions from data. It allows data scientists and analysts to...