null hypothesis of factor analysis

Secure Your Spot in Our PCA Online Course Starting on April 02 (Click for More Info)

Joachim Schork Image Course

Confirmatory Factor Analysis (CFA) | Meaning & Interpretation

In this tutorial, I’ll introduce Confirmatory Factor Analysis (CFA), which is a multivariate statistical technique researchers use to to confirm hypotheses or theories about the underlying structures of certain datasets.

The table of content is structured as follows:

Let’s dive into it!

Introduction

Confirmatory Factor Analysis (CFA) is a statistical technique used primarily in the social sciences. CFA allows researchers to validate their proposed measurement models by testing how well the *observed variables (e.g., questionnaire items) represent the underlying * factors (latent variables) they are theorized to measure. Unlike exploratory factor analysis (EFA), which seeks to identify potential underlying factor structures, CFA tests whether a particular structure fits the data.

Instead of “discovering” or exploring potential relationships between variables, as in exploratory factor analysis, CFA is designed to test a predefined model based on theoretical expectations. Relatedly, in CFA, the starting point should be a theory or empirical findings , which could come from existing literature, previous EFA results, or well-established theoretical models.

Like other statistical techniques, CFA operates under certain assumptions. It’s essential to verify these assumptions, as violating them can lead to biased, misleading, or incorrect results. Let’s elaborate on them!

Assumptions

Linearity: The relationships between observed variables and the underlying unobserved latent variables are assumed to be linear. In the typical CFA, observed variables are defined as linear functions of the latent variables. An observed variable x is formalized as x = λξ + δx , where λ refers to factor loadings, ξ refers to underlying factors and δx refers to measurement error of x .

Multivariate Normality : The observed variables should ideally follow a multivariate normal distribution . This means that all combinations of variables are jointly normally distributed. This assumption can significantly impact some fit indices and standard errors. In such a case, robust estimation techniques could be performed to avoid the effect of normality violation.

Sample Size Adequacy: An adequate sample size is crucial for stable factor solutions. A common recommendation is a ratio of at least 5 participants per variable, but larger samples are generally better.

Over identification: The model should be over-identified. This means there should be more observed variances and covariances than parameters estimated (estimated variances, covariances and factor loadings). One common way is to ensure that one of the factor loadings is fixed to value 1. By doing so, the scale of the latent variable will also be defined. For further details, see page 80 in Confirmatory Factor Analysis for Applied Research (Brown, 2006).

If your data meets these assumptions and you intend to test your measurement model, then you can perform CFA. Let’s see the steps of CFA!

Steps to Perform CFA

In this section, the steps of performing CFA are theoretically explained. For the practical implementation in R, see Confirmatory Factor Analysis in R to be published soon.

Model Specification

Based on theory or previous analyses, the practitioner should decide which observed variables are connected to which latent variables, in other words, which observed variables will load onto which latent variables.

Imagine, based on prior empirical research, you hypothesized that an individual’s general well-being can be measured by two latent factors: physical health and emotional health .

To measure these constructs, you prepared a questionnaire. In your measurement model, the first three questions measure physical health with the following:

  • Q1 : I feel physically active and energetic.
  • Q2 : I rarely get sick.
  • Q3 : I am satisfied with my overall physical health.

The next three questions measure emotional health with the following:

  • Q4 : I feel emotionally stable.
  • Q5 : I generally feel happy and contented.
  • Q6 : I rarely feel overwhelmed or anxious.

When you have a designed measurement structure, the next step will be gathering the related data.

Data Collection & Exploration

The data measuring the observed variables should be collected, cleaned and preprocessed before the analysis.

Parameter Estimation

In CFA, the process of parameter estimation revolves around determining the optimal values for factor loadings, factor covariances, and measurement error variances. The ultimate goal is to ensure that the covariance matrix predicted using these estimated parameters closely mirrors the observed covariance matrix from the data. Let’s take a look at the **estimations output regarding the well-being example.

  Traditionally, Maximum Likelihood (ML) estimation method is employed when there is not enough evidence that the assumptions, like normality and sample size adequacy, are not violated. Otherwise, other robust techniques like WLSMV and MLR are performed.

The chosen estimation method uses an optimization algorithm , which iteratively adjusts parameter values to find the best-fitting parameters. Once the model converges, the software of choice (e.g., SPSS , R , Mplus ) provides estimates for all parameters (as shown above), along with some fit statistics . Let’s take a closer look at these fit statistics next!

Fit Evaluation

It’s crucial to evaluate the fit of the model to ensure it adequately represents the data. Several fit indices can be used, each with its strengths and limitations. Here are some of the common fit indices:

Standardized Root Mean Square Residual (SRMR) : This is the average difference between the observed correlation matrix (which represents the perfect prediction) and the matrix predicted by the model. Values less than 0.08 are generally considered indicative of a good fit.

Root Mean Square Error of Approximation (RMSEA) : This adjusted chi-square statistic evaluates model fit in relation to the perfect model, penalizing model complexity. Values less than 0.05 indicate a close fit, values between 0.05 and 0.08 indicate a reasonable fit, and values greater than 0.10 may suggest a poor model fit.

Comparative Fit Index (CFI) and Tucker-Lewis Index (TLI) : Both these indices compare the fit of the specified model to the null model . Values close to 0.95 or above are generally considered indicative of a good fit. The primary distinction between CFI and TLI is that TLI penalizes model complexity, whereas CFI does not.

It’s advisable to present and evaluate various fit indicators because each one offers its own advantages and drawbacks. For formulae, see this article .

Now, let’s take a look at the next step!

Model Respecification

If the fit indices indicate that the initial model does not fit the data well, then the hypothesized factor structure can be modified to achieve a better fit as long as the changes make theoretical sense or are based on substantive reasons.

The modification indices (MIs) and expected parameter change (EPC) are the statistics to consider in this step. They provide suggestions about which parameters to unconstrain (conventionally, the cross-loadings and error covariances are constrained to be zero in CFA).

MIs show how much the chi-square statistic would decrease if a fixed parameter was freely estimated. EPC shows the expected change in the magnitude of the parameter if a fixed parameter was unconstrained.

After making the modifications, the fit should be reassessed by a chi-square difference test or other comparing indices like AIC and BIC . Once the final structure is determined, the result can be interpreted and communicated visually.

Let’s see what kind of visual can be used!

Visualisation

The most common way to visualize CFA models is through path or factor diagrams. These diagrams consist of observed variables, latent factors, arrows indicating relationships, and error terms. You can see how the estimated covariance components are visualized below.

factor diagramm CFA

In the diagram above, the covariances are represented by the double-headed blue arrows indicating the direction of the relation. For instance, the arrowheads pointing to the same component represent the component’s covariance with itself, which is also known as its variance.

The orange arrows refer to the factor loadings. They point to the observed variables (Q1, Q2, etc.) since the latent variables cause/influence the observed variables contexually. The relation can described for Q2 as Q2 = Physical Health * 0.75 + e2 , where e2 represents the associated measurement error with variance 0.09 .

Please be aware that the blue arrows pointing to the observed variables (Q1, Q2, etc.) indicate the measurement error variance, not the observed variable variance. However, the measurement error variance is a part of the estimated observed variable variance, which can be written for Q2 as Var(Q2) = 0.75 2 * [Var(Physical Health) = 0.92] + [Var(e2)= 0.09 ].

*Observed variables can also be referred to as manifest variables, indicators, or endogenous variables, whereas latent variables can be referred to as factors, constructs, unobserved/underlying variables, or exogenous variables in the context of CFA.

**The example data is randomly generated; hence does not reflect any real analysis output.

Video, Further Resources & Summary

Do you need more explanations on how to perform CFA? Then, you might check out the following video of the Statistics Globe YouTube channel.

In the video tutorial, we explain how to conduct CFA to validate the proposed measurement model.

The YouTube video will be added soon.

Furthermore, you could have a look at some of the other tutorials on Statistics Globe:

  • Intoduction to Factor Analysis
  • Exploratory Factor Analysis (EFA)

This article has demonstrated steps of performing CFA . If you have further questions, you may leave a comment below.

Rana Cansu Kebabci Statistician & Data Scientist

This page was created in collaboration with Cansu Kebabci. You might have a look at Cansu’s author page to get more information about academic background and the other articles she has written for Statistics Globe.

Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe. I hate spam & you may opt out anytime: Privacy Policy .

6 Comments . Leave new

' src=

Could you please help me use the Newton-Raphson method to estimate the parameter of the normal distribution in r?

' src=

This source on RPubs could be helpful.

Best, Cansu

' src=

I was not the original poster, but THANK YOU, Cansu, that RPubs link was very helpful.

Hello Scott!

I am glad that my response was helpful 🙂

' src=

Hi Cansu, may I know when this page article was published? Very helpful article I would like to cite it.

Thank you sastra

Hello Sastra,

I am happy that you found the tutorial helpful :). The publishing date is August 29, 2023.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Post Comment

Joachim Schork Statistician Programmer

I’m Joachim Schork. On this website, I provide statistics tutorials as well as code in Python and R programming.

Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe. I hate spam & you may opt out anytime: Privacy Policy .

Related Tutorials

How to Combine PCA & k-means Clustering (Example)

How to Combine PCA & k-means Clustering (Example)

Text Summarization Using Hugging Face Transformers in Python (Example)

Text Summarization Using Hugging Face Transformers in Python (Example)

  • Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

Institute for Digital Research and Education

Factor Analysis | SPSS Annotated Output

This page shows an example of a factor analysis with footnotes explaining the output.  The data used in this example were collected by Professor James Sidanius, who has generously shared them with us.  You can download the data set M255.sav .

Overview:  The “what” and “why” of factor analysis

Factor analysis is a method of data reduction.  It does this by seeking underlying unobservable (latent) variables that are reflected in the observed variables (manifest variables).  There are many different methods that can be used to conduct a factor analysis (such as principal axis factor, maximum likelihood, generalized least squares, unweighted least squares), There are also many different types of rotations that can be done after the initial extraction of factors, including orthogonal rotations, such as varimax and equimax, which impose the restriction that the factors cannot be correlated, and oblique rotations, such as promax, which allow the factors to be correlated with one another.  You also need to determine the number of factors that you want to extract. Given the number of factor analytic techniques and options, it is not surprising that different analysts could reach very different results analyzing the same data set.  However, all analysts are looking for simple structure.  Simple structure is pattern of results such that each variable loads highly onto one and only one factor.

Factor analysis is a technique that requires a large sample size. Factor analysis is based on the correlation matrix of the variables involved, and correlations usually need a large sample size before they stabilize. Tabachnick and Fidell (2001, page 588) cite Comrey and Lee’s (1992) advise regarding sample size: 50 cases is very poor, 100 is poor, 200 is fair, 300 is good, 500 is very good, and 1000 or more is excellent.  As a rule of thumb, a bare minimum of 10 observations per variable is necessary to avoid computational difficulties.

For the example below, we are going to do a rather “plain vanilla” factor analysis.  We will use iterated principal axis factor with three factors as our method of extraction, a varimax rotation, and for comparison, we will also show the promax oblique solution.  The determination of the number of factors to extract should be guided by theory, but also informed by running the analysis extracting different numbers of factors and seeing which number of factors yields the most interpretable results.

In this example we have included many options, including the original and reproduced correlation matrix, the scree plot and the plot of the rotated factors.  While you may not wish to use all of these options, we have included them here to aid in the explanation of the analysis.  We have also created a page of annotated output for a principal components analysis that parallels this analysis.  For general information regarding the similarities and differences between principal components analysis and factor analysis, see Tabachnick and Fidell (2001), for example.

Orthogonal (Varimax) Rotation

Let’s start with orthgonal varimax rotation. First open the file M255.sav and then copy, paste and run the following syntax into the SPSS Syntax Editor.

Image spss_output_f1_0

The table above is output because we used the univariate option on the /print subcommand.  Please note that the only way to see how many cases were actually used in the factor analysis is to include the univariate option on the /print subcommand.  The number of cases used in the analysis will be less than the total number of cases in the data file if there are missing values on any of the variables used in the factor analysis, because, by default, SPSS does a listwise deletion of incomplete cases.  If the factor analysis is being conducted on the correlations (as opposed to the covariances), it is not much of a concern that the variables have very different means and/or standard deviations (which is often the case when variables are measured on different scales).

a. Mean – These are the means of the variables used in the factor analysis.

b. Std. Deviation – These are the standard deviations of the variables used in the factor analysis.

c. Analysis N – This is the number of cases used in the factor analysis.

The table above is included in the output because we used the det option on the /print subcommand.  All we want to see in this table is that the determinant is not 0.  If the determinant is 0, then there will be computational problems with the factor analysis, and SPSS may issue a warning message or be unable to complete the factor analysis.

a.  Kaiser-Meyer-Olkin Measure of Sampling Adequacy – This measure varies between 0 and 1, and values closer to 1 are better.  A value of .6 is a suggested minimum.

b.  Bartlett’s Test of Sphericity – This tests the null hypothesis that the correlation matrix is an identity matrix.  An identity matrix is matrix in which all of the diagonal elements are 1 and all off diagonal elements are 0. You want to reject this null hypothesis.

Taken together, these tests provide a minimum standard which should be passed before a factor analysis (or a principal components analysis) should be conducted.

a.  Communalities – This is the proportion of each variable’s variance that can be explained by the factors (e.g., the underlying latent continua).  It is also noted as h 2 and can be defined as the sum of squared factor loadings for the variables.

b.  Initial – With principal factor axis factoring, the initial values on the diagonal of the correlation matrix are determined by the squared multiple correlation of the variable with the other variables.  For example, if you regressed items 14 through 24 on item 13, the squared multiple correlation coefficient would be .564.

c.  Extraction – The values in this column indicate the proportion of each variable’s variance that can be explained by the retained factors. Variables with high values are well represented in the common factor space, while variables with low values are not well represented.  (In this example, we don’t have any particularly low values.)  They are the reproduced variances from the factors that you have extracted.  You can find these values on the diagonal of the reproduced correlation matrix.

a.  Factor – The initial number of factors is the same as the number of variables used in the factor analysis.  However, not all 12 factors will be retained.  In this example, only the first three factors will be retained (as we requested).

b.  Initial Eigenvalues – Eigenvalues are the variances of the factors.  Because we conducted our factor analysis on the correlation matrix, the variables are standardized, which means that the each variable has a variance of 1, and the total variance is equal to the number of variables used in the analysis, in this case, 12.

c.  Total – This column contains the eigenvalues.  The first factor will always account for the most variance (and hence have the highest eigenvalue), and the next factor will account for as much of the left over variance as it can, and so on.  Hence, each successive factor will account for less and less variance.

d.  % of Variance – This column contains the percent of total variance accounted for by each factor.

e.  Cumulative % – This column contains the cumulative percentage of variance accounted for by the current and all preceding factors. For example, the third row shows a value of 68.313.  This means that the first three factors together account for 68.313% of the total variance.

f.  Extraction Sums of Squared Loadings – The number of rows in this panel of the table correspond to the number of factors retained.  In this example, we requested that three factors be retained, so there are three rows, one for each retained factor.  The values in this panel of the table are calculated in the same way as the values in the left panel, except that here the values are based on the common variance.  The values in this panel of the table will always be lower than the values in the left panel of the table, because they are based on the common variance, which is always smaller than the total variance.

g.  Rotation Sums of Squared Loadings – The values in this panel of the table represent the distribution of the variance after the varimax rotation. Varimax rotation tries to maximize the variance of each of the factors, so the total amount of variance accounted for is redistributed over the three extracted factors.

The scree plot graphs the eigenvalue against the factor number.  You can see these values in the first two columns of the table immediately above. From the third factor on, you can see that the line is almost flat, meaning the each successive factor is accounting for smaller and smaller amounts of the total variance.

b.  Factor Matrix – This table contains the unrotated factor loadings, which are the correlations between the variable and the factor.  Because these are correlations, possible values range from -1 to +1.  On the /format subcommand, we used the option blank(.30) , which tells SPSS not to print any of the correlations that are .3 or less.  This makes the output easier to read by removing the clutter of low correlations that are probably not meaningful anyway.

c.  Factor – The columns under this heading are the unrotated factors that have been extracted.  As you can see by the footnote provided by SPSS (a.), three factors were extracted (the three factors that we requested).

c.  Reproduced Correlations – This table contains two tables, the reproduced correlations in the top part of the table, and the residuals in the bottom part of the table.

d.  Reproduced Correlation – The reproduced correlation matrix is the correlation matrix based on the extracted factors.  You want the values in the reproduced matrix to be as close to the values in the original correlation matrix as possible.  This means that the residual matrix, which contains the differences between the original and the reproduced matrix to be close to zero.  If the reproduced matrix is very similar to the original correlation matrix, then you know that the factors that were extracted accounted for a great deal of the variance in the original correlation matrix, and these few factors do a good job of representing the original data. The numbers on the diagonal of the reproduced correlation matrix are presented in the Communalities table in the column labeled Extracted.

e.  Residual – As noted in the first footnote provided by SPSS (a.), the values in this part of the table represent the differences between original correlations (shown in the correlation table at the beginning of the output) and the reproduced correlations, which are shown in the top part of this table. For example, the original correlation between item13 and item14 is .661, and the reproduced correlation between these two variables is .646.  The residual is .016 = .661 – .646 (with some rounding error).

b.  Rotated Factor Matrix – This table contains the rotated factor loadings, which represent both how the variables are weighted for each factor but also the correlation between the variables and the factor.  Because these are correlations, possible values range from -1 to +1.  On the /format subcommand, we used the option blank(.30) , which tells SPSS not to print any of the correlations that are .3 or less.  This makes the output easier to read by removing the clutter of low correlations that are probably not meaningful anyway.

For orthogonal rotations, such as varimax, the factor pattern and factor structure matrices are the same.

c.  Factor – The columns under this heading are the rotated factors that have been extracted.  As you can see by the footnote provided by SPSS (a.), three factors were extracted (the three factors that we requested).  These are the factors that analysts are most interested in and try to name.  For example, the first factor might be called “instructor competence” because items like “instructor well prepare” and “instructor competence” load highly on it. The second factor might be called “relating to students” because items like “instructor is sensitive to students” and “instructor allows me to ask questions” load highly on it.  The third factor has to do with comparisons to other instructors and courses.

Oblique (Promax) Rotation

The table below is from another run of the factor analysis program shown above, except with a promax rotation.  We have included it here to show how different the rotated solutions can be, and to better illustrate what is meant by simple structure.

As you can see with an oblique rotation, such as a promax rotation, the factors are permitted to be correlated with one another.  With an orthogonal rotation, such as the varimax shown above, the factors are not permitted to be correlated (they are orthogonal to one another). Oblique rotations, such as promax, produce both factor pattern and factor structure matrices. For orthogonal rotations, such as varimax and equimax, the factor structure and the factor pattern matrices are the same.  The factor structure matrix represents the correlations between the variables and the factors.  The factor pattern matrix contain the coefficients for the linear combination of the variables.

The table below indicates that the rotation done is an oblique rotation. If an orthogonal rotation had been done (like the varimax rotation shown above), this table would not appear in the output because the correlations between the factors are set to 0.  Here, you can see that the factors are highly correlated.

The rest of the output shown below is part of the output generated by the SPSS syntax shown at the beginning of this page.

a.  Factor Transformation Matrix – This is the matrix by which you multiply the unrotated factor matrix to get the rotated factor matrix.

The plot above shows the items (variables) in the rotated factor space.  While this picture may not be particularly helpful, when you get this graph in the SPSS output, you can interactively rotate it.  This may help you to see how the items (variables) are organized in the common factor space.

a.  Factor Score Coefficient Matrix – This is the factor weight matrix and is used to compute the factor scores.

a.  Factor Score Covariance Matrix – Because we used an orthogonal rotation, this should be a diagonal matrix, meaning that the same number should appear in all three places along the diagonal.  In actuality the factors are uncorrelated; however, because factor scores are estimated there may be slight correlations among the factor scores.

Your Name (required)

Your Email (must be a valid email for us to receive the report!)

Comment/Error Report (required)

How to cite this page

  • © 2021 UC REGENTS

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons

Margin Size

  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Engineering LibreTexts

13.12: Factor analysis and ANOVA

  • Last updated
  • Save as PDF
  • Page ID 22533

  • Alexander Voice, Andrew Wilkins, Rohan Parambi, & Ibrahim Oraiqat
  • University of Michigan

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

First invented in the early 1900s by psychologist Charles Spearman, factor analysis is the process by which a complicated system of many variables is simplified by completely defining it with a smaller number of "factors." If these factors can be studied and determined, they can be used to predict the value of the variables in a system. A simple example would be using a person's intelligence (a factor) to predict their verbal, quantitative, writing, and analytical scores on the GRE (variables).

Analysis of variance (ANOVA) is the method used to compare continuous measurements to determine if the measurements are sampled from the same or different distributions. It is an analytical tool used to determine the significance of factors on measurements by looking at the relationship between a quantitative "response variable" and a proposed explanatory "factor." This method is similar to the process of comparing the statistical difference between two samples, in that it invokes the concept of hypothesis testing. Instead of comparing two samples, however, a variable is correlated with one or more explanatory factors, typically using the F-statistic. From this F-statistic, the P-value can be calculated to see if the difference is significant. For example, if the P-value is low (P-value<0.05 or P-value<0.01 - this depends on desired level of significance), then there is a low probability that the two groups are the same. The method is highly versatile in that it can be used to analyze complicated systems, with numerous variables and factors. In this article, we will discuss the computation involved in Single-Factor , Two-Factor: Without Replicates , and Two-Factor: With Replicates ANOVA. Below, is a brief overview of the different types of ANOVA and some examples of when they can be applied.

Overview and Examples of ANOVA Types

ANOVA Types

Single-Factor ANOVA (One-Way):

One-way ANOVA is used to test for variance among two or more independent groups of data, in the instance that the variance depends on a single factor. It is most often employed when there are at least three groups of data, otherwise a t-test would be a sufficient statistical analysis.

Two-Factor ANOVA (Two-Way):

Two-way ANOVA is used in the instance that the variance depends on two factors. There are two cases in which two-way ANOVA can be employed:

  • Data without replicates : used when collecting a single data point for a specified condition
  • Data with replicates : used when collecting multiple data points for a specified condition (the number of replicates must be specified and must be the same among data groups)

When to Use Each ANOVA Type

  • Example: There are three identical reactors (R1, R2, R3) that generate the same product.
  • One-way ANOVA: You want to analyze the variance of the product yield as a function of the reactor number.
  • Two-way ANOVA without replicates: You want to analyze the variance of the product yield as a function of the reactor number and the catalyst concentration.
  • Two-way ANOVA with replicates: For each catalyst concentration, triplicate data were taken. You want to analyze the variance of the product yield as a function of the reactor number and the catalyst concentration.

ANOVA is a Linear Model

Though ANOVA will tell you if factors are significantly different, it will do so according to a linear model. ANOVA will always assumes a linear model, it is important to consider strong nonlinear interactions that ANOVA may not incorporate when determining significance. ANOVA works by assuming each observation as overall mean + mean effect + noise. If there are non-linear relationship between these (for example, if the difference between column 1 and column 2 on the same row is that column2 = column1^2), then there is the chance that ANOVA will not catch it.

Before further explanation, please review the terms below, which are used throughout this Wiki.

eyTerms.jpg

Comparison of Sample Means Using the F-Test

The F-Test is the ratio of the sample variances. The F-statistic and the corresponding F-Test are used in single-factor ANOVA for purposes of hypothesis testing.

Null hypothesis ( H o ): all sample means arising from different factors are equal

Alternative hypothesis ( H a ): the sample means are not all equal

Several assumptions are necessary to use the F-test:

  • The samples are independent and random
  • The distribution of the response variable is a normal curve within each population
  • The different populations may have different means
  • All populations have the same standard deviation

Introduction to the F-Statistic

The F-statistic is the ratio of two variance estimates: the variance between groups divided by the variance within groups. The larger the F-statistic, the more likely it is that the difference between samples is due to the factor being tested, and not just the natural variation within a group. A standardized table can be used to find F critical for any system. F critical will depend on alpha, which is a measure of the confidence level. Typically, a value of alpha = 0.05 is used, which corresponds to 95% confidence. If F observed > F critical , we conclude with 95% confidence that the null hypothesis is false. For an explanation of how to read an F-Table, see Interpreting the F-statistic (below). In a similar manner, F tables can also be used to determine the p-value for a given data set. The p-value for a given data set is the probability that you could obtain this data set if the null hypothesis were true: that is, if the results were strictly due to chance. When H o is true, the F-statistic has an F distribution.

F-Distributions

The F-distribution is important to ANOVA, because it is used to find the p-value for an ANOVA F-test. The F-distribution arises from the ratio of two Chi squared distributions. Thus, this family has a numerator and denominator degrees of freedom. (For information on the Chi squared test, click here.) Every function of this family has a a skewed distribution and minimum value of zero.

dist.JPG

Single-Factor Analysis of Variance

In the case of single-factor analysis, also called single classification or one-way, a factor is varied while observing the result on the set of dependent variables. These dependent variables belong to a specific related set of values and hence, the results are expected to be related.

This section will describe some of the computational details for the F-statistic in one-way ANOVA. Although these equations provide insight into the concept of analysis of variance and how the F-test is constructed, it is not necessary to learn formulas or to do this analysis by hand. In practice, computers are always used to do one-way ANOVA

Setting up an Analysis of Variance Table

The fundamental concept in one-way analysis of variance is that the variation among data points in all samples can be divided into two categories: variation between group means and variation between data points in a group. The theory for analysis of variance stems from a simple equation, stating that the total variance is equal to the sum of the variance between groups and the variation within groups –

Total variation = variation between groups + variation within groups

An analysis of variance table is used to organize data points, indicating the value of a response variable, into groups according to the factor used in each case. For example, Table 1 is an ANOVA table for comparing the amount of weight lost over a three month period by dieters on various weight-loss programs.

Table 1 - Amount of weight lost by dieters on various programs over a 3 month period
Program 1 Program 2 Program 3
7 9 15
9 11 12
5 7 18
7    

A reasonable question is, can the type of program (a factor) be used to predict the amount of weight a dieter would lose on that program (a response variable)? Or, in other words, is any one program superior to the others?

Measuring Variation Between Groups

The variation between group means is measured with a weighted sum of squared differences between the sample means and the overall mean of all the data. Each squared difference is multiplied by the appropriate group sample size, ni, in this sum. This quantity is called sum of squares between groups or SS Groups .

\[\text { SSGroups }=n_{1}\left(x_{1}-x\right)^{2}+n_{2}\left(x_{2}-x\right)^{2}+\ldots+n_{k}\left(x_{k}-x\right)^{2}=\sum_{\text {yrсups }} n_{j}\left(\bar{x}_{j j}-\bar{x}^{\prime 2}\right.\nonumber \]

The numerator of the F-statistic for comparing means is called the mean square between groups or MS Groups , and it is calculated as -

\[\text { MSGroups }=\frac{\text { SSGroups }}{k-1}\nonumber \]

Measuring Variation Within Groups

To measure the variation among data points within the groups, find the sum of squared deviations between data values and the sample mean in each group, and then add these quantities. This is called the sum of squared errors , SSE , or sum of squares within groups .

\[S S E=\left(n_{1}-1\right) s_{1}^{2}+\left(n_{2}-1\right) s_{2}^{2}+\ldots+\left(n_{k}-1\right) s_{k}^{2}=\sum_{\text {allgroups }}\left(n_{j}-1\right) s_{j}^{2}\nonumber \]

\[s_{j}^{2}=\sum_{\text {groupj }} \frac{\left(x_{i j}-\bar{x}_{j}\right)^{2}}{n_{j}-1}=\nonumber \]

and is the variance within each group. The denominator of the F-statistic is called the mean square error , MSE , or mean squares within groups . It is calculated as

\[M S E=\frac{S S E}{N-k}=\frac{\left(n_{1}-1\right) s_{1}^{2}+\left(n_{2}-1\right) s_{2}^{2}+\ldots+\left(n_{k}-1\right) s_{k}^{2}}{n_{1}+n_{2}+\ldots+n_{k}-k}\nonumber \] MSE is simply a weighted average of the sample variances for the k groups. Therefore, if all n i are equal, MSE is simply the average of the k sample variances. The square root of MSE ( s p ), called the pooled standard deviation, estimates the population standard deviation of the response variable (keep in mind that all of the samples being compared are assumed to have the same standard deviation σ).

Measuring the Total Variation

The total variation in all samples combined is measured by computing the sum of squared deviations between data values and the mean of all data points. This quantity is referred to as the total sum of squares or SS Total. The total sum of squares may also be referred to as SSTO. A formula for the sum of squared differences from the overall mean is

bar x

Overall, the relationship between the total variation, the variation between groups, and the variation within a group is illustrated by Figure 2.

A general table for performing the one-way ANOVA calculations required to compute the F-statistic is given below

Table 2 - One-Way ANOVA Table
Source Degrees of Freedom Sum of Squares Mean Sum of Squares F-Statistic
Between groups k-1
Within groups (error) N-k  
Total N-1    

Interpreting the F-statistic

Once the F-statistic has been found, it can be compared with a critical F value from a table, such as this one: F Table. This F table is calculated for a value of alpha = 0.05, indicating a 95% confidence level. This means that if F observed is larger than F critical from the table, then we can reject the null hypothesis and say with 95% confidence that the variance between groups is not due to random chance, but rather due to the influence of a tested factor. Tables are also available for other values of alpha and can be used to find a more exact probability that the difference between groups is (or is not) caused by random chance.

Finding the Critical F value

In this F Table, the first row of the F table is the number of degrees of between groups (number of groups - 1), and the first column is the number of degrees of freedom within groups (total number of samples - number of groups).

ow to read F-Table.jpg

For the diet example in Table 1, the degree of freedom between groups is (3-1) = 2 and and the degree of freedom within groups is (13-3) = 10. Thus, the critical F value is 4.10.

Computing the 95% Confidence Interval for the Population Means

ample mean \pm Multiplier*Standard error

An example for using factor analysis is the following:

You have two assembly lines. Suppose you sample 10 parts from the two assembly lines. Ho: s1 2 = s2x 2 Ha: variances are not equal Are the two lines producing similar outputs? Assume a=0.05 F .025,9,9 = 4.03 F 1-.025,9,9 = ?

Image

Are variances different?

-test example cont.jpg

How would we test if the means are different?

-test example 3.jpg

Two-Factor Analysis of Variance

A two factor or two-way analysis of variance is used to examine how two qualitative categorical variables (male/female) affect the mean of a quantitative response variable. For example, a psychologist might want to study how the type and volume of background music affect worker productivity. Alternatively, an economist maybe be interested in determining the affect of gender and race on mean income. In both of these examples, there is interest in the effect of each separate explanatory factor, as well as the combined effect of both factors.

Assumptions

In order to use the two-way ANOVA, the following assumptions are required:

  • Samples must be independent.
  • Population variances must be equal.
  • Groups must have same sample size. The populations from which the samples were obtained must be normally distributed (or at least approximately so).
  • The null hypothesis is assumed to be true.

The null hypothesis is as follows:

  • The population means for the first factor have to be equal. This is similar to the one-way ANOVA for the row factor.
  • The population means for the second factor must also be equal. This is similar to the one-way ANOVA for the column factor.
  • There isn’t an interaction between the two factors. This is similar to performing an independence test using contingency tables.

More simply, the null hypothesis implies that the populations are all similar and any differences in the populations are caused by chance, not by the influence of a factor. After carrying out two-way ANOVA it will be possible to analyze the validity of this assumption.

Terms Used in Two-Way ANOVA

The interaction between two factors is the most unique part of a two-way analysis of variance problem. When two factors interact, the effect on the response variable depends on the value of the other factor. For example, the statement being overweight caused greater increases in blood pressure for men than women describes an interaction. In other words, the effect of weight (factor) on blood pressure (response) depends on gender (factor).

The term main effect is used to describe the overall effect of a single explanatory variable. In the music example, the main effect of the factor "music volume" is the effect on productivity averaged over all types of music. Clearly, the main effect may not always be useful if the interaction is unknown.

In a two-way analysis of variance, three F-statistics are constructed. One is used to test the statistical significance of the interaction, while the other two are used to test the significance of the two separate main effects. The p-value for each F-statistic is also reported--a p-value of <.05 is usually used to indicate significance. When an F-factor is found to have statistical significance, it is considered a main effect. The p-value is also used as an indicator to determine if the two factors have a significant interaction when considered simultaneously. If one factor depends strongly on the other, the F-statistic for the interaction term will have a low p-value. An example output of two-way analysis of variance of restaurant tip data is given in Table 4.

Table 4 - Two-Way Analysis of Variance of Restaurant Tipping Data
Source DF Adj SS Adj MS F-Statistic P-Value
Message 1 14.7 14.7 .13 .715
Sex 1 2602.0 2602.0 23.69 0.00
Interaction 1 438.7 438.7 3.99 .049
Error 85 9335.5 109.8    
Total 88 12407.9      

In this case, the factors being studied are sex (male or female) and message on the receipt ( :-) or none). The p-values in the last column are the most important information contained in this table. A lower p-value indicates a higher level of significance. Message has a significance value of .715. This is much greater than .05, the 95% confidence interval, indicating that this factor has no significance (no strong correlation between presence of message and amount of tip). The reason this occurs is that there is a relationship between the message and the sex of the waiter. The interaction term, which was significant with a value of p= 0.049, showed that drawing a happy face increased the tip for women but decreased it for men. The main effect of waiter sex (with a p-value of approximately 0) shows that there is a statistical difference in average tips for men and women.

Two-Way ANOVA Calculations

Like in one-way ANOVA analysis the main tool used is the square sums of each group. Two-way ANOVA can be split between two different types: with repetition and without repetition. With repetition means that every case is repeated a set number of times. For the above example that would mean that the :-) was given to females 10 times and males 10 times, and no message was given to females 10 times and males 10 times

Using the SS values as a start the F-statistics for two-way ANOVA with repetition are calculated using the chart below where a is the number of levels of main effect A, b is the number of levels of main effect B, and n is the number of repetitions.

Source SS DF Adj MS F-Statistic
Main Effect A From data given a-1 SS/df MS(A)/MS(W)
Main Effect B From data given b-1 SS/df MS(B)/MS(W)
Interaction Effect From data given (a-1)(b-1) SS/df MS(A*B)/MS(W)
Within From data given ab(n-1) SS/df  
Total sum of others abn-1    

Without repetition means there is one reading for every case. For example is you were investigating whether or not difference in yield are more significant based on the day the readings were taken or the reactor that the readings were taken from you would have one reading for Reactor 1 on Monday, one reading for Reactor 2 on Monday etc... The results for two-way ANOVA without repetition is slightly different in that there is no interaction effect measured and the within row is replaced with a similar (but not equal) error row. The calculations needed are shown in the table below.

Source SS DF MS F-Statistic
Main Effect A From data given a-1 SS/df MS(A)/MS(E)
Main Effect B From data given b-1 SS/df MS(B)/MS(E)
Error From data given (a-1)(b-1) SS/df  
Total sum of others ab-1    

These calculations are almost never done by hand. In this class you will usually use Excel or Mathematica to create these tables. Sections describing how to use these programs are found later in this chapter.

Other Methods of Comparison

Unfortunately, the conditions for using the ANOVA F-test do not hold in all situations. In this section, several other methods which do not rely on equal population standard deviations or normal distribution. It is important to realize that no method of factor analysis is appropriate if the data given is not representative of the group being studied.

Hypotheses About Medians

In general, it is best construct hypotheses about a population median, rather than the mean. Using the median accounts for the sample being skewed based on extreme outliers. Median hypotheses should also be used for dealing with ordinal variables (variables which are described only as being higher or lower than one other and do not have a precise value). When several populations are compared, the hypotheses are stated as -

H 0 : Population medians are equal H a : Population medians are not all equal

Kruskal-Wallis Test for Comparing Medians

The Kruskal-Wallis Test provides a method of comparing medians by comparing the relative rankings of data in the observed samples. This test is therefore referred to as a rank test or non-parametric test because the test does not make any assumptions about the distribution of data.

To conduct this test, the values in the total data set are first ranked from lowest to highest, with 1 being lowest and N being highest. The ranks of the values within each group are averaged, and the test statistic measures the variation among the average ranks for each group. A p-value can be determined by finding the probability that the variation among the set of rank averages for the groups would be as large or larger as it is if the null hypothesis is true. More information on the Kruskal-Wallis test can be found [here].

Mood's Median Test for Comparing Medians

Another nonparametric test used to compare population medians is Mood's Median Test . Also called the Sign Scores Test, this test involves multiple steps.

1. Calculate the median (M) using all data points from every group in the study 2. Create a contingency table as follows

  A B C Total
Number of values greater than M        
Number of values less than or equal to M        

3. Calculate the expected value for each data set using the following formula:

\[\text { expected }=\frac{(\text { rowtotal })(\text { columntotal })}{\text { grandtotal }}\nonumber \]

4. Calculate the chi-square value using the following formula

\[\chi=\frac{(\text { actual - expected })^{2}}{\text { expected }}\nonumber \]

A chi-square statistic for two-way tables is used to test the null hypothesis that the population medians are all the same. The test is equivalent to testing whether or not the two variables are related.

ANOVA and Factor Analysis in Process Control

ANOVA and factor analysis are typically used in process control for troubleshooting purposes. When a problem arises in a process control system, these techniques can be used to help solve it. A factor can be defined as a single variable or simple process that has an effect on the system. For example, a factor can be temperature of an inlet stream, flow rate of coolant, or the position of a specific valve. Each factor can be analyzed individually to determine the effect that changing the input has on the process control system as a whole. The input variable can have a large, small, or no effect on what is being analyzed. The amount that the input variable affects the system is called the “factor loading”, and is a numerical measure of how much a specific variable influences the system or the output variable. In general, the larger the factor loading is for a variable the more of an affect that it has on the output variable.

A simple equation for this would be:

Output = f 1 * input 1 + f 2 * input 2 + ... + f n * input n

where f n is the factor loading for the n th input.

Factor analysis is used in this case study to determine the fouling in an alcohol plant reboiler. This article provides some additional insight as to how factor analysis is used in an industrial situation.

Using Mathematica to Conduct ANOVA

Mathematica can be used for one-way and two-way factor anaylses. Before this can be done, the ANOVA package must be loaded into Mathematica using the following command:

Needs[ "ANOVA`" ]

Once this command is executed, the 'ANOVA' command can be utilized.

One-Way Factor Analysis

The basic form of the 'ANOVA' command to perform a one-way factor analysis is as follows:

ANOVA [ data ] An example set of data with five elements would look like:

ANOVA [ ParseError: EOF expected (click for details) Callstack: at (Bookshelves/Industrial_and_Systems_Engineering/Chemical_Process_Dynamics_and_Controls_(Woolf)/13:_Statistics_and_Probability_Background/13.12:_Factor_analysis_and_ANOVA), /content/body/div[8]/div[1]/p[3]/span, line 1, column 2 ] An output table that includes the degrees of freedom, sum of the squares, mean sum of the squares, F-statistic, and the P-value for the model, error, and total will be displayed when this line is executed. A list of cell means for each model will be displayed beneath the table.

Two-Way Factor Analysis

The basic form of the 'ANOVA' command to perform a two-way factor analysis is as follows:

ANOVA [ data, model, vars ] An example set of data with seven elements would look like:

ANOVA [ ParseError: EOF expected (click for details) Callstack: at (Bookshelves/Industrial_and_Systems_Engineering/Chemical_Process_Dynamics_and_Controls_(Woolf)/13:_Statistics_and_Probability_Background/13.12:_Factor_analysis_and_ANOVA), /content/body/div[8]/div[2]/p[3]/span, line 1, column 2 ,{x,y},{x,y}] An output table will appear similar to the one that is displayed in the one-way analysis except that there will be a row of statistics for each variable (i.e. x,y).

athematicaWiki.JPG

ANOVA in Microsoft Excel 2007

In order to access the ANOVA data analysis tool, install the package:

  • Click on the Microsoft Office button (big circle with office logo)
  • Click 'Excel Options'
  • Click 'Add-ins' on the left side
  • In the manage drop-down box at the bottom of the window, select 'Excel Add-ins'
  • Click 'Go...'
  • On the Add-Ins window, check the Analysis ToolPak box and click 'OK'

To use this package:

  • Click on the 'Data' tab and select 'Data Analysis'
  • Choose the desired ANOVA type- 'Anova: Single Factor', 'Anova: Two Factor with Replication', or 'Anova: Two Factor without Replication'(see note below for when to use replication)
  • Select the desired data points including data labels at top of the corresponding columns. Make sure the box is checked for 'Labels in first row' in the ANOVA parameter window.
  • Specify alpha in the ANOVA parameter window. Alpha represents the level of significance.
  • Output the results into a new worksheet.

NOTE: Anova: Two Factor with Replication is used in the cases where there are multiple readings for a single factor. For instance, the input below, there are 2 factors, control architecture and unit. This input shows how there are 3 readings corresponding to each control architecture (FB, MPC, and cascade). In this sense, the control architecture is replicated 3 times, each time providing different data relating to each unit. So, in this case, you would want to use the Anova Two Factor with Replication option.

ith Replication.JPG

Anova: Two Factor without Replication is used in cases where there is only one reading pertaining to a particular factor. For example, in the case below, each sample (row) is independent of the other samples since they are based on the day they were taken. Since multiple readings were not taken within the same day, the "without Replication" option should be chosen.

ithout Replication.JPG

Excel outputs:

Summary: 1. Count- number of data points in a set 2. Sum- sum of the data points in a set 3. Average- mean of the data points in a set 4. Variance- standard deviation of the data points in a set

ANOVA: 1. Sum of squares (SS) 2. The degree of freedom (df) 3. The mean squares (MS) 4. F-statistic (F) 5. P-value 6. F critical

See the figure below for an example of the inputs and outputs using Anova: Single Factor. Note the location of the Data Analysis tab. The data was obtained from the dieting programs described in Table 1. Since the F-statistic is greater than F critical , the null hypothesis can be rejected at a 95% confidence level (since alpha was set at 0.05). Thus, weight loss was not random and in fact depends on diet type chosen.

X2ANOVA2.JPG

Example \(\PageIndex{1}\)

Determine the fouling rate of the reboiler at the following parameters:

  • C c = 16.7 g / L
  • R T = 145 min

Which process variable has the greatest effect (per unit) on the fouling rate of the reboiler?

Note that the tables below are made up data. The output data for a single input was gathered assuming that the other input variables provide a negligible output. Although the factors that affect the fouling of the reboiler are similar to the ones found the the article linked in the "ANOVA and Factor Analysis in Process Controls" section, the data is not.

Temperature of Reboiler (K) 400 450 500
Fouling Rate (mg/min) 0.8 0.86 0.95
Catalyst Concentration (g/L) 10 20 30
Fouling Rate (mg/min) 0.5 1.37 2.11
Residence Time (min) 60 120 180
Fouling Rate (mg/min) 0.95 2.3 3.81

1) Determine the "factor loading" for each variable.

This can be done using any linearization tool. In this case, the factor loading is just the slope of the line for each set of data. Using Microsoft Excel, the equations for each set of data are the following:

Temperature of Reboiler y = 0.0015 * x + 0.195 Factor loading: 0.0015 Catalyst Concentration y = 0.0805 * x − 0.2833 Factor loading: 0.0805 Residence Time y = 0.0238 * x − 0.5067 Factor loading: 0.0238

X1EXCEL.JPG

2) Determine the fouling rate for the given process conditions and which process variable affects the fouling rate the most (per unit). Note that the units of the factor loading value are always the units of the output divided by the units of the input.

Plug in the factor loading values into the following equation: Output = f 1 * input 1 + f 2 * input 2 + ... + f n * input n You will end up with: FoulingRate = 0.0015 * T + 0.0805 * C c + 0.0238 * R T Now plug in the process variables: FoulingRate = 0.0015 * 410 + 0.0805 * 16.7 + 0.0238 * 145 FoulingRate = 5.41 mg / min The process variable that affects the fouling rate the most (per unit) is the catalyst concentration because it has the largest factor loading value.

Example \(\PageIndex{2}\)

The exit flow rate leaving a tank is being tested for 3 cases. The first case is under the normal operating conditions, while the second (A) and the third (B) cases are for new conditions that are being tested. The flow value of 7 (gallons /hour) is desired with a maximum of 10. A total of 24 runs are tested with 8 runs for each case. The tests are run to determine whether any of the new conditions will result in a more accurate flow rate. First, we determine if the new conditions A and B affect the flow rate. The results are as follows:

nova excell.jpg

The recorded values for the 3 cases are tabulated. Following this the values for each case are squared and the sums for all of these are taken. For the 3 cases, the sums are squared and then their means are found.

These values are used to help determine the table above (the equations give an idea as to how they are calculated). In the same way with the help of ANOVA, these values can be determine faster. This can be done using the mathematica explained above.

Conclusion:

F critical equals 3.4668, from an F-table. Since the calculated F value is greater than F critical , we know that there is a statistically significant difference between 2 of the conditions. Thus, the null hypothesis can be rejected. However we do not know between which 2 conditions there is a difference. A post-hoc analysis will help us determine this. However we are able to confirmed that there is a difference.

Example \(\PageIndex{3}\)

As the new engineer on site, one of your assigned tasks is to install a new control architecture for three different units. You test three units in triplicate, each with 3 different control architecture: feedback (FB), model predictive control (MPC) and cascade control. In each case you measure the yield and organize the data as follows:

ontrol architecture.jpg

Do the units differ significantly? Do the control architectures differ significantly?

This problem can be solved using ANOVA Two factor with replication analysis.

ithReplication.jpg

Exercise \(\PageIndex{1}\)

ANOVA analysis works better for?

  • Non-linear models
  • Linear models
  • Exponential models
  • All of the above

Exercise \(\PageIndex{2}\)

Two-Way ANOVA analysis is used to compare?

  • Any two sets of data
  • Two One-Way ANOVA models to each other
  • Two factors on their effect of the output
  • Ogunnaike, Babatunde and W. Harmon Ray. Process Dynamics, Modeling, and Control. Oxford University Press. New York, NY: 1994.
  • Uts, J. and R. Hekerd. Mind on Statistics. Chapter 16 - Analysis of Variance. Belmont, CA: Brooks/Cole - Thomson Learning, Inc. 2004.
  • Charles Spearman. Retrieved November 1, 2007, from www.indiana.edu/~intell/spearman.shtml
  • Plonsky, M. "One Way ANOVA." Retrieved November 13, 2007, from www.uwsp.edu/psych/stat/12/anova-1w.htm
  • Ender, Phil. "Statistical Tables F Distribution." Retrieved November 13, 2007, from www.gseis.ucla.edu/courses/help/dist3.html
  • Devore, Jay L. Probability and Statistics for Engineering and the Sciences. Chapter 10 - The Analysis of Variance. Belment, CA: Brooks/Cole - Thomson Learning, Inc. 2004.
  • "Mood's Median Test (Sign Scores Test)" Retrieved November 29, 2008, from www.micquality.com/six_sigma_glossary/mood_median_test.htm

previous episode

High dimensional statistics with r, next episode, factor analysis.

Overview Teaching: 30 min Exercises: 10 min Questions What is factor analysis and when can it be used? What are communality and uniqueness in factor analysis? How to decide on the number of factors to use? How to interpret the output of factor analysis? Objectives Perform a factor analysis on high-dimensional data. Select an appropriate number of factors. Interpret the output of factor analysis.

Introduction

Biologists often encounter high-dimensional datasets from which they wish to extract underlying features – they need to carry out dimensionality reduction. The last episode dealt with one method to achieve this, called principal component analysis (PCA), which expressed new dimension-reduced components as linear combinations of the original features in the dataset. Principal components can therefore be difficult to interpret. Here, we introduce a related but more interpretable method called factor analysis (FA), which constructs new components, called factors , that explicitly represent underlying (latent) constructs in our data. Like PCA, FA uses linear combinations, but uses latent constructs instead. FA is therefore often more interpretable and useful when we would like to extract meaning from our dimension-reduced set of variables.

There are two types of FA, called exploratory and confirmatory factor analysis (EFA and CFA). Both EFA and CFA aim to reproduce the observed relationships among a group of features with a smaller set of latent variables. EFA is used in a descriptive (exploratory) manner to uncover which measured variables are reasonable indicators of the various latent dimensions. In contrast, CFA is conducted in an a priori , hypothesis-testing manner that requires strong empirical or theoretical foundations. We will mainly focus on EFA here, which is used to group features into a specified number of latent factors.

Unlike with PCA, researchers using FA have to specify the number of latent variables (factors) at the point of running the analysis. Researchers may use exploratory data analysis methods (including PCA) to provide an initial estimate of how many factors adequately explain the variation observed in a dataset. In practice, a range of different values is usually tested.

Motivating example: student scores

One scenario for using FA would be whether student scores in different subjects can be summarised by certain subject categories. Take a look at the hypothetical dataset below. If we were to run and EFA on this, we might find that the scores can be summarised well by two factors, which we can then interpret. We have labelled these hypothetical factors “mathematical ability” and “writing ability”.

A table displaying data of student scores across several subjects. Each row displays the scores across different subjects for a given individual. The plot is annotated at the top with a curly bracket labelled Factor 1: mathematical ability and encompasses the data for the student scores is Arithmetic, Algebra, Geometry, and Statistics. Similarly, the subjects Creative Writing, Literature, Spelling/Grammar are encompassed by a different curly bracket with label Factor 2: writing ability.

Student scores data across several subjects with hypothesised factors.

So, EFA is designed to identify a specified number of unobservable factors from observable features contained in the original dataset. This is slightly different from PCA, which does not do this directly. Just to recap, PCA creates as many principal components as there are features in the dataset, each component representing a different linear combination of features. The principal components are ordered by the amount of variance they account for.

Prostate cancer patient data

We revisit the [ prostate ]((https://carpentries-incubator.github.io/high-dimensional-stats-r/data/index.html) dataset of 97 men who have prostate cancer. Although not strictly a high-dimensional dataset, as with other episodes, we use this dataset to explore the method.

In this example, we use the clinical variables to identify factors representing various clinical variables from prostate cancer patients. Two principal components have already been identified as explaining a large proportion of variance in the data when these data were analysed in the PCA episode. We may expect a similar number of factors to exist in the data.

Let’s subset the data to just include the log-transformed clinical variables for the purposes of this episode:

Performing exploratory factor analysis

EFA may be implemented in R using the factanal() function from the stats package (which is a built-in package in base R). This function fits a factor analysis by maximising the log-likelihood using a data matrix as input. The number of factors to be fitted in the analysis is specified by the user using the factors argument.

Challenge 1 (3 mins) Use the factanal() function to identify the minimum number of factors necessary to explain most of the variation in the data Solution # Include one factor only pros_fa <- factanal ( pros2 , factors = 1 ) pros_fa Call: factanal(x = pros2, factors = 1) Uniquenesses: lcavol lweight lbph lcp lpsa 0.149 0.936 0.994 0.485 0.362 Loadings: Factor1 lcavol 0.923 lweight 0.253 lbph lcp 0.718 lpsa 0.799 Factor1 SS loadings 2.074 Proportion Var 0.415 Test of the hypothesis that 1 factor is sufficient. The chi square statistic is 29.81 on 5 degrees of freedom. The p-value is 1.61e-05 # p-value <0.05 suggests that one factor is not sufficient # we reject the null hypothesis that one factor captures full # dimensionality in the dataset # Include two factors pros_fa <- factanal ( pros2 , factors = 2 ) pros_fa Call: factanal(x = pros2, factors = 2) Uniquenesses: lcavol lweight lbph lcp lpsa 0.121 0.422 0.656 0.478 0.317 Loadings: Factor1 Factor2 lcavol 0.936 lweight 0.165 0.742 lbph 0.586 lcp 0.722 lpsa 0.768 0.307 Factor1 Factor2 SS loadings 2.015 0.992 Proportion Var 0.403 0.198 Cumulative Var 0.403 0.601 Test of the hypothesis that 2 factors are sufficient. The chi square statistic is 0.02 on 1 degree of freedom. The p-value is 0.878 # p-value >0.05 suggests that two factors is sufficient # we cannot reject the null hypothesis that two factors captures # full dimensionality in the dataset #Include three factors pros_fa <- factanal ( pros2 , factors = 3 ) Error in factanal(pros2, factors = 3): 3 factors are too many for 5 variables # Error shows that fitting three factors are not appropriate # for only 5 variables (number of factors too high)

The output of factanal() shows the loadings for each of the input variables associated with each factor. The loadings are values between -1 and 1 which represent the relative contribution each input variable makes to the factors. Positive values show that these variables are positively related to the factors, while negative values show a negative relationship between variables and factors. Loading values are missing for some variables because R does not print loadings less than 0.1.

How many factors do we need?

There are numerous ways to select the “best” number of factors. One is to use the minimum number of features that does not leave a significant amount of variance unaccounted for. In practice, we repeat the factor analysis for different numbers of factors (by specifying different values in the factors argument). If we have an idea of how many factors there will be before analysis, we can start with that number. The final section of the analysis output then shows the results of a hypothesis test in which the null hypothesis is that the number of factors used in the model is sufficient to capture most of the variation in the dataset. If the p-value is less than our significance level (for example 0.05), we reject the null hypothesis that the number of factors is sufficient and we repeat the analysis with more factors. When the p-value is greater than our significance level, we do not reject the null hypothesis that the number of factors used captures variation in the data. We may therefore conclude that this number of factors is sufficient.

Like PCA, the fewer factors that can explain most of the variation in the dataset, the better. It is easier to explore and interpret results using a smaller number of factors which represent underlying features in the data.

Variance accounted for by factors - communality and uniqueness

The communality of a variable is the sum of its squared loadings. It represents the proportion of the variance in a variable that is accounted for by the FA model.

Uniqueness is the opposite of communality and represents the amount of variation in a variable that is not accounted for by the FA model. Uniqueness is calculated by subtracting the communality value from 1. If uniqueness is high for a given variable, that means this variable is not well explained/accounted for by the factors identified.

Visualising the contribution of each variable to the factors

Similar to a biplot as we produced in the PCA episode, we can “plot the loadings”. This shows how each original variable contributes to each of the factors we chose to visualise.

A scatter plot of the factor 2 loadings for each feature versus the factor 2 loadings for each feature. The lpsa, lcavol and lcp feature points are located in the east of the plot, indicating a high loading on factor 1 and close to zero loading on factor 2. The lbph and lweight features are located in the north of the plot, indicating a close to zero loading on factor 1 and a high loading on factor 2.

Factor 2 loadings versus factor 1 loadings for each feature.

Challenge 2 (3 mins) Use the output from your factor analysis and the plots above to interpret the results of your analysis. What variables are most important in explaining each factor? Do you think this makes sense biologically? Consider or discuss in groups. Solution This plot suggests that the variables lweight and lbph are associated with high values on factor 2 (but lower values on factor 1) and the variables lcavol, lcp and lpsa are associated with high values on factor 1 (but lower values on factor 2). There appear to be two ‘clusters’ of variables which can be represented by the two factors. The grouping of weight and enlargement (lweight and lbph) makes sense biologically, as we would expect prostate enlargement to be associated with greater weight. The groupings of lcavol, lcp, and lpsa also make sense biologically, as larger cancer volume may be expected to be associated with greater cancer spread and therefore higher PSA in the blood.

Advantages and disadvantages of Factor Analysis

There are several advantages and disadvantages of using FA as a dimensionality reduction method.

Advantages:

  • FA is a useful way of combining different groups of data into known representative factors, thus reducing dimensionality in a dataset.
  • FA can take into account researchers’ expert knowledge when choosing the number of factors to use, and can be used to identify latent or hidden variables which may not be apparent from using other analysis methods.
  • It is easy to implement with many software tools available to carry out FA.
  • Confirmatory FA can be used to test hypotheses.

Disadvantages:

  • Justifying the choice of number of factors to use may be difficult if little is known about the structure of the data before analysis is carried out.
  • Sometimes, it can be difficult to interpret what factors mean after analysis has been completed.
  • Like PCA, standard methods of carrying out FA assume that input variables are continuous, although extensions to FA allow ordinal and binary variables to be included (after transforming the input matrix).

Further reading

  • Gundogdu et al. (2019) Comparison of performances of Principal Component Analysis (PCA) and Factor Analysis (FA) methods on the identification of cancerous and healthy colon tissues. International Journal of Mass Spectrometry 445:116204.
  • Kustra et al. (2006) A factor analysis model for functional genomics. BMC Bioinformatics 7: doi:10.1186/1471-2105-7-21.
  • Yong, A.G. & Pearce, S. (2013) A beginner’s guide to factor analysis: focusing on exploratory factor analysis. Tutorials in Quantitative Methods for Psychology 9(2):79-94.
  • Confirmatory factor analysis can be carried out with the package Lavaan .
  • A more sophisticated implementation of EFA is available in the packages EFA.dimensions and psych .
Key Points Factor analysis is a method used for reducing dimensionality in a dataset by reducing variation contained in multiple variables into a smaller number of uncorrelated factors. PCA can be used to identify the number of factors to initially use in factor analysis. The factanal() function in R can be used to fit a factor analysis, where the number of factors is specified by the user. Factor analysis can take into account expert knowledge when deciding on the number of factors to use, but a disadvantage is that the output requires careful interpretation.

Hypothesis Testing - Analysis of Variance (ANOVA)

Lisa Sullivan, PhD

Professor of Biostatistics

Boston University School of Public Health

null hypothesis of factor analysis

Introduction

This module will continue the discussion of hypothesis testing, where a specific statement or hypothesis is generated about a population parameter, and sample statistics are used to assess the likelihood that the hypothesis is true. The hypothesis is based on available information and the investigator's belief about the population parameters. The specific test considered here is called analysis of variance (ANOVA) and is a test of hypothesis that is appropriate to compare means of a continuous variable in two or more independent comparison groups. For example, in some clinical trials there are more than two comparison groups. In a clinical trial to evaluate a new medication for asthma, investigators might compare an experimental medication to a placebo and to a standard treatment (i.e., a medication currently being used). In an observational study such as the Framingham Heart Study, it might be of interest to compare mean blood pressure or mean cholesterol levels in persons who are underweight, normal weight, overweight and obese.  

The technique to test for a difference in more than two independent means is an extension of the two independent samples procedure discussed previously which applies when there are exactly two independent comparison groups. The ANOVA technique applies when there are two or more than two independent groups. The ANOVA procedure is used to compare the means of the comparison groups and is conducted using the same five step approach used in the scenarios discussed in previous sections. Because there are more than two groups, however, the computation of the test statistic is more involved. The test statistic must take into account the sample sizes, sample means and sample standard deviations in each of the comparison groups.

If one is examining the means observed among, say three groups, it might be tempting to perform three separate group to group comparisons, but this approach is incorrect because each of these comparisons fails to take into account the total data, and it increases the likelihood of incorrectly concluding that there are statistically significate differences, since each comparison adds to the probability of a type I error. Analysis of variance avoids these problemss by asking a more global question, i.e., whether there are significant differences among the groups, without addressing differences between any two groups in particular (although there are additional tests that can do this if the analysis of variance indicates that there are differences among the groups).

The fundamental strategy of ANOVA is to systematically examine variability within groups being compared and also examine variability among the groups being compared.

Learning Objectives

After completing this module, the student will be able to:

  • Perform analysis of variance by hand
  • Appropriately interpret results of analysis of variance tests
  • Distinguish between one and two factor analysis of variance tests
  • Identify the appropriate hypothesis testing procedure based on type of outcome variable and number of samples

The ANOVA Approach

Consider an example with four independent groups and a continuous outcome measure. The independent groups might be defined by a particular characteristic of the participants such as BMI (e.g., underweight, normal weight, overweight, obese) or by the investigator (e.g., randomizing participants to one of four competing treatments, call them A, B, C and D). Suppose that the outcome is systolic blood pressure, and we wish to test whether there is a statistically significant difference in mean systolic blood pressures among the four groups. The sample data are organized as follows:

 

n

n

n

n

s

s

s

s

The hypotheses of interest in an ANOVA are as follows:

  • H 0 : μ 1 = μ 2 = μ 3 ... = μ k
  • H 1 : Means are not all equal.

where k = the number of independent comparison groups.

In this example, the hypotheses are:

  • H 0 : μ 1 = μ 2 = μ 3 = μ 4
  • H 1 : The means are not all equal.

The null hypothesis in ANOVA is always that there is no difference in means. The research or alternative hypothesis is always that the means are not all equal and is usually written in words rather than in mathematical symbols. The research hypothesis captures any difference in means and includes, for example, the situation where all four means are unequal, where one is different from the other three, where two are different, and so on. The alternative hypothesis, as shown above, capture all possible situations other than equality of all means specified in the null hypothesis.

Test Statistic for ANOVA

The test statistic for testing H 0 : μ 1 = μ 2 = ... =   μ k is:

and the critical value is found in a table of probability values for the F distribution with (degrees of freedom) df 1 = k-1, df 2 =N-k. The table can be found in "Other Resources" on the left side of the pages.

NOTE: The test statistic F assumes equal variability in the k populations (i.e., the population variances are equal, or s 1 2 = s 2 2 = ... = s k 2 ). This means that the outcome is equally variable in each of the comparison populations. This assumption is the same as that assumed for appropriate use of the test statistic to test equality of two independent means. It is possible to assess the likelihood that the assumption of equal variances is true and the test can be conducted in most statistical computing packages. If the variability in the k comparison groups is not similar, then alternative techniques must be used.

The F statistic is computed by taking the ratio of what is called the "between treatment" variability to the "residual or error" variability. This is where the name of the procedure originates. In analysis of variance we are testing for a difference in means (H 0 : means are all equal versus H 1 : means are not all equal) by evaluating variability in the data. The numerator captures between treatment variability (i.e., differences among the sample means) and the denominator contains an estimate of the variability in the outcome. The test statistic is a measure that allows us to assess whether the differences among the sample means (numerator) are more than would be expected by chance if the null hypothesis is true. Recall in the two independent sample test, the test statistic was computed by taking the ratio of the difference in sample means (numerator) to the variability in the outcome (estimated by Sp).  

The decision rule for the F test in ANOVA is set up in a similar way to decision rules we established for t tests. The decision rule again depends on the level of significance and the degrees of freedom. The F statistic has two degrees of freedom. These are denoted df 1 and df 2 , and called the numerator and denominator degrees of freedom, respectively. The degrees of freedom are defined as follows:

df 1 = k-1 and df 2 =N-k,

where k is the number of comparison groups and N is the total number of observations in the analysis.   If the null hypothesis is true, the between treatment variation (numerator) will not exceed the residual or error variation (denominator) and the F statistic will small. If the null hypothesis is false, then the F statistic will be large. The rejection region for the F test is always in the upper (right-hand) tail of the distribution as shown below.

Rejection Region for F   Test with a =0.05, df 1 =3 and df 2 =36 (k=4, N=40)

Graph of rejection region for the F statistic with alpha=0.05

For the scenario depicted here, the decision rule is: Reject H 0 if F > 2.87.

The ANOVA Procedure

We will next illustrate the ANOVA procedure using the five step approach. Because the computation of the test statistic is involved, the computations are often organized in an ANOVA table. The ANOVA table breaks down the components of variation in the data into variation between treatments and error or residual variation. Statistical computing packages also produce ANOVA tables as part of their standard output for ANOVA, and the ANOVA table is set up as follows: 

Source of Variation

Sums of Squares (SS)

Degrees of Freedom (df)

Mean Squares (MS)

F

Between Treatments

k-1

Error (or Residual)

N-k

Total

N-1

where  

  • X = individual observation,
  • k = the number of treatments or independent comparison groups, and
  • N = total number of observations or total sample size.

The ANOVA table above is organized as follows.

  • The first column is entitled "Source of Variation" and delineates the between treatment and error or residual variation. The total variation is the sum of the between treatment and error variation.
  • The second column is entitled "Sums of Squares (SS)" . The between treatment sums of squares is

and is computed by summing the squared differences between each treatment (or group) mean and the overall mean. The squared differences are weighted by the sample sizes per group (n j ). The error sums of squares is:

and is computed by summing the squared differences between each observation and its group mean (i.e., the squared differences between each observation in group 1 and the group 1 mean, the squared differences between each observation in group 2 and the group 2 mean, and so on). The double summation ( SS ) indicates summation of the squared differences within each treatment and then summation of these totals across treatments to produce a single value. (This will be illustrated in the following examples). The total sums of squares is:

and is computed by summing the squared differences between each observation and the overall sample mean. In an ANOVA, data are organized by comparison or treatment groups. If all of the data were pooled into a single sample, SST would reflect the numerator of the sample variance computed on the pooled or total sample. SST does not figure into the F statistic directly. However, SST = SSB + SSE, thus if two sums of squares are known, the third can be computed from the other two.

  • The third column contains degrees of freedom . The between treatment degrees of freedom is df 1 = k-1. The error degrees of freedom is df 2 = N - k. The total degrees of freedom is N-1 (and it is also true that (k-1) + (N-k) = N-1).
  • The fourth column contains "Mean Squares (MS)" which are computed by dividing sums of squares (SS) by degrees of freedom (df), row by row. Specifically, MSB=SSB/(k-1) and MSE=SSE/(N-k). Dividing SST/(N-1) produces the variance of the total sample. The F statistic is in the rightmost column of the ANOVA table and is computed by taking the ratio of MSB/MSE.  

A clinical trial is run to compare weight loss programs and participants are randomly assigned to one of the comparison programs and are counseled on the details of the assigned program. Participants follow the assigned program for 8 weeks. The outcome of interest is weight loss, defined as the difference in weight measured at the start of the study (baseline) and weight measured at the end of the study (8 weeks), measured in pounds.  

Three popular weight loss programs are considered. The first is a low calorie diet. The second is a low fat diet and the third is a low carbohydrate diet. For comparison purposes, a fourth group is considered as a control group. Participants in the fourth group are told that they are participating in a study of healthy behaviors with weight loss only one component of interest. The control group is included here to assess the placebo effect (i.e., weight loss due to simply participating in the study). A total of twenty patients agree to participate in the study and are randomly assigned to one of the four diet groups. Weights are measured at baseline and patients are counseled on the proper implementation of the assigned diet (with the exception of the control group). After 8 weeks, each patient's weight is again measured and the difference in weights is computed by subtracting the 8 week weight from the baseline weight. Positive differences indicate weight losses and negative differences indicate weight gains. For interpretation purposes, we refer to the differences in weights as weight losses and the observed weight losses are shown below.

Low Calorie

Low Fat

Low Carbohydrate

Control

8

2

3

2

9

4

5

2

6

3

4

-1

7

5

2

0

3

1

3

3

Is there a statistically significant difference in the mean weight loss among the four diets?  We will run the ANOVA using the five-step approach.

  • Step 1. Set up hypotheses and determine level of significance

H 0 : μ 1 = μ 2 = μ 3 = μ 4 H 1 : Means are not all equal              α=0.05

  • Step 2. Select the appropriate test statistic.  

The test statistic is the F statistic for ANOVA, F=MSB/MSE.

  • Step 3. Set up decision rule.  

The appropriate critical value can be found in a table of probabilities for the F distribution(see "Other Resources"). In order to determine the critical value of F we need degrees of freedom, df 1 =k-1 and df 2 =N-k. In this example, df 1 =k-1=4-1=3 and df 2 =N-k=20-4=16. The critical value is 3.24 and the decision rule is as follows: Reject H 0 if F > 3.24.

  • Step 4. Compute the test statistic.  

To organize our computations we complete the ANOVA table. In order to compute the sums of squares we must first compute the sample means for each group and the overall mean based on the total sample.  

 

Low Calorie

Low Fat

Low Carbohydrate

Control

n

5

5

5

5

Group mean

6.6

3.0

3.4

1.2

We can now compute

So, in this case:

Next we compute,

SSE requires computing the squared differences between each observation and its group mean. We will compute SSE in parts. For the participants in the low calorie diet:  

6.6

8

1.4

2.0

9

2.4

5.8

6

-0.6

0.4

7

0.4

0.2

3

-3.6

13.0

Totals

0

21.4

For the participants in the low fat diet:  

3.0

2

-1.0

1.0

4

1.0

1.0

3

0.0

0.0

5

2.0

4.0

1

-2.0

4.0

Totals

0

10.0

For the participants in the low carbohydrate diet:  

3

-0.4

0.2

5

1.6

2.6

4

0.6

0.4

2

-1.4

2.0

3

-0.4

0.2

Totals

0

5.4

For the participants in the control group:

2

0.8

0.6

2

0.8

0.6

-1

-2.2

4.8

0

-1.2

1.4

3

1.8

3.2

Totals

0

10.6

We can now construct the ANOVA table .

Source of Variation

Sums of Squares

(SS)

Degrees of Freedom

(df)

Means Squares

(MS)

F

Between Treatmenst

75.8

4-1=3

75.8/3=25.3

25.3/3.0=8.43

Error (or Residual)

47.4

20-4=16

47.4/16=3.0

Total

123.2

20-1=19

  • Step 5. Conclusion.  

We reject H 0 because 8.43 > 3.24. We have statistically significant evidence at α=0.05 to show that there is a difference in mean weight loss among the four diets.    

ANOVA is a test that provides a global assessment of a statistical difference in more than two independent means. In this example, we find that there is a statistically significant difference in mean weight loss among the four diets considered. In addition to reporting the results of the statistical test of hypothesis (i.e., that there is a statistically significant difference in mean weight losses at α=0.05), investigators should also report the observed sample means to facilitate interpretation of the results. In this example, participants in the low calorie diet lost an average of 6.6 pounds over 8 weeks, as compared to 3.0 and 3.4 pounds in the low fat and low carbohydrate groups, respectively. Participants in the control group lost an average of 1.2 pounds which could be called the placebo effect because these participants were not participating in an active arm of the trial specifically targeted for weight loss. Are the observed weight losses clinically meaningful?

Another ANOVA Example

Calcium is an essential mineral that regulates the heart, is important for blood clotting and for building healthy bones. The National Osteoporosis Foundation recommends a daily calcium intake of 1000-1200 mg/day for adult men and women. While calcium is contained in some foods, most adults do not get enough calcium in their diets and take supplements. Unfortunately some of the supplements have side effects such as gastric distress, making them difficult for some patients to take on a regular basis.  

 A study is designed to test whether there is a difference in mean daily calcium intake in adults with normal bone density, adults with osteopenia (a low bone density which may lead to osteoporosis) and adults with osteoporosis. Adults 60 years of age with normal bone density, osteopenia and osteoporosis are selected at random from hospital records and invited to participate in the study. Each participant's daily calcium intake is measured based on reported food intake and supplements. The data are shown below.   

1200

1000

890

1000

1100

650

980

700

1100

900

800

900

750

500

400

800

700

350

Is there a statistically significant difference in mean calcium intake in patients with normal bone density as compared to patients with osteopenia and osteoporosis? We will run the ANOVA using the five-step approach.

H 0 : μ 1 = μ 2 = μ 3 H 1 : Means are not all equal                            α=0.05

In order to determine the critical value of F we need degrees of freedom, df 1 =k-1 and df 2 =N-k.   In this example, df 1 =k-1=3-1=2 and df 2 =N-k=18-3=15. The critical value is 3.68 and the decision rule is as follows: Reject H 0 if F > 3.68.

To organize our computations we will complete the ANOVA table. In order to compute the sums of squares we must first compute the sample means for each group and the overall mean.  

Normal Bone Density

n =6

n =6

n =6

 If we pool all N=18 observations, the overall mean is 817.8.

We can now compute:

Substituting:

SSE requires computing the squared differences between each observation and its group mean. We will compute SSE in parts. For the participants with normal bone density:

1200

261.6667

68,486.9

1000

61.6667

3,806.9

980

41.6667

1,738.9

900

-38.3333

1,466.9

750

-188.333

35,456.9

800

-138.333

19,126.9

Total

0

130,083.3

For participants with osteopenia:

1000

200

40,000

1100

300

90,000

700

-100

10,000

800

0

0

500

-300

90,000

700

-100

10,000

Total

0

240,000

For participants with osteoporosis:

890

175

30,625

650

-65

4,225

1100

385

148,225

900

185

34,225

400

-315

99,225

350

-365

133,225

Total

0

449,750

Between Treatments

152,477.7

2

76,238.6

1.395

Error or Residual

819,833.3

15

54,655.5

Total

972,311.0

17

We do not reject H 0 because 1.395 < 3.68. We do not have statistically significant evidence at a =0.05 to show that there is a difference in mean calcium intake in patients with normal bone density as compared to osteopenia and osterporosis. Are the differences in mean calcium intake clinically meaningful? If so, what might account for the lack of statistical significance?

One-Way ANOVA in R

The video below by Mike Marin demonstrates how to perform analysis of variance in R. It also covers some other statistical issues, but the initial part of the video will be useful to you.

Two-Factor ANOVA

The ANOVA tests described above are called one-factor ANOVAs. There is one treatment or grouping factor with k > 2 levels and we wish to compare the means across the different categories of this factor. The factor might represent different diets, different classifications of risk for disease (e.g., osteoporosis), different medical treatments, different age groups, or different racial/ethnic groups. There are situations where it may be of interest to compare means of a continuous outcome across two or more factors. For example, suppose a clinical trial is designed to compare five different treatments for joint pain in patients with osteoarthritis. Investigators might also hypothesize that there are differences in the outcome by sex. This is an example of a two-factor ANOVA where the factors are treatment (with 5 levels) and sex (with 2 levels). In the two-factor ANOVA, investigators can assess whether there are differences in means due to the treatment, by sex or whether there is a difference in outcomes by the combination or interaction of treatment and sex. Higher order ANOVAs are conducted in the same way as one-factor ANOVAs presented here and the computations are again organized in ANOVA tables with more rows to distinguish the different sources of variation (e.g., between treatments, between men and women). The following example illustrates the approach.

Consider the clinical trial outlined above in which three competing treatments for joint pain are compared in terms of their mean time to pain relief in patients with osteoarthritis. Because investigators hypothesize that there may be a difference in time to pain relief in men versus women, they randomly assign 15 participating men to one of the three competing treatments and randomly assign 15 participating women to one of the three competing treatments (i.e., stratified randomization). Participating men and women do not know to which treatment they are assigned. They are instructed to take the assigned medication when they experience joint pain and to record the time, in minutes, until the pain subsides. The data (times to pain relief) are shown below and are organized by the assigned treatment and sex of the participant.

Table of Time to Pain Relief by Treatment and Sex

12

21

15

19

16

18

17

24

14

25

14

21

17

20

19

23

20

27

17

25

25

37

27

34

29

36

24

26

22

29

The analysis in two-factor ANOVA is similar to that illustrated above for one-factor ANOVA. The computations are again organized in an ANOVA table, but the total variation is partitioned into that due to the main effect of treatment, the main effect of sex and the interaction effect. The results of the analysis are shown below (and were generated with a statistical computing package - here we focus on interpretation). 

 ANOVA Table for Two-Factor ANOVA

Model

967.0

5

193.4

20.7

0.0001

Treatment

651.5

2

325.7

34.8

0.0001

Sex

313.6

1

313.6

33.5

0.0001

Treatment * Sex

1.9

2

0.9

0.1

0.9054

Error or Residual

224.4

24

9.4

Total

1191.4

29

There are 4 statistical tests in the ANOVA table above. The first test is an overall test to assess whether there is a difference among the 6 cell means (cells are defined by treatment and sex). The F statistic is 20.7 and is highly statistically significant with p=0.0001. When the overall test is significant, focus then turns to the factors that may be driving the significance (in this example, treatment, sex or the interaction between the two). The next three statistical tests assess the significance of the main effect of treatment, the main effect of sex and the interaction effect. In this example, there is a highly significant main effect of treatment (p=0.0001) and a highly significant main effect of sex (p=0.0001). The interaction between the two does not reach statistical significance (p=0.91). The table below contains the mean times to pain relief in each of the treatments for men and women (Note that each sample mean is computed on the 5 observations measured under that experimental condition).  

Mean Time to Pain Relief by Treatment and Gender

A

14.8

21.4

B

17.4

23.2

C

25.4

32.4

Treatment A appears to be the most efficacious treatment for both men and women. The mean times to relief are lower in Treatment A for both men and women and highest in Treatment C for both men and women. Across all treatments, women report longer times to pain relief (See below).  

Graph of two-factor ANOVA

Notice that there is the same pattern of time to pain relief across treatments in both men and women (treatment effect). There is also a sex effect - specifically, time to pain relief is longer in women in every treatment.  

Suppose that the same clinical trial is replicated in a second clinical site and the following data are observed.

Table - Time to Pain Relief by Treatment and Sex - Clinical Site 2

22

21

25

19

26

18

27

24

24

25

14

21

17

20

19

23

20

27

17

25

15

37

17

34

19

36

14

26

12

29

The ANOVA table for the data measured in clinical site 2 is shown below.

Table - Summary of Two-Factor ANOVA - Clinical Site 2

Source of Variation

Sums of Squares

(SS)

Degrees of freedom

(df)

Mean Squares

(MS)

F

P-Value

Model

907.0

5

181.4

19.4

0.0001

Treatment

71.5

2

35.7

3.8

0.0362

Sex

313.6

1

313.6

33.5

0.0001

Treatment * Sex

521.9

2

260.9

27.9

0.0001

Error or Residual

224.4

24

9.4

Total

1131.4

29

Notice that the overall test is significant (F=19.4, p=0.0001), there is a significant treatment effect, sex effect and a highly significant interaction effect. The table below contains the mean times to relief in each of the treatments for men and women.  

Table - Mean Time to Pain Relief by Treatment and Gender - Clinical Site 2

24.8

21.4

17.4

23.2

15.4

32.4

Notice that now the differences in mean time to pain relief among the treatments depend on sex. Among men, the mean time to pain relief is highest in Treatment A and lowest in Treatment C. Among women, the reverse is true. This is an interaction effect (see below).  

Graphic display of the results in the preceding table

Notice above that the treatment effect varies depending on sex. Thus, we cannot summarize an overall treatment effect (in men, treatment C is best, in women, treatment A is best).    

When interaction effects are present, some investigators do not examine main effects (i.e., do not test for treatment effect because the effect of treatment depends on sex). This issue is complex and is discussed in more detail in a later module. 

null hypothesis of factor analysis

  • Subscribe to journal Subscribe
  • Get new issue alerts Get alerts

Secondary Logo

Journal logo.

Colleague's E-mail is Invalid

Your message has been successfully sent to your colleague.

Save my selection

Confirmatory Factor Analysis: An Introduction for Psychosomatic Medicine Researchers

Babyak, Michael A. PhD; Green, Samuel B. PhD

From the Department of Psychiatry and Behavioral Sciences (M.A.B.), Duke University Medical Center, Durham, North Carolina; and the Division of Psychology in Education (S.B.G.), Arizona State University, Tempe, Arizona.

Address correspondence and reprint requests to Michael Babyak, PhD, Department of Psychiatry and Behavioral Sciences, Box 3119, Duke University Medical Center, Durham, NC 27710. E-mail: [email protected]

Received for publication January 30, 2009; revision received February 2, 2010.

We present an introduction to the basic concepts essential to understanding confirmatory factor analysis (CFA). We initially discuss the underlying mathematical model and its graphical representation. We then show how parameters are estimated for the CFA model based on the maximum likelihood function. Finally, we discuss several ways in which model fit is evaluated as well as introduce the concept of model identification. In our presentation, we use an example to illustrate the application of CFA to psychosomatic research and touch on the more general role of structural equation modeling in psychosomatic research.

SEM = structural equation modeling;

CFA = confirmatory factor analysis;

CAD = coronary artery disease;

EFA = exploratory factor analysis;

CFI = comparative fit index;

RMSEA = root mean square error of approximation.

INTRODUCTION

Structural equation modeling (SEM) is a general data analytic method for the assessment of models that specify relationships among variables. SEM involves investigating two primary models: the measurement model that delineates the relationships between observed measures and unobserved underlying factors and the structural model that defines the relationships between underlying factors. In this paper, we argue that psychosomatic medicine researchers frequently hypothesize complex relationships within and between measures and constructs (specified as factors in SEM) and, thus, should more frequently apply SEM. To encourage its use, we provide an intuitive presentation of the basic concepts of SEM: model specification, model estimation, and assessment of fit between the specified model and the data. We introduce these concepts within the framework of confirmatory factor analysis (CFA), which restricts analyses to those used to evaluate measurement models. We focus our presentation on CFA for a number of reasons: a) CFA is a method that psychosomatic medicine researchers are likely to apply; b) discussion of basic concepts based on a broader set of SEM methods is likely to make our presentation overly complex, particularly given page space limitations; and c) CFA is not limiting in the sense that it allows us to discuss the basic SEM concepts without loss of generality. Where possible, we consider how to expand applications beyond CFA to a broader range of SEM applications.

SEM in Psychosomatic and Medical Research

Although SEM is used frequently in some fields, such as psychology, education, sociology, and genetics, research using SEM appears comparatively infrequently in psychosomatic and medical journals. There is at least a small irony to the scarcity of SEM in psychosomatic and medical research in that the technique actually has its direct roots in biology. In the 1920s, geneticist Sewall Wright ( 1 ) first developed path analysis, a special case of SEM, in an attempt to better understand the complex relationships among variables that might determine the birth weight of guinea pig offspring.

Lately SEM has begun to appear more often in psychosomatic research. For example, Rosen et al. ( 2 ) applied SEM to estimate the association of global subjective health with psychological distress, social support, and physical function. In their analyses, these four constructs were specified as factors, which were imperfectly measured by their observed measures. SEM has even made an occasional foray into high-profile medical journals. For example, in a paper published in the New England Journal of Medicine in 2008, Calis et al. ( 3 ) used a path model to estimate a set of complex associations among malaria, human immunodeficiency virus, and various nutritional deficiencies. In a recent commentary on posttraumatic stress syndrome that appeared in JAMA , Bell and Orcutt ( 4 ) explicitly pointed out the potential utility of SEM in their area of study: “Structural equation modeling is particularly well suited for examining complex associations between multiple constructs; such constructs are often represented as latent constructs and are assumed to be free of measurement error.” Despite the applicability of SEM to assess research hypotheses posed by researchers in at least some areas of medicine, very few papers have graced the pages of JAMA using SEM to address any topic.

Even fewer papers in medicine and psychosomatic medicine focus exclusively on the CFA component of SEM. This is largely because the majority of research questions posed by researchers in these fields involve evaluating the relationship between multiple predictors and one or more outcome variables. CFA, by itself, does not address this type of question, but rather it assesses the quality and nature of the variables under study from a measurement perspective. CFA, however, can be used effectively as a preliminary step to evaluate the measurement properties of predictor and outcome measures—before conducting a larger structural model that includes not only the relationships defined by the measurement model but also the relationship between predictors and outcome variables.

A Pedagogical Example

We created a research scenario to illustrate various aspects of CFA. The scenario bears similarity to the one studied by Rosen et al. ( 2 ), although ours is much simpler so that the focus is on understanding CFA rather than the complexities associated with a real life study. We draw on the literature suggesting that a variety of psychosocial constructs, including trait hostility, anger, anxiety, and depressive symptoms, seem to be risk factors for the development of coronary artery disease (CAD) ( 5 ). Despite the large number of papers published on this topic, fundamental questions remain. For example, do depression, hostility, anger, and anxiety each uniquely pose a risk for heart disease, or is it simply a more general “negative affect” factor that poses the risk? 1 To help us answer this question, we first need to assess whether the relationships among measures of depression, hostility, anger, and anxiety allow us to interpret these measures as manifestations of a general, negative affect factor. Such an analysis might support the general factor conjecture but also might indicate that some measures are better than others in assessing it. Alternatively, we might discover that there is more than one underlying dimension or even that the measures are too distinct to be jointly related to underlying factors. CFA is ideally suited to address these types of questions.

In the ensuing pages, we attempt to convey an understanding of CFA through our example, which focuses on the dimensions underlying scales assessing depression, hostility, anger, and anxiety. We use this example in that we believe that studying the measurement properties of these scales in the context of substantive theory is a critical step in our understanding of the usefulness of these scales in the prediction of CAD and should allow researchers to better design studies that include health outcome measures.

Purpose of CFA

CFA, as well as the more familiar exploratory factor analysis (EFA), defines factors that account for covariability or shared variance among measured variables and ignores the variance that is unique to each of the measures. Broadly speaking, either can be a useful technique for a) understanding the structure underlying a set of measures; b) reducing redundancy among a set of measured variables by representing them with a fewer number of factors; and c) exploiting redundancy and, in so doing, improving the reliability and validity of measures. However, the purposes of EFA and CFA, and accordingly the methods associated with them, are different. The goal of EFA is to discover a set of as-yet-unknown factors based on the data, although a priori hypotheses based on the literature may help guide some decisions in the EFA process. In other words, EFA may be conceptualized as primarily an inductive, data-driven method to obtain factors. In contrast, in CFA, we start with an explicit hypothesis about the number of factors underlying measures and the parameters of the model, such as the effects of the factors on the measures (i.e., weights or loadings). In practice, researchers impose constraints on a factor model based on a priori hypotheses about measures. In imposing constraints, we are forcing the model to be consistent with our substantive theory or beliefs. For example, if a measure was not designed to assess factor A (but rather to assess factor B), we force the weight between this measure and factor A to be equal to zero. We will discuss more about how and why we do this soon. If these constraints are inconsistent with the data and, more specifically, the pattern of relationships among measured variables, the model with its imposed constraints is rejected. If the constraints are consistent with the data, the estimated parameters of a model are interpreted in the context of the substantive area. Given the focus of CFA is on hypothesized models, let’s first describe how these models are specified before considering how the parameters of models are estimated and how the fit of models to data are assessed.

Model Specification

With CFA, we hypothesize a model that specifies the relationship between measured variables and presumed underlying factors. 2 The model includes parameters (e.g., factor loadings) we want to estimate based on the data (i.e., freely estimated parameters) and parameters we constrain to particular values based on our understanding of our data and the literature (i.e., constrained or fixed parameters). The constraints on model parameters are what produce lack of fit to the data, which in turn informs us about how well our hypothesized model is supported.

In this section, we will consider three prototypical CFA models, each with a different substantive interpretation. We will present each prototypical model and discuss it in the context of our negative affect example. The example for the first two prototypes involves postulating a factor structure underlying four measures: scales of hostility, anger, anxiety, and depressive symptoms. Scores for each of the scales are computed by summing the items on a self-report instrument. For the third prototypical model, we extend this example by having three rather than one measure of depressive symptoms: affective, somatic, and cognitive.

Single-Factor Model

The single-factor model is the simplest CFA model. Nevertheless, we devote considerable attention to it in order to introduce the basic concept of CFA and conventional SEM terminology. A one-factor model specifies a single dimension underlying a set of measures and, thus, provides a parsimonious explanation for the responses on these measures. As with any structural equation model, a single-factor model can be presented pictorially as a path diagram or in equation form. Figure 1 is a graphical representation of a model with a single factor (F 1 ) underlying four measures: hostility (X 1 ); anger (X 2 ); anxiety (X 3 ); and depressive symptoms (X 4 ). By convention, the factor is depicted as a circle, which represents a latent variable, whereas the observed measures are squares, which represent observable or indicator variables. A single-headed arrow between two variables indicates the direction of the effect of the one variable on the other. Within the context of our example, we are postulating that a factor called negative affect (F 1 ) underlies or determines the observed scores on the hostility, anger, anxiety, and depressive symptom measures (as well as error). Statistically, we believe these four measures are correlated because they have a common underlying factor, negative affect. In other words, the model reflects the belief that changes in the unobserved latent variable, negative affect, are presumed to result in changes in the four variables that we have measured.

F1-11

Continuing with Figure 1 , a variable with arrows pointing only away from it is called exogenous . A variable with one or more arrows pointing to it, even if one or more arrows are pointing away from it, is called endogenous . An equation is associated with each endogenous variable. Accordingly, the model in Figure 1 involves four endogenous variables and therefore four equations:

X 1 = λ 11 F 1 + E 1

X 2 = λ 21 F 1 + E 2

X 3 = λ 31 F 1 + E 3

X 4 = λ 41 F 1 + E 4 .

The lambdas (λ) in these equations are factor weights or loadings, which can be interpreted essentially like regression coefficients. For example, for every 1-unit increase in the negative affect factor, F 1 , the expected change in hostility, X 1 , will be λ 11 .

Observed measures are not likely to be pure indicators of a factor but almost certainly contain unique components, frequently referred to as errors. A unique component for a measure includes reliable information that is specific to that measure—that is, unrelated to the factor—and measurement error. Because errors are not directly observable, they are also latent variables and are represented in our path diagram as circles. For the hostility measure in our example, the unique component might include the specific component of agitation as well as measurement error due to inattentiveness of respondents and ambiguity of the items on this measure.

Finally, our path diagram also includes double-headed curved arrows. If an arrow begins and returns to the same exogenous variable, it represents the variance of that variable. A double-headed arrow drawn between any two exogenous variables represents a covariance between them. In Figure 1 , we could have drawn a double-headed arrow between any two errors, but we chose not to include error covariances in our model to avoid unnecessary complexity.

The model parameters, which we seek to estimate or constrain based on our understanding of a study, are associated with the single-headed and double-headed arrows in our diagrams and, by convention, are shown as Greek letters. In addition to the lambdas, the parameters for the model in Figure 1 are the variance of the factor (ς F1 2 ) and the variances of the errors (ς E1 2 − ς E4 2 ). As shown at the bottom of the figure, the model parameters can also be presented in three matrices: the phi matrix (Φ) containing the variances and covariances among factors, the lambda matrix (Λ) that includes all factor weights, and the theta matrix (θ) that includes the variances and covariances among the errors.

When we specify a model, we must stipulate whether parameters are “free” or “constrained.” Free parameters are estimated based on data. We are familiar with free parameters in that we routinely interpret them when conducting various types of statistical analyses, such as predictor weights in regression analysis or factor loadings in exploratory factor analysis. On the other hand, a constrained or fixed parameter is not estimated but rather is restricted by researchers to be equal to a specific value or to the value of another parameter. A common constraint is to fix the value of a parameter to zero and, in so doing, to indicate that it is unnecessary. For example, for our single factor model ( Fig. 1 ), we could fix the loading for anxiety to be equal to 0 if we wanted to assess the hypothesis that the negative affect factor underlying the other three measures does not influence anxiety.

As shown in Figure 1 , we have imposed a number of constraints on our single-factor model, although almost all the constraints are represented by what is not shown rather than what is shown in the figure. For example, our model includes no covariances among errors (i.e., all zeros in the off-diagonal positions of the theta matrix) in that there are no double-headed arrows between errors. Substantively, these constraints on the error covariances reflect the belief that only a single factor is necessary to explain the covariability among the measures. In addition, the variance associated with a second factor and its factor loadings have been implicitly fixed to zero, implying that an additional factor is unnecessary. If one or more of these constraints are incorrect, the model should fit the data poorly and should be rejected.

A second type of constraint—the metric constraint—must be imposed to define the metric or units of latent variables. The metric constraint is often a bit mysterious to SEM novices. We will take an intuitive approach to understanding this type of constraint. The metric of a factor is arbitrary because it is a latent variable with no inherent metric or scale; that is, it could assume any one of a number of alternative metrics. For example, it is arbitrary whether a factor representing length is measured in inches, feet, or meters. We typically assign a metric for a factor by fixing either its variance to 1, as we did in the one factor model, or one of its weights to 1. Fixing the variance of a factor to 1, essentially defines the units of the factor to be in Z score (conventional standardized) units. If researchers choose to define the metric by fixing the weight of a measured indicator variable to 1, they should choose the weight associated with the measured variable that is believed to have the strongest relationship with the underlying factor to avoid empirical identification issues. In our negative affect example, we might choose to fix the weight of the best developed depressive symptom measure to 1. The metric of the factor is then in the same metric as the measure with the loading of 1 (in our example, the depressive symptom scale). It is important to know that factor metric constraints in CFA models have no affect on the fit of the model to the data, but all other model constraints can result in a decrease in model fit to the extent that they are not supported by the data.

Once the model is specified by indicating what parameters are free to be estimated and which are fixed, the free parameters are estimated based on the data. If our one-factor model provides adequate fit (which we will define later), we could conclude that the data are consistent with the hypothesis that a single latent variable underlies the four observed measures, although we cannot rule out other models also producing good fit. We would also examine the standardized weights (which are analogous to standardized weights in regression analysis) to assess whether the measures are strong indicators of the factor and whether some measures are better indicators than others. A good fitting model would offer some support for using a single negative affect latent variable as a predictor (or as an outcome or mediator) in a more extensive structural model. If the model fails to fit, we should not interpret the estimated parameters because their values are likely to be inaccurate due to model misspecification. Instead, we would assess alternative models, such as the correlated factors model, to understand whether they might fit the observed data better.

Correlated Factors Model

Our second model is a correlated factors model, which specifies that two or more factors underlie a set of measured variables and that these factors are correlated. For simplicity, we will consider a two-factor model, but our discussion is relevant to models with more than two factors.

In Figure 2 , we present a model for our four measures but now with two correlated factors. As with our path diagram for a single-factor model, we have circles for latent variables (i.e., factors and errors), squares for measured variables, single-headed arrows for effects of one variable on another, double-headed curved arrows for variances of exogenous variables, and a double-headed curved arrow for the covariance between the two factors. Within the context of our negative affect example, we might speculate that the hostility and anger measures are related to one another due to the shared characteristic of outward-directed agitation and distinct, to some degree, from the anxiety and depressive symptom measures, which share the characteristic of self-directed negativity. In other words, the model should include a factor (F 1 ) affecting the hostility and anger measures (X 1 and X 2 ), and another factor (F 2 ) affecting the anxiety and depressive symptom measures (X 3 and X 4 ).

F2-11

Model parameters are associated with all single-headed and double-headed arrows and are presented in matrix form at the bottom of Figure 2 . Constraints can be imposed on the model parameters. As previously presented, we can define the metric for factors by constraining their variances to 1 or one of their weights to 1. In this instance, we arbitrarily chose to set the factor variance to 1 (i.e., ς F1 2 = 1 and ς F2 2 = 1).

All constraints besides those to determine the metric of factors can produce lack of fit and are evaluated in assessing the quality of a model. For example, the effects of factors on measures, as shown by arrows between factors and measures in the path diagram, can be represented as equations:

X 1 = λ 11 F 1 + 0 F 2 + E 1

X 2 = λ 21 F 1 + 0 F 2 + E 2

X 3 = 0 F 1 + λ 32 F 2 + E 3

X 4 = 0 F 1 + λ 42 F 2 + E 4

As shown, the equations indicate that a number of factor loadings are constrained to zero such that each measured variable is associated with one and only one factor. The specified structure is consistent with the idea of simple structure, an objective frequently felt to be desirable with EFA. In addition, a measure is less likely to be misinterpreted if it is a function of only one factor. Given the advantages of this structure, researchers frequently begin with specifying models that constrain factor loadings for a measure to be associated with one and only one factor. In other words, each measure has one weight that is freely estimated, and all other weights (potential cross-loadings) between that measure and other factors are constrained to 0.

Other parameters in our model that may be freely estimated or constrained are the covariance between the factors and the variances and covariances among errors. 3 a) With CFA, we typically allow the factors to be correlated by freely estimating the covariances between factors. If we constrained all factor covariances to be equal to zero (which requires three or more measures per factor) 3 and imposed constraints so that any one measure is a function of only one factor, we would be hypothesizing a model that requires correlations among measures associated with different factors to be equal to zero. This model is likely to conflict with reality and be rejected empirically. In addition, this model would be inconsistent with many psychological theories that suggest underlying correlated dimensions. The decision to allow for correlated factors is in stark contrast with practice in EFA, where researchers routinely choose varimax rotation resulting in orthogonal factors. However, in EFA, we can still obtain good fit to data in that all factor loadings are freely estimated (i.e., all measured variables are a function of all factors), permitting correlations among all measured variables. b) We usually think of our measured variables as being unreliable to some degree and, thus, must freely estimate the error variances. In most CFA models, we begin by constraining all covariance between errors to be 0. By imposing these constraints, we are implying that the correlations among measures are purely a function of the specified factors.

If this model fits our data, we would have a structure that is consistent with the data, but we could not rule out that other models might fit the data as well or even better. If this model fits, the standardized loadings are high, and the correlation between factors is not too high, we could argue that anger and hostility may represent an underlying construct that is relatively distinct from the construct underlying the anxiety and depression measure. We might interpret the anger and hostility factor as outward-directed agitation, whereas the depression and anxiety factor might be interpreted as self-directed negativity. Besides addressing measurement issues, the finding of two factors would indicate that we would want to include measures associated with both of these factors in studies predicting CAD. On the other hand, if neither the one- nor two-factor models fit, we should probably include measures assessing all four constructs in predicting CAD.

In practice, we would not only assess the fit of the two-factor model but also its relative fit to a single-factor model. This comparison would evaluate whether the increase in model complexity associated with a two-factor model is warranted. This additional step is necessary in that a good-fitting model does not necessarily imply a correct model. In our example, to the extent that the factors are highly correlated, we would expect the fit of the one-factor model to be similar to the fit of the two-factor model. SEM procedures, including CFA, are at their scientific best when there are several theoretically plausible models available to compare. We will discuss fit later in this article. For now, we turn to one more type of model structure, just to further illustrate the kinds of models that can be represented.

Bifactor Model

A bifactor model may include a general factor associated with all measures and one or more group factors associated with a limited number of measures ( 7,8 ). In Figure 3 , we present a bifactor model for six measures with one general factor and one group factor. The six measures include hostility, anger, and anxiety scales (X 1 , X 2 , and X 3 , respectively) and three scales of depression that distinguish among affective, cognitive, and somatic symptoms (X 4 , X 5 , and X 6 , respectively). Due to space limitations, we will only briefly describe the specification of this model.

F3-11

As typically applied, we are unlikely to obtain a bifactor model with EFA in that an objective of this method (with rotation) is to obtain simple structure, which is generally intolerant to a general factor. In contrast, in CFA, we choose which parameters to estimate freely and which to constrain to 0. Thus, we can simultaneously allow for a general factor as well as group factors. Bifactor models have been suggested as appropriate for item measures associated with psychological scales ( 9 ). Although measures are likely to assess a general trait or factor, they are also likely to include more specific aspects of that trait, that is, group factors. In contrast with the previous model, the group factors for a bifactor model are typically specified to be uncorrelated (i.e., the factor covariances are constrained to 0). In our example, this model sug—gests that the three depressive symptom measures are to some extent distinct from the other three measures, but that a broader general factor, which might be called negative affect, also underlies all six measures. 4

Although the results of the CFA models above will not have an immediate impact on the question of how these variables might be related to cardiac risk, they do inform our conceptual understanding and interpretation of the four measured variables. For example, if we were developing a prediction model with cardiac disease as an outcome, and the two-factor model turned out to be more consistent with the data in our CFA, we might choose to use those two factors as separate potential risk factors rather than just an overall negative affect variable. At this point in our exposition, we do not know how to determine whether a model is “good” or not, nor do we even know how the various unknown parameters in the models are actually estimated. We now turn to these more technical aspects of CFA.

Estimation of Free Parameters

Next, we consider how free parameters are estimated. We discuss estimation using the model presented in Figure 1 with four measures and a single factor. Hats (⁁) are placed on top of model parameters in recognition that we are estimating parameters based on sample data rather than considering them at the population level.

SEM software typically allows a variety of input data formats, including raw case-level data, sample variances and covariances among measures, or correlations and standard deviations among measures. Regardless of how you enter your data, the most popular method for estimation of CFA models (i.e., maximum likelihood estimation) involves fitting the variances and covariances among measures. In other words, based on your inputted data, the software creates a covariance matrix (denoted as S ) that contains the variances and covariances of the measures, and the analysis is conducted on this covariance matrix. From this perspective, CFA treats this covariance matrix as the data.

The SEM software calculates values for the freely estimated parameters of a CFA model so that these estimated parameters are as consistent as possible with the data (i.e., the sample covariance matrix of the measures). More specifically, parameters are estimated so that a reproduced covariance matrix based on the estimated parameters (denoted Model and called the model-implied covariance matrix) is as similar as possible to the sample covariance matrix, S . Using our example with the four observed measures, the equation linking Model to the model estimated parameters is

null hypothesis of factor analysis

The details of the equation are unimportant, but rather it is crucial to understand that the values of the estimated model parameters dictate the quantities in the reproduced covariance matrix among the measured variables. The objective of the estimation procedure is to have the variances and covariances based on the estimated model parameters (i.e., the values in Model ) to be as close as possible to the variances and covariances among measures in our sample data (i.e., the values in S ).

Stepping back from the technical details, we are assuming that some process exists that has generated the set of variances and covariances that we have observed among our four measures. In SEM, in general, we use substantive knowledge of the field to make specific hypotheses about what this process might be and then translate these hypotheses into a coherent model. To the extent that we have specified a model that approximates reality, the values in the covariance matrix implied by our model (and its parameters) ought to be similar, within sampling error, to the sample covariance matrix among the measures. In practice, constraints imposed on the model are likely to produce imperfect reproduction of S ; that is, S ≠ Model . Next, we consider in greater detail how model parameters are estimated and then how the implied matrix is used to evaluate the fit of the model.

In contrast to regression analysis and many other statistical methods, equations are not available for directly computing the freely estimated parameters. The estimates are computed instead by an iterative process, initially making arbitrary guesses about the values of the model parameters and then repeatedly modifying these values in an attempt to have the values between S and Model be as similar as possible. The process stops when a prespecified criterion is met that suggests that the differences between S and Model cannot be smaller.

A very simple example with constructed data might be helpful at this point. 5 Let’s say that the variances for the hostility, anger, anxiety, and depressive symptom measures are 1 (diagonal element of S , the sample covariance matrix among measures), and all covariances among measures are 0.36 (off-diagonal elements in S ).

null hypothesis of factor analysis

This set of values is highly improbable in the real world, but for convenience we have created a covariance matrix S as a correlation matrix. In specifying the model for these data, let’s say we fix the variance of our underlying factor to 1 to define its metric and estimate the loadings between the four measures and the factor, and the error variances for these measures. The software package (EQS, for our example) begins with very rough estimates of 1 for all factor loadings and all error variances. For these estimated parameters, the reproduced covariance matrix among the measured variables based on Equation 1 is:

null hypothesis of factor analysis

The reproduced variances and covariances (on the right with 2s along the diagonal and 1s in the off-diagonal positions) are not very similar to the 1s and 0.36s in S . The software then takes another guess, revising its estimates so that the factor loadings are all 0.68 and the error variances are 0.64.

null hypothesis of factor analysis

Now the values in Model are more similar to the values in S , but still not exactly the same. In the next iteration, all factor loadings are estimated to be 0.605, whereas the error variances are estimated to be 0.640. With two additional iterations, the final estimates are 0.60 for all factor loadings and 0.64 for all error variances.

null hypothesis of factor analysis

For this artificial example, the model parameters reproduce perfectly the sample covariance matrix among the measures. In other words, the fit of the model to the data are perfect—a highly unlikely result in practice.

How does the algorithm know when S and Model are similar? Mathematically, it is necessary to specify a function to define the similarity. The most popular estimation approach is maximum likelihood; and with this approach, the iterative estimation procedure is designed to minimize the following function:

null hypothesis of factor analysis

where p is the number of measured variables. It is not crucial to understand the details of the equation. What is important to know is that each iteration (set of parameter guesses) produces a value for F ML , and that F ML is a mathematical reflection of the difference between S and Model for a given set of estimated parameters. When F ML is at its smallest value, S and Model are as similar as they can be, given the data and the hypothesized model. The values of the parameter estimates at this point in the iterative process are the maximum likelihood estimates for the CFA model.

For our example, F ML becomes smaller through the first four iterations, as shown in Table 1 . At step 5, there is no change in F ML , and there are no changes in the values of the parameter estimates, so the process stops and the estimates at the last step are the maximum likelihood estimates.

T1-11

Although researchers most frequently minimize F ML to obtain estimates in SEM, it is sometimes preferable to choose other functions to minimize, especially when data diverge too far from multivariate normality or have missing values on some measures. For example, a different function—the full information maximum likelihood (FIML) function—is preferable if some data on measures are missing. On the other hand, a weighted least squares function is generally preferred for estimating model parameters when analyzing ordinal, item-level data (such as Likert-type items). 6

To summarize our steps so far, we specify a model with free and constrained parameters and then estimate the parameters of the model in an iterative fashion so that reproduced covariance matrix ( Model ) based on the model is as similar as possible to the observed covariance matrix ( S ). The constraints imposed on the model are likely to produce differences between [ Model ] and S . We now turn to methods for assessing the fit of a model or, alternatively stated, the lack of fit due to the constraints imposed on a model’s parameters.

Assessment of Global Fit

We must assess the quality of a model by examining the output from SEM software to determine if the model and its estimated parameters are interpretable. We first scan the output for warning messages and rerun analyses when the model was improperly specified. Second, we assess local fit. Examples include evaluation of individual estimated parameters to ensure they are within mathematical bounds (e.g., no negative variances or correlations of >1.0), are within interpretational bounds (i.e., no parameter estimates with values that defy interpretation), and are significantly different from zero based on hypothesis tests. Third, we examine global fit to determine if the constrained parameters of a model allow for good reproduction of the sample covariance matrix among measures. We will concentrate our attention on global judgments of fit.

The fit function is used to determine the estimates of the model parameters. Given this function is deemed useful for computing parameter estimates, it is not surprising that this same fit function is also central in assessing global fit.

Testing the Hypothesis of Perfect Fit in the Population

We can assess the hypothesis that the researcher’s model is correct in the population. More specifically, we can ask whether the reproduced covariance matrix based on the model ( Σ Model ) is equal to the population covariance matrix among the measures ( Σ ). As shown in equation 3 , the null hypothesis, H 0 , states the model-implied and population covariance matrices are equal , whereas the alternative hypothesis, H A , indicates that these two matrices are different:

null hypothesis of factor analysis

Two comments are worth noting about how this question is posed in SEM. First, in most non-SEM applications of hypothesis testing, rejection of the null hypothesis implies support for the researcher’s hypothesis. In contrast, in SEM rejection of the null hypothesis indicates that the researcher’s hypothesized model does not hold in the population; that is, the model implied and population matrices are different. Second, no model is likely to fit perfectly in the population; and thus, we know a priori that the null hypothesis concerning the researcher’s hypothesis is false.

The test of the null hypothesis is straightforward. The test statistic, T , is a simple function of sample size ( n ) and the fit function:

null hypothesis of factor analysis

(or T = n F ml , as computed in some SEM software packages). In large samples and assuming the p measured variables are normally distributed in the population, T is distributed approximately as a χ 2 . The degrees of freedom for χ 2 are equal to the number of unique variances and covariances in the covariance matrix among measured variables (i.e., [ p ( p + 1)]/2, where p is number of measured variables) minus the number of freely estimated model parameters ( q ), that is,

null hypothesis of factor analysis

In most applications with some degree of model complexity, a sample size of ≥200 is recommended for T to be distributed approximately as a χ 2 . However, a greater sample size may be required to have sufficient power to reject hypotheses of interest, including hypotheses about a particular parameter or set of parameters.

Unfortunately, this test of global fit suffers from the same problems that a conventional hypothesis does. If the null is not rejected, it may be due to insufficient sample size, that is, a lack of power. In addition, nonrejection does not imply that the researcher’s model is correct—it is incorrect to “accept the null hypothesis.” It is likely that a number of alternative models produce similar T values. If the hypothesis is rejected, we can only conclude what we knew initially: The model is imperfect. If the sample size is large, the T value will necessarily be large, and even small and possibly unimportant discrepancies between the model implied and observed covariance matrix will yield significance. It is our observation that tests of models are routinely significant—meaning that we conclude our model does not fit—when sample size exceeds 200.

Fit Indices: Assessing Degree of Fit

Because the χ 2 fit test is affected by sample size, a wide variety of other measures of fit have been proposed. Two indices that are used frequently are Bentler’s comparative fit index (CFI) and the root mean square error of approximation (RMSEA).

The CFI compares the fit of the researcher’s model with the fit of a null model. The null model is highly constrained and unrealistic. More specifically, the model parameters are constrained such that all covariances among measured variables are equal to zero (implying all correlations are equal to zero). Accordingly, we expect a researcher’s model to fit much better than a null model.

In the population, CFI is defined as:

null hypothesis of factor analysis

λ is a noncentrality parameter that is an index of lack of fit of a model to a population covariance matrix. λ is zero if a model is correct and becomes larger to the degree that the model is misspecified. We would expect the null model to be a badly misspecified model in most applications of SEM; therefore, λ null model would be large. In comparison, λ researcher’s model should be much smaller. Accordingly, we expect to obtain high CFI values to the extent that the researcher’s model is superior to a null model.

In the formula for the CFI pop , we can substitute T − df for λ to obtain a sample estimate of CFI:

null hypothesis of factor analysis

According to Hu and Bentler ( 12 ), a value of ≥0.95 indicates good fit. This cutoff VALUE is consistent with the belief that a researcher’s model should fit much better than the unrealistic null model. We emphasize here that cutoffs for fit indices are problematic; a preferable approach is to use these indices to compare fits for various alternative models.

The RMSEA is a fit index that assesses lack of fit of a model but not in comparison with any other model. Instead, it evaluates the absolute fit of a model (i.e., T researcher’s model /( n − 1)), taking into account how complex the model relative to the amount of data (i.e., df researcher’s model ). The sample estimate of the RMSEA is:

null hypothesis of factor analysis

To the extent that the model fits [i.e., small T researcher’s model /( n − 1)] and the model involves estimating few model parameters (large df researcher’s model ), RMSEA should approach zero. RMSEAs of <0.06 indicate good fit, according to Hu and Bentler ( 12 ), but again this cutoff should be treated as a rough-and-ready rule of thumb.

Underidentification and Other Problems in Estimation

We turn now to discuss one technical issue. A requirement for estimating the parameters of a model is that the model must be identified. Broadly speaking, underidentification is simply a problem of algebra, a matter of trying to estimate too many parameters given the data available. On the other hand, a model is identified if the information on your sample equals or exceeds the needs defined by the estimation of your model parameters. The information about the sample is captured by the unique variances and covariances in the covariance matrix of the observed measures. This information is used to estimate the free model parameters, which are the factor loadings, the variances, and covariances among the factors, and the variances and covariances among the errors. One formal rule of thumb to help assess identification is called the t rule, which states that the number of freely estimated parameters ( q ) must be less than or equal to the number of unique variances and covariances among the measured variables, which is equal to p ( p + 1)/2. Another way to express the t rule is that the df for the χ 2 test cannot be negative ( Eq. 5 ). For our two-factor example ( Fig. 2 ), the number of unique variances and covariance is p ( p + 1)/2 = 4(4 + 1)/2 = 10. The number of free parameters is 9 (4 factor weights + 1 covariance between factors + 4 error variances). Therefore, this model passes the t rule for identification because 10 to 9 = 1.

The bad news is that even if your model passes the t rule, the model may still be underidentified (i.e., not identified). This occurs if the number of free parameters for a portion of the model exceeds the available sample information. For our two-factor model ( Fig. 2 ), we might have chosen to constrain the covariance between the two factors to be equal to zero. This model passes the t rule in that there are ten unique variances and covariance and only eight freely estimated parameters (4 factor weights + 4 error variances), but it is not identified. The variances and covariances for each pair of measures associated with a factor are available to estimate only the model parameters for these measures. Because each pair of measures are linked to one and only one factor, it is as if two CFAs are being conducted—one for the pair of measures associated with one factor and another for the pair of measures associated with the second factor. The consequence is that the model cannot be estimated because the number of freely estimated parameters for any pair of measures (two loadings and two error variances) exceeds the amount of sample information (two variances and one covariance between these measures).

Additional identification rules are available. The three-indicator rule may be applied for the example just described. If each measured variable is associated with only one estimated factor loading (others constrained to 0), the covariances among factors are constrained to 0, and the covariances among the errors are constrained to 0, then a model is identified if each factor affects at least three measures (as opposed to two measures, as described for our two-factor model with uncorrelated factors). In most CFA applications, factors are allowed to be correlated, and then a model is identified if each factor affects at least two measures. In other words, for the two-indicator rule, the same conditions must hold as with the three-indicator rule except the covariances among factors are freely estimated.

There is both bad and good news about the use of the two- and three-indicator rules. The bad news is that they are not applicable for many CFA models. For example, they are not helpful in determining if a bifactor model with both group and general factors is identified. The good news is that available software is very likely to give warning messages if the model is underidentified. More bad news is that it is not always obvious what the warning messages mean and what, if anything, should be done to remedy the problem.

The messages might suggest other estimation problems, such as empirical underidentification or bad start values. With empirical underidentification, the model is identified mathematically, but nevertheless the parameters of the model cannot be estimated because of the data. For example, a CFA model with two factors might meet all the requirements of the two-indicator rule, but it may still not be able to be estimated if the freely estimated covariance between factors is 0 (or close to 0). In this case, because the factors are uncorrelated, three measures are required per factor. Alternatively, for the same example, if the estimated factor loading for a measure is 0, it cannot be counted as one of the indicators for a factor.

The other estimation problem is bad start values. The estimation process in CFA is iterative and requires start values that are created by the SEM software. With more complex models, the start values created by the program may be bad in that they do not produce adequate estimates. In this instance, the researcher may ask the program to conduct more iterations to get a good solution or may be forced to supply their own start values for parameter estimation. Researchers might use estimates from EFA or other CFA models to supply start values.

In conducting CFA, researchers are relieved when they receive no warning messages about parameter estimates. When they do receive messages, they should consult their SEM software manual to differentiate between messages that suggest minimal problems and those that require careful exploration of the model. Most importantly, it is essential not to deny the presence of warning messages, but rather to acknowledge their presence and, when in doubt, work through them with someone you trust (i.e., your local SEM expert).

CONCLUSIONS

In many applications, researchers who apply EFA could use CFA. To the extent that researchers have some knowledge about the measures that they are analyzing, they should be conducting CFA. There are real benefits to stating rigorously one’s beliefs about measures, ideally by specifying alternative models, assessing those beliefs with indices that allow for their disconfirmation, and at the end being able to specify which alternative model produces the best fit. It may require more thoughtfulness upfront than EFA, but the outcome is likely to be more informative if the methods of CFA are applied skillfully. We offer some suggested readings in an appendix to allow you to develop a better understanding of CFA and SEM in general.

As we noted at the beginning of this piece, we have considered only CFA in the article, one of many analytic procedures that can be conducted with SEM. In our example of hostility, anger, anxiety, and depressive symptoms, CFA was used to help understand the fundamental question of how those measures relate to one another—a worthy pursuit in and of itself but one that is often ignored in the psychosomatic literature. Typically, we would go a step further and use the factors we derive from CFA as predictors of cardiac disease risk, using the full structural model. In the full structural model, the paths between factors are assessed in addition to the relationship between factor and measures. A key advantage to this model is that the factors are free of measurement error, if correctly specified; thus, the paths between factors are essentially corrected for measurement error.

We include a list of suggested readings to allow researchers to learn about analyzing data, using a full structural equation model as well as to develop a more in-depth understanding of CFA and the various methods associated with it.

  • Cited Here |
  • View Full Text | PubMed | CrossRef
  • PubMed | CrossRef

Suggested Readings

Introductory reading.

Brown TA. Confirmatory factor analysis for applied research. New York: Guilford Press; 2006.

Green SB, Thompson MS. Structural equation modeling in clinical re search. In: Roberts MC, Illardi SS, editors, Methods of Research in Clinical Psychology: A Handbook. London: Blackwell; 2003. p 138– 175.

Kline RB. Principles and practice of structural equation modeling (2nd ed). New York: Guilford Press; 2005.

Glaser D. Structural Equation Modeling Texts: A primer for the beginner. Journal of Clinical Child Psychology 2002;31:573–578. Available at: http://dx.doi.org/10.1207/S15374424JCCP3104_16.

More Advanced Reading

Bollen KA. Structural Equations with Latent Variables. New York: Wiley; 1989.

Edwards JR, Bagozzi RP. On the nature and direction of relationships between constructs and measures. Psychol Methods 2000;5:155–174. Available at http://dx.doi.org/10.1037/1082-989X.5.2.155.

MacCallum RC, Roznowski M, Necowitz LB. Model modifications in covariance structure analysis: The problem of capitalization on chance. Psychol Bull 1992;111:490–504. Available at: http://dx.doi.org/10.1037/0033-2909.111.3.490.

McDonald R, Ho M-HR. Principles and practice in reporting structural equation analyses. Psychol Methods 2002;7:64–82. Available at: http://dx.doi.org/10.1037/1082-989X.7.1.64.

Wirth RJ, Edwards MC. Item factor analysis: Current approaches and future directions. Psychol Methods 1989; 12:58–79. Available at: http://dx.doi.org/10.1037/1082-989X.12.1.58.

1 Frasure-Smith and Lespérance ( 6 ) make an excellent attempt to answer a very similar question about psychosocial variables and cardiac risk, using conventional factor analytic techniques. Cited Here

2 We use the terms “factor” and “latent variable” interchangeably throughout this paper. Cited Here

3 For our example, we are limited to what changes we can make in terms of whether a parameter is free or fixed. For example, the covariance between the factors must be free to obtain proper estimates of the model parameters. This problem is due to the simplicity of model, which includes only two indicators per factor. More measures for each of the factors would obviate this difficulty. This problem with model specification and estimation is discussed more generally later in the article when we consider model identification. Cited Here

4 See reference 10 for a discussion of bifactor models. They suggest, for example, that items on an appropriately developed scale of depression would assess not only the general factor of depression. but also subsets of items would assess group factors representing such aspects as somatization and feelings of hopelessness. Cited Here

5 The input data and the SEM software code, using EQS and Mplus, can be found for all examples at http://www.duke.edu/web/behavioralmed . Cited Here

6 Item response models can also be useful with item-level data. Analyses based on CFA and item response models yield comparable results under certain conditions. For further discussion, the reader is referred to the study by Brown ( 11 ). Cited Here

confirmatory factor analysis; mathematical model; model fit and identification; structural equation modeling; coronary artery disease; psychosomatic research

  • + Favorites
  • View in Gallery

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

3.1 - experiments with one factor and multiple levels.

Lesson 3 is the beginning of the one-way analysis of variance part of the course, which extends the two sample situation to k samples..

Text Reading : In addition to these notes, read Chapter 3 of the text and the online supplement.  (If you have the 7th edition, also read 13.1.)

We review the issues related to a single factor experiment, which we see in the context of a Completely Randomized Design (CRD). In a single factor experiment with a CRD, the levels of the factor are randomly assigned to the experimental units. Alternatively, we can think of randomly assigning the experimental units to the treatments or in some cases, randomly selecting experimental units from each level of the factor.

Example 3-1: Cotton Tensile Strength Section  

image of cotton

This is an investigation into the formulation of synthetic fibers that are used to make cloth. The response is tensile strength, the strength of the fiber. The experimenter wants to determine the best level of the cotton in terms of percent, to achieve the highest tensile strength of the fiber. Therefore, we have a single quantitative factor, the percent of cotton combined with synthetic fabric fibers.

The five treatment levels of percent cotton are evenly spaced from 15% to 35%. We have five replicates, five runs on each of the five cotton weight percentages.

Cotton
Weight Percentage
Observations  
1 2 3 4 5 Total Average
15 7 7 15 11 9 49 9.8
20 12 17 12 18 18 77 15.4
25 14 18 18 19 19 88 17.6
30 19 25 22 19 23 108 21.6
35 7 10 11 15 11 54 10.8
           

The box plot of the results shows an indication that there is an increase in strength as you increase the cotton and then it seems to drop off rather dramatically after 30%.

Box plots of tensile strength versus cotton weight percentage

Makes you wonder about all of those 50% cotton shirts that you buy?!

The null hypothesis asks: does the cotton percent make a difference? Now, it seems that it doesn't take statistics to answer this question. All we have to do is look at the side by side box plots of the data and there appears to be a difference – however, this difference is not so obvious by looking at the table of raw data. A second question, frequently asked when the factor is quantitative: what is the optimal level of cotton if you only want to consider strength?

There is a point that I probably should emphasize now and repeatedly throughout this course. There is often more than one response measurement that is of interest. You need to think about multiple responses in any given experiment. In this experiment, for some reason, we are interested in only one response, tensile strength, whereas in practice the manufacturer would also consider comfort, ductility, cost, etc.

This single factor experiment can be described as a completely randomized design (CRD). The completely randomized design means there is no structure among the experimental units. There are 25 runs which differ only in the percent cotton, and these will be done in random order. If there were different machines or operators, or other factors such as the order or batches of material, this would need to be taken into account. We will talk about these kinds of designs later. This is an example of a completely randomized design where there are no other factors that we are interested in other than the treatment factor percentage of cotton.

Reference: Problem 3.10 of Montgomery (3.8 in the \(7^{th}\) edition)

Analysis of Variance Section  

The Analysis of Variance (ANOVA) is a somewhat misleading name for this procedure. But we call it the analysis of variance because we are partitioning the total variation in the response measurements.

The Model Statement

Each measured response can be written as the overall mean plus the treatment effect plus a random error.

\(Y_{ij} = \mu + \tau_i +\epsilon_{ij}\)

\(i = 1, ... , a,\) and \( j = 1, ... n_i\)

Generally, we will define our treatment effects so that they sum to 0, a constraint on our definition of our parameters,\(\sum \tau_{i}=0\). This is not the only constraint we could choose, one treatment level could be a reference such as the zero level for cotton and then everything else would be a deviation from that. However, generally, we will let the effects sum to 0. The experimental error terms are assumed to be normally distributed, with zero mean and if the experiment has constant variance then there is a single variance parameter \(\sigma^2\). All of these assumptions need to be checked. This is called the effects model.

An alternative way to write the model, besides the effects model, where the expected value of our observation, \(E\left(Y_{i j}\right)=\mu+\tau_{i}\) or an overall mean plus the treatment effect. This is called the means model and is written as:

\(Y_{ij} = \mu +\epsilon_{ij}\)

In looking ahead there is also the regression model. Regression models can also be employed but for now, we consider the traditional analysis of variance model and focus on the effects of the treatment.

Analysis of variance formulas that you should be familiar with by now are provided in the textbook.

The total variation is the sum of the observations minus the overall mean squared, summed over all a × n observations.

The analysis of variance simply takes this total variation and partitions it into the treatment component and the error component. The treatment component is the difference between the treatment mean and the overall mean. The error component is the difference between the observations and the treatment mean, i.e. the variation not explained by the treatments.

Notice when you square the deviations there are also cross-product terms, (see equation 3-5), but these sum to zero when you sum over the set of observations. The analysis of variance is the partition of the total variation into treatment and error components. We want to test the hypothesis that the means are equal versus at least one is different, i.e.

\(H_0 \colon \mu_{1}=\ldots=\mu_{a}\) versus \(H_a \colon \mu_{i} \neq \mu_{i'}\)

Corresponding to the sum of squares (SS) are the degrees of freedom associated with the treatments, \(a - 1\), and the degrees of freedom associated with the error, \(a × (n - 1)\), and finally one degree of freedom is due to the overall mean parameter. These add up to the total \(N = a × n\), when the \(n_i\) are all equal to \(n\), or \( N=\sum n_{i}\) otherwise.

The mean square treatment (MST) is the sum of squares due to treatment divided by its degrees of freedom.

The mean square error (MSE) is the sum of squares due to error divided by its degrees of freedom.

If the true treatment means are equal to each other, i.e. the \(\mu_i\) are all equal, then these two quantities should have the same expectation. If they are different then the treatment component, MST will be larger. This is the basis for the F -test.

The basic test statistic for testing the hypothesis that the means are all equal is the F ratio, MST/MSE, with degrees of freedom, a -1, and a ×( n -1) or a-1 and N-a.

We reject \(H_0\) if this quantity is greater than \(1-α\) percentile of the F distribution.

Example 3-1: Continued - Cotton Weight Percent Section  

Here is the Analysis of Variance table from the Minitab output:

One-way ANOVA: Observations versus Cotton Weight %

Source DF SS MS F P
Cotton Weight % 4 475.76 118.94 14.76 0.000
Error 20 161.20 8.06    
Total 24 636.96      
S = 2.839 R-Sq = 74.69% R-Sq(adj)=69.63%

Individual 95% CIs for Mean based on Pooled StDev

Level N Mean StDev ------+----------+----------+----------+---
15 5 9.800 3.347 (-----*-----)
20 5 15.400 3.130                   (----*----)
25 5 17.600 2.074                         (----*----)
30 5 21.600 2.608                                     (----*----)
35 5 10.800 2.864      (-----*----)
  ------+----------+----------+----------+---
  10.0 15.0 20.0 25.0

Note a very large F statistic that is, 14.76. The p -value for this F -statistic is < .0005 which is taken from an F distribution pictured below with 4 and 20 degrees of freedom.

We can see that most of the distribution lies between zero and about four. Our statistic, 14.76, is far out in the tail, obvious confirmation about what the data show, that indeed the means are not the same. Hence, we reject the null hypothesis.

Model Assumption Checking Section  

We should check if the data are normal - they should be approximately normal - they should certainly have constant variance among the groups. Independence is harder to check but plotting the residuals in the order in which the operations are done can sometimes detect if there is lack of independence. The question, in general, is how do we fit the right model to represent the data observed. In this case, there's not too much that can go wrong since we only have one factor and it is a completely randomized design. It is hard to argue with this model.

Let's examine the residuals, which are just the observations minus the predicted values, in this case, treatment means. Hence, \(e_{ij}=y_{ij}-\bar{y}_{i}\).

These plots don't look exactly normal but at least they don't seem to have any wild outliers. The normal scores plot looks reasonable. The residuals versus the order of the data plot are a plot of the error residuals data in the order in which the observations were taken. This looks a little suspect in that the first six data points all have small negative residuals which are not reflected in the following data points. Does this look like it might be a startup problem? These are the kinds of clues that you look for... if you are conducting this experiment you would certainly want to find out what was happening in the beginning.

Post-ANOVA Comparison of Means Section  

So, we found the means are significantly different. Now what? In general, if we had a qualitative factor rather than a quantitative factor we would want to know which means differ from which other ones. We would probably want to do t -tests or Tukey maximum range comparisons, or some set of contrasts to examine the differences in means. There are many multiple comparison procedures.

Two methods, in particular, are Fisher's Least Significant Difference (LSD), and the Bonferroni Method. Both of these are based on the t -test. Fisher's LSD says do an F -test first and if you reject the null hypothesis, then just do ordinary t -tests between all pairs of means. The Bonferroni method is similar but only requires that you decide in advance how many pairs of means you wish to compare, say g, and then perform the g t -tests with a type I level of \(\alpha / g\). This provides protection for the entire family of g tests that the type I error is no more than \(\alpha \). For this setting, with a treatments, g = a ( a -1)/2 when comparing all pairs of treatments.

All of these multiple comparison procedures are simply aimed at interpreting or understanding the overall F -test --which means are different? They apply to many situations especially when the factor is qualitative. However, in this case, since cotton percent is a quantitative factor, doing a test between two arbitrary levels e.g. 15% and 20% level, isn't really what you want to know. What you should focus on is the whole response function as you increase the level of the quantitative factor, cotton percent.

Whenever you have a quantitative factor you should be thinking about modeling that relationship with a regression function.

Review the video that demonstrates the use of polynomial regression to help explain what is going on.

Here is the Minitab output where regression was applied:

Polynomial Regression Analysis: Observation versus Cotton Weight %

The regression equations is:

Observations = 62.61 - 9.011 Cotton Weight % + 0.4814 cotton Weight % **2 - 0.007600 Cotton Weight%**3

S = 3.04839 R-Sq = 69.4% R-Sq(sq) = 65.0%

Analysis of Variance

Source DF SS MS F P
Regression 3 441.814 147.271 15.85 0.000
Error 21 195.146 9.293    
Total 24 636.960      

Sequential Analysis of Variance

Source DF SS F P
Linear 1 33.620 1.28 0.269
Quadratic 1 343.214 29.03 0.000
Cubic 1 64.980 6.99 0.015

Here is a link to the Cotton Weight % dataset ( cotton_weight.mwx | cotton_weight.csv ). Open this in Minitab so that you can try this yourself.

You can see that the linear term in the regression model is not significant but the quadratic is highly significant. Even the cubic term is significant with p -value = 0.015. In Minitab we can plot this relationship in the fitted line plot as seen below:

This shows the actual fitted equation. Why wasn't the linear term significant? If you just fit a straight line to this data it would be almost flat, not quite but almost. As a result, the linear term by itself is not significant. We should still leave it in the polynomial regression model, however, because we like to have a hierarchical model when fitting polynomials. What we can learn from this model is that the tensile strength of cotton is probably best between the 25 and 30 weight.

This is a more focused conclusion than we get from simply comparing the means of the actual levels in the experiment because the polynomial model reflects the quantitative relationship between the treatment and the response.

We should also check whether the observations have constant variance \(\sigma^2\), for all treatments. If they are all equal we can say that they are equal to \(\sigma^2\). This is an assumption of the analysis and we need to confirm this assumption. We can either test it with Bartlett's test, the Levene's test, or simply use the 'eye ball' technique of plotting the residuals versus the fitted values and see if they are roughly equal. The eyeball approach is almost as good as using these tests since by testing we cannot ‘prove’ the null hypothesis.

Bartlett's test is very susceptible to non-normality because it is based on the sample variances, which are not robust to outliers. We must assume that the data are normally distributed and thus not very long-tailed. When one of the residuals is large and you square it, you get a very large value which explains why the sample variance is not very robust. One or two outliers can cause any particular variance to be very large. Thus simply looking at the data in a box plot is as good as these formal tests. If there is an outlier you can see it. If the distribution has a strange shape you can also see this in a histogram or a box plot. The graphical view is very useful in this regard.

Levene's test is preferred to Bartlett’s in my view because it is more robust. To calculate the Levene's test you take the observations and obtain (not the squared deviations from the mean but) the absolute deviations from the median. Then, you simply do the usual one way ANOVA F -test on these absolute deviations from the medians. This is a very clever and simple test that has been around for a long time, created by Levene back in the 1950s. It is much more robust to outliers and non-normality than Bartlett's test.

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons

Margin Size

  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

13.6: Post‐hoc Analysis – Tukey’s Honestly Significant Difference (HSD) Test85

  • Last updated
  • Save as PDF
  • Page ID 20925

  • Maurice A. Geraghty
  • De Anza College

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

When the Null Hypothesis is rejected in one factor ANOVA, the conclusion is that not all means are the same. This however leads to an obvious question: which particular means are different? Seeking further information after the results of a test is called post‐hoc analysis.

The problem of multiple tests

One attempt to answer this question is to conduct multiple pairwise independent same t‐tests and determine which ones are significant. We would compare \(\mu_{1}\) to \(\mu_{2}\), \(\mu_{1}\) to \(\mu_{3}\), \(\mu_{2}\) to \(\mu_{3}\), \(\mu_{1}\) to \(\mu_{4}\), etc. There is a major flaw in this methodology in that each test would have a significance level of \(\alpha\), so making Type I error would be significantly more than the desired \(\alpha\). Furthermore, these pairwise tests would NOT be mutually independent.  There were several statisticians who designed tests that effectively dealt with this problem of determining an "honest" significance level of a set of tests; we will cover the one developed by John Tukey, the Honestly Significant Difference (HSD) test. 86 To use this test, we need the critical value from the Studentized Range Distribution (\(q\)), which is used to find when difference of pairs of sample means are significant.

The Tukey HSD test

Tests : \(H_{o}: \mu_{i}=\mu_{j} \quad H_{a}: \mu_{i} \neq \mu_{j}\) where the subscripts \(i\) and \(j\) represent two different populations

Overall significance level of \(\alpha\) : This means that all pairwise tests can be run at the same time with an overall significance level of \(\alpha\)

Test Statistic : \(\mathrm{HSD}=q \sqrt{\dfrac{\mathrm{MSE}{n_{c}}}\)

\(q\) = critical value from Studentized Range table

\(\mathrm{MSE}\) = Mean Square Error from ANOVA table

\(n_c\) = number of replicates per treatment. An adjustment is made for unbalanced designs.

Decision : Reject \(H_o\) if \(\left|\overline{X}_{i}-\overline{X}_{j}\right|>\mathrm{HSD}_\text{critical value}\) 

Computer software, such as Minitab, will calculate the critical values and test statistics for these series of tests. We will not perform the manual calculations in this text.

Example: Party Pizza

Let us return to the Tofu pizza example where we rejected the Null Hypothesis and supported the claim that there was a difference in means among the three restaurants.  

clipboard_e37ad8eb238298221c6fe3e60f506eb5f.png

In reviewing the graph of the sample means, it appears that Santa Clara has a much higher number of sales than Cupertino and San Jose. There will be three pairwise post‐hoc tests to run.

\(H_{o}: \mu_{1}=\mu_{2} \qquad H_{a}: \mu_{1} \neq \mu_{2} \qquad H_{o}: \mu_{1}=\mu_{3} \qquad H_{a}: \mu_{1} \neq \mu_{3} \qquad H_{o}: \mu_{2}=\mu_{3} \qquad H_{a}: \mu_{2} \neq \mu_{3}\)

These three tests will be conducted with an overall significance level of \(\alpha\) = 5%.

The model will be the Tukey \(\mathrm{HSD}\) test.

Here are the differences of the sample means for each pair ranked from lowest to highest:

Test 1: Cupertino to San Jose : \(\left|\overline{X}_{1}-\overline{X}_{2}\right|=|12.75-11.50|=1.25\)

Test 2: Cupertino to Santa Clara : \(\left|\overline{X}_{1}-\overline{X}_{3}\right|=|12.75-17.00|=4.25\)

Test 3:  San Jose to Santa Clara : \(\left|\overline{X}_{2}-\overline{X}_{3}\right|=|11.50-17.00|=5.50\)

The \(\mathrm{HSD}\) critical values (using statistical software) for this particular test:

\(\mathrm{HSD}_\text{crit}\) at 5% significance level = 1.85        \(\mathrm{HSD}_\text{crit}\) at 1% significance level = 2.51

For each test, reject \(H_o\) if the difference of means is greater than \(\mathrm{HSD}_\text{crit}\)

Test 2 and Test 3 show significantly different means at both the 1% and 5% level.

The Minitab approach for the decision rule will be to reject \(H_o\) for each pair that does not share a common group. Here are the results for the test conducted at the 5% level of significance:

Data/Results

Refer to the Minitab output. Santa Clara is in group A while Cupertino and San Jose are in Group B.

clipboard_e7fbc60f8c92cea838a904cc38da10375.png

Conclusion    

Santa Clara has a significantly higher mean number of tofu pizzas sold compared to both San Jose and Cupertino. There is no significant difference in mean sales between San Jose and Cupertino.

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

One-way ANOVA | When and How to Use It (With Examples)

Published on March 6, 2020 by Rebecca Bevans . Revised on May 10, 2024.

ANOVA , which stands for Analysis of Variance, is a statistical test used to analyze the difference between the means of more than two groups.

A one-way ANOVA uses one independent variable , while a two-way ANOVA uses two independent variables.

Table of contents

When to use a one-way anova, how does an anova test work, assumptions of anova, performing a one-way anova, interpreting the results, post-hoc testing, reporting the results of anova, other interesting articles, frequently asked questions about one-way anova.

Use a one-way ANOVA when you have collected data about one categorical independent variable and one quantitative dependent variable . The independent variable should have at least three levels (i.e. at least three different groups or categories).

ANOVA tells you if the dependent variable changes according to the level of the independent variable. For example:

  • Your independent variable is social media use , and you assign groups to low , medium , and high levels of social media use to find out if there is a difference in hours of sleep per night .
  • Your independent variable is brand of soda , and you collect data on Coke , Pepsi , Sprite , and Fanta to find out if there is a difference in the price per 100ml .
  • You independent variable is type of fertilizer , and you treat crop fields with mixtures 1 , 2 and 3 to find out if there is a difference in crop yield .

The null hypothesis ( H 0 ) of ANOVA is that there is no difference among group means. The alternative hypothesis ( H a ) is that at least one group differs significantly from the overall mean of the dependent variable.

If you only want to compare two groups, use a t test instead.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

null hypothesis of factor analysis

ANOVA determines whether the groups created by the levels of the independent variable are statistically different by calculating whether the means of the treatment levels are different from the overall mean of the dependent variable.

If any of the group means is significantly different from the overall mean, then the null hypothesis is rejected.

ANOVA uses the F test for statistical significance . This allows for comparison of multiple means at once, because the error is calculated for the whole set of comparisons rather than for each individual two-way comparison (which would happen with a t test).

The F test compares the variance in each group mean from the overall group variance. If the variance within groups is smaller than the variance between groups , the F test will find a higher F value, and therefore a higher likelihood that the difference observed is real and not due to chance.

The assumptions of the ANOVA test are the same as the general assumptions for any parametric test:

  • Independence of observations : the data were collected using statistically valid sampling methods , and there are no hidden relationships among observations. If your data fail to meet this assumption because you have a confounding variable that you need to control for statistically, use an ANOVA with blocking variables.
  • Normally-distributed response variable : The values of the dependent variable follow a normal distribution .
  • Homogeneity of variance : The variation within each group being compared is similar for every group. If the variances are different among the groups, then ANOVA probably isn’t the right fit for the data.

While you can perform an ANOVA by hand , it is difficult to do so with more than a few observations. We will perform our analysis in the R statistical program because it is free, powerful, and widely available. For a full walkthrough of this ANOVA example, see our guide to performing ANOVA in R .

The sample dataset from our imaginary crop yield experiment contains data about:

  • fertilizer type (type 1, 2, or 3)
  • planting density (1 = low density, 2 = high density)
  • planting location in the field (blocks 1, 2, 3, or 4)
  • final crop yield (in bushels per acre).

This gives us enough information to run various different ANOVA tests and see which model is the best fit for the data.

For the one-way ANOVA, we will only analyze the effect of fertilizer type on crop yield.

Sample dataset for ANOVA

After loading the dataset into our R environment, we can use the command aov() to run an ANOVA. In this example we will model the differences in the mean of the response variable , crop yield, as a function of type of fertilizer.

Prevent plagiarism. Run a free check.

To view the summary of a statistical model in R, use the summary() function.

The summary of an ANOVA test (in R) looks like this:

One-way ANOVA summary

The ANOVA output provides an estimate of how much variation in the dependent variable that can be explained by the independent variable.

  • The first column lists the independent variable along with the model residuals (aka the model error).
  • The Df column displays the degrees of freedom for the independent variable (calculated by taking the number of levels within the variable and subtracting 1), and the degrees of freedom for the residuals (calculated by taking the total number of observations minus 1, then subtracting the number of levels in each of the independent variables).
  • The Sum Sq column displays the sum of squares (a.k.a. the total variation) between the group means and the overall mean explained by that variable. The sum of squares for the fertilizer variable is 6.07, while the sum of squares of the residuals is 35.89.
  • The Mean Sq column is the mean of the sum of squares, which is calculated by dividing the sum of squares by the degrees of freedom.
  • The F value column is the test statistic from the F test: the mean square of each independent variable divided by the mean square of the residuals. The larger the F value, the more likely it is that the variation associated with the independent variable is real and not due to chance.
  • The Pr(>F) column is the p value of the F statistic. This shows how likely it is that the F value calculated from the test would have occurred if the null hypothesis of no difference among group means were true.

Because the p value of the independent variable, fertilizer, is statistically significant ( p < 0.05), it is likely that fertilizer type does have a significant effect on average crop yield.

ANOVA will tell you if there are differences among the levels of the independent variable, but not which differences are significant. To find how the treatment levels differ from one another, perform a TukeyHSD (Tukey’s Honestly-Significant Difference) post-hoc test.

The Tukey test runs pairwise comparisons among each of the groups, and uses a conservative error estimate to find the groups which are statistically different from one another.

The output of the TukeyHSD looks like this:

Tukey summary one-way ANOVA

First, the table reports the model being tested (‘Fit’). Next it lists the pairwise differences among groups for the independent variable.

Under the ‘$fertilizer’ section, we see the mean difference between each fertilizer treatment (‘diff’), the lower and upper bounds of the 95% confidence interval (‘lwr’ and ‘upr’), and the p value , adjusted for multiple pairwise comparisons.

The pairwise comparisons show that fertilizer type 3 has a significantly higher mean yield than both fertilizer 2 and fertilizer 1, but the difference between the mean yields of fertilizers 2 and 1 is not statistically significant.

When reporting the results of an ANOVA, include a brief description of the variables you tested, the  F value, degrees of freedom, and p values for each independent variable, and explain what the results mean.

If you want to provide more detailed information about the differences found in your test, you can also include a graph of the ANOVA results , with grouping letters above each level of the independent variable to show which groups are statistically different from one another:

One-way ANOVA graph

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Chi square test of independence
  • Statistical power
  • Descriptive statistics
  • Degrees of freedom
  • Pearson correlation
  • Null hypothesis

Methodology

  • Double-blind study
  • Case-control study
  • Research ethics
  • Data collection
  • Hypothesis testing
  • Structured interviews

Research bias

  • Hawthorne effect
  • Unconscious bias
  • Recall bias
  • Halo effect
  • Self-serving bias
  • Information bias

The only difference between one-way and two-way ANOVA is the number of independent variables . A one-way ANOVA has one independent variable, while a two-way ANOVA has two.

  • One-way ANOVA : Testing the relationship between shoe brand (Nike, Adidas, Saucony, Hoka) and race finish times in a marathon.
  • Two-way ANOVA : Testing the relationship between shoe brand (Nike, Adidas, Saucony, Hoka), runner age group (junior, senior, master’s), and race finishing times in a marathon.

All ANOVAs are designed to test for differences among three or more groups. If you are only testing for a difference between two groups, use a t-test instead.

A factorial ANOVA is any ANOVA that uses more than one categorical independent variable . A two-way ANOVA is a type of factorial ANOVA.

Some examples of factorial ANOVAs include:

  • Testing the combined effects of vaccination (vaccinated or not vaccinated) and health status (healthy or pre-existing condition) on the rate of flu infection in a population.
  • Testing the effects of marital status (married, single, divorced, widowed), job status (employed, self-employed, unemployed, retired), and family history (no family history, some family history) on the incidence of depression in a population.
  • Testing the effects of feed type (type A, B, or C) and barn crowding (not crowded, somewhat crowded, very crowded) on the final weight of chickens in a commercial farming operation.

In ANOVA, the null hypothesis is that there is no difference among group means. If any group differs significantly from the overall group mean, then the ANOVA will report a statistically significant result.

Significant differences among group means are calculated using the F statistic, which is the ratio of the mean sum of squares (the variance explained by the independent variable) to the mean square error (the variance left over).

If the F statistic is higher than the critical value (the value of F that corresponds with your alpha value, usually 0.05), then the difference among groups is deemed statistically significant.

Quantitative variables are any variables where the data represent amounts (e.g. height, weight, or age).

Categorical variables are any variables where the data represent groups. This includes rankings (e.g. finishing places in a race), classifications (e.g. brands of cereal), and binary outcomes (e.g. coin flips).

You need to know what type of variables you are working with to choose the right statistical test for your data and interpret your results .

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2024, May 09). One-way ANOVA | When and How to Use It (With Examples). Scribbr. Retrieved June 10, 2024, from https://www.scribbr.com/statistics/one-way-anova/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, two-way anova | examples & when to use it, anova in r | a complete step-by-step guide with examples, guide to experimental design | overview, steps, & examples, what is your plagiarism score.

Teach yourself statistics

One-Way Analysis of Variance (ANOVA)

Researchers use one-way analysis of variance in controlled experiments to test for significant differences among group means. This lesson explains when, why, and how to use one-way analysis of variance. The discussion covers fixed-effects models and random-effects models .

Note: One-way analysis of variance is also known as simple analysis of variance or as single-factor analysis of variance.

When to Use One-Way ANOVA

You should only use one-way analysis of variance when you have the right data from the right experimental design .

Experimental Design

One-way analysis of variance should only be used with one type of experimental design - a completely randomized design with one factor (also known as a single-factor, independent groups design). This design is distinguished by the following attributes:

  • The design has one, and only one, factor (i.e., one independent variable ) with two or more levels .
  • Treatment groups are defined by a unique combination of non-overlapping factor levels.
  • The design has k treatment groups, where k is greater than one.
  • Experimental units are randomly selected from a known population .
  • Each experimental unit is randomly assigned to one, and only one, treatment group.
  • Each experimental unit provides one dependent variable score.

Data Requirements

One-way analysis of variance requires that the dependent variable be measured on an interval scale or a ratio scale . In addition, you need to know three things about the experimental design:

  • k = Number of treatment groups
  • n j = Number of subjects assigned to Group j (i.e., number of subjects that receive treatment j )
  • X i,j = The dependent variable score for the i th subject in Group j

For example, the table below shows the critical information that a researcher would need to conduct a one-way analysis of variance, given a typical single-factor, independent groups design:

Group 1 Group 2 Group 3
X X X
X X X
X X
X

The design has three treatment groups ( k =3). Nine subjects have been randomly assigned to the groups: three subjects to Group 1 ( n 1 = 3), two subjects to Group 2 ( n 2  = 2), and four subjects to Group 3 ( n 3  = 4). The dependent variable score is X 1,1 for the first subject in Group 1; X 1,2 for the first subject in Group 2; X 1,3 for the first subject in Group 3; X 2,1 for the second subject in Group 1; and so on.

Assumptions of ANOVA

One-way analysis of variance makes three assumptions about dependent variable scores:

  • Independence . The dependent variable score for each experimental unit is independent of the score for any other unit.
  • Normality . In the population, dependent variable scores are normally distributed within treatment groups.
  • Equality of variance . In the population, the variance of dependent variable scores in each treatment group is equal. (Equality of variance is also known as homogeneity of variance or homoscedasticity.)

The assumption of independence is the most important assumption. When that assumption is violated, the resulting statistical tests can be misleading. This assumption is tenable when (a) experimental units are randomly sampled from the population and (b) sampled units are randomly assigned to treatments.

With respect to the other two assumptions, analysis of variance is more forgiving. Violations of normality are less problematic when the sample size is large. And violations of the equal variance assumption are less problematic when the sample size within groups is equal.

Before conducting an analysis of variance, it is best practice to check for violations of normality and homogeneity assumptions. For further information, see:

  • How to Test for Normality: Three Simple Tests
  • How to Test for Homogeneity of Variance: Hartley's Fmax Test
  • How to Test for Homogeneity of Variance: Bartlett's Test

Why to Use One-Way ANOVA

Researchers use one-way analysis of variance to assess the effect of one independent variable on one dependent variable. The analysis answers two research questions:

  • Is the mean score in any treatment group significantly different from the mean score in another treatment group?
  • What is the magnitude of the effect of the independent variable on the dependent variable?

Notice that analysis of variance tells us whether treatment groups differ significantly, but it doesn't tell us how the groups differ. Understanding how the groups differ requires additional analysis.

How to Use One-Way ANOVA

To implement one-way analysis of variance with a single-factor, independent groups design, a researcher takes the following steps:

  • Specify a mathematical model to describe the causal factors that affect the dependent variable.
  • Write statistical hypotheses to be tested by experimental data.
  • Specify a significance level for a hypothesis test.
  • Compute the grand mean and the mean scores for each group.
  • Compute sums of squares for each effect in the model.
  • Find the degrees of freedom associated with each effect in the model.
  • Based on sums of squares and degrees of freedom, compute mean squares for each effect in the model.
  • Find the expected value of the mean squares for each effect in the model.
  • Compute a test statistic , based on observed mean squares and their expected values.
  • Find the P value for the test statistic.
  • Accept or reject the null hypothesis , based on the P value and the significance level.
  • Assess the magnitude of the effect of the independent variable, based on sums of squares.

Whew! Altogether, the steps to implement one-way analysis of variance may look challenging, but each step is simple and logical. That makes the whole process easy to implement, if you just focus on one step at a time. So let's go over each step, one-by-one.

Mathematical Model

For every experimental design, there is a mathematical model that accounts for all of the independent and extraneous variables that affect the dependent variable.

Fixed Effects

For example, here is the fixed-effects mathematical model for a completely randomized design:

X i j = μ + β j + ε i ( j )

where X i j is the dependent variable score for subject i in treatment group j , μ is the population mean, β j is the treatment effect in group j ; and ε i ( j ) is the effect of all other extraneous variables on subject i in treatment j .

For this model, it is assumed that ε i ( j ) is normally and independently distributed with a mean of zero and a variance of σ ε 2 . The mean ( μ ) is constant.

Note: The parentheses in ε i ( j ) indicate that subjects are nested under treatment groups. When a subject is assigned to only one treatment group, we say that the subject is nested under a treatment.

Random Effects

The random-effects mathematical model for a completely randomized design is similar to the fixed-effects mathematical model. It can also be expressed as:

Like the fixed-effects mathematical model, the random-effects model also assumes that (1) ε i ( j ) is normally and independently distributed with a mean of zero and a variance of σ ε 2 and (2) the mean ( μ ) is constant.

Here's the difference between the two mathematical models. With a fixed-effects model, the experimenter includes all treatment levels of interest in the experiment. With a random-effects model, the experimenter includes a random sample of treatment levels in the experiment. Therefore, in the random-effects mathematical model, the treatment effect ( β j  ) is a random variable with a mean of zero and a variance of σ 2 β .

Statistical Hypotheses

For fixed-effects models, it is common practice to write statistical hypotheses in terms of the treatment effect β j ; for random-effects models, in terms of the treatment variance σ 2 β  .

H 0 : β j = 0 for all j (fixed-effects)

H 0 : σ 2 β = 0 for all j (random-effects)

H 1 : β j ≠ 0 for some j (fixed-effects)

H 0 : σ 2 β ≠ 0 for all j (random-effects)

If the null hypothesis is true, the mean score in each treatment group should equal the population mean. Thus, if the null hypothesis is true, sample means in the k treatment groups should be roughly equal. If the null hypothesis is false, at least one pair of sample means should be unequal.

Significance Level

The significance level (also known as alpha or α) is the probability of rejecting the null hypothesis when it is actually true. The significance level for an experiment is specified by the experimenter, before data collection begins. Experimenters often choose significance levels of 0.05 or 0.01.

A significance level of 0.05 means that there is a 5% chance of rejecting the null hypothesis when it is true. A significance level of 0.01 means that there is a 1% chance of rejecting the null hypothesis when it is true. The lower the significance level, the more persuasive the evidence needs to be before an experimenter can reject the null hypothesis.

Mean Scores

Analysis of variance begins by computing a grand mean and group means:

  • Grand mean. The grand mean ( X ) is the mean of all observations, computed as follows: n = k Σ j=1 n  j X  = ( 1 / n ) k Σ j=1 n j Σ i=1 ( X  i j  )
  • Group means. The mean of group j ( X j ) is the mean of all observations in group j , computed as follows:

In the equations above, n is the total sample size across all groups; and n  j is the sample size in Group j  .

Sums of Squares

A sum of squares is the sum of squared deviations from a mean score. One-way analysis of variance makes use of three sums of squares:

  • Between-groups sum of squares. The between-groups sum of squares (SSB) measures variation of group means around the grand mean. It can be computed from the following formula: SSB = k Σ j=1 n j Σ i=1 (  X   j  -  X  ) 2  =  k Σ j=1 n j  (  X   j  -  X  ) 2
  • Within-groups sum of squares. The within-groups sum of squares (SSW) measures variation of all scores around their respective group means. It can be computed from the following formula: SSW = k Σ j=1 n j Σ i=1 ( X  i j  -  X   j  ) 2
  • Total sum of squares. The total sum of squares (SST) measures variation of all scores around the grand mean. It can be computed from the following formula: SST = k Σ j=1 n j Σ i=1 ( X  i j  -  X  ) 2

It turns out that the total sum of squares is equal to the between-groups sum of squares plus the within-groups sum of squares, as shown below:

SST = SSB + SSW

As you'll see later on, this relationship will allow us to assess the magnitude of the effect of the independent variable on the dependent variable.

Degrees of Freedom

The term degrees of freedom (df) refers to the number of independent sample points used to compute a statistic minus the number of parameters estimated from the sample points.

To illustrate what is going on, let's find the degrees of freedom associated with the various sum of squares computations:

Here, the formula uses k independent sample points, the sample means X   j  . And it uses one parameter estimate, the grand mean X , which was estimated from the sample points. So, the between-groups sum of squares has k - 1 degrees of freedom.

Here, the formula uses n independent sample points, the individual subject scores X  i j  . And it uses k parameter estimates, the group means X   j  , which were estimated from the sample points. So, the between-groups sum of squares has n - k degrees of freedom (where n is total sample size across all groups).

Here, the formula uses n independent sample points, the individual subject scores X  i j  . And it uses one parameter estimate, the grand mean X , which was estimated from the sample points. So, the total sum of squares has n  - 1 degrees of freedom (where n is total sample size across all groups).

The degrees of freedom for each sum of squares are summarized in the table below:

Sum of squares Degrees of freedom
Between-groups k - 1
Within-groups n - k
Total n - 1

Notice that there is an additive relationship between the various sums of squares. The degrees of freedom for total sum of squares (df TOT ) is equal to the degrees of freedom for between-groups sum of squares (df BG ) plus the degrees of freedom for within-groups sum of squares (df WG ). That is,

df TOT = df BG + df WG

Mean Squares

A mean square is an estimate of population variance. It is computed by dividing a sum of squares (SS) by its corresponding degrees of freedom (df), as shown below:

MS = SS / df

To conduct a one-way analysis of variance, we are interested in two mean squares:

MS WG = SSW / df WG

MS BG = SSB / df BG

Expected Value

The expected value of a mean square is the average value of the mean square over a large number of experiments.

Statisticians have derived formulas for the expected value of the within-groups mean square ( MS WG  ) and for the expected value of the between-groups mean square ( MS BG  ). For one-way analysis of variance, the expected value formulas are:

Fixed- and Random-Effects:

E( MS WG  ) = σ ε 2

Fixed-Effects:

Σj=1
E( MS  ) = σ +
( k - 1 )

Random-Effects:

E( MS BG  ) = σ ε 2 + nσ β 2

In the equations above, E( MS WG  ) is the expected value of the within-groups mean square; E( MS BG  ) is the expected value of the between-groups mean square; n is total sample size; k is the number of treatment groups; β  j is the treatment effect in Group j ; σ ε 2 is the variance attributable to everything except the treatment effect (i.e., all the extraneous variables); and σ β 2 is the variance due to random selection of treatment levels.

Notice that MS BG should equal MS WG when the variation due to treatment effects ( β  j for fixed effects and σ β 2 for random effects) is zero (i.e., when the independent variable does not affect the dependent variable). And MS BG should be bigger than the MS WG when the variation due to treatment effects is not zero (i.e., when the independent variable does affect the dependent variable)

Conclusion: By examining the relative size of the mean squares, we can make a judgment about whether an independent variable affects a dependent variable.

Test Statistic

Suppose we use the mean squares to define a test statistic F as follows:

F(v 1 , v 2 ) = MS BG / MS WG

where MS BG is the between-groups mean square, MS WG is the within-groups mean square, v 1 is the degrees of freedom for MS BG , and v 2 is the degrees of freedom for MS WG .

Defined in this way, the F ratio measures the size of MS BG relative to MS WG . The F ratio is a convenient measure that we can use to test the null hypothesis. Here's how:

  • When the F ratio is close to one, MS BG is approximately equal to MS WG . This indicates that the independent variable did not affect the dependent variable, so we cannot reject the null hypothesis.
  • When the F ratio is significantly greater than one, MS BG is bigger than MS WG . This indicates that the independent variable did affect the dependent variable, so we must reject the null hypothesis.

What does it mean for the F ratio to be significantly greater than one? To answer that question, we need to talk about the P-value.

Note: With a completely randomized design, the test statistic F is computed in the same way for fixed-effects and for random-effects. With more complex designs (i.e., designs with more than one factor), test statistics may be computed differently for fixed-effects models than for random-effects models.

In an experiment, a P-value is the probability of obtaining a result more extreme than the observed experimental outcome, assuming the null hypothesis is true.

With analysis of variance, the F ratio is the observed experimental outcome that we are interested in. So, the P-value would be the probability that an F statistic would be more extreme (i.e., bigger) than the actual F ratio computed from experimental data.

How does an experimenter attach a probability to an observed F ratio? Luckily, the F ratio is a random variable that has an F distribution . Therefore, we can use an F table or an online calculator to find the probability that an F statistic will be bigger than the actual F ratio observed in the experiment.

F Distribution Calculator

To find the P-value associated with an observed F ratio, use Stat Trek's free F distribution calculator . You can access the calculator by clicking a link in the table of contents (at the top of this web page in the left column). find the calculator in the Appendix section of the table of contents, which can be accessed by tapping the "Analysis of Variance: Table of Contents" button at the top of the page. Or you can click tap the button below.

For an example that shows how to find the P-value for an F ratio, see Problem 2 at the bottom of this page.

Hypothesis Test

Recall that the experimenter specified a significance level early on - before the first data point was collected. Once you know the significance level and the P-value, the hypothesis test is routine. Here's the decision rule for accepting or rejecting the null hypothesis:

  • If the P-value is bigger than the significance level, accept the null hypothesis.
  • If the P-value is equal to or smaller than the significance level, reject the null hypothesis.

A "big" P-value indicates that (1) none of the k treatment means ( X j ) were significantly different, so (2) the independent variable did not have a statistically significant effect on the dependent variable.

A "small" P-value indicates that (1) at least one treatment mean differed significantly from another treatment mean, so (2) the independent variable had a statistically significant effect on the dependent variable.

Magnitude of Effect

The hypothesis test tells us whether the independent variable in our experiment has a statistically significant effect on the dependent variable, but it does not address the magnitude (strength) of the effect. Here's the issue:

  • When the sample size is large, you may find that even small differences in treatment means are statistically significant.
  • When the sample size is small, you may find that even big differences in treatment means are not statistically significant.

With this in mind, it is customary to supplement analysis of variance with an appropriate measure of effect size. Eta squared (η 2 ) is one such measure. Eta squared is the proportion of variance in the dependent variable that is explained by a treatment effect. The eta squared formula for one-way analysis of variance is:

η 2 = SSB / SST

where SSB is the between-groups sum of squares and SST is the total sum of squares.

ANOVA Summary Table

It is traditional to summarize ANOVA results in an analysis of variance table. Here, filled with hypothetical data, is an analysis of variance table for a one-way analysis of variance.

Analysis of Variance Table

Source SS df MS F P
BG 230 k - 1 = 10 23 2.3 0.09
WG 220 N - k = 22 10
Total 450 N - 1 = 32

This is an ANOVA table for a single-factor, independent groups design. The experiment used 11 treatment groups, so k equals 11. And three subjects were assigned to each treatment group, so N equals 33. The table shows critical outputs for between-group (BG) treatment effects and within-group (WG) treatment effects.

Many of the table entries are derived from the sum of squares (SS) and degrees of freedom (df), based on the following formulas:

SS TOTAL = SS BG + SS WG = 230 + 220 = 450

MS BG = SS BG / df BG = 230/10 = 23

MS WG = MS WG / df WG = 220/22 = 10

F(v 1 , v 2 ) = MS BG / MS WG = 23/10 = 2.3

where MS bg is the between-groups mean square, MS wg is the within-groups mean square, v 1 and df BG are the degrees of freedom for MS BG , v 2 and df WG are the degrees of freedom for MS WG , and the F ratio is F(v 1 , v 2 ).

An ANOVA table provides all the information an experimenter needs to (1) test hypotheses and (2) assess the magnitude of treatment effects.

Hypothesis Tests

The P-value (shown in the last column of the ANOVA table) is the probability that an F statistic would be more extreme (bigger) than the F ratio shown in the table, assuming the null hypothesis is true. When the P-value is bigger than the significance level, we accept the null hypothesis; when it is smaller, we reject it.

Suppose the significance level for this experiment was 0.05. Based on the table entries, can we reject the null hypothesis? From the ANOVA table, we see that the P-value is 0.09. Since P-value is bigger than the significance level (0.05), we cannot reject the null hypothesis.

Magnitude of Effects

Since the P-value in the ANOVA table was bigger than the significance level, the treatment effect in this experiment was not statistically significant. Does that mean the treatment effect was small? Not necessarily.

To assess the strength of the treatment effect, an experimenter might compute eta squared (η 2 ). The computation is easy, using sums of squares entries from the ANOVA table, as shown below:

η 2 = SSB / SST = 230 / 450 = 0.51

For this experiment, eta squared is 0.51. This means that 51% of the variance in the dependent variable can be explained by the effect of the independent variable.

Even though the treatment effect was not statistically significant, it was not unimportant; since the independent variable accounted for more than half the variance in the dependent variable. The moral here is that a hypothesis test by itself may not tell the whole story. It also pays to look at the magnitude of an effect.

Advantages and Disadvantages

One-way analysis of variance with a single-factor, independent groups design has advantages and disadvantages. Advantages include the following:

  • The design layout is simple - one factor with k factor levels.
  • Data analysis is easier with this design than with other designs.
  • Computational procedures are identical for fixed-effects and random-effects models.
  • The design does not require equal sample sizes for treatment groups.
  • The design requires subjects to participate in only one treatment group.

Disadvantages include the following:

  • The design does not permit repeated measures.
  • The design can test the effect of only one independent variable.

Test Your Understanding

In analysis of variance, what is a mean square?

(A) The average deviation from the mean. (B) A measure of standard deviation. (C) A measure of variance. (D) A measure of skewness. (E) A vicious geometric shape.

The correct answer is (C). Mean squares are estimates of variance within groups or across groups. Mean squares are used to calculate F ratios, such as the following:

F = MS bg / MS wg

where MS bg is the between-group mean square and MS wg is the within-group mean square.

In the ANOVA table shown below, the P-value is missing. What is the correct entry for the P-value?

Source SS df MS F P-value
BG 300 5 60 3 ???
WG 600 30 20
Total 900 35

Hint: Stat Trek's F Distribution Calculator may be helpful.

(A) 0.01 (B) 0.03 (C) 0.11 (D) 0.89 (E) 0.97

The correct answer is (B).

A P-value is the probability of obtaining a result more extreme (bigger) than the observed F ratio, assuming the null hypothesis is true. From the ANOVA table, we know the following:

  • The observed value of the F ratio is 3.
  • The degrees of freedom (v 1 ) for the between-groups mean square is 5.
  • The degrees of freedom (v 2 ) for the within-groups mean square is 30.

Therefore, the P-value we are looking for is the probability that an F with 5 and 30 degrees of freedom is greater than 3. We want to know:

P [ F(5, 30) > 3 ]

Now, we are ready to use the F Distribution Calculator . We enter the degrees of freedom (v1 = 5) for the between-groups mean square, the degrees of freedom (v2 = 30) for the within-groups mean square, and the F ratio (3) into the calculator; and hit the Calculate button.

The calculator reports that the probability that F is greater (more extreme) than 3 equals about 0.026. Hence, the correct P-value is 0.026.

Statology

Statistics Made Easy

A Guide to Bartlett’s Test of Sphericity

Bartlett’s Test of Sphericity compares an observed correlation matrix to the identity matrix. Essentially it checks to see if there is a certain redundancy between the variables that we can summarize with a few number of factors. 

The null hypothesis of the test is that the variables are orthogonal, i.e. not correlated. The alternative hypothesis is that the variables are not orthogonal, i.e. they are correlated enough to where the correlation matrix diverges significantly from the identity matrix. 

This test is often performed before we use a data reduction technique such as principal component analysis or factor analysis to verify that a data reduction technique can actually compress the data in a meaningful way. 

Note: Bartlett’s Test of Sphericity is not the same as Bartlett’s Test for Equality of Variances . This is a common confusion, since the two have similar names.

Correlation Matrix vs. Identity Matrix

A correlation matrix is simply a matrix of values that shows the correlation coefficients between variables. For example, the following correlation matrix shows the correlation coefficients between different variables for professional basketball teams.

Example of a correlation matrix

Correlation coefficients can vary from -1 to 1. The further a value is from 0, the higher the correlation between two variables.

An identity matrix is a matrix in which all of the values along the diagonal are 1 and all of the other values are 0. 

Identity matrix example picture

In this case, if the numbers in this matrix represent correlation coefficients it means that each variable is perfectly orthogonal (i.e. “uncorrelated”) to every other variable and thus a data reduction technique like PCA or factor analysis would not be able to “compress” the data in any meaningful way. 

Thus, the reason we conduct Bartlett’s Test of Sphericity is to make sure that the correlation matrix of the variables in our dataset diverges significantly from the identity matrix, so that we know a data reduction technique is suitable to use.

If the p-value from Bartlett’s Test of Sphericity is lower than our chosen significance level (common choices are 0.10, 0.05, and 0.01), then our dataset is suitable for a data reduction technique.

How to Conduct Bartlett’s Test of Sphericity in R

To conduct Bartlett’s Test of Sphericity in R, we can use the cortest.bartlett() function from the  psych  library. The general syntax for this function is as follows:

cortest.bartlett(R, n)

  • R: a correlation matrix of the dataset
  • n: sample size of the dataset

The following code illustrates how to conduct this test on a fake dataset we created:

The Chi-Square test statistic is 5.252329 and the corresponding p-value is 0.1542258, which is not smaller than our significance level (let’s use 0.05). Thus, this data is likely not suitable for PCA or factor analysis. 

To put this in layman’s terms, the three variables in our dataset are fairly uncorrelated so a data reduction technique like PCA or factor analysis would have a hard time compressing these variables into linear combinations that are able to capture significant variance present in the data.

Featured Posts

null hypothesis of factor analysis

Hey there. My name is Zach Bobbitt. I have a Masters of Science degree in Applied Statistics and I’ve worked on machine learning algorithms for professional businesses in both healthcare and retail. I’m passionate about statistics, machine learning, and data visualization and I created Statology to be a resource for both students and teachers alike.  My goal with this site is to help you learn statistics through using simple terms, plenty of real-world examples, and helpful illustrations.

8 Replies to “A Guide to Bartlett’s Test of Sphericity”

Correction: The further a value is from “0”, the higher the correlation between two variables.

Thanks for the pointing this out! Just fixed the typo 🙂

Well understood exposition

Thanks for a wonderful clarification. It’s easy to understand now 😀 Also, May I ask for the resource or references?

Bless you, this is so helpful

What can be the maximum value of Approx. Chi Square. I am having following: Bartlett’s Test of Sphericity: Approx Chi-Square -> 9503.743 df -> 171 Sig. -> .000 My Dataset had 503 samples and 29 variables and KMO value is .949

Thank you very much

Could you please let me know what does this mean?

barthley Test of Sphericity is <.001

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Join the Statology Community

Sign up to receive Statology's exclusive study resource: 100 practice problems with step-by-step solutions. Plus, get our latest insights, tutorials, and data analysis tips straight to your inbox!

By subscribing you accept Statology's Privacy Policy.

The JASP guidelines for conducting and reporting a Bayesian analysis

  • Theoretical Review
  • Open access
  • Published: 09 October 2020
  • Volume 28 , pages 813–826, ( 2021 )

Cite this article

You have full access to this open access article

null hypothesis of factor analysis

  • Johnny van Doorn 1 ,
  • Don van den Bergh 1 ,
  • Udo Böhm 1 ,
  • Fabian Dablander 1 ,
  • Koen Derks 2 ,
  • Tim Draws 1 ,
  • Alexander Etz 3 ,
  • Nathan J. Evans 1 ,
  • Quentin F. Gronau 1 ,
  • Julia M. Haaf 1 ,
  • Max Hinne 1 ,
  • Šimon Kucharský 1 ,
  • Alexander Ly 1 , 4 ,
  • Maarten Marsman 1 ,
  • Dora Matzke 1 ,
  • Akash R. Komarlu Narendra Gupta 1 ,
  • Alexandra Sarafoglou 1 ,
  • Angelika Stefan 1 ,
  • Jan G. Voelkel 5 &
  • Eric-Jan Wagenmakers 1  

39k Accesses

439 Citations

44 Altmetric

Explore all metrics

Despite the increasing popularity of Bayesian inference in empirical research, few practical guidelines provide detailed recommendations for how to apply Bayesian procedures and interpret the results. Here we offer specific guidelines for four different stages of Bayesian statistical reasoning in a research setting: planning the analysis, executing the analysis, interpreting the results, and reporting the results. The guidelines for each stage are illustrated with a running example. Although the guidelines are geared towards analyses performed with the open-source statistical software JASP, most guidelines extend to Bayesian inference in general.

Similar content being viewed by others

null hypothesis of factor analysis

Bayesian Analysis Reporting Guidelines

Four reasons to prefer bayesian analyses over significance testing, how to become a bayesian in eight easy steps: an annotated reading list.

Avoid common mistakes on your manuscript.

In recent years, Bayesian inference has become increasingly popular, both in statistical science and in applied fields such as psychology, biology, and econometrics (e.g., Andrews & Baguley, 2013 ; Vandekerckhove, Rouder, & Kruschke, 2018 ). For the pragmatic researcher, the adoption of the Bayesian framework brings several advantages over the standard framework of frequentist null-hypothesis significance testing (NHST), including (1) the ability to obtain evidence in favor of the null hypothesis and discriminate between “absence of evidence” and “evidence of absence” (Dienes, 2014 ; Keysers, Gazzola, & Wagenmakers, 2020 ); (2) the ability to take into account prior knowledge to construct a more informative test (Gronau, Ly, & Wagenmakers, 2020 ; Lee & Vanpaemel, 2018 ); and (3) the ability to monitor the evidence as the data accumulate (Rouder, 2014 ). However, the relative novelty of conducting Bayesian analyses in applied fields means that there are no detailed reporting standards, and this in turn may frustrate the broader adoption and proper interpretation of the Bayesian framework.

Several recent statistical guidelines include information on Bayesian inference, but these guidelines are either minimalist (Appelbaum et al., 2018 ; The BaSiS group, 2001 ), focus only on relatively complex statistical tests (Depaoli & Schoot, 2017 ), are too specific to a certain field (Spiegelhalter, Myles, Jones, & Abrams, 2000 ; Sung et al., 2005 ), or do not cover the full inferential process (Jarosz & Wiley, 2014 ). The current article aims to provide a general overview of the different stages of the Bayesian reasoning process in a research setting. Specifically, we focus on guidelines for analyses conducted in JASP (JASP Team, 2019 ; jasp-stats.org ), although these guidelines can be generalized to other software packages for Bayesian inference. JASP is an open-source statistical software program with a graphical user interface that features both Bayesian and frequentist versions of common tools such as the t test, the ANOVA, and regression analysis (e.g., Marsman & Wagenmakers, 2017 ; Wagenmakers et al., 2018 ).

We discuss four stages of analysis: planning, executing, interpreting, and reporting. These stages and their individual components are summarized in Table  1 . In order to provide a concrete illustration of the guidelines for each of the four stages, each section features a data set reported by Frisby and Clatworthy ( 1975 ). This data set concerns the time it took two groups of participants to see a figure hidden in a stereogram—one group received advance visual information about the scene (i.e., the VV condition), whereas the other group did not (i.e., the NV condition). Footnote 1 Three additional examples (mixed ANOVA, correlation analysis, and a t test with an informed prior) are provided in an online appendix at https://osf.io/nw49j/ . Throughout the paper, we present three boxes that provide additional technical discussion. These boxes, while not strictly necessary, may prove useful to readers interested in greater detail.

Stage 1: Planning the analysis

Specifying the goal of the analysis..

We recommend that researchers carefully consider their goal, that is, the research question that they wish to answer, prior to the study (Jeffreys, 1939 ). When the goal is to ascertain the presence or absence of an effect, we recommend a Bayes factor hypothesis test (see Box 1). The Bayes factor compares the predictive performance of two hypotheses. This underscores an important point: in the Bayes factor testing framework, hypotheses cannot be evaluated until they are embedded in fully specified models with a prior distribution and likelihood (i.e., in such a way that they make quantitative predictions about the data). Thus, when we refer to the predictive performance of a hypothesis, we implicitly refer to the accuracy of the predictions made by the model that encompasses the hypothesis (Etz, Haaf, Rouder, & Vandekerckhove, 2018 ).

When the goal is to determine the size of the effect, under the assumption that it is present, we recommend to plot the posterior distribution or summarize it by a credible interval (see Box 2). Testing and estimation are not mutually exclusive and may be used in sequence; for instance, one may first use a test to ascertain that the effect exists, and then continue to estimate the size of the effect.

Box 1. Hypothesis testing

The principled approach to Bayesian hypothesis testing is by means of the Bayes factor (e.g., Etz & Wagenmakers, 2017 ; Jeffreys, 1939 ; Ly, Verhagen, & Wagenmakers, 2016 ; Wrinch & Jeffreys, 1921 ). The Bayes factor quantifies the relative predictive performance of two rival hypotheses, and it is the degree to which the data demand a change in beliefs concerning the hypotheses’ relative plausibility (see Equation  1 ). Specifically, the first term in Equation  1 corresponds to the prior odds, that is, the relative plausibility of the rival hypotheses before seeing the data. The second term, the Bayes factor, indicates the evidence provided by the data. The third term, the posterior odds, indicates the relative plausibility of the rival hypotheses after having seen the data.

The subscript in the Bayes factor notation indicates which hypothesis is supported by the data. BF 10 indicates the Bayes factor in favor of \({\mathscr{H}}_{1}\) over \({\mathscr{H}}_{0}\) , whereas BF 01 indicates the Bayes factor in favor of \({\mathscr{H}}_{0}\) over \({\mathscr{H}}_{1}\) . Specifically, BF 10 = 1/BF 01 . Larger values of BF 10 indicate more support for \({\mathscr{H}}_{1}\) . Bayes factors range from 0 to \(\infty \) , and a Bayes factor of 1 indicates that both hypotheses predicted the data equally well. This principle is further illustrated in Figure  4 .

Box 2. Parameter estimation

For Bayesian parameter estimation, interest centers on the posterior distribution of the model parameters. The posterior distribution reflects the relative plausibility of the parameter values after prior knowledge has been updated by means of the data. Specifically, we start the estimation procedure by assigning the model parameters a prior distribution that reflects the relative plausibility of each parameter value before seeing the data. The information in the data is then used to update the prior distribution to the posterior distribution. Parameter values that predicted the data relatively well receive a boost in plausibility, whereas parameter values that predicted the data relatively poorly suffer a decline (Wagenmakers, Morey, & Lee, 2016 ). Equation  2 illustrates this principle. The first term indicates the prior beliefs about the values of parameter 𝜃 . The second term is the updating factor: for each value of 𝜃 , the quality of its prediction is compared to the average quality of the predictions over all values of 𝜃 . The third term indicates the posterior beliefs about 𝜃 .

The posterior distribution can be plotted or summarized by an x % credible interval. An x % credible interval contains x % of the posterior mass. Two popular ways of creating a credible interval are the highest density credible interval, which is the narrowest interval containing the specified mass, and the central credible interval, which is created by cutting off \(\frac {100-x}{2}\%\) from each of the tails of the posterior distribution.

Specifying the statistical model.

The functional form of the model (i.e., the likelihood; Etz, 2018 ) is guided by the nature of the data and the research question. For instance, if interest centers on the association between two variables, one may specify a bivariate normal model in order to conduct inference on Pearson’s correlation parameter ρ . The statistical model also determines which assumptions ought to be satisfied by the data. For instance, the statistical model might assume the dependent variable to be normally distributed. Violations of assumptions may be addressed at different points in the analysis, such as the data preprocessing steps discussed below, or by planning to conduct robust inferential procedures as a contingency plan.

The next step in model specification is to determine the sidedness of the procedure. For hypothesis testing, this means deciding whether the procedure is one-sided (i.e., the alternative hypothesis dictates a specific direction of the population effect) or two-sided (i.e., the alternative hypothesis dictates that the effect can be either positive or negative). The choice of one-sided versus two-sided depends on the research question at hand and this choice should be theoretically justified prior to the study. For hypothesis testing it is usually the case that the alternative hypothesis posits a specific direction. In Bayesian hypothesis testing, a one-sided hypothesis yields a more diagnostic test than a two-sided alternative (e.g., Jeffreys, 1961 ; Wetzels, Raaijmakers, Jakab, & Wagenmakers, 2009 , p.283). Footnote 2

For parameter estimation, we recommend to always use the two-sided model instead of the one-sided model: when a positive one-sided model is specified but the observed effect turns out to be negative, all of the posterior mass will nevertheless remain on the positive values, falsely suggesting the presence of a small positive effect.

The next step in model specification concerns the type and spread of the prior distribution, including its justification. For the most common statistical models (e.g., correlations, t tests, and ANOVA), certain “default” prior distributions are available that can be used in cases where prior knowledge is absent, vague, or difficult to elicit (for more information, see Ly et al.,, 2016 ). These priors are default options in JASP. In cases where prior information is present, different “informed” prior distributions may be specified. However, the more the informed priors deviate from the default priors, the stronger becomes the need for a justification (see the informed t test example in the online appendix at https://osf.io/ybszx/ ). Additionally, the robustness of the result to different prior distributions can be explored and included in the report. This is an important type of robustness check because the choice of prior can sometimes impact our inferences, such as in experiments with small sample sizes or missing data. In JASP, Bayes factor robustness plots show the Bayes factor for a wide range of prior distributions, allowing researchers to quickly examine the extent to which their conclusions depend on their prior specification. An example of such a plot is given later in Figure  7 .

Specifying data preprocessing steps.

Dependent on the goal of the analysis and the statistical model, different data preprocessing steps might be taken. For instance, if the statistical model assumes normally distributed data, a transformation to normality (e.g., the logarithmic transformation) might be considered (e.g., Draper & Cox, 1969 ). Other points to consider at this stage are when and how outliers may be identified and accounted for, which variables are to be analyzed, and whether further transformation or combination of data are necessary. These decisions can be somewhat arbitrary, and yet may exert a large influence on the results (Wicherts et al., 2016 ). In order to assess the degree to which the conclusions are robust to arbitrary modeling decisions, it is advisable to conduct a multiverse analysis (Steegen, Tuerlinckx, Gelman, & Vanpaemel, 2016 ). Preferably, the multiverse analysis is specified at study onset. A multiverse analysis can easily be conducted in JASP, but doing so is not the goal of the current paper.

Specifying the sampling plan.

As may be expected from a framework for the continual updating of knowledge, Bayesian inference allows researchers to monitor evidence as the data come in, and stop whenever they like, for any reason whatsoever. Thus, strictly speaking there is no Bayesian need to pre-specify sample size at all (e.g., Berger & Wolpert, 1988 ). Nevertheless, Bayesians are free to specify a sampling plan if they so desire; for instance, one may commit to stop data collection as soon as BF 10 ≥ 10 or BF 01 ≥ 10. This approach can also be combined with a maximum sample size ( N ), where data collection stops when either the maximum N or the desired Bayes factor is obtained, whichever comes first (for examples see ; Matzke et al., 2015 ;Wagenmakers et al., 2015 ).

In order to examine what sampling plans are feasible, researchers can conduct a Bayes factor design analysis (Schönbrodt & Wagenmakers, 2018 ; Stefan, Gronau, Schönbrodt, & Wagenmakers, 2019 ), a method that shows the predicted outcomes for different designs and sampling plans. Of course, when the study is observational and the data are available ‘en bloc’, the sampling plan becomes irrelevant in the planning stage.

Stereogram example

First, we consider the research goal, which was to determine if participants who receive advance visual information exhibit a shorter fuse time (Frisby & Clatworthy, 1975 ). A Bayes factor hypothesis test can be used to quantify the evidence that the data provide for and against the hypothesis that an effect is present. Should this test reveal support in favor of the presence of the effect, then we have grounds for a follow-up analysis in which the size of the effect is estimated.

Second, we specify the statistical model. The study focus is on the difference in performance between two between-subjects conditions, suggesting a two-sample t test on the fuse times is appropriate. The main measure of the study is a reaction time variable, which can for various reasons be non-normally distributed (Lo & Andrews, 2015 ; but see Schramm & Rouder, 2019 ). If our data show signs of non-normality we will conduct two alternatives: a t test on the log-transformed fuse time data and a non-parametric t test (i.e., the Mann–Whitney U test), which is robust to non-normality and unaffected by the log-transformation of the fuse times.

For hypothesis testing, we compare the null hypothesis (i.e., advance visual information has no effect on fuse times) to a one-sided alternative hypothesis (i.e., advance visual information shortens the fuse times), in line with the directional nature of the original research question. The rival hypotheses are thus \({\mathscr{H}}_{0}: \delta = 0\) and \({\mathscr{H}}_{+}: \delta > 0\) , where δ is the standardized effect size (i.e., the population version of Cohen’s d ), \({\mathscr{H}}_{0}\) denotes the null hypothesis, and \({\mathscr{H}}_{+}\) denotes the one-sided alternative hypothesis (note the ‘+’ in the subscript). For parameter estimation (under the assumption that the effect exists), we use the two-sided t test model and plot the posterior distribution of δ . This distribution can also be summarized by a 95 % central credible interval.

We complete the model specification by assigning prior distributions to the model parameters. Since we have only little prior knowledge about the topic, we select a default prior option for the two-sample t test, that is, a Cauchy distribution Footnote 3 with spread r set to \({1}/{\sqrt {2}}\) . Since we specified a one-sided alternative hypothesis, the prior distribution is truncated at zero, such that only positive effect size values are allowed. The robustness of the Bayes factor to this prior specification can be easily assessed in JASP by means of a Bayes factor robustness plot.

Since the data are already available, we do not have to specify a sampling plan. The original data set has a total sample size of 103, from which 25 participants were eliminated due to failing an initial stereo-acuity test, leaving 78 participants (43 in the NV condition and 35 in the VV condition). The data are available online at https://osf.io/5vjyt/ .

Stage 2: Executing the analysis

Before executing the primary analysis and interpreting the outcome, it is important to confirm that the intended analyses are appropriate and the models are not grossly misspecified for the data at hand. In other words, it is strongly recommended to examine the validity of the model assumptions (e.g., normally distributed residuals or equal variances across groups). Such assumptions may be checked by plotting the data, inspecting summary statistics, or conducting formal assumption tests (but see Tijmstra, 2018 ).

A powerful demonstration of the dangers of failing to check the assumptions is provided by Anscombe’s quartet (Anscombe, 1973 ; see Fig.  1 ). The quartet consists of four fictitious data sets of equal size that each have the same observed Pearson’s product moment correlation r , and therefore lead to the same inferential result both in a frequentist and a Bayesian framework. However, visual inspection of the scatterplots immediately reveals that three of the four data sets are not suitable for a linear correlation analysis, and the statistical inference for these three data sets is meaningless or even misleading. This example highlights the adage that conducting a Bayesian analysis does not safeguard against general statistical malpractice—the Bayesian framework is as vulnerable to violations of assumptions as its frequentist counterpart. In cases where assumptions are violated, an ordinal or non-parametric test can be used, and the parametric results should be interpreted with caution.

figure 1

Model misspecification is also a problem for Bayesian analyses. The four scatterplots in the top panel show Anscombe’s quartet (Anscombe, 1973 ); the bottom panel shows the corresponding inference, which is identical for all four scatter plots. Except for the leftmost scatterplot, all data violate the assumptions of the linear correlation analysis in important ways

Once the quality of the data has been confirmed, the planned analyses can be carried out. JASP offers a graphical user interface for both frequentist and Bayesian analyses. JASP 0.10.2 features the following Bayesian analyses: the binomial test, the Chi-square test, the multinomial test, the t test (one-sample, paired sample, two-sample, Wilcoxon rank-sum, and Wilcoxon signed-rank tests), A/B tests, ANOVA, ANCOVA, repeated measures ANOVA, correlations (Pearson’s ρ and Kendall’s τ ), linear regression, and log-linear regression. After loading the data into JASP, the desired analysis can be conducted by dragging and dropping variables into the appropriate boxes; tick marks can be used to select the desired output.

The resulting output (i.e., figures and tables) can be annotated and saved as a .jasp file. Output can then be shared with peers, with or without the real data in the .jasp file; if the real data are added, reviewers can easily reproduce the analyses, conduct alternative analyses, or insert comments.

In order to check for violations of the assumptions of the t test, the top row of Fig.  2 shows boxplots and Q-Q plots of the dependent variable fuse time, split by condition. Visual inspection of the boxplots suggests that the variances of the fuse times may not be equal (observed standard deviations of the NV and VV groups are 8.085 and 4.802, respectively), suggesting the equal variance assumption may be unlikely to hold. There also appear to be a number of potential outliers in both groups. Moreover, the Q-Q plots show that the normality assumption of the t test is untenable here. Thus, in line with our analysis plan we will apply the log-transformation to the fuse times. The standard deviations of the log-transformed fuse times in the groups are roughly equal (observed standard deviations are 0.814 and 0.818 in the NV and the VV group, respectively); the Q-Q plots in the bottom row of Fig.  2 also look acceptable for both groups and there are no apparent outliers. However, it seems prudent to assess the robustness of the result by also conducting the Bayesian Mann–Whitney U test (van Doorn, Ly, Marsman, & Wagenmakers, 2020 ) on the fuse times.

figure 2

Descriptive plots allow a visual assessment of the assumptions of the t test for the stereogram data. The top row shows descriptive plots for the raw fuse times, and the bottom row shows descriptive plots for the log-transformed fuse times. The left column shows boxplots, including the jittered data points, for each of the experimental conditions. The middle and right columns show parQ-Q plots of the dependent variable, split by experimental condition. Here we see that the log-transformed dependent variable is more appropriate for the t test, due to its distribution and absence of outliers. Figures from JASP

Following the assumption check, we proceed to execute the analyses in JASP. For hypothesis testing, we obtain a Bayes factor using the one-sided Bayesian two-sample t test. Figure  3 shows the JASP user interface for this procedure. For parameter estimation, we obtain a posterior distribution and credible interval, using the two-sided Bayesian two-sample t test. The relevant boxes for the various plots were ticked, and an annotated .jasp file was created with all of the relevant analyses: the one-sided Bayes factor hypothesis tests, the robustness check, the posterior distribution from the two-sided analysis, and the one-sided results of the Bayesian Mann–Whitney U test. The .jasp file can be found at https://osf.io/nw49j/ . The next section outlines how these results are to be interpreted.

figure 3

JASP menu for the Bayesian two-sample t test. The left input panel offers the analysis options, including the specification of the alternative hypothesis and the selection of plots. The right output panel shows the corresponding analysis output. The prior and posterior plot is explained in more detail in Fig.  6 . The input panel specifies the one-sided analysis for hypothesis testing; a two-sided analysis for estimation can be obtained by selecting “Group 1 ≠ Group 2” under “Alt. Hypothesis”

Stage 3: Interpreting the results

With the analysis outcome in hand, we are ready to draw conclusions. We first discuss the scenario of hypothesis testing, where the goal typically is to conclude whether an effect is present or absent. Then, we discuss the scenario of parameter estimation, where the goal is to estimate the size of the population effect, assuming it is present. When both hypothesis testing and estimation procedures have been planned and executed, there is no predetermined order for their interpretation. One may adhere to the adage “only estimate something when there is something to be estimated” (Wagenmakers et al., 2018 ) and first test whether an effect is present, and then estimate its size (assuming the test provided sufficiently strong evidence against the null), or one may first estimate the magnitude of an effect, and then quantify the degree to which this magnitude warrants a shift in plausibility away from or toward the null hypothesis (but see Box 3).

If the goal of the analysis is hypothesis testing, we recommend using the Bayes factor. As described in Box 1, the Bayes factor quantifies the relative predictive performance of two rival hypotheses (Wagenmakers et al., 2016 ; see Box 1). Importantly, the Bayes factor is a relative metric of the hypotheses’ predictive quality. For instance, if BF 10 = 5, this means that the data are 5 times more likely under \({\mathscr{H}}_{1}\) than under \({\mathscr{H}}_{0}\) . However, a Bayes factor in favor of \({\mathscr{H}}_{1}\) does not mean that \({\mathscr{H}}_{1}\) predicts the data well. As Figure  1 illustrates, \({\mathscr{H}}_{1}\) provides a dreadful account of three out of four data sets, yet is still supported relative to \({\mathscr{H}}_{0}\) .

There can be no hard Bayes factor bound (other than zero and infinity) for accepting or rejecting a hypothesis wholesale, but there have been some attempts to classify the strength of evidence that different Bayes factors provide (e.g., Jeffreys, 1939 ; Kass & Raftery, 1995 ). One such classification scheme is shown in Figure  4 . Several magnitudes of the Bayes factor are visualized as a probability wheel, where the proportion of red to white is determined by the degree of evidence in favor of \({\mathscr{H}}_{0}\) and \({\mathscr{H}}_{1}\) . Footnote 4 In line with Jeffreys, a Bayes factor between 1 and 3 is considered weak evidence, a Bayes factor between 3 and 10 is considered moderate evidence, and a Bayes factor greater than 10 is considered strong evidence. Note that these classifications should only be used as general rules of thumb to facilitate communication and interpretation of evidential strength. Indeed, one of the merits of the Bayes factor is that it offers an assessment of evidence on a continuous scale.

figure 4

A graphical representation of a Bayes factor classification table. As the Bayes factor deviates from 1, which indicates equal support for \({\mathscr{H}}_{0}\) and \({\mathscr{H}}_{1}\) , more support is gained for either \({\mathscr{H}}_{0}\) or \({\mathscr{H}}_{1}\) . Bayes factors between 1 and 3 are considered to be weak, Bayes factors between 3 and 10 are considered moderate, and Bayes factors greater than 10 are considered strong evidence. The Bayes factors are also represented as probability wheels, where the ratio of white (i.e., support for \({\mathscr{H}}_{0}\) ) to red (i.e., support for \({\mathscr{H}}_{1}\) ) surface is a function of the Bayes factor. The probability wheels further underscore the continuous scale of evidence that Bayes factors represent. These classifications are heuristic and should not be misused as an absolute rule for all-or-nothing conclusions

When the goal of the analysis is parameter estimation, the posterior distribution is key (see Box 2). The posterior distribution is often summarized by a location parameter (point estimate) and uncertainty measure (interval estimate). For point estimation, the posterior median (reported by JASP), mean, or mode can be reported, although these do not contain any information about the uncertainty of the estimate. In order to capture the uncertainty of the estimate, an x % credible interval can be reported. The credible interval [ L , U ] has a x % probability that the true parameter lies in the interval that ranges from L to U (an interpretation that is often wrongly attributed to frequentist confidence intervals, see Morey, Hoekstra, Rouder, Lee, & Wagenmakers, 2016 ). For example, if we obtain a 95 % credible interval of [− 1,0.5] for effect size δ , we can be 95 % certain that the true value of δ lies between − 1 and 0.5, assuming that the alternative hypothesis we specify is true. In case one does not want to make this assumption, one can present the unconditional posterior distribution instead. For more discussion on this point, see Box 3.

Box 3. Conditional vs. unconditional inference.

A widely accepted view on statistical inference is neatly summarized by Fisher ( 1925 ), who states that “it is a useful preliminary before making a statistical estimate \(\dots \) to test if there is anything to justify estimation at all” (p. 300; see also Haaf, Ly, & Wagenmakers, 2019 ). In the Bayesian framework, this stance naturally leads to posterior distributions conditional on \({\mathscr{H}}_{1}\) , which ignores the possibility that the null value could be true. Generally, when we say “prior distribution” or “posterior distribution” we are following convention and referring to such conditional distributions. However, only presenting conditional posterior distributions can potentially be misleading in cases where the null hypothesis remains relatively plausible after seeing the data. A general benefit of Bayesian analysis is that one can compute an unconditional posterior distribution for the parameter using model averaging (e.g., Clyde, Ghosh, & Littman, 2011 ; Hinne, Gronau, Bergh, & Wagenmakers, 2020 ). An unconditional posterior distribution for a parameter accounts for both the uncertainty about the parameter within any one model and the uncertainty about the model itself, providing an estimate of the parameter that is a compromise between the candidate models (for more details see Hoeting, Madigan, Raftery, & Volinsky, 1999 ). In the case of a t test, which features only the null and the alternative hypothesis, the unconditional posterior consists of a mixture between a spike under \({\mathscr{H}}_{0}\) and a bell-shaped posterior distribution under \({\mathscr{H}}_{1}\) (Rouder, Haaf, & Vandekerckhove, 2018 ; van den Bergh, Haaf, Ly, Rouder, & Wagenmakers, 2019 ). Figure  5 illustrates this approach for the stereogram example.

figure 5

Updating the unconditional prior distribution to the unconditional posterior distribution for the stereogram example. The left panel shows the unconditional prior distribution, which is a mixture between the prior distributions under \({\mathscr{H}}_{0}\) and \({\mathscr{H}}_{1}\) . The prior distribution under \({\mathscr{H}}_{0}\) is a spike at the null value, indicated by the dotted line ; the prior distribution under \({\mathscr{H}}_{1}\) is a Cauchy distribution, indicated by the gray mass . The mixture proportion is determined by the prior model probabilities \(p({\mathscr{H}}_{0})\) and \(p({\mathscr{H}}_{1})\) . The right panel shows the unconditional posterior distribution, after updating the prior distribution with the data D . This distribution is a mixture between the posterior distributions under \({\mathscr{H}}_{0}\) and \({\mathscr{H}}_{1}\) ., where the mixture proportion is determined by the posterior model probabilities \(p({\mathscr{H}}_{0} \mid D)\) and \(p({\mathscr{H}}_{1} \mid D)\) . Since \(p({\mathscr{H}}_{1} \mid D) = 0.7\) (i.e., the data provide support for \({\mathscr{H}}_{1}\) over \({\mathscr{H}}_{0}\) ), about 70% of the unconditional posterior mass is comprised of the posterior mass under \({\mathscr{H}}_{1}\) , indicated by the gray mass . Thus, the unconditional posterior distribution provides information about plausible values for δ , while taking into account the uncertainty of \({\mathscr{H}}_{1}\) being true. In both panels, the dotted line and gray mass have been rescaled such that the height of the dotted line and the highest point of the gray mass reflect the prior ( left ) and posterior ( right ) model probabilities

Common pitfalls in interpreting Bayesian results

Bayesian veterans sometimes argue that Bayesian concepts are intuitive and easier to grasp than frequentist concepts. However, in our experience there exist persistent misinterpretations of Bayesian results. Here we list five:

The Bayes factor does not equal the posterior odds; in fact, the posterior odds are equal to the Bayes factor multiplied by the prior odds (see also Equation  1 ). These prior odds reflect the relative plausibility of the rival hypotheses before seeing the data (e.g., 50/50 when both hypotheses are equally plausible, or 80/20 when one hypothesis is deemed to be four times more plausible than the other). For instance, a proponent and a skeptic may differ greatly in their assessment of the prior plausibility of a hypothesis; their prior odds differ, and, consequently, so will their posterior odds. However, as the Bayes factor is the updating factor from prior odds to posterior odds, proponent and skeptic ought to change their beliefs to the same degree (assuming they agree on the model specification, including the parameter prior distributions).

Prior model probabilities (i.e., prior odds) and parameter prior distributions play different conceptual roles. Footnote 5 The former concerns prior beliefs about the hypotheses, for instance that both \({\mathscr{H}}_{0}\) and \({\mathscr{H}}_{1}\) are equally plausible a priori. The latter concerns prior beliefs about the model parameters within a model, for instance that all values of Pearson’s ρ are equally likely a priori (i.e., a uniform prior distribution on the correlation parameter). Prior model probabilities and parameter prior distributions can be combined to one unconditional prior distribution as described in Box 3 and Fig.  5 .

The Bayes factor and credible interval have different purposes and can yield different conclusions. Specifically, the typical credible interval for an effect size is conditional on \({\mathscr{H}}_{1}\) being true and quantifies the strength of an effect, assuming it is present (but see Box 3); in contrast, the Bayes factor quantifies evidence for the presence or absence of an effect. A common misconception is to conduct a “hypothesis test” by inspecting only credible intervals. Berger ( 2006 , p. 383) remarks: “[...] Bayesians cannot test precise hypotheses using confidence intervals. In classical statistics one frequently sees testing done by forming a confidence region for the parameter, and then rejecting a null value of the parameter if it does not lie in the confidence region. This is simply wrong if done in a Bayesian formulation (and if the null value of the parameter is believable as a hypothesis).”

The strength of evidence in the data is easy to overstate: a Bayes factor of 3 provides some support for one hypothesis over another, but should not warrant the confident all-or-none acceptance of that hypothesis.

The results of an analysis always depend on the questions that were asked. Footnote 6 For instance, choosing a one-sided analysis over a two-sided analysis will impact both the Bayes factor and the posterior distribution. For an illustration of this, see Fig.  6 for a comparison between one-sided and a two-sided results.

In order to avoid these and other pitfalls, we recommend that researchers who are doubtful about the correct interpretation of their Bayesian results solicit expert advice (for instance through the JASP forum at http://forum.cogsci.nl ).

For hypothesis testing, the results of the one-sided t test are presented in Fig.  6 a. The resulting BF + 0 is 4.567, indicating moderate evidence in favor of \({\mathscr{H}}_{+}\) : the data are approximately 4.6 times more likely under \({\mathscr{H}}_{+}\) than under \({\mathscr{H}}_{0}\) . To assess the robustness of this result, we also planned a Mann–Whitney U test. The resulting BF + 0 is 5.191, qualitatively similar to the Bayes factor from the parametric test. Additionally, we could have specified a multiverse analysis where data exclusion criteria (i.e., exclusion vs. no exclusion), the type of test (i.e., Mann–Whitney U vs. t test), and data transformations (i.e., log-transformed vs. raw fuse times) are varied. Typically in multiverse analyses these three decisions would be crossed, resulting in at least eight different analyses. However, in our case some of these analyses are implausible or redundant. First, because the Mann–Whitney U test is unaffected by the log transformation, the log-transformed and raw fuse times yield the same results. Second, due to the multiple assumption violations, the t test model for raw fuse times is misspecified and hence we do not trust the validity of its result. Third, we do not know which observations were excluded by (Frisby & Clatworthy, 1975 ). Consequently, only two of these eight analyses are relevant. Footnote 7 Furthermore, a more comprehensive multiverse analysis could also consider the Bayes factors from two-sided tests (i.e., BF 10 = 2.323) for the t test and BF 10 = 2.557 for the Mann–Whitney U test). However, these tests are not in line with the theory under consideration, as they answer a different theoretical question (see “Specifying the statistical model” in the Planning section).

figure 6

Bayesian two-sample t test for the parameter δ . The probability wheel on top visualizes the evidence that the data provide for the two rival hypotheses. The two gray dots indicate the prior and posterior density at the test value (Dickey & Lientz, 1970 ; Wagenmakers, Lodewyckx, Kuriyal, & Grasman, 2010 ). The median and the 95 % central credible interval of the posterior distribution are shown in the top right corner. The left panel shows the one-sided procedure for hypothesis testing and the right panel shows the two-sided procedure for parameter estimation. Both figures from JASP

For parameter estimation, the results of the two-sided t test are presented in Fig.  6 a. The 95 % central credible interval for δ is relatively wide, ranging from 0.046 to 0.904: this means that, under the assumption that the effect exists and given the model we specified, we can be 95 % certain that the true value of δ lies between 0.046 to 0.904. In conclusion, there is moderate evidence for the presence of an effect, and large uncertainty about its size.

Stage 4: Reporting the results

For increased transparency, and to allow a skeptical assessment of the statistical claims, we recommend to present an elaborate analysis report including relevant tables, figures, assumption checks, and background information. The extent to which this needs to be done in the manuscript itself depends on context. Ideally, an annotated .jasp file is created that presents the full results and analysis settings. The resulting file can then be uploaded to the Open Science Framework (OSF; https://osf.io ), where it can be viewed by collaborators and peers, even without having JASP installed. Note that the .jasp file retains the settings that were used to create the reported output. Analyses not conducted in JASP should mimic such transparency, for instance through uploading an R-script. In this section, we list several desiderata for reporting, both for hypothesis testing and parameter estimation. What to include in the report depends on the goal of the analysis, regardless of whether the result is conclusive or not.

In all cases, we recommend to provide a complete description of the prior specification (i.e., the type of distribution and its parameter values) and, especially for informed priors, to provide a justification for the choices that were made. When reporting a specific analysis, we advise to refer to the relevant background literature for details. In JASP, the relevant references for specific tests can be copied from the drop-down menus in the results panel.

When the goal of the analysis is hypothesis testing, it is key to outline which hypotheses are compared by clearly stating each hypothesis and including the corresponding subscript in the Bayes factor notation. Furthermore, we recommend to include, if available, the Bayes factor robustness check discussed in the section on planning (see Fig.  7 for an example). This check provides an assessment of the robustness of the Bayes factor under different prior specifications: if the qualitative conclusions do not change across a range of different plausible prior distributions, this indicates that the analysis is relatively robust. If this plot is unavailable, the robustness of the Bayes factor can be checked manually by specifying several different prior distributions (see the mixed ANOVA analysis in the online appendix at https://osf.io/wae57/ for an example). When data come in sequentially, it may also be of interest to examine the sequential Bayes factor plot, which shows the evidential flow as a function of increasing sample size.

figure 7

The Bayes factor robustness plot. The maximum BF + 0 is attained when setting the prior width r to 0.38. The plot indicates BF + 0 for the user specified prior ( \(r = {1}/{\sqrt {2}}\) ), wide prior ( r = 1), and ultrawide prior ( \(r = \sqrt {2}\) ). The evidence for the alternative hypothesis is relatively stable across a wide range of prior distributions, suggesting that the analysis is robust. However, the evidence in favor of \({\mathscr{H}}_{+}\) is not particularly strong and will not convince a skeptic

When the goal of the analysis is parameter estimation, it is important to present a plot of the posterior distribution, or report a summary, for instance through the median and a 95 % credible interval. Ideally, the results of the analysis are reported both graphically and numerically. This means that, when possible, a plot is presented that includes the posterior distribution, prior distribution, Bayes factor, 95 % credible interval, and posterior median. Footnote 8

Numeric results can be presented either in a table or in the main text. If relevant, we recommend to report the results from both estimation and hypothesis test. For some analyses, the results are based on a numerical algorithm, such as Markov chain Monte Carlo (MCMC), which yields an error percentage. If applicable and available, the error percentage ought to be reported too, to indicate the numeric robustness of the result. Lower values of the error percentage indicate greater numerical stability of the result. Footnote 9 In order to increase numerical stability, JASP includes an option to increase the number of samples for MCMC sampling when applicable.

This is an example report of the stereograms t test example:

Here we summarize the results of the Bayesian analysis for the stereogram data. For this analysis we used the Bayesian t test framework proposed by (see also; Jeffreys, 1961 ; Rouder et al., 2009 ). We analyzed the data with JASP (JASP Team, 2019 ). An annotated .jasp file, including distribution plots, data, and input options, is available at https://osf.io/25ekj/ . Due to model misspecification (i.e., non-normality, presence of outliers, and unequal variances), we applied a log-transformation to the fuse times. This remedied the misspecification. To assess the robustness of the results, we also applied a Mann–Whitney U test. First, we discuss the results for hypothesis testing. The null hypothesis postulates that there is no difference in log fuse time between the groups and therefore \({\mathscr{H}}_{0}: \delta = 0\) . The one-sided alternative hypothesis states that only positive values of δ are possible, and assigns more prior mass to values closer to 0 than extreme values. Specifically, δ was assigned a Cauchy prior distribution with \(r ={1}/{\sqrt {2}}\) , truncated to allow only positive effect size values. Figure  6 a shows that the Bayes factor indicates evidence for \({\mathscr{H}}_{+}\) ; specifically, BF + 0 = 4.567, which means that the data are approximately 4.5 times more likely to occur under \({\mathscr{H}}_{+}\) than under \({\mathscr{H}}_{0}\) . This result indicates moderate evidence in favor of \({\mathscr{H}}_{+}\) . The error percentage is < 0.001 % , which indicates great stability of the numerical algorithm that was used to obtain the result. The Mann–Whitney U test yielded a qualitatively similar result, BF + 0 is 5.191. In order to assess the robustness of the Bayes factor to our prior specification, Fig.  7 shows BF + 0 as a function of the prior width r . Across a wide range of widths, the Bayes factor appears to be relatively stable, ranging from about 3 to 5. Second, we discuss the results for parameter estimation. Of interest is the posterior distribution of the standardized effect size δ (i.e., the population version of Cohen’s d , the standardized difference in mean fuse times). For parameter estimation, δ was assigned a Cauchy prior distribution with \(r ={1}/{\sqrt {2}}\) . Figure  6 b shows that the median of the resulting posterior distribution for δ equals 0.47 with a central 95% credible interval for δ that ranges from 0.046 to 0.904. If the effect is assumed to exist, there remains substantial uncertainty about its size, with values close to 0 having the same posterior density as values close to 1.

Limitations and challenges

The Bayesian toolkit for the empirical social scientist still has some limitations to overcome. First, for some frequentist analyses, the Bayesian counterpart has not yet been developed or implemented in JASP. Secondly, some analyses in JASP currently provide only a Bayes factor, and not a visual representation of the posterior distributions, for instance due to the multidimensional parameter space of the model. Thirdly, some analyses in JASP are only available with a relatively limited set of prior distributions. However, these are not principled limitations and the software is actively being developed to overcome these limitations. When dealing with more complex models that go beyond the staple analyses such as t tests, there exist a number of software packages that allow custom coding, such as JAGS (Plummer, 2003 ) or Stan (Carpenter et al., 2017 ). Another option for Bayesian inference is to code the analyses in a programming language such as R (R Core Team, 2018 ) or Python (van Rossum, 1995 ). This requires a certain degree of programming ability, but grants the user more flexibility. Popular packages for conducting Bayesian analyses in R are the BayesFactor package (Morey & Rouder, 2015 ) and the brms package (Bürkner, 2017 ), among others (see https://cran.r-project.org/web/views/Bayesian.html for a more exhaustive list). For Python, a popular package for Bayesian analyses is PyMC3 (Salvatier, Wiecki, & Fonnesbeck, 2016 ). The practical guidelines provided in this paper can largely be generalized to the application of these software programs.

Concluding comments

We have attempted to provide concise recommendations for planning, executing, interpreting, and reporting Bayesian analyses. These recommendations are summarized in Table  1 . Our guidelines focused on the standard analyses that are currently featured in JASP. When going beyond these analyses, some of the discussed guidelines will be easier to implement than others. However, the general process of transparent, comprehensive, and careful statistical reporting extends to all Bayesian procedures and indeed to statistical analyses across the board.

The variables are participant number, the time (in seconds) each participant needed to see the hidden figure (i.e., fuse time), experimental condition (VV = with visual information, NV = without visual information), and the log-transformed fuse time.

A one-sided alternative hypothesis makes a more risky prediction than a two-sided hypothesis. Consequently, if the data are in line with the one-sided prediction, the one-sided alternative hypothesis is rewarded with a greater gain in plausibility compared to the two-sided alternative hypothesis; if the data oppose the one-sided prediction, the one-sided alternative hypothesis is penalized with a greater loss in plausibility compared to the two-sided alternative hypothesis.

The fat-tailed Cauchy distribution is a popular default choice because it fulfills particular desiderata, see (Jeffreys, 1961 ;Liang, German, Clyde, & Berger, 2008 ; Ly et al., 2016 ; Rouder, Speckman, Sun, Morey, & Iverson, 2009 ) for details.

Specifically, the proportion of red is the posterior probability of \({\mathscr{H}}_{1}\) under a prior probability of 0.5; for a more detailed explanation and a cartoon see https://tinyurl.com/ydhfndxa

This confusion does not arise for the rarely reported unconditional distributions (see Box 3).

This is known as Jeffreys’s platitude: “The most beneficial result that I can hope for as a consequence of this work is that more attention will be paid to the precise statement of the alternatives involved in the questions asked. It is sometimes considered a paradox that the answer depends not only on the observations but on the question; it should be a platitude” (Jeffreys, 1939 , p.vi).

The Bayesian Mann–Whitney U test results and the results for the raw fuse times are in the .jasp file at https://osf.io/nw49j/ .

The posterior median is popular because it is robust to skewed distributions and invariant under smooth transformations of parameters, although other measures of central tendency, such as the mode or the mean, are also in common use.

We generally recommend error percentages below 20% as acceptable. A 20% change in the Bayes factor will result in one making the same qualitative conclusions. However, this threshold naturally increases with the magnitude of the Bayes factor. For instance, a Bayes factor of 10 with a 50% error percentage could be expected to fluctuate between 5 and 15 upon recomputation. This could be considered a large change. However, with a Bayes factor of 1000 a 50% reduction would still leave us with overwhelming evidence.

Andrews, M., & Baguley, T. (2013). Prior approval: The growth of Bayesian methods in psychology. British Journal of Mathematical and Statistical Psychology , 66 , 1–7.

Google Scholar  

Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician , 27 , 17–21.

Appelbaum, M., Cooper, H., Kline, R. B., Mayo-Wilson, E., Nezu, A. M., & Rao, S. M. (2018). Journal article reporting standards for quantitative research in psychology: The APA Publications and Communications Board task force report. American Psychologist , 73 , 3–25.

Berger, J. O. (2006). Bayes factors. In S. Kotz, N. Balakrishnan, C. Read, B. Vidakovic, & N. L. Johnson (Eds.) Encyclopedia of Statistical Sciences, vol. 1, 378-386, Hoboken, NJ, Wiley .

Berger, J. O., & Wolpert, R. L. (1988) The likelihood principle , (2nd edn.) Hayward (CA): Institute of Mathematical Statistics.

Bürkner, P.C. (2017). brms: An R package for Bayesian multilevel models using Stan. Journal of Statistical Software , 80 , 1–28.

Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., & et al. (2017). Stan: A probabilistic programming language. Journal of Statistical Software , 76 , 1–37.

Clyde, M. A., Ghosh, J., & Littman, M. L. (2011). Bayesian adaptive sampling for variable selection and model averaging. Journal of Computational and Graphical Statistics , 20 , 80–101.

Depaoli, S., & Schoot, R. van de (2017). Improving transparency and replication in Bayesian statistics: The WAMBS-checklist. Psychological Methods , 22 , 240–261.

PubMed   Google Scholar  

Dickey, J. M., & Lientz, B. P. (1970). The weighted likelihood ratio, sharp hypotheses about chances, the order of a Markov chain. The Annals of Mathematical Statistics , 41 , 214–226.

Dienes, Z. (2014). Using Bayes to get the most out of non-significant results. Frontiers in Psychology , 5 , 781.

PubMed   PubMed Central   Google Scholar  

Draper, N. R., & Cox, D. R. (1969). On distributions and their transformation to normality. Journal of the Royal Statistical Society: Series B (Methodological) , 31 , 472–476.

Etz, A. (2018). Introduction to the concept of likelihood and its applications. Advances in Methods and Practices in Psychological Science , 1 , 60–69.

Etz, A., Haaf, J. M., Rouder, J. N., & Vandekerckhove, J. (2018). Bayesian inference and testing any hypothesis you can specify. Advances in Methods and Practices in Psychological Science , 1 (2), 281–295.

Etz, A., & Wagenmakers, E. J. (2017). J. B. S. Haldane’s contribution to the Bayes factor hypothesis test. Statistical Science , 32 , 313–329.

Fisher, R. (1925). Statistical methods for research workers, (12). Edinburgh Oliver & Boyd.

Frisby, J. P., & Clatworthy, J. L. (1975). Learning to see complex random-dot stereograms. Perception , 4 , 173–178.

Gronau, Q. F., Ly, A., & Wagenmakers, E. J. (2020). Informed Bayesian t tests. The American Statistician , 74 , 137–143.

Haaf, J., Ly, A., & Wagenmakers, E. (2019). Retire significance, but still test hypotheses. Nature , 567 (7749), 461.

Hinne, M., Gronau, Q. F., Bergh, D., & Wagenmakers, E. J. (2020). Van den A conceptual introduction to Bayesian model averaging. Advances in Methods and Practices in Psychological Science , 3 , 200–215.

Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). Bayesian model averaging: a tutorial. Statistical science, 382–401.

JASP Team (2019). JASP (Version 0.9.2)[Computer software]. https://jasp-stats.org/ .

Jarosz, A. F., & Wiley, J. (2014). What are the odds? A practical guide to computing and reporting Bayes factors. Journal of Problem Solving , 7 , 2–9.

Jeffreys, H. (1939). Theory of probability, 1st. Oxford University Press.

Jeffreys, H. (1961). Theory of probability. 3rd. Oxford University Press.

Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association , 90 , 773–795.

Keysers, C., Gazzola, V., & Wagenmakers, E. J. (2020). Using Bayes factor hypothesis testing in neuroscience to establish evidence of absence. Nature Neuroscience , 23 , 788–799.

Lee, M. D., & Vanpaemel, W. (2018). Determining informative priors for cognitive models. Psychonomic Bulletin & Review , 25 , 114–127.

Liang, F., German, R. P., Clyde, A., & Berger, J. (2008). Mixtures of G priors for Bayesian variable selection. Journal of the American Statistical Association , 103 , 410–424.

Lo, S., & Andrews, S. (2015). To transform or not to transform: Using generalized linear mixed models to analyse reaction time data. Frontiers in Psychology , 6 , 1171.

Ly, A., Verhagen, A. J., & Wagenmakers, E. J. (2016). Harold Jeffreys’s default Bayes factor hypothesis tests: Explanation, extension, and application in psychology. Journal of Mathematical Psychology , 72 , 19–32.

Marsman, M., & Wagenmakers, E. J. (2017). Bayesian benefits with JASP. European Journal of Developmental Psychology , 14 , 545–555.

Matzke, D., Nieuwenhuis, S., van Rijn, H., Slagter, H. A., van der Molen, M. W., & Wagenmakers, E. J. (2015). The effect of horizontal eye movements on free recall: A preregistered adversarial collaboration. Journal of Experimental Psychology: General , 144 , e1–e15.

Morey, R. D., Hoekstra, R., Rouder, J. N., Lee, M. D., & Wagenmakers, E. J. (2016). The fallacy of placing confidence in confidence intervals. Psychonomic Bulletin & Review , 23 , 103–123.

Morey, R. D., & Rouder, J. N. (2015). BayesFactor 0.9.11-1. Comprehensive R Archive Network. http://cran.r-project.org/web/packages/BayesFactor/index.html .

Plummer, M. (2003). JAGS: A Program for analysis of Bayesian graphical models using Gibbs sampling. In K. Hornik, F. Leisch, & A. Zeileis (Eds.) Proceedings of the 3rd international workshop on distributed statistical computing, Vienna, Austria .

R Core Team (2018). R: A language and environment for statistical computing [Computer software manual]. Vienna, Austria. https://www.R-project.org/ .

Rouder, J. N. (2014). Optional stopping: No problem for Bayesians. Psychonomic Bulletin & Review , 21 , 301–308.

Rouder, J. N., Haaf, J. M., & Vandekerckhove, J. (2018). Bayesian inference for psychology, part IV: Parameter estimation and Bayes factors. Psychonomic Bulletin & Review , 25 (1), 102–113.

Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review , 16 , 225– 237.

Salvatier, J., Wiecki, T. V., & Fonnesbeck, C. (2016). Probabilistic programming in Python using pyMC. PeerJ Computer Science , 3 (2), e55.

Schönbrodt, F.D., & Wagenmakers, E. J. (2018). Bayes factor design analysis: Planning for compelling evidence. Psychonomic Bulletin & Review , 25 , 128–142.

Schramm, P., & Rouder, J. N. (2019). Are reaction time transformations really beneficial? PsyArXiv, March 5.

Spiegelhalter, D. J., Myles, J. P., Jones, D. R., & Abrams, K. R. (2000). Bayesian methods in health technology assessment: a review. Health Technology Assessment , 4 , 1–130.

Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing transparency through a multiverse analysis. Perspectives on Psychological Science , 11 , 702–712.

Stefan, A. M., Gronau, Q. F., Schönbrodt, F.D., & Wagenmakers, E. J. (2019). A tutorial on Bayes factor design analysis using an informed prior. Behavior Research Methods , 51 , 1042–1058.

Sung, L., Hayden, J., Greenberg, M. L., Koren, G., Feldman, B. M., & Tomlinson, G. A. (2005). Seven items were identified for inclusion when reporting a Bayesian analysis of a clinical study. Journal of Clinical Epidemiology , 58 , 261–268.

The BaSiS group (2001). Bayesian standards in science: Standards for reporting of Bayesian analyses in the scientific literature. Internet. http://lib.stat.cmu.edu/bayesworkshop/2001/BaSis.html .

Tijmstra, J. (2018). Why checking model assumptions using null hypothesis significance tests does not suffice: a plea for plausibility. Psychonomic Bulletin & Review , 25 , 548–559.

Vandekerckhove, J., Rouder, J. N., & Kruschke, J. K. (eds.) (2018). Beyond the new statistics: Bayesian inference for psychology [special issue]. Psychonomic Bulletin & Review , p 25.

Wagenmakers, E. J., Beek, T., Rotteveel, M., Gierholz, A., Matzke, D., Steingroever, H., & et al. (2015). Turning the hands of time again: A purely confirmatory replication study and a Bayesian analysis. Frontiers in Psychology: Cognition , 6 , 494.

Wagenmakers, E. J., Lodewyckx, T., Kuriyal, H., & Grasman, R. (2010). Bayesian hypothesis testing for psychologists: A tutorial on the Savage–Dickey method. Cognitive Psychology , 60 , 158–189.

Wagenmakers, E. J., Love, J., Marsman, M., Jamil, T., Ly, A., Verhagen, J., & et al. (2018). Bayesian inference for psychology. Part II: Example applications with JASP. Psychonomic Bulletin &, Review , 25 , 58–76.

Wagenmakers, E. J., Marsman, M., Jamil, T., Ly, A., Verhagen, J., Love, J., & et al. (2018). Bayesian inference for psychology. Part I: Theoretical advantages and practical ramifications. Psychonomic Bulletin &, Review , 25 , 35–57.

Wagenmakers, E. J., Morey, R. D., & Lee, M. D. (2016). Bayesian benefits for the pragmatic researcher. Current Directions in Psychological Science , 25 , 169–176.

Wetzels, R., Raaijmakers, J. G. W., Jakab, E., & Wagenmakers, E. J. (2009). How to quantify support for and against the null hypothesis: A flexible winBUGS implementation of a default Bayesian t test. Psychonomic Bulletin & Review , 16 , 752– 760.

Wicherts, J. M., Veldkamp, C. L. S., Augusteijn, H. E. M., Bakker, M., van Aert, R. C. M., & van Assen, M. A. L. M. (2016). Degrees of freedom in planning, running, analyzing, and reporting psychological studies: A checklist to avoid p-hacking. Frontiers in Psychology , 7 , 1832.

Wrinch, D., & Jeffreys, H. (1921). On certain fundamental principles of scientific inquiry. Philosophical Magazine , 42 , 369– 390.

van Doorn, J., Ly, A., Marsman, M., & Wagenmakers, E. J. (2020). Bayesian rank-based hypothesis testing for the rank sum test, the signed rank test, and spearman’s rho. Journal of Applied Statistics , 1–23.

van Rossum, G. (1995). Python tutorial (Tech. Rep. No. CS-R9526). Amsterdam: Centrum voor Wiskunde en Informatica (CWI).

van den Bergh, D., Haaf, J. M., Ly, A., Rouder, J. N., & Wagenmakers, E. J. (2019). A cautionary note on estimating effect size. PsyArXiv. Retrieved from psyarxiv.com/h6pr8 .

Download references

Acknowledgments

We thank Dr. Simons, two anonymous reviewers, and the editor for comments on an earlier draft. Correspondence concerning this article may be addressed to Johnny van Doorn, University of Amsterdam, Department of Psychological Methods, Valckeniersstraat 59, 1018 XA Amsterdam, the Netherlands. E-mail may be sent to [email protected]. This work was supported in part by a Vici grant from the Netherlands Organization of Scientific Research (NWO) awarded to EJW (016.Vici.170.083) and an advanced ERC grant awarded to EJW (743086 UNIFY). DM is supported by a Veni Grant (451-15-010) from the NWO. MM is supported by a Veni Grant (451-17-017) from the NWO. AE is supported by a National Science Foundation Graduate Research Fellowship (DGE1321846). Centrum Wiskunde & Informatica (CWI) is the national research institute for mathematics and computer science in the Netherlands.

Author information

Authors and affiliations.

University of Amsterdam, Amsterdam, Netherlands

Johnny van Doorn, Don van den Bergh, Udo Böhm, Fabian Dablander, Tim Draws, Nathan J. Evans, Quentin F. Gronau, Julia M. Haaf, Max Hinne, Šimon Kucharský, Alexander Ly, Maarten Marsman, Dora Matzke, Akash R. Komarlu Narendra Gupta, Alexandra Sarafoglou, Angelika Stefan & Eric-Jan Wagenmakers

Nyenrode Business University, Breukelen, Netherlands

University of California, Irvine, California, USA

Alexander Etz

Centrum Wiskunde & Informatica, Amsterdam, Netherlands

Alexander Ly

Stanford University, Stanford, California, USA

Jan G. Voelkel

You can also search for this author in PubMed   Google Scholar

Contributions

JvD wrote the main manuscript. EJW, AE, JH, and JvD contributed to manuscript revisions. All authors reviewed the manuscript and provided feedback.

Corresponding author

Correspondence to Johnny van Doorn .

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Open Practices Statement

The data and materials are available at https://osf.io/nw49j/ .

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

van Doorn, J., van den Bergh, D., Böhm, U. et al. The JASP guidelines for conducting and reporting a Bayesian analysis. Psychon Bull Rev 28 , 813–826 (2021). https://doi.org/10.3758/s13423-020-01798-5

Download citation

Published : 09 October 2020

Issue Date : June 2021

DOI : https://doi.org/10.3758/s13423-020-01798-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Bayesian inference
  • Scientific reporting
  • Statistical software
  • Find a journal
  • Publish with us
  • Track your research

Statistics: Data analysis and modelling

Chapter 16 introduction to bayesian hypothesis testing.

In this chapter, we will introduce an alternative to the Frequentist null-hypothesis significance testing procedure employed up to now, namely a Bayesian hypothesis testing procedure. This also consists of comparing statistical models. What is new here is that Bayesian models contain a prior distribution over the values of the model parameters. In doing so for both the null and the alternative model, Bayesian model comparisons provide a more direct measure of the relative evidence of the null model compared to the alternative. We will introduce the Bayes Factor as the primary measure of evidence in Bayesian model comparison. We then go on to discuss “default priors”, which can be useful in a Bayesian testing procedure. We end with an overview of some objections to the traditional Frequentist method of hypothesis testing, and a comparison between the two approaches.

16.1 Hypothesis testing, relative evidence, and the Bayes factor

In the Frequentist null-hypothesis significance testing procedure, we defined a hypothesis test in terms of comparing two nested models, a general MODEL G and a restricted MODEL R which is a special case of MODEL G. Moreover, we defined the testing procedure in terms of determining the probability of a test result, or one more extreme, given that the simpler MODEL R is the true model. This was necessary because MODEL G is too vague to determine the sampling distribution of the test statistic.

By supplying a prior distribution to parameters, Bayesian models can be “vague” whilst not suffering from the problem that they effectively make no predictions. As we saw for the prior predictive distributions in Figure 15.3 , even MODEL 1, which assumes all possible values of the parameter \(\theta\) are equally likely, still provides a valid predicted distribution of the data. Because any Bayesian model with a valid prior distribution provides a valid prior predictive distribution, which then also provides a valid value for the marginal likelihood, we do not have to worry about complications that arise when comparing models in the Frequentist tradition, such as that the likelihood of one model will always be higher than the other because we need to estimate an additional parameter by maximum likelihood. The relative marginal likelihood of the data assigned by each model, which can be stated as a marginal likelihood ratio analogous to the likelihood ratio of Chapter 2 , provides a direct measure of the relative evidence for both models. The marginal likelihood ratio is also called the Bayes factor, and can be defined for two general Bayesian models as: \[\begin{equation} \text{BF}_{12} = \text{BayesFactor}(\text{MODEL 1}, \text{MODEL 2}) = \frac{p(Y_1,\ldots,Y_n|\text{MODEL 1})}{p(Y_1,\ldots,Y_n|\text{MODEL 2})} \tag{16.1} \end{equation}\] where \(p(Y_1,\ldots,Y_n|\text{MODEL } j)\) denotes the marginal likelihood of observed data \(Y_1,\ldots,Y_n\) according to MODEL \(j\) .

The Bayes factor is a central statistic of interest in Bayesian hypothesis testing. It is a direct measure of the relative evidence for two models. Its importance can also be seen when we consider the ratio of the posterior probabilities for two models, which is also called the posterior odds . In a Bayesian framework, we can assign probabilities not just to data and parameters, but also to whole models. These probabilities reflect our belief that a model is “true” in the sense that it provides a better account of the data than other models. Before observing data, we can assign a prior probability \(p(\text{model } j)\) to a model, and we can update this to a posterior probability \(p(\text{model } j|Y_1,\ldots,Y_n)\) after observing data \(Y_1,\ldots,Y_n\) . If the marginal likelihood \(p(\text{MODEL 2}|Y_1,\ldots,Y_n)\) is larger than 1, the posterior probability is higher than the prior probability, and hence our belief in the model would increase. If the marginal likelihood is smaller than 1, the posterior probability is lower than the prior probability, and hence our belief in the model would decrease. We can compare the relative change in our belief for two models by considering the posterior odds ratio, which is just the ratio of the posterior probability of two models, and computed by multiplying the ratio of the prior probabilities of the models (the prior odds ratio) by the marginal likelihood ratio: \[\begin{aligned} \frac{p(\text{MODEL 1}|Y_1,\ldots,Y_n)}{p(\text{MODEL 2}|Y_1,\ldots,Y_n)} &= \frac{p(Y_1,\ldots,Y_n|\text{MODEL 1})}{p(Y_1,\ldots,Y_n|\text{MODEL 2})} \times \frac{p(\text{MODEL 1})}{p(\text{MODEL 2})} \\ \text{posterior odds} &= \text{Bayes factor} \times \text{prior odds} \end{aligned}\]

In terms of the relative evidence that the data provides for the two models, the Bayes factor is all that matters, as the prior probabilities do not depend on the data. Moreover, if we assign an equal prior probability to each model, then the prior odds ratio would equal 1, and hence the posterior odds ratio is identical to the Bayes factor.

In a Frequentist framework, we would evaluate the magnitude of the likelihood ratio by considering its place within the sampling distribution under the assumption that one of the models is true. Although in principle we might be able to determine the sampling distribution of the Bayes factor in a similar manner, there is no need. A main reason for going through all this work in the Frequentist procedure was that the models are on unequal footing, with the likelihood ratio always favouring a model with additional parameters. The Bayes Factor does not inherently favour a more general model compared to a restricted one. Hence, we can interpret its value “as is”. The Bayes factor is a continuous measure of relative evidential support, and there is no real need for classifications such as “significant” and “non-significant”. Nevertheless, some guidance in interpreting the magnitude might be useful. One convention is the classification provided by Jeffreys ( 1939 ) in Table 16.1 . Because small values below 1, when the Bayes factor favours the second model, can be difficult to discern, the table also provides the corresponding values of the logarithm of the Bayes factor ( \(\log \text{BF}_{1,2}\) ). On a logarithmic scale, any value above 0 favours the first model, and any value below 0 the second one. Moreover, magnitudes above and below 0 can be assigned a similar meaning.

Table 16.1: Interpretation of the values of the Bayes factor (after ).
\(\text{BF}_{1,2}\) \(\log \text{BF}_{1,2}\) Interpretation
> 100 > 4.61 Extreme evidence for MODEL 1
30 – 100 3.4 – 4.61 Very strong evidence for MODEL 1
10 – 30 2.3 – 3.4 Strong evidence for MODEL 1
3 – 10 1.1 – 2.3 Moderate evidence for MODEL 1
1 – 3 0 – 1.1 Anecdotal evidence for MODEL 1
1 0 No evidence
1/3 – 1 -1.1 – 0 Anecdotal evidence for MODEL 2
1/10 – 1/3 -2.3 – -1.1 Moderate evidence for MODEL 2
1/30 – 1/10 -3.4 – -2.3 Strong evidence for MODEL 2
1/100 – 1/30 -4.61 – -3.4 Very strong evidence for MODEL 2
< 1/100 < -4.61 Extreme evidence for MODEL 2

The Bayes factor is a general measure that can be used to compare any Bayesian models. We do not have to focus on nested models, as we did with null-hypothesis significance testing. But such nested model comparisons are often of interest. For instance, when considering Paul’s psychic abilities, fixing \(\theta = .5\) is a useful model of an octopus without psychic abilities, while a model that allows \(\theta\) to take other values is a useful model of a (somewhat) psychic octopus. For the first model, assigning prior probability to \(\theta\) is simple: the prior probability of \(\theta = .5\) is \(p(\theta = .5) = 1\) , and \(p(\theta \neq .5) = 0\) . For the second model, we need to consider how likely each possible value of \(\theta\) is. Figure 15.3 shows two choices for this prior distribution, which are both valid representations of the belief that \(\theta\) can be different from .5. These choices will give a different marginal likelihood, and hence a different value of the Bayes factor when comparing them to the restricted null-model

\[\text{MODEL 0}: \theta = .5\] The Bayes factor comparing MODEL 1 to MODEL 0 is

\[\text{BF}_{1,0} = 12.003\] which indicates that the data (12 out of 14 correct predictions) is roughly 12 times as likely under MODEL 1 compared to MODEL 0, which in the classification of Table 16.1 means strong evidence for MODEL 1. For MODEL 2, the Bayes factor is

\[\text{BF}_{2,0} = 36.409\] which indicates that the data is roughly 36 times as likely under MODEL 2 compared to MODEL 0, which would be classified as very strong evidence for MODEL 2. In both cases, the data favours the alternative model to the null model and may be taken as sufficient to reject MODEL 0. However, the strength of the evidence varies with the choice of prior distribution of the alternative model. This is as it should be. A model such as such as MODEL 2, which places stronger belief on higher values of \(\theta\) , is more consistent with Paul’s high number of correct predictions.

Bayesian hypothesis testing with Bayes factors is, at it’s heart, a model comparison procedure. Bayesian models consist of a likelihood function and a prior distribution. A different prior distribution means a different model, and therefore a different result of the model comparison. Because there are an infinite number of alternative prior distributions to the one of the null model, there really isn’t a single test of the null hypothesis \(H_0: \theta = .5\) . The prior distribution of MODEL 1, where each possible value of \(\theta\) is equally likely, is the Bayesian equivalent of the alternative hypothesis in a null-hypothesis significance testing, and as such might seem a natural default against which to compare the null hypothesis. But there is nothing to force this choice, and other priors are in principle equally valid, as long as they reflect your a priori beliefs about likely values of the parameter. Notice the “a priori” specification in the last sentence: it is vital that the prior distribution is chosen before observing the data. If you choose the a prior distribution to match the data after having looked at it, the procedure loses some of its meaning as a hypothesis test, even if the Bayes factor is still an accurate reflection of the evidential support of the models.

16.2 Parameter estimates and credible intervals

Maximum likelihood estimation provides a single point-estimate for each parameter. In a Bayesian framework, estimation involves updating prior beliefs to posterior beliefs. What we end up with is a posterior distribution over the parameter values. If you want to report a single estimate, you could chose one of the measures of location: the mean, median, or mode of the posterior distribution. Unless the posterior is symmetric, these will have different values (see Figure 16.1 ), and one is not necessarily better than the other. I would usually choose the posterior mean, but if the posterior is very skewed, a measure such as the mode or median might provide a better reflection of the location of the distribution.

In addition to reporting an estimate, it is generally also a good idea to consider the uncertainty in the posterior distribution. A Bayesian version of a confidence interval (with a more straightforward interpretation!) is a credible interval . A credible interval is an interval in the posterior distribution which contains a given proportion of the probability mass. A common interval is the Highest Density Interval (HDI), which is the narrowest interval which contains a given proportion of the probability mass. Figure 16.1 shows the 95% HDI of the posterior probability that Paul makes a correct prediction, where the prior distribution was the uniform distribution of MODEL 1 in Figure 15.2 .

Figure 16.1: Posterior distribution for the probability that Paul makes a correct prediction, for MODEL 1 in Figure 15.2 .

A slightly different way to compute a credible interval is as the central credible interval . For such an interval, the excluded left and right tail of the distribution each contain \(\tfrac{\alpha}{2}\) of the probability mass (where e.g.  \(\alpha = .05\) for a 95% credible interval). Unlike the HDI, the central credible interval is not generally the most narrow interval which contains a given proportion of the posterior probability. But it is generally more straightforward to compute. Nevertheless, the HDI is more often reported than the central credible interval.

A nice thing about credible intervals is that they have a straightforward interpretation: the (subjective) probability that the true value of a parameter lies within an \(x\) % credible interval is \(x\) %. Compare this to the correct interpretation of an \(x\) % Frequentist confidence interval, which is that for infinite samples from the DGP, and computing an infinite number of corresponding confidence intervals, \(x\) % of those intervals would contain the true value of the parameter.

16.3 A Bayesian t-test

As discussed above, Bayesian hypothesis testing concerns comparing models with different prior distributions for model parameters. If one model, the “null model”, restricts a parameter to take a specific value, such as \(\theta = .5\) , or \(\mu = 0\) , while another model allows the parameter to take different values, we compare a restricted model to a more general one, and hence we can think of the model comparison as a Bayesian equivalent to a null-hypothesis significance test. The prior distribution assigned to the parameter in the more general alternative model will determine the outcome of the test, and hence it is of the utmost importance to choose this sensibly. This, however, is not always easy. Therefore, much work has been conducted to derive sensible default priors to enable researchers to conduct Bayesian hypothesis tests without requiring them to define prior distributions which reflect their own subjective beliefs.

Rouder, Speckman, Sun, Morey, & Iverson ( 2009 ) developed a default prior distribution to test whether two groups have a different mean. The test is based on the two-group version of the General Linear Model (e.g. Section 7.2 ):

\[Y_i = \beta_0 + \beta_1 \times X_{1,i} + \epsilon_i \quad \quad \epsilon_i \sim \textbf{Normal}(0, \sigma_\epsilon)\] where \(X_{1,i}\) is a contrast-coded predictor with the values \(X_{1i} = \pm \tfrac{1}{2}\) for the different groups. Remember that with this contrast code, the slope \(\beta_1\) reflects the difference between the group means, e.g.  \(\beta_1 = \mu_1 - \mu_2\) , and the intercept represents the grand mean \(\beta_0 = \frac{\mu_1 + \mu_2}{2}\) . Testing for group differences involves a test of the following hypotheses:

\[\begin{aligned} H_0\!: & \quad \beta_1 = 0 \\ H_1\!: & \quad \beta_1 \neq 0 \\ \end{aligned}\]

To do this in a Bayesian framework, we need prior distributions for all the model parameters ( \(\beta_0\) , \(\beta_1\) , and \(\sigma_\epsilon\) ). Rouder et al. ( 2009 ) propose to use so-called uninformative priors for \(\beta_0\) and \(\sigma_\epsilon\) (effectively meaning that for these parameters, “anything goes”). The main consideration is then the prior distribution for \(\beta_1\) . Rather than defining a prior distribution for \(\beta_1\) directly, they propose to define a prior distribution for \(\frac{\beta_1}{\sigma_\epsilon}\) , which is the difference between the group means divided by the variance of the dependent variable within each group. This is a measure of effect-size and is also known as Cohen’s \(d\) :

\[\text{Cohen's } d = \frac{\mu_1 - \mu_2}{\sigma_\epsilon} \quad \left(= \frac{\beta_1}{\sigma_\epsilon}\right)\] Defining the prior distribution for the effect-size is more convenient than defining the prior distribution for the difference between the means, as the latter difference is dependent on the scale of the dependent variable, which makes it difficult to define a general prior distribution suitable for all two-group comparisons. The “default” prior distribution they propose is a so-called scaled Cauchy distribution:

\[\frac{\beta_1}{\sigma_\epsilon} \sim \mathbf{Cauchy}(r)\] The Cauchy distribution is identical to a \(t\) -distribution with one degree of freedom ( \(\text{df} = 1\) ). The scaling factor \(r\) can be used to change the width of the distribution, so that either smaller or larger effect sizes become more probable. Examples of the distribution, with three common values for the scaling factor \(r\) (“medium”: \(r = \frac{\sqrt{2}}{2}\) , “wide”: \(r = 1\) , and “ultrawide”: \(r = \sqrt{2}\) ), are depicted in Figure 16.2 .

Figure 16.2: Scaled Cauchy prior distributions on the effect size \(\frac{\beta_1}{\sigma_\epsilon}\)

Rouder et al. ( 2009 ) call the combination of the priors for the effect size and error variance the Jeffreys-Zellner-Siow prior (JZS prior). The “default” Bayesian t-test is to compare the model with these priors to one which assumes \(\beta_1 = 0\) , i.e. a model with a prior \(p(\beta_1 = 0) = 1\) and \(p(\beta_1 \neq 0) = 0\) , whilst using the same prior distributions for the other parameters ( \(\beta_0\) and \(\sigma_\epsilon\) ).

As an example, we can apply the Bayesian t-test to the data from the Tetris study analysed in Chapter 7 . Comparing the Tetris+Reactivation condition to the Reactivation-Only condition, and setting the scale of the prior distribution for the effects size in the alternative MODEL 1 to \(r=1\) , provides a Bayes factor comparing the alternative hypothesis \(H_1\) ( \(\beta \neq 0\) ) to the null-hypothesis \(H_0\) ( \(\beta_1 = 0\) ) of \(\text{BF}_{1,0} = 17.225\) , which can be interpreted as strong evidence against the null hypothesis.

As we indicated earlier, the value of the Bayes factor depends on the prior distribution for the tested parameter in the model representing the alternative hypothesis. This dependence is shown in Figure 16.3 by varying the scaling factor \(r\) .

Figure 16.3: Bayes factor \(\text{BF}_{1,0}\) testing equivalence of the means of the Tetris+Reactivation and Reactivation-Only conditions for different values of the scaling factor \(r\) of the scaled Cauchy distribution.

As this figure shows, the Bayes factor is small for values of \(r\) close to 0. The lower the value of \(r\) , the less wide the resulting Cauchy distribution becomes. In the limit, as \(r\) reaches 0, the prior distribution in the alternative model becomes the same as that of the null model (i.e., assigning only probability to the value \(\beta_1 = 0\) ). This makes the models indistinguishable, and the Bayes factor would be 1, regardless of the data. As \(r\) increases in value, we see that the Bayes factor quickly rises, showing support for the alternative model. For this data, the Bayes factor is largest for a scaling factor just below \(r=1\) . When the prior distribution becomes wider than this, the Bayes factor decreases again. This is because the prior distribution then effectively assigns too much probability to high values of the effect size, and as a result lower probability to small and medium values of the effect size. At some point, the probability assigned to the effect size in the data becomes so low, that the null model will provide a better account of the data than the alternative model. A plot like the one in Figure 16.3 is useful to inspect the robustness of a test result to the specification of the prior distribution. In this case, the Bayes factor shows strong evidence ( \(\text{BF}_{1,0} > 10\) ) for a wide range of sensible values of \(r\) , and hence one might consider the test result quite robust. You should not use a plot like this to determine the “optimal” choice of the prior distribution (i.e. the one with the highest Bayes factor). If you did this, then the prior distribution would depend on the data, which is sometimes referred to as “double-dipping”. You would then end up with similar issues as in Frequentist hypothesis testing, where substituting an unknown parameter with a maximum likelihood estimate biases the likelihood ratio to favour the alternative hypothesis, which we then needed to correct for by considering the sampling distribution of the likelihood ratio statistic under the assumption that the null hypothesis is true. A nice thing about Bayes factors is that we do not need to worry about such complications. But that changes if you try to “optimise” a prior distribution by looking at the data.

16.4 Bayes factors for General Linear Models

The suggested default prior distributions can be generalized straightforwardly to more complex versions of the General Linear Model, such as multiple regression ( Liang, Paulo, Molina, Clyde, & Berger, 2008 ) and ANOVA models ( Rouder, Morey, Speckman, & Province, 2012 ) , by specifying analogous JZS prior distributions over all parameters. This provides a means to test each parameter in a model individually, as well as computing omnibus tests by comparing a general model to one where the prior distribution allows only a single value (i.e.  \(\beta_j = 0\) ) for multiple parameters.

Table 16.2 shows the results of a Bayesian equivalent to the moderated regression model discussed in Section 6.1.5 . The results generally confirm the results of the frequenist tests employed there, although evidence for the interaction between fun and intelligence can be classified as “anecdotal”.

Table 16.2: Results of a Bayesian regression analysis for the Speed Dating data (cf Table ) with a default JZS prior with ‘medium’ scaling factor \(r = \sqrt{2}/4\) (for regression models, default scaling factors are \(\sqrt{2}/4\), \(1/2\), and \(\sqrt{2}/2\) for medium, wide, and ultrawide, respectively). The test of each effect compares the full model to one with that effect excluded.
effect BF
\(\texttt{attr}\) > 1000
\(\texttt{intel}\) > 1000
\(\texttt{fun}\) > 1000
\(\texttt{attr} \times \texttt{intel}\) 37.46
\(\texttt{fun} \times \texttt{intel}\) 2.05

Table 16.3 shows the Bayesian equivalent of the factorial ANOVA reported in Section 8.2.1 . The results show “extreme” evidence for an effect of experimenter belief, and no evidence for an effect of power prime, nor for an interaction between power prime and experimenter belief. In the Frequentist null-hypothesis significance test, the absence of a significant test result can not be taken as direct evidence for the null hypothesis. There is actually no straightforward way to quantify the evidence for the null hypothesis in a Frequentist framework. This is not so for the Bayesian hypothesis tests. Indeed, the Bayes factor directly quantifies the relative evidence for either the alternative or null hypothesis. Hence, we find “moderate” evidence that the null hypothesis is true for power prime, and for the interaction between power prime and experimenter belief. This ability to quantify evidence both for and against the null hypothesis is one of the major benefits of a Bayesian hypothesis testing procedure.

Table 16.3: Results of a Bayesian factorial ANOVA analysis for the social priming data (cf Table ) with a default JZS prior with a ‘medium’ scaling factor of \(r = 1/2\) (for ANOVA models, default scaling factors are \(1/2\), \(\sqrt{2}/2\), and \(1\) for medium, wide, and ultrawide, respectively; this assumes standard effect coding for the contrast-coded predictors, which then matches the priors to those set for the linear regression model). The test of each effect compares the full model to one with that effect excluded.
effect BF
\(\texttt{P}\) 0.127
\(\texttt{B}\) 537.743
\(\texttt{P} \times \texttt{B}\) 0.216

16.5 Some objections to null-hypothesis significance testing

Above, we have presented a Bayesian alternative to the traditional Frequentist null-hypothesis significance testing (NHST) procedure. While still the dominant method of statistical inference in psychology, the appropriateness of the NHST has been hotly debated almost since its inception ( Cohen, 1994 ; Nickerson, 2000 ; Wagenmakers, 2007 ) . One issue is that a significant test result is not the same as a “theoretical” or “practical” significance. For a given true effect not equal to 0, the (expected) \(p\) -value becomes smaller and smaller as the sample size increases, because of the increased power in detecting that effect. As a result, even the smallest effect size will become significant for a sufficiently large sample size. For example, a medicine might result in a significant decrease of a symptom compared to a placebo, even if the effect is hardly noticeable to the patient. I should point out that this is more an issue with testing a “point” null hypothesis (e.g. the hypothesis that the effect is exactly equal to 0), rather than an issue with the Frequentist procedure per se. It is an important limitation of null hypothesis testing procedures in general. A similar objection to these hypotheses is that the null hypothesis is unlikely to ever be exactly true. Thompson ( 1992 ) states the potential issues strongly as:

Statistical significance testing can involve a tautological logic in which tired researchers, having collected data on hundreds of subjects, then, conduct a statistical test to evaluate whether there were a lot of subjects, which the researchers already know, because they collected the data and know they are tired. This tautology has created considerable damage as regards the cumulation of knowledge. ( Thompson, 1992, p. 436 )

There are other objections, which I will go into in the following sections.

16.5.1 The \(p\) -value is not a proper measure of evidential support

It is common practice to interpret the magnitude of the \(p\) -value as an indication of the strength of the evidence against the null hypothesis. That is, a smaller \(p\) -value is taken to indicate stronger evidence against the null hypothesis than a larger \(p\) -value. Indeed, Fisher himself seems to have subscribed to this view ( Wagenmakers, 2007 ) . While it is true that the magnitude is often correlated with the strength of evidence, there are some tricky issues regarding this. If a \(p\) -value were a “proper” measure of evidential support, then if two experiments provide the same \(p\) -value, they should provide the same support against the null hypothesis. But what if the first experiment had a sample size of 10, and the second a sample size of 10,000? Would a \(p\) -value of say \(p=.04\) indicate the same evidence against the null-hypothesis? The general consensus is that sample size is an important consideration in the interpretation of the \(p\) -value, although not always for the same reason. On the one hand, many researchers argue that the \(p\) -value of the larger study provides stronger evidence, possibly because the significant result in the larger study might be less likely due to random sample variability (see e.g. Rosenthal & Gaito, 1963 ) . On the other hand, it can be argued that the smaller study actually provides stronger evidence, because to obtain the same \(p\) -value, the effect size must be larger in the smaller study. Bayesian analysis suggests the latter interpretation is the correct one ( Wagenmakers, 2007 ) . That the same \(p\) -value can indicate a different strength of evidence means that the \(p\) -value does not directly reflect evidential support (at least not without considering the sample size).

Another thing worth pointing out is that, if the null hypothesis is true, any \(p\) -value is equally likely. This is by definition. Remember that the \(p\) -value is defined as the probability of obtaining the same test statistic, or one more extreme, assuming the null-hypothesis is true. A \(p\) -value of say \(p=.04\) indicates that you would expect to find an equal or more extreme value of the test statistic in 4% of all possible replications of the experiment. Conversely, in 4% of all replications would you obtain a \(p\) -value of \(p \leq .04\) . For a \(p\) -value of \(p=.1\) , you would expect to find a similar or smaller \(p\) -value in 10% of all replications of the experiment. The only distribution for which this relation between the value ( \(p\) ) and the probability of obtaining a value equal-or-smaller than it \(p(p-\text{value} \leq p)\) , is the uniform distribution. So, when the null hypothesis is true, there is no reason to expect a large \(p\) -value, because every \(p\) -value is equally likely. When the null hypothesis is false, smaller \(p\) -values are more likely than higher \(p\) -values, especially as the sample size increases. This is show by simulation for a one-sample t-test in Figure 16.4 . Under the null hypothesis (left plot), the distribution of the \(p\) -values is uniform.

Figure 16.4: Distribution of \(p\) -values for 10,000 simulations of a one-sample \(t\) -test. \(\delta = \frac{\mu - \mu_0}{\sigma}\) refers to the effect size. Under the null hypothesis (left plot; \(\delta = 0\) ) the distribution of the \(p\) -values is uniform. When the null-hypothesis is false ( \(\delta = .3\) ), the distribution is skewed, with smaller \(p\) -values being more probable, especially when the sample size is larger (compare the middle plot with \(n=10\) to the right-hand plot with \(n=50\) ).

16.5.2 The \(p\) -value depends on researcher intentions

The sampling distribution of a test statistic is the distribution of the values of the statistic calculated for an infinite number of datasets produced by the same Data Generating Process (DGP). The DGP includes all the relevant factors that affect the data, including not only characteristics of the population under study, but also characteristics of the study, such as whether participants were randomly sampled, how many participants were included, which measurement tools were used, etc. Choices such as when to stop collecting data are part of the study design. That means that the same data can have a different \(p\) -value, depending on whether the sample size was fixed a priori, or whether sampling continued until some criterion was reached. The following story, paraphrased from ( Berger & Wolpert, 1988, pp. 30–33 ) , may highlight the issue:

A scientist has obtained 100 independent observations that are assumed be Normal-distributed with mean \(\mu\) and standard deviation \(\sigma\) . In order to test the null hypothesis that \(\mu=0\) , the scientist consults a Frequentist statistician. The mean of the observations is \(\overline{Y} = 0.2\) , and the sample standard deviation is \(S_Y=1\) , hence the \(p\) -value is \(p = .0482\) , which is a little lower than than the adopted significance level of \(\alpha.05\) . This leads to a rejection of the null hypothesis, and a happy scientist. However, the statistician decides to probe deeper and asks the scientist what he would have done in case that the experiment had not yielded a significant result after 100 observations. The scientist replies he would have collected another 100 observations. As such, the implicit sampling plan was not to collect \(n=100\) observation and stop, but rather to first take 100 observations and check whether \(p <.05\) , and collect another 100 observations (resulting in \(n=200\) ) if not. This is a so-called sequential testing procedure, and requires a different treatment than a fixed-sampling procedure. In controlling the Type 1 error of the procedure as a whole, one would need to consider the possible results after \(n=100\) observations, but also after \(n=200\) observations, which is possible, but not straightforward, as the results of after \(n=100\) are dependent on the results after \(n=100\) observations. But the clever statistician works it out and then convinces the scientist that the appropriate p-value for this sequential testing procedure is no longer significant. The puzzled and disappointed scientist leaves to collect another 100 observations. After lots of hard work, the scientist returns, and the statistician computes a \(p\) -value for the new data, which is now significant. Just to make sure the sampling plan is appropriately reflected in the calculation, the statistician asks what the scientist would have done if the result would not have been significant at this point. The scientist answers “This would depend on the status of my funding; If my grant is renewed, I would test another 100 observations. If my grant is not renewed, I would have had to stop the experiment. Not that this matters, of course, because the data were significant anyway”. The statistician then explains that the correct inference depends on the grant renewal; if the grant is not renewed, the sampling plan stands and no correction is necessary. But if the grant is renewed, the scientist could have collected more data, which calls for a further correction, similar to the first one. The annoyed scientist then leaves and resolves to never again share with the statistician her hypothetical research intentions.

What this story shows is that in considering infinite possible repetitions of a study, everything about the study that might lead to variations in the results should be taken into account. This includes a scientists’ decisions made during each hypothetical replication of the study. As such, the interpretation of the data at hand (i.e., whether the hypothesis test is significant or not significant) depends on hypothetical decisions in situations that did not actually occur. If exactly the same data had been collected by a scientist who would have not have collected more observations, regardless of the outcome of the first test, then the result would have been judged significant. So the same data can provide different evidence. This does not mean the Frequentist NHST is inconsistent. The procedure “does what it says on the tin”, namely providing a bound on the rate of Type 2 errors in decisions , when the null hypothesis is true. In considering the accuracy of the decision procedure, we need to consider all situations in which a decision might be made in the context of a given study. This means considering the full design of the study, including the sampling plan, as well as, to some extent, the analysis plan. For instance, if you were to “explore” the data, trying out different ways to analyse the data, by e.g. including or excluding potential covariates and applying different criteria to excluding participants or their responses until you obtain a significant test result for an effect of interest, then the significance level \(\alpha\) for that test needs to be adjusted to account for such a fishing expedition. This fishing expedition is also called p-hacking ( Simmons, Nelson, & Simonsohn, 2011 ) and there really isn’t a suitable correction for it. Although corrections for multiple comparisons exist, which allow you to test all possible comparisons within a single model (e.g. the Scheffé correction), when you go on to consider different models, and different subsets of the data to apply that model to, all bets are off. This, simply put, is just really bad scientific practice. And it would render the \(p\) -value meaningless.

16.5.3 Results of a NHST are often misinterpreted

I have said it before, and I will say it again: the \(p\) -value is the probability of observing a particular value of a test statistic, or one more extreme, given that the null-hypothesis is true. This is the proper, and only, interpretation of the \(p\) -value. It is a tricky one, to be sure, and the meaning of the \(p\) -value is often misunderstood. Some common misconceptions (see e.g., Nickerson, 2000 ) are:

  • The \(p\) -value is the probability that the null-hypothesis is true, given the data, i.e.  \(p = p(H_0|\text{data})\) . This posterior probability can be calculated in a Bayesian framework, but not in a Frequentist one.
  • One minus the \(p\) -value is the probability that the alternative hypothesis is true, given the data, i.e.  \(1-p = p(H_1|\text{data})\) . Again, the posterior probability of the alternative hypothesis can be obtained in a Bayesian framework, when the alternative hypothesis is properly defined by a suitable prior distribution. In the conventional Frequentist NHST, the alternative hypothesis is so poorly defined, that it can’t be assigned any probability (apart from perhaps \(p(H_1) = p(H_1|\text{data}) = 1\) , which does not depend on the data, and just reflects that e.g.  \(-\infty \leq \mu - \mu_0 \leq \infty\) will have some value).
  • The \(p\) -value is the probability that the results were due to random chance. If you take a statistical model seriously, then all results are, to some extent, due to random chance. Trying to work out the probability that something is a probability seems a rather pointless exercise (if you want to know the answer, it is 1. It would have been more fun if the answer was 42, but alas, the scale of probabilities does not allow this particular answer).

Misinterpretations of \(p\) -values are mistakes by practitioners, and do not indicate a problem with NHST itself. However, it does point to a mismatch between what the procedure provides, and what the practitioner would like the procedure to provide. If one desires to know the probability that the null hypothesis is true, or the probability that the alternative hypothesis is true, than one has to use a Bayesian procedure. Unless you consider a wider context, where the truth of hypotheses can be sampled from a distribution, then there is no “long-run frequency” for the truth of hypotheses, and hence no Frequentist definition of that probability.

16.6 To Bayes or not to Bayes? A pragmatic view

At this point, you might feel slightly annoyed. Perhaps even very annoyed. We have spent all the preceding chapters focusing on the Frequentist null hypothesis significance testing procedure, and after all that work I’m informing you of these issues. Why? Was all that work for nothing?

No, obviously not. Although much of the criticism regarding the NHST is appropriate, as long as you understand what it does and apply the procedure properly, there is no need to abandon it. The NHST is designed to limit the rate of Type 1 errors (rejecting the null hypothesis when it is true). It does this well. And, when using the appropriate test statistic, in the most powerful way possible. Limiting Type 1 errors is, whilst modest, a reasonable concern in scientific practice. The Bayesian alternative allows you to do more, such as evaluate the relative evidence for and against the null hypothesis, and even calculate the posterior probability of both (as long as you are willing to assign a prior probability to both as well).

An advantage of the NHST is its “objectiveness”: once you have determined a suitable distribution of the data, and decided on a particular value for a parameter to test, there are no other decisions to make apart from setting the significance level of the test. In the Bayesian hypothesis testing procedure, you also need to specify a prior distribution for the parameter of interest in the alternative hypothesis. Although considering what parameter values you would expect if the null hypothesis were false is an inherently important consideration, it is often not straightforward when you start a research project, or rely on measures you have not used before in a particular context. Although much work has been devoted to deriving sensible “default priors”, I don’t believe there is a sensible objective prior applicable to all situations. Given a freedom to choose a prior distribution for the alternative hypothesis, this makes the Bayesian testing procedure inherently subjective. This is perfectly in keeping with the subjectivist interpretation of probability as the rational belief of an agent endowed with (subjective) prior beliefs. Moreover, at some point, if you were to accumulate all data, the effect of prior beliefs “washes out” (as long as you don’t assign a probability of zero to the true parameter value).

My pragmatic answer to the question whether you should use a Bayesian test or a Frequentist one is then the following: if you can define a suitable prior distribution to reflect what you expect to observe in a study, before you actually conduct that study, then use a Bayesian testing procedure. This will allow you to do what you most likely would like to do, namely quantify the evidence for your hypotheses against alternative hypotheses. If you are unable to form any expectations regarding the effects within your study, you probably should consider a traditional NHST to assess whether there is an indication of any effect, and limiting your Type 1 error rate in doing so. In some sense, this is a “last resort”, but in psychology, where quantitative predictions are inherently difficult, something I reluctantly have to rely on quite frequently. Instead of a hypothesis test, you could also consider simply estimating the effect-size in that case, with a suitable credible interval.

16.7 In practice

The steps involved in conducting a Bayesian hypothesis test are not too different from the steps involved in conducting a Frequentist hypothesis test, with the additional step of choosing prior distributions over the values of the model parameters.

Explore the data. Plot distributions of the data within the conditions of the experiment (if any), pairwise scatterplots between numerical predictors and the dependent variable, etc. Consider what model you might use to analyse the data, and assess the validity of the underlying assumptions.

Choose an appropriate general statistical model. In many cases, this will be a version of the GLM or an extension such as a linear mixed-effects model, perhaps using suitably transformed dependent and independent variables.

Choose appropriate prior distributions for the model parameters. This is generally the most difficult part. If you have prior data, then you could base the prior distributions on this. If not, then ideally, formulate prior distributions which reflect your beliefs about the data. You can check whether the prior distributions lead to sensible predictions by simulating data from the resulting model (i.e., computing prior predictive distributions). Otherwise, you can resort to “default” prior distributions.

Conduct the analysis. To test null-hypotheses, compare the general model to a set of restricted models which fix a parameter to a particular value (e.g. 0), and compute the Bayes Factor for each of these comparisons. To help you interpret the magnitude of the Bayes Factor, you can consult Table 16.1 . Where possible, consider conducting a robustness analysis, by e.g. varying the scaling factor of the prior distributions. This will inform you about the extent to which the results hinge on a particular choice of prior, or whether they hold for a range of prior distributions.

Report the results. Make sure that you describe the statistical model, as well as the prior distributions chosen. The latter is crucial, as Bayes Factors are not interpretable without knowing the prior distributions. For example, the results of the analysis in Table 16.2 , with additional results from the posterior parameter distributions, may be reported as:

To analyse the effect of rated attractiveness, intelligence, and fun on the liking of dating partners, we used a Bayesian linear regression analysis ( Rouder & Morey, 2012 ) . In the model, we allowed the effect of attractiveness and fun to be moderated by intelligence. All predictors were mean-centered before entering the analysis. We used a default JZS-prior for all parameters, with a medium scaling factor of \(r = \sqrt{2}/4\) , as recommended by Richard D. Morey & Rouder ( 2018 ) . The analysis showed “extreme” evidence for effects of attractiveness, intelligence, and fun ( \(\text{BF}_{1,0} > 1000\) ; comparing the model to one with a point-prior at 0 for each effect). All effects were positive, with the posterior means of the slopes equalling \(\hat{\beta}_\text{attr} = 0.345\) , 95% HDI [0.309; 0.384], \(\hat{\beta}_\text{intel} = 0.257\) , 95% HDI [0.212; 0.304], and \(\hat{\beta}_\text{fun} = 0.382\) , 95% HDI [0.342; 0.423]. In addition, we found “very strong” evidence for a moderation of the effect of attractiveness by intelligence ( \(BF_{0,1} = 37.459\) ). For every one-unit increase in rated intelligence, the effect of attraciveness reduced by \(\hat{\beta}_{\text{attr} \times \text{intel}} = 0.043\) , 95% HDI [-0.066; -0.02]. There was only “anecdotal” evidence for a moderation of the effect of fun by intelligence ( \(BF_{0,1} = 2.052\) ). Although we don’t place too much confidence in this result, it indicates that for every one-unit increase in rated intelligence, the effect of fun increased by \(\hat{\beta}_{\text{fun} \times \text{intel}} = 0.032\) , 95% HDI [0.01; 0.055].

16.8 “Summary”

Figure 16.5: ‘Piled Higher and Deeper’ by Jorge Cham www.phdcomics.com. Source: https://phdcomics.com/comics/archive.php?comicid=905

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Educ Psychol Meas
  • v.79(1); 2019 Feb

Proportion of Indicator Common Variance Due to a Factor as an Effect Size Statistic in Revised Parallel Analysis

1 Arizona State University, Tempe, AZ, USA

Samuel B. Green

Marilyn s. thompson.

Past research suggests revised parallel analysis (R-PA) tends to yield relatively accurate results in determining the number of factors in exploratory factor analysis. R-PA can be interpreted as a series of hypothesis tests. At each step in the series, a null hypothesis is tested that an additional factor accounts for zero common variance among measures in the population. Integration of an effect size statistic—the proportion of common variance (PCV)—into this testing process should allow for a more nuanced interpretation of R-PA results. In this article, we initially assessed the psychometric qualities of three PCV statistics that can be used in conjunction with principal axis factor analysis: the standard PCV statistic and two modifications of it. Based on analyses of generated data, the modification that considered only positive eigenvalues ( π ^ SMC : k ′ + Λ ^ ) overall yielded the best results. Next, we examined PCV using minimum rank factor analysis, a method that avoids the extraction of negative eigenvalues. PCV with minimum rank factor analysis generally did not perform as well as π ^ SMC : k ′ + Λ ^ , even with a relatively large sample size of 5,000. Finally, we investigated the use of π ^ SMC : k ′ + Λ ^ in combination with R-PA and concluded that practitioners can gain additional information from π ^ SMC : k ′ + Λ ^ and make more nuanced decision about the number of factors when R-PA fails to retain the correct number of factors.

A number of factor analytic experts (e.g., Fabrigar, Wegener, MacCallum, & Strahan, 1999 ; Schmitt, 2011 ) have recommended the use of parallel analysis (PA) to assess the number of factors underlying measures in conducting exploratory factor analysis (EFA). Traditional PA (T-PA) compares the eigenvalues for a sample correlation matrix with the mean eigenvalues for correlation matrices for M (e.g., M = 100) parallel datasets (PDs) generated such that the variables are independent. Despite the relative accuracy of PA, Harshman and Reddon (1983) and Turner (1998) believed T-PA to be flawed. They argued against the use of a reference distribution of eigenvalues for PDs with uncorrelated variables. They contended that the proper reference distribution of eigenvalues to reach a conclusion about the k th factor should be based on PDs with k −1 underlying factors.

To address these concerns, Green, Levy, Thompson, Lu, and Lo (2012) suggested revised PA (R-PA). When assessing whether at least k factors underlie a set of measures with R-PA, PDs are generated assuming measures are a function of k −1 factors rather than 0 factors. Ideally, PDs should be generated based on the population loadings of these k −1 factors. In practice, the population factor loadings are unknown, and factor loadings based on the sample dataset are used.

R-PA differs from T-PA in two other ways. First, R-PA uses principal axis factoring (PAF) rather than principal components analysis (PCA) because unlike PCA, PAF allows for measurement error and thus is more appropriate for educational/psychological data (e.g., Ford, MacCallum, & Tait, 1986 ). A second difference is that the eigenvalues for the sample data are compared with the 95th percentile rather than the mean of eigenvalues for the referent distribution ( Buja & Eyuboglu, 1992 ; Glorfeld, 1995 ), such that each step in R-PA is similar to the hypothesis testing with a nominal alpha of .05.

Results of Monte Carlo studies offer support for R-PA ( Green et al., 2012 ; Green, Redell, Thompson, & Levy, 2016 ; Green, Thompson, Levy, & Lo, 2015 ; Green, Xu, & Thompson, 2018 ) relative to other PA methods. Ruscio and Roche (2012) also offered a PA method that uses the proper referent distribution, but Green et al. (2017) conducted a Monte Carlo study suggesting R-PA generally was preferable across the examined conditions.

R-PA, Hypothesis Testing, and Effect Size

R-PA can be viewed as a sequential, hypothesis-testing process. At each step in the process, the null hypothesis is assessed that k −1 factors are sufficient to reproduce the population correlation matrix. Alpha is set at .05 given the 95th percentile eigenvalue rule is applied. Rejection of the null hypothesis implies at least k factors underlie the population correlation matrix. Nonrejection is interpreted as k −1 factors underlie this matrix.

Because R-PA is a series of hypothesis tests, it is open to the misinterpretations that commonly occur with hypothesis testing. First, nonrejection of the null hypothesis in R-PA should not necessarily imply acceptance of the null hypothesis. Nonrejection can be due to a lack of power. In previous Monte Carlo studies ( Green et al., 2012 ; Green et al., 2015 ), R-PA underestimated the number of factors in conditions that have low power (e.g., small sample size, lower factor loadings, and strong factor correlations). Second, rejection of the null hypothesis conducting R-PA does not necessarily imply that an additional factor is nontrivial. A number of psychometricians have argued that a very large number of factors is likely to underlie any set of measures (e.g., Cudeck & Browne, 1992 ; Tucker, Koopman, & Linn, 1969 ). Within this context, the researcher who is conducting EFA is attempting to determine the number of major factors that comes “close” to reproducing the correlations among them, and to ignore the trivial factors ( Fabrigar et al., 1999 ).

By including an effect size statistic in the R-PA process, researchers must not only consider the results of the hypothesis tests, but also address whether each additional factor is weak and should be ignored or sufficiently strong to have psychometric meaning. In so doing, researchers must make more nuanced decisions about the number of factors.

Purpose of Article

A number of researchers have suggested the proportion of common variance of indicators explained by a factor (PCV) as an effect size index for that factor (e.g., Reise, 2012 ; Ten Berge & Sočan, 2004 ). The purpose of this article was to investigate the choice of PCV statistics for use with R-PA. The results are presented in three studies. In Study 1, we assessed the psychometric qualities of a frequently applied PCV index using PAF. Due to problems with this statistic, we also evaluated two modifications to this index. In Study 2, we assessed the quality of a PCV index that uses a lesser-known factor extraction method, minimum rank factor analysis (MRFA; Lorenzo-Seva, 2013 ; Shapiro & Ten Berge, 2002 ; Sočan, 2003 ; Ten Berge & Kiers 1991 ; Ten Berge & Sočan, 2004 ). MRFA avoids the problem in the computation of PCV using PAF, such that all factors have positive eigenvalues. In Study 3, we examined the use of these effect size statistics in the application of R-PA.

Definition of PCV as a Factor Effect Size Index

We begin by considering the PCV of a factor in the population. With factor analysis, communalities for indicators ( γ p ) are substituted along the diagonal of a correlation matrix to yield a reduced correlation matrix, where a communality gives the proportion of variance of an indicator explained by the underlying factors. Factors are then extracted from this reduced correlation matrix. Each eigenvalue for an extracted factor gives the variance of the indicators accounted for by that factor and, thus, should be greater than or equal to zero. If we knew the underlying structure of a set of indicators, we could compute correctly PCV for factor k ′ in the population ( π k ′ ) :

where Λ k ′ is the eigenvalue for the k ′ th factor, and ∑ k = 1 K Λ k is the sum of the eigenvalues for the reduced correlation matrix across the K indicators. Given the communalities are correct and the number of factors is correctly specified ( N F:Correct ) and less than K , the last K − N F:Correct eigenvalues must be zeros, and the denominator for Equation 1 can be reexpressed as ∑ k = 1 N F : Correct Λ k .

There is an alternative computation of PCV if the correct model is unknown in calculating communalities for the reduced correlation matrix. Squared multiple correlations (SMC, denoted as ρ p 2 for any indicator p ) between indicators and all other indicators can be used as estimates of the correct communalities. We will focus on SMCs in this article in order to be consistent with the choice of communality estimates for R-PA, although corrected SMC ( Cureton & D’Agostino, 2013 ) or quantities other than SMCs have been suggested in the literature ( Mulaik, 2009 ). It should be noted that a squared multiple correlation gives the proportion of indicator variance attributable to other indicators rather than the proportion of indictor variance attributable to the underlying factors. Thus, although we are interested in π k ′ , we may have to focus on the population PCV with population squared multiple correlations ( π SMC : k ′ ) as rough estimates of the communalities. Accordingly, π SMC : k ′ is based on the eigenvalues ( Λ SMC : k ′ ) of a reduced correlation matrix with squared multiple correlations along the diagonal:

π SMC : k ′ is a negatively biased estimator of π k ′ at the population level (i.e., π SMC : k ′ − π k ′ < 0 ) because ρ p 2 is generally less than γ p 2 for any indicator p .

π k ′ and π SMC : k ′ are PCV parameters in the population. At the sample level, we do not know the correct model and thus cannot estimate π k ′ , but can estimate π SMC : k ′ . To obtain an estimate of π SMC : k ′ , we substitute sample estimates for parameters on the right side of Equation 2 to obtain

A Problem With π ^ SMC : k ′

In computing PCV in a sample (i.e., Equation 3 ), we factor analyze a reduced correlation matrix with SMCs along the diagonal to obtain eigenvalues. The numerator of Equation 3 contains the eigenvalue for the k ′ th factor, an estimate of the common variance of the indicators explained by the k ′ th factor. The denominator is the sum of eigenvalues across all factors, a rough estimate of the common variance of the indicators explained by all possible factors. A problem with this approach becomes apparent in conducting a common factor analysis: the first number of eigenvalues generally are positive, whereas the remaining eigenvalues are negative. Conceptually, the results are nonsensical in that an eigenvalue is the variance of the indicators due to any one factor, and a variance cannot have negative values.

It is crucial to assess the statistical properties of π ^ SMC : k ′ in that this index is reported by popular statistical packages, including SAS and Stata. In the SAS User’s Guide ( SAS Institute Inc., 2009 ), an example (labeled Example 33.2 Principal Factor Analysis) is presented based on the factor analysis of five variables. Eigenvalues of the reduced correlations (with SMCs along the diagonal) as well as proportions of common variance are presented; the first three eigenvalues and proportions are positive in value, whereas the last two are negative. As reported in the text in the SAS User’s Guide, the reported results are perplexing in that the first two eigenvalues accounted for 101.31% of the common variance. Appropriately, the SAS manual indicates that this out-of-bound estimate occurred because the reduced correlation matrix was not positive definite and accordingly yielded negative eigenvalues. A similar example is presented in the Stata User’s Guide Release 13 ( StataCorp, 2013 ) in discussing the output generated by the factor analysis procedure.

Alternative Estimators of PCVs

Given the problems with π ^ SMC : k ′ , we considered two adaptations of π ^ SMC : k ′ to assess PCV. The first adaptation was to alter the denominator. We took a rather simple approach to this adaptation: Negative eigenvalues are problematic so let us get rid of them. In other words, rather than summing across all eigenvalues in the denominator, we sum across only the positive eigenvalues. Thus, at the population level, we can define an alternative PCV ( π SMC : k ′ + Λ ),

where the denominator is the sum of the N + Λ positive eigenvalues. 1 At the sample level, we substitute the estimates on the right side of the equation:

It should be noted that ∑ k = 1 N + Λ ^ Λ ^ SMC : k must be greater than or equal to ∑ k = 1 K = p Λ ^ SMC : k , and thus π ^ SMC : k ′ + Λ ^ must be less than or equal to π ^ SMC : k ′ .

A second adaptation is a corrected π ^ SMC : k ′ (denoted π ^ SMC : k ′ Corrected ) that takes into account our expectation that π ^ SMC : k ′ is an overestimate of π k ′ . π ^ SMC : k ′ is corrected by the mean eigenvalues for parallel samples. More specifically, for the k th factor, the corrected effect size statistic involves (a) computing Λ ^ SMC : k ′ based on the reduced correlation matrix; (b) calculating comparable eigenvalues for the parallel samples generated assuming k −1 factors, denoted as Λ ^ SMC : k ′ m ( m represents the m th parallel dataset and M represents the total number of parallel datasets); (c) computing a mean eigenvalue across the parallel datasets; (d) subtracting the quantity computed in Step c K from the quantity determined in Step a ; and (e) dividing the result of Step d by ∑ k = 1 K Λ ^ SMC : k . The equation for π ^ SMC : k ′ Corrected is thus

The subtraction in Step (c) provides a downward correction of π ^ SMC : k ′ . In addition, Step (c) ensures that when the number of factors is zero in the population (e.g., in a null model with no common factor), π ^ SMC : k ′ Corrected is 0 for all k under finite samples and thus nonzero eigenvalues are avoided.

Purpose of Study 1

The objective of Study 1 was to assess the psychometric quality of the presented PCV indices. Initially, we examined the bias of π SMC : k ′ and π SMC : k ′ + Λ at the population level. Bias is defined as the difference between either of these parameters and π k ′ . Note we did not assess π ^ SMC : k ′ Corrected in that it is undefined in the population.

At the sample level, biases of these PCV indices were assessed by calculating E ( π ^ SMC : k ′ ) − π k ′ , E ( π ^ SMC : k ′ + Λ ^ ) − π k ′ , and E ( π ^ SMC : k ′ Corrected ) − π k ′ , where E() is an expected value. As a byproduct of assessing the biases of these PCV indices, we assessed whether the statistical properties of π ^ SMC : k ′ warrant its use in popular statistical packages or whether one of the alternative investigated indices (i.e., π ^ SMC : k ′ + Λ ^ and π ^ SMC : k ′ Corrected ) has better statistical properties.

We manipulated three dimensions to evaluate the bias of π SMC : k ′ and π SMC : k ′ + Λ in estimating π k ′ at the population: factor-model type, magnitude of factor loadings, and correlations between factors, where appropriate. At the sample level, we manipulated the same three dimensions plus sample size to evaluate the psychometric qualities of π ^ SMC : k ′ , π ^ SMC : k ′ + Λ ^ , and π ^ SMC : k ′ Corrected . We described these design dimensions below:

An external file that holds a picture, illustration, etc.
Object name is 10.1177_0013164418754611-fig1.jpg

Models for data generation in Study 1.

  • Factor loadings on unidimensional models were .5s or .7s for indicators that were a function of a single factor. For the two-factor or three-factor perfect-cluster models, the nonzero loadings on the factors were .5s or .7s. For the bifactor models, the indicators on the general factor had loadings of .5s or .7s, and the 4 indicators on the group factor(s) had loadings of .5s.
  • Correlations between factors were .0, .4, or .8 for any two-factor or three-factor perfect-cluster model.
  • Number of observations was set at 200 or 400 for the sample-level simulation.

Data Generation and Analyses

At both the population and sample levels, we restricted our presentation to the PCAs for the first three factors, given that the maximum correct number of factors manipulated in the population was three. PCAs for the rest of the factors were in general close to 0 (or negative for π SMC : k ′ ).

At the population level, we computed a reproduced correlation matrix based on the parameters of the models for each combination of the manipulated dimensions. Reduced correlation matrices were then created by substituting the correct γ p or SMCs along the diagonal of the generated correlation matrices. Note that the correct γ p was obtained by specifying the correct number of factors. In reality, the correct number of factors is unknown, and thus γ p is replaced by SMCs. Each of the reduced correlation matrices was analyzed using PAF to obtain eigenvalues for the unrotated factors. π SMC : k ′ and π SMC : k ′ + Λ were computed according to Equations 2 and 4 , respectively, as well as π k ′ using Equation 1 .

At the sample level, 1,000 sample datasets were created for each combination of the manipulated dimensions. The factors and errors in the model all follow N (0, 1). Correlation matrices were computed for each sample dataset and analyzed using PAF to obtain eigenvalues to compute π ^ SMC : k ′ and π ^ SMC : k ′ + Λ ^ . In addition, to compute π ^ SMC : k ′ Corrected , we generated 100 parallel datasets based on the estimated loadings given 0, 1, and 2 factors for each sample dataset to assess the PCVs for 1, 2, and 3 factors, respectively.

All data generation and analyses were implemented in R ( R Core Team, 2017 ).

We considered the results of the PCV indices at the population level and then at the sample level.

Bias at the Population Level

We present the results of π SMC : k ′ , π SMC : k ′ + Λ , and π k ′ for the first three factors in Table 1 . Bias at the population level was assessed by π SMC : k ′ − π k ′ and π SMC : k ′ + Λ − π k ′ . We considered a bias that was greater than or equal to .10 as substantial and bolded these values in Table 1 , as well as all Tables 2 through ​ through5. 5 . Overall, π SMC : k ′ performed poorly. π SMC : k ′ was positively biased across all conditions for the first factor. π SMC : k ′ for the first factor was greater than its upper bound of 1.0 in 10 of the 18 conditions and was higher than 1.20 in four of these 10 conditions. In the remaining 8 conditions, π SMC : k ′ was .08 or greater relative to π k ′ . For the second factor, substantial positive bias for π SMC : k ′ was observed for perfect-cluster models with zero correlation among factor. The bias for π SMC : k ′ decreased rapidly with an increase in the correlation among factors. Minimal bias (not exceeding .05) was observed for the remaining conditions. For the third factor, π SMC : k ′ evidenced negative values when π k ′ = 0 . Considerable positive bias occurred with the three-factor perfect-cluster model when the correlation among factors was 0, but this bias was minimal when correlations among factors were large.

PCVs at the Population Level for One-, Two-, and Three-Factor Models.

PCV for first factorPCV for second factorPCV for third eigenvalue
ρ λ
.51.00 1.000−.05−.040−.05−.04
.71.00 1.000−.02−.020−.02−.02
0.5.50 .50.50 .500−.07
.7.50 .50.50 .500−.04−.03
.4.5.70 .73.30 .270−.07
.7.70 .71.30.36.290−.04−.03
.8.5.90 .94.10.08.060−.08−.05
.7.90 .92.10.09.080−.03−.03
0.5.33 .33.33 .33.33 .33
.7.33.42.33.33.42.33.33.42.33
.4.5.60 .63.20 .18.20 .18
.7.60 .62.20.24.19.20.24.19
.8.5.87 .92.07.06.04.07.06.04
.7.871.04.89.07.06.05.07.06.05
0.5.87 .90.13.13.100−.04−.03
.7.91 .93.09.08.070−.02−.02
0.5.75 .77.20.23.19.05.05.04
.7.84.92.85.13.13.12.04.03.03

Note . PCV = proportion of common variance. Values that yielded ≥.10 bias were bolded.

PCVs at the Population Level and Mean Effect Sizes at the Sample Level for the First Factor.

PopulationSample with = 200Sample with = 400
ρ λ
.51.00 1.00 1.00 .95
.71.00 1.001.09.97.99 .991.04
0.5.50 .50 .55.59 .55
.7.50 .50 .54.55 .53.58
.4.5.70 .73 .68.78 .71
.7.70 .71 .71.74 .71.79
.8.5.90 .94 .85.97 .90
.7.90 .92 .90.95 .91
0.5.33 .33 .38 .38
.7.33.42.33 .38.36 .37.38
.4.5.60 .63 .56.60 .60
.7.60 .62 .60.61 .61.67
.8.5.87 .92 .83 .83
.7.87 .89 .85.88 .87
0.5.87 .90 .86.88 .88
.7.91 .93 .91.93 .92.96
0.5.75 .77 .72.76 .75.83
.7.84.92.85.90.83.83 .84.86

PCV at the Population Level and Mean Estimate of PCV Using MRFA at the Sample Level for the First Three Factors.

Factor 1Factor 2Factor 3
ρ λ = 200 = 400 = 200 = 400 = 200 = 400
.51.00 0 0.09.07
.71.00 0.06.040.04.03
0.5.50.43.44.50 0 .09
.7.50.48.49.50.41.430.05.04
.4.5.70 .30.25.260 .09
.7.70.62.64.30.28.280.05.04
.8.5.90 .10.15.130 .08
.7.90 .81.10.11.110.05.04
0.5.33.28.29.33.24.25.33 .22
.7.33.33.33.33.29.30.33.24.27
.4.5.60 .20.19.19.20.14.15
.7.60 .52.20.20.20.20.16.17
.8.5.87 .07.12.10.07.09.08
.7.87 .07.09.08.07.07.07
0.5.87 .13.14.140.07.05
.7.91.84.86.09.10.090.03.02
0.5.75 .20.17.18.05.08.07
.7.84.75.77.13.12.12.04.05.04

Note . PCV = proportion of common variance; MRFA = minimum rank factor analysis. Values that yielded ≥.10 bias were bolded.

In comparison with π SMC : k ′ , π SMC : k ′ + Λ showed much less bias in estimating π k ′ . We considered separately results when π k ′ > 0 and results when π k ′ = 0 , aggregating across the first, second, and third factors. In the conditions in which π k ′ > 0 , π SMC : k ′ + Λ was equal to π k ′ for all 12 estimates with a single factor model and perfect-cluster models with uncorrelated factors. For the remaining estimates in which π k ′ > 0 , π SMC : k ′ + Λ was within .03 of π k ′ 28 times and between .04 and .05 of π k ′ 2 times. When π k ′ = 0 , π SMC : k ′ + Λ was less than or equal to 0 in all 12 conditions; π SMC : k ′ + Λ was between .00 and −.05 for 10 of the estimates and was equal to −.07 for the remaining 2 estimates.

To explain the differences in results between π SMC : k ′ and π SMC : k ′ + Λ , we examined the eigenvalues for the extracted factors. In Figure 2 , we present a graph of eigenvalues for the condition in which the reduced correlation matrix was based on a bifactor model with one group factor having loadings of .5s. The pattern of eigenvalues for this condition was similar to the patterns of eigenvalues for the other conditions. It should be noted that (a) the eigenvalues based on SMCs as communality estimates were positive when the eigenvalues based on correct communalities were positive, (b) the eigenvalues based on SMCs as communality estimates were negative when the eigenvalues based on correct communalities were zero, and (c) the eigenvalues were uniformly lower when the communality estimates were SMCs versus when they were correct values.

An external file that holds a picture, illustration, etc.
Object name is 10.1177_0013164418754611-fig2.jpg

Eigenvalues of reduced correlation matrices based on a bifactor model with one group factor having factor loadings of .5s.

Because π SMC : k ′ and π SMC : k ′ + Λ have the same numerator, differences in these estimates are due to differences in their denominators: the sum of all eigenvalues for π SMC : k ′ + Λ versus the sum of only the positive eigenvalues for π SMC : k ′ + Λ .

Bias at the Sample Level

At the sample level, we considered not only PCV estimates based on all eigenvalues and based on all positive eigenvalues, but also estimates that are corrected using parallel datasets. The results of these analyses for the first, second, and third factors are presented in Tables 2 , ​ ,3, 3 , and ​ and4, 4 , respectively. The population values for three of the PCVs also are included in these tables to assess bias.

PCVs at the Population Level and Mean Effect Sizes at the Sample Level for the Second Factor.

PopulationSample with = 200Sample with = 400
ρ λ
.50−.05−.04 .07.00.06.04.00
.70−.02−.02.03.02.00.01.01.00
0.5.50 .50 .40.46 .44.59
.7.50 .50.55.45.49.58.47.54
.4.5.30 .27 .27.26 .28.34
.7.30.36.29.35.29.30.35.29.33
.8.5.10.08.06.16.11.03.12.08.05
.7.10.09.08.10.09.07.10.08.08
0.5.33 .33 .30 .32.38
.7.33.42.33.40.33.32.41.33.36
.4.5.20 .18 .22.16 .21.21
.7.20.24.19.26.22.20.26.21.22
.8.5.07.06.04.14.10.02.11.08.03
.7.07.06.05.09.08.04.08.07.05
0.5.13.13.10.15.12.04.14.11.10
.7.09.08.07.09.08.07.09.08.07
0.5.20.23.19.23.19.16.23.19.19
.7.13.13.12.13.12.11.13.12.12

PCVs at the Population Level and Mean Effect Sizes at the Sample Level for the Third Factor.

PopulationSample with = 200Sample with = 400
ρ λ
.50−.05−.04.04.03−.01.02.01.00
.70−.02−.02.01.01.00.00.00.00
0.50−.07.07.04.01.03.02.01
.70−.04−.03.01.01.00.00.00.01
.4.50−.07.07.04−.01.03.02.00
.70−.04−.03.01.01.00.00.00.01
.8.50−.08−.05.06.04.00.03.02.00
.70−.03−.03.01.01.00.00.00.00
0.5.33 .33.36.24.36.42.26.33
.7.33.42.33.33.28.29.36.29.34
.4.5.20 .18.22.15.10.24.16.16
.7.20.24.19.20.17.16.21.18.19
.8.5.07.06.04.01.07.00.07.05.01
.7.07.06.05.06.05.03.06.05.04
0.50−.04−.03.03.02−.03.01.01.00
.70−.02−.02.01.01.00.09.00.00
0.5.05.05.04.07.06.02.06.05.03
.7.04.03.03.04.04.02.04.03.02

We begin by describing the results in Table 2 for the first factor. π ^ SMC : k ′ was positively biased in all conditions, largely attributed to the bias at the population level. Disturbingly, the mean of π ^ SMC : k ′ yielded out-of-bound values (i.e., greater than 1) in 17 of 36 conditions, and the bias was greater for the larger sample size. For π SMC : k ′ + Λ ^ , the bias was minimal (within .03) in 22 conditions, moderate (between .04 and .07) in 12 conditions, and large (>.07) in the remaining 2 conditions. π SMC : k ′ + Λ ^ demonstrated less bias than either of the other two estimates in 10 of the 18 conditions when N was equal to 200, and π ^ SMC : k ′ Corrected showed the least amount of bias in 7 of these conditions (with 1 tie). π SMC : k ′ + Λ ^ had the least bias in all 18 conditions when N was equal to 400.

The results for the second factor are shown in Table 3 . π ^ SMC : k ′ yielded the least bias in 7 conditions; π SMC : k ′ + Λ ^ yielded the least biased in 18 conditions; and π ^ SMC : k ′ Corrected showed the least bias in 8 conditions (with ties between at least two of the three estimators in the other 3 conditions). It should be noted that for the conditions in which π ^ SMC : k ′ was least biased, the other estimates were generally only slightly more biased. Also π ^ SMC : k ′ Corrected performed best for models with one underlying factor.

The results for the third factor are shown in Table 4 . For the models with π k ′ = 0 , π ^ SMC : k ′ Corrected yielded the least biased estimates of π k ′ for 15 out of the 20 conditions, but the degree of bias was small in general. For the models with π k ′ > 0 , π ^ SMC : k ′ was the least biased estimate of π k ′ for 7 conditions; π SMC : k ′ + Λ ^ for 3 conditions; and π ^ SMC : k ′ Corrected for 1 condition (with ties between two estimates for the remaining 5 conditions). However, the degree of bias was relatively similar across conditions relative to the results for the first and second factors.

In summary, no single estimator performed consistently better than others across all 54 conditions for the first, second, and third factors. However, π SMC : k ′ + Λ ^ overall tended to demonstrate less bias than the other two estimators. In addition, the standard deviations of the three estimates across replications (i.e., the empirical standard errors) were very similar to each other, although they displayed a consistent pattern across all conditions: π SMC : k ′ + Λ ^ yielded the smallest standard deviation, followed by π ^ SMC : k ′ Corrected and then π ^ SMC : k ′ .

A major difficulty with the considered PCVs is that they are based on negative eigenvalues for later factors. These results are nonsensical in that eigenvalues represent the variance accounted for by a factor, which should be greater than or equal to zero. π ^ SMC : k ′ + Λ ^ tended to produce less biased estimates of π k ′ in comparison with π ^ SMC : k ′ and π ^ SMC : k ′ Corrected ; however, π ^ SMC : k ′ + Λ ^ is an ad hoc modification to π ^ SMC : k ′ .

A more elegant mathematical alternative is to use an EFA method that does not allow for negative eigenvalues. There is such a method, minimum rank factor analysis (MRFA; Lorenzo-Seva, 2013 ; Shapiro & Ten Berge, 2002 ; Sočan, 2003 ; Ten Berge & Kiers 1991 ; Ten Berge & Sočan, 2004 ). MRFA yields optimal communalities for an observed correlation matrix in the sense that the reduced correlation matrix is positive semidefinite; that is, MRFA does not allow for negative eigenvalues. MRFA is not as well known as other EFA methods (e.g., PAF and maximum likelihood) and is not part of major statistical packages. However, it is available as Windows ( Lorenzo-Seva & Ferrando, 2006 ) and R programs ( Navarro-Gonzalez & Lorenzo-Seva, 2017 ).

MRFA eliminates the problem with negative variances for factors and thus may yield more accurate estimates at the population and sample levels (denoted π MRFA : k ′ and π ^ MRFA : k ′ , respectively). Previous research indicates that MRFA yields positively biased estimates for unexplained common variance after extracting a fixed number of factors, particularly as a function of sample size ( Shapiro & Ten Berge, 2002 ; Sočan, 2003 ). The implication is that π ^ MRFA : k ′ is likely to be biased, although the degree and type of bias (i.e., negatively or positively biased) are likely to differ across the extracted factors. Shapiro and Ten Berge (2002) offered a method to compute the asymptotic bias of the unexplained variance, which is appropriate for the analysis of covariance matrices, but not for correlation matrices.

A study by Timmerman and Lorenzo-Seva (2011) proposed a parallel analysis approach that incorporates MRFA as the factor extraction method, and judgments about the number of factors are made based on proportions of explained common variance. The results of their Monte Carlo study indicated that, under a number of conditions, the proposed method can yield relatively accurate conclusions in the assessment of the number of factors for ordered polytomous items. Their study offers some support for MRFA; however, they did not investigate the accuracy of MRFA in estimating PCV, the focus of the current study. Thus, it is unknown based on their results whether MRFA is a useful effect size statistic for parallel analysis.

Purpose of Study 2

The purpose of Study 2 is to explore the psychometric properties of π MRFA : k ′ and π ^ MRFA : k ′ and to compare their properties to those for π SMC : k ′ and π ^ SMC : k ′ as well as π SMC : k ′ + Λ and π ^ SMC : k ′ + Λ ^ . We did not include π ^ SMC : k ′ Corrected because the index is undefined in the population, is not a standard estimate of PCV, and is less accurate than π ^ SMC : k ′ + Λ ^ .

We used the same design in Study 2 as we employed in Study 1 with one exception. We included sample sizes of 1,000, 2,000, and 5,000 besides 200 and 400 to explore large-sample properties. We generated data and conducted analyses of these data using comparable methods as used in Study 1.

At the population level, π MRFA : k ′ yielded the same values as π k ′ ; that is, π MRFA : k ′ perfectly reproduced the proportion of common variance in the population. Thus, π MRFA : k ′ was superior to π SMC : k ′ and π SMC : k ′ + Λ at the population level.

At the sample level, we present in Table 5 the results of π ^ MRFA : k ′ for the first, second, and third extracted factor for sample sizes of 200 and 400. For the first extracted factor, π ^ MRFA : k ′ was consistently negatively biased. π ^ MRFA : k ′ , on average, was .14 less than π k ′ when N = 200 and .11 less than π k ′ when N = 400. π ^ MRFA : k ′ was a much less accurate estimator of π k ′ in comparison with π SMC : k ′ + Λ ^ . The mean absolute differences between π SMC : k ′ + Λ ^ and π k ′ were .10 and .08 for sample sizes of 200 and 400, respectively. In comparison, the mean absolute differences between π SMC : k ′ + Λ ^ and π k ′ were .04 and .02 for sample sizes of 200 and 400, respectively.

For the second extracted factor, π ^ MRFA : k ′ was on average relatively unbiased. Similar to the first extracted factor, π ^ MRFA : k ′ was a somewhat less accurate estimator of π k ′ than π ^ SMC : k ′ + Λ ^ . The mean absolute differences between π ^ MRFA : k ′ and π k ′ were .05 and .04 for sample sizes of 200 and 400, respectively. In comparison, the mean absolute differences between π ^ SMC : k ′ + Λ ^ and π k ′ were .03 and .02 for sample sizes of 200 and 400, respectively.

For the third extracted factor, π ^ MRFA : k ′ tended to be positively biased. π ^ MRFA : k ′ , on average, was .06 greater than π k ′ when N = 200 and .05 greater than π k ′ when N = 400. π ^ MRFA : k ′ was a somewhat less accurate estimator of π k ′ than π ^ SMC : k ′ + Λ ^ . The mean absolute differences between π ^ MRFA : k ′ and π k ′ were .06 and .05 for sample sizes of 200 and 400, respectively. In comparison, the mean absolute differences between π ^ SMC : k ′ + Λ ^ and π k ′ were .03 and .02 for sample sizes of 200 and 400, respectively.

Based on these results, π ^ SMC : k ′ + Λ ^ appeared to be a more accurate estimator of π k ′ than π ^ MRFA : k ′ . Given that π MRFA : k ′ = π k ′ and π SMC : k ′ + Λ ≠ π k ′ at the population level, π ^ MRFA : k ′ should become a more accurate estimator of π k ′ relative to π ^ SMC : k ′ + Λ ^ as sample size increases. To assess whether π ^ SMC : k ′ + Λ ^ continues to demonstrate more accuracy for large sample sizes, we included conditions with sample sizes of 1,000, 2,000, and 5,000.

For the first extracted factor, π ^ MRFA : k ′ was generally negatively biased when N = 1,000, 2,000, and 5,000. π ^ MRFA : k ′ , on average, was .08 less than π k ′ when N = 1,000, .06 less than π k ′ when N = 2,000, and .04 less than π k ′ when N = 5,000. However, π ^ MRFA : k ′ was still a much less accurate estimator of π k ′ than π ^ SMC : k ′ + Λ ^ . In comparison, π ^ SMC : k ′ + Λ ^ tended to be positively biased; on average, it was .02 greater than π k ′ for all three sample sizes. The mean absolute differences between π ^ MRFA : k ′ and π k ′ were .08, .06, and .04 for sample sizes of 1,000, 2,000, and 5,000, respectively. In comparison, the mean absolute differences between π ^ MRFA : k ′ and π k ′ were .02 for all three sample sizes.

For the second extracted factor, both π ^ MRFA : k ′ and π ^ SMC : k ′ + Λ ^ were on average slightly negatively biased; on average, the biases for both statistics were −.01 across different sample sizes. π ^ MRFA : k ′ and π ^ SMC : k ′ + Λ ^ also displayed similar accuracies across sample size; on average, the mean absolute differences between the two alternative PCVs and π k ′ were between .01 and .02 across sample sizes.

For the third extracted factor, π ^ MRFA : k ′ tended to be slightly positively biased, with mean differences of +.01 across sample sizes. In contrast, π ^ SMC : k ′ + Λ ^ demonstrated a negative bias, with mean differences of −.02 across sample sizes. π ^ MRFA : k ′ and π ^ SMC : k ′ + Λ ^ displayed similar accuracies across sample size; on average, the mean absolute differences between the two alternative PCVs and π k ′ were between .02 and .03 across sample sizes.

In addition, the standard deviations of π ^ MRFA : k ′ across replications were very similar to those of π ^ SMC : k ′ + Λ ^ . However, π ^ SMC : k ′ + Λ ^ resulted in slightly smaller standard deviations (.01 or .02 smaller) across all conditions.

Based on the analyses of the generated data in the two previous studies, we found π ^ SMC : k ′ + Λ ^ to be the preferred PCV index. In Study 3, we focus on the usefulness of π ^ SMC : k ′ + Λ ^ in combination with R-PA.

Purpose of Study 3

The objective of Study 3 was to demonstrate how π ^ SMC : k ′ + Λ ^ can be used to yield a more nuanced interpretation of R-PA. We concentrate on two situations in which researchers may choose to make decisions that go counter to the standard interpretation of R-PA after taking into account π ^ SMC : k ′ + Λ ^ : (a) inclusion of a factor that was not significant but has a nontrivial π ^ SMC : k ′ + Λ ^ and (b) exclusion of a factor that was significant but has a trivial π ^ SMC : k ′ + Λ ^ .

In Study 3, we generated and analyzed data for two-factor and three-factor perfect-cluster models as well as two-factor bifactor models. Because the interpretation of results for the perfect-cluster models is essentially the same as those for the two-factor bifactor models, we present the results for only the latter models. The bifactor models consisted of 8 indicators, with all indicators loading on the general factor and 4 indicators loading on the group factor.

In Study 3, we manipulated the factor loadings and sample size. Factor loadings on the general factor were .5s or .7s, whereas factor loadings on the group factor were .3s, .4s, .5s, or .6s. The number of observations was set at 200, 500, or 800.

As with the previous studies, the factor and error scores for the model were generated to be normally distributed. One-thousand-sample datasets were created for each combination of the manipulated dimensions.

We conducted a series of hypothesis tests required in performing a revised parallel analysis. Following the steps of R-PA, we initially evaluated the null hypothesis that 0 factors underlie a correlation matrix. For all 1,000 replications in each condition, this hypothesis was rejected, implying that more than 0 factors were required. Next, we tested the null hypothesis that 1 factor explains the correlation matrix. The number of replications in which this hypothesis was not rejected and was rejected was recorded for each condition. Based on the standard use of R-PA, we reached one of two conclusions: nonrejection of this null hypothesis suggests 1 factor is sufficient to explain the correlation matrix, whereas rejection implies that 2 or more factors are required. For replications in which this hypothesis was rejected, an additional hypothesis test was conducted to evaluate the null hypothesis that 2 factors are necessary to explain the correlation matrix. The number of replications in which this hypothesis was not rejected and was rejected was recorded for each condition. Nonrejection of this null hypothesis suggests 2 factors are sufficient to explain the correlation matrix, whereas rejection implies that 3 or more factors are required. Given the bifactor model used to generate the data included 2 factors, we stopped testing and concluded that either the correct number of factors was determined if the hypothesis was nonsignificant or the number of factors was overestimated if the hypothesis was significant.

To assess whether the effect size statistic π ^ SMC : k ′ + Λ ^ augments the interpretation of R-PA, we report the mean of π ^ SMC : k ′ + Λ ^ ( π ^ ¯ SMC : k ′ + Λ ^ ) separately for nonsignificant and significant results at each step in the R-PA process. The reported π ^ ¯ SMC : k ′ + Λ ^ depended on the hypothesis that was tested. More specifically, when the null hypothesis was assessing k ′ − 1 underlying factors, the effect size was computed for the k ′ factor ( π ^ ¯ SMC : k ′ + Λ ^ ) .

In interpreting the results, it is important to keep in mind two issues. First, hypothesis test results are a function of sample size and effect size. Thus, for any one condition, the mean effect size value will be greater for significant versus nonsignificant hypothesis tests. Second, the number of observations decreases as one proceeds through the sequence of tests with R-PA in that R-PA does not proceed when a hypothesis test is nonsignificant.

In Table 6 , we present π ^ ¯ SMC : k ′ + Λ ^ within the sequence of steps of R-PA. In this table, we focus on the types of outcomes with effect sizes that suggest an alternative estimate of the number of factors relative to R-PA.

Percent Correct with R-PA as well as π ^ ¯ k ′ : + Λ ^ (Number of Replications Out of 1,000) for Nonsignificant (NS) and Significant (Sig.) Tests of Null of Hypotheses of One or Two Factors for Data Generated Using a Bifactor Model With One General Factor and One Group Factor

= 200 = 500 = 800
(# of reps) (# of reps) (# of reps) (# of reps) (# of reps) (# of reps)
H : 1 factor; H : ≥2 factorsH : 2 factors; H ≥3 factorsH : 1 factor; H : ≥2 factorsH : 2 factors; H ≥3 factorsH : 1 factor; H : ≥2 factorsH : 2 factors; H ≥3 factors
λ Λ % Cor.NSSig.NSSig.% Cor.NSSig.NSSig.% Cor.NSSig.NSSig.
.5.315.2.071 (833).118 (167).029 (152).068 (15)47.8.038 (495).067 (505).010 (478).031 (27)69.5.025 (256).051 (744).005 (695).016 (49)
.450.4.075 (471).116 (529).025 (504).060 (25)90.4.041 (62).085 (938).008 (904).028 (34)97.1.028 (6).077 (994).004 (971).013 (23)
.579.6.072 (176).124 (824).020 (796).053 (28)97.4− (0).108 (1000).007 (974).021 (26)98.8− (0).106 (1000).003 (988).012 (12)
.693.2.072 (39).133 (931).016 (932).047 (29)97.4− (0).125 (1000).005 (974).016 (26)98.0− (0).123 (1000).002 (980).006 (20)
.7.341.9.028 (563).048 (437).009 (419).024 (18)88.2.013 (87).030 (913).002 (882).009 (31)96.9.007 (6).026 (994).001 (969).003 (25)
.489.5.032 (74).059 (926).008 (895).022 (31)97.7− (0).052 (1000).002 (977).006 (23)98.0− (0).050 (1000).001 (980).002 (20)
.597.3.024 (1).080 (999).006 (973).018 (26)97.8− (0).077 (1000).001 (978).006 (22)99.2− (0).076 (1000).001 (992).001 (8)
.697.1− (0).102 (1000).004 (971).014 (29)97.0− (0).101 (1000).001 (970).004 (30)100.0− (0).099 (1000)− (0)− (0)

Nonsignificant Test and Nontrivial Effect Size

We first considered conditions when R-PA failed to reach the correct number of factors due to small sample size and thus a lack of power of the R-PA significance tests. This occurred most frequently with nonsignificant tests assessing the null hypothesis of a single underlying factor, a sample size of 200, and loadings on the general factor of .5s (e.g., 83.3% and 47.1% of the replications when the loadings on the group factors were .3s and .4s, respectively). In these conditions, π ^ ¯ SMC : k ′ = 2 + Λ ^ were .07 or greater, indicating that approximately 7% of the common variance is accounted for by the second factor. Given this effect size, researchers might consider that a second factor underlies the variables rather than using the stopping rule of R-PA and reaching the conclusion that a single factor is sufficient.

Although it may be tempting to suggest a cutoff criterion for π ^ SMC : k ′ + Λ ^ (e.g., a factor is defined as relevant if π ^ SMC : k ′ + Λ ^ > . 05 ), we believe that setting such a cutoff is counterproductive. Rather, we believe researchers should consider the results of R-PA, π ^ SMC : k ′ + Λ ^ , and the rotated factor solutions in the context of the variables that are being analyzed and what they are purported to measure. It is interesting to note that in the same conditions π ^ ¯ SMC : k ′ = 3 + Λ ^ (i.e., the effect size for the third factor) ranged in value from .016 to .029 for nonsignificant results when evaluating the null hypothesis that 2 factors are sufficient. Thus, the mean effect sizes for the second factor were twice the size, or greater, than the mean effect sizes for the third factor when data were generated with two underlying factors.

Significant Test and Trivial Effect Size

We next examined conditions when R-PA reached nominally the correct number of factors, but one of the factors was sufficiently weak that it might be evaluated as psychometrically inconsequential. This result can occur if sample size is large, and thus the R-PA significance tests have high power. To address this possibility, we key on the results for a sample size of 800, factor loadings on the general factor of .7s, and factor loadings on the group factor of .3s. For this condition, 99.4% of the replications yielded significant tests; however, the π ^ ¯ SMC : k ′ = 2 + Λ ^ for the significant results was only .026. If researchers found similar findings for their data, they might decide that a single factor is adequate to explain the correlation among the variables. Before making a final decision, however, researchers should examine the one-factor solution as well as rotated factor solutions (and in particular two-factor solutions) in the context of the analyzed variables and their purported meaning.

For our Monte Carlo study, we increased the loadings on the group factor from .3 to .6, in essence defining a stronger group factor. Appropriately the π ^ ¯ SMC : k ′ = 2 + Λ ^ increased by .025 for each increase of .1 on the group factor loadings. Thus, researchers are less likely to call the second factor inconsequential as the group factor increases in strength.

Initially, researchers who are conducting an exploratory factor analysis first must determine the number of factors underlying the reduced correlation matrix among variables. Methods like R-PA examine the eigenvalues associated with the extracted factors; these eigenvalues give the common variance accounted for by a factor. It is convenient to examine the eigenvalues relative to each other, more explicitly the PCV accounted by a factor. Some major statistical packages compute π ^ SMC : k ′ by dividing the eigenvalue for a factor by the sum of the eigenvalues. The problem with this approach is that some eigenvalues are negative and thus nonsensical. We suggested alternatives to this index, including π ^ SMC : k ′ + Λ ^ , which is computed by excluding the negative eigenvalues. Based on Study 1, this index overall outperformed the alternatives.

An extraction method, minimum rank factor analysis, obviates problems with π ^ SMC : k ′ in that it does not allow for negative eigenvalues. The resulting index of π ^ MRFA : k ′ worked optimally in Study 2 at the population level, but was outperformed by π ^ SMC : k ′ + Λ ^ for small to moderately large samples (i.e., sample sizes of 200 to 5,000).

Given that π ^ SMC : k ′ + Λ ^ overall tended to produce better estimates than the investigated alternatives, we explored in Study 3 the use of π ^ SMC : k ′ + Λ ^ in combination with R-PA. As described in this study, researchers are likely to seek a binary decision using a cutoff criterion: unacceptable or acceptable. Accordingly, researchers may want a cutoff criterion for π ^ SMC : k ′ + Λ ^ to aid their decisions, such that when π ^ SMC : k ′ + Λ ^ is above the cutoff, they accept the next factor, and otherwise, reject the next factor. However, we argue that researchers should resist the use of cutoffs. In making an interpretation of π ^ SMC : k ′ + Λ ^ , it is crucial to take into account that it is a “partialled” statistic; that is, it examines the proportion of common variance accounted for by a factor after partialling out previously extracted factors. Thus, the magnitude of π ^ SMC : k ′ + Λ ^ can be quite small for later extracted factors.

The decision about the number of factors is a complex one and should involve multiple methods ( Henson & Roberts, 2006 ; Velicer, Eaton, & Fava, 2000 ). In making this decision, it is important not to dismiss the context of the study. Fabrigar et al. (1999) made this point quite clearly,

Furthermore, it is important to remember that the decision of how many factors to include in a model is a substantive issue as well as a statistical issue. A model that fails to produce a rotated solution that is interpretable and theoretically sensible has little value. Therefore, a researcher should always consider relevant theory and previous research when determining the appropriate number of factors to retain. (p. 281)

R-PA, as well as any other method to determine the number of factors, can be inaccurate, particularly under conditions with small samples, measurement items that have poor quality, and factors that are highly correlated. In such cases, researchers should be aware of the fact that a single answer for the number of factors obtained from R-PA can be misleading. An effect size statistic, in combination with substantive considerations, can thus help researchers determine the number of factors in a more nuanced way. For our Monte Carlo study, we illustrated how π ^ SMC : k ′ + Λ ^ can augment the results of R-PA.

The simulation design in the present study can be extended in a couple of directions. First, we employed models that have simple or bifactor structures with less than four factors. Psychology and educational research can involve many factors (e.g., eight factors) with complicated cross-loading structures. It would be useful to investigate effects size statistics for parallel analyses with more complex structures. Second, we assumed that factor, error, and observed scores all follow multivariate normal distributions. When nonnormality exists or when data are collected based on Likert-type scales, it remains a question as to whether our conclusions hold. Future studies are needed to address these concerns.

1. The last few extracted factors are likely to have negative eigenvalues. We recommend not computing π ^ SMC : k ′ + Λ ^ for these factors and essentially considering PCVs for these factors to be zero.

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.

  • Buja A., Eyuboglu N. (1992). Remarks on parallel analysis . Multivariate Behavioral Research , 27 , 509-540. [ PubMed ] [ Google Scholar ]
  • Cudeck R., Browne M. W. (1992). Constructing a covariance matrix that yields a specified minimizer and a specified minimum discrepancy function value . Psychometrika , 57 , 357-369. [ Google Scholar ]
  • Cureton E. E., D’Agostino R. B. (2013). Factor analysis: An applied approach . London, England: Psychology Press. [ Google Scholar ]
  • Fabrigar L. R., Wegener D. T., MacCallum R. C., Strahan E. J. (1999). Evaluating the use of exploratory factor analysis in psychological research . Psychological Methods , 4 , 272-299. [ Google Scholar ]
  • Ford J. K., MacCallum R. C., Tait M. (1986). The applications of exploratory factor analysis in applied psychology: A critical review and analysis . Personnel Psychology , 39 , 291-314. [ Google Scholar ]
  • Glorfeld L. W. (1995). An improvement on Horn’s parallel analysis methodology for selecting the correct number of factors to retain . Educational and Psychological Measurement , 55 , 377-393. [ Google Scholar ]
  • Green S. B., Levy R., Thompson M. S., Lu M., Lo W.-J. (2012). A proposed solution to the problem with using completely random data to assess the number of factors with parallel analysis . Educational and Psychological Measurement , 72 , 357-374. [ Google Scholar ]
  • Green S. B., Redell N., Thompson M. S., Levy R. (2016). Accuracy of revised and traditional parallel analyses for assessing dimensionality with binary data . Educational and Psychological Measurement , 76 , 5-21. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Green S. B., Thompson M. S., Levy R., Lo W.-J. (2015). Type I and II error rates and overall accuracy of the revised parallel analysis method for determining the number of factors . Educational and Psychological Measurement , 75 , 428-457. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Green S. B., Xu Y., Thompson M. (2018). Relative accuracy of two parallel analysis methods that use the proper reference distribution . Educational and Psychological Measurement , 78 , 589-604. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Harshman R. A., Reddon J. R. (1983). Determining the number of factors by comparing real with random data: A serious flaw and some possible corrections . Proceedings of the Classification Society of North America at Philadelphia, pp. 14-15. [ Google Scholar ]
  • Henson R. K., Roberts J. K. (2006). Use of exploratory factor analysis in published research: Common errors and some comment on improved practice . Educational and Psychological Measurement , 66 , 393-416. [ Google Scholar ]
  • Lorenzo-Seva U. (2013). How to report the percentage of explained common variance in exploratory factor analysis . Unpublished manuscript.
  • Lorenzo-Seva U., Ferrando P. J. (2006). FACTOR: A computer program to fit the exploratory factor analysis model . Behavior Research Methods , 38 ( 1 ), 88-91. [ PubMed ] [ Google Scholar ]
  • Mulaik S. A. (2009). Foundations of factor analysis . Boca Raton, FL: CRC Press. [ Google Scholar ]
  • Navarro-Gonzalez D., Lorenzo-Seva U. (2017). Dimensionality assessment using Minimum Rank Factor Analysis . Retrieved from https://cran.r-project.org/web/packages/DA.MRFA/index.html
  • R Core Team. (2017). R: A language and environment for statistical computing . Vienna, Austria: R Foundation for Statistical Computing; Retrieved from https://www.R-project.org/ [ Google Scholar ]
  • Reise S. P. (2012). The rediscovery of bifactor measurement models . Multivariate Behavioral Research , 47 , 667-696. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Ruscio J., Roche B. (2012). Determining the number of factors to retain in an exploratory factor analysis using comparison data of known factorial structure . Psychological Assessment , 24 ( 2 ), 282-292. [ PubMed ] [ Google Scholar ]
  • SAS Institute Inc. (2009). SAS/STAT® 9.2 user’s guide (2nd ed.). Cary, NC: Author. [ Google Scholar ]
  • Schmitt T. A. (2011). Current methodological considerations in exploratory and confirmatory factor analysis . Journal of Psychoeducational Assessment , 29 , 304-321. [ Google Scholar ]
  • Shapiro A., Ten Berge J. M. (2002). Statistical inference of minimum rank factor analysis . Psychometrika , 67 , 79-94. [ Google Scholar ]
  • Sočan G. (2003). The incremental value of minimum rank factor analysis (Unpublished doctoral dissertation). University of Groningen, Netherlands. [ Google Scholar ]
  • StataCorp. (2013). Stata: Release 13 [Statistical software] . College Station, TX: Author. [ Google Scholar ]
  • Ten Berge J. M., Kiers H. A. (1991). A numerical approach to the approximate and the exact minimum rank of a covariance matrix . Psychometrika , 56 , 309-315. [ Google Scholar ]
  • Ten Berge J. M., Sočan G. (2004). The greatest lower bound to the reliability of a test and the hypothesis of unidimensionality . Psychometrika , 69 , 613-625. [ Google Scholar ]
  • Timmerman M. E., Lorenzo-Seva U. (2011). Dimensionality assessment of ordered polytomous items with parallel analysis . Psychological Methods , 16 , 209-220. [ PubMed ] [ Google Scholar ]
  • Tucker L. R., Koopman R. F., Linn R. L. (1969). Evaluation of factor analytic research procedures by means of simulated correlation matrices . Psychometrika , 34 , 421-459. [ Google Scholar ]
  • Turner N. E. (1998). The effect of common variance and structure pattern on random data eigenvalues: Implications for the accuracy of parallel analysis . Educational and Psychological Measurement , 58 , 541-556. [ Google Scholar ]
  • Velicer W. F., Eaton C. A., Fava J. L. (2000). Construct explication through factor or component analysis: A review and evaluation of alternative procedures for determining the number of factors or components . In Goffin R. D., Helmes E. (Eds.), Problems and solutions in human assessment: Honoring Douglas N. Jackson at seventy (pp. 41-71). New York, NY: Springer. [ Google Scholar ]
  • School Guide
  • Mathematics
  • Number System and Arithmetic
  • Trigonometry
  • Probability
  • Mensuration
  • Maths Formulas
  • Integration Formulas
  • Differentiation Formulas
  • Trigonometry Formulas
  • Algebra Formulas
  • Mensuration Formula
  • Statistics Formulas
  • Trigonometric Table

Null Hypothesis

  • Hypothesis Testing Formula
  • SQLite IS NULL
  • MySQL NOT NULL Constraint
  • Real-life Applications of Hypothesis Testing
  • Null in JavaScript
  • SQL NOT NULL Constraint
  • Null Cipher
  • Hypothesis in Machine Learning
  • Understanding Hypothesis Testing
  • Difference between Null and Alternate Hypothesis
  • Hypothesis Testing in R Programming
  • SQL | NULL functions
  • Kotlin Null Safety
  • PHP | is_null() Function
  • MySQL | NULLIF( ) Function
  • Null-Coalescing Operator in C#
  • Nullity of a Matrix

Null Hypothesis , often denoted as H 0, is a foundational concept in statistical hypothesis testing. It represents an assumption that no significant difference, effect, or relationship exists between variables within a population. It serves as a baseline assumption, positing no observed change or effect occurring. The null is t he truth or falsity of an idea in analysis.

In this article, we will discuss the null hypothesis in detail, along with some solved examples and questions on the null hypothesis.

Table of Content

What is Null Hypothesis?

Null hypothesis symbol, formula of null hypothesis, types of null hypothesis, null hypothesis examples, principle of null hypothesis, how do you find null hypothesis, null hypothesis in statistics, null hypothesis and alternative hypothesis, null hypothesis and alternative hypothesis examples, null hypothesis – practice problems.

Null Hypothesis in statistical analysis suggests the absence of statistical significance within a specific set of observed data. Hypothesis testing, using sample data, evaluates the validity of this hypothesis. Commonly denoted as H 0 or simply “null,” it plays an important role in quantitative analysis, examining theories related to markets, investment strategies, or economies to determine their validity.

Null Hypothesis Meaning

Null Hypothesis represents a default position, often suggesting no effect or difference, against which researchers compare their experimental results. The Null Hypothesis, often denoted as H 0 asserts a default assumption in statistical analysis. It posits no significant difference or effect, serving as a baseline for comparison in hypothesis testing.

The null Hypothesis is represented as H 0 , the Null Hypothesis symbolizes the absence of a measurable effect or difference in the variables under examination.

Certainly, a simple example would be asserting that the mean score of a group is equal to a specified value like stating that the average IQ of a population is 100.

The Null Hypothesis is typically formulated as a statement of equality or absence of a specific parameter in the population being studied. It provides a clear and testable prediction for comparison with the alternative hypothesis. The formulation of the Null Hypothesis typically follows a concise structure, stating the equality or absence of a specific parameter in the population.

Mean Comparison (Two-sample t-test)

H 0 : μ 1 = μ 2

This asserts that there is no significant difference between the means of two populations or groups.

Proportion Comparison

H 0 : p 1 − p 2 = 0

This suggests no significant difference in proportions between two populations or conditions.

Equality in Variance (F-test in ANOVA)

H 0 : σ 1 = σ 2

This states that there’s no significant difference in variances between groups or populations.

Independence (Chi-square Test of Independence):

H 0 : Variables are independent

This asserts that there’s no association or relationship between categorical variables.

Null Hypotheses vary including simple and composite forms, each tailored to the complexity of the research question. Understanding these types is pivotal for effective hypothesis testing.

Equality Null Hypothesis (Simple Null Hypothesis)

The Equality Null Hypothesis, also known as the Simple Null Hypothesis, is a fundamental concept in statistical hypothesis testing that assumes no difference, effect or relationship between groups, conditions or populations being compared.

Non-Inferiority Null Hypothesis

In some studies, the focus might be on demonstrating that a new treatment or method is not significantly worse than the standard or existing one.

Superiority Null Hypothesis

The concept of a superiority null hypothesis comes into play when a study aims to demonstrate that a new treatment, method, or intervention is significantly better than an existing or standard one.

Independence Null Hypothesis

In certain statistical tests, such as chi-square tests for independence, the null hypothesis assumes no association or independence between categorical variables.

Homogeneity Null Hypothesis

In tests like ANOVA (Analysis of Variance), the null hypothesis suggests that there’s no difference in population means across different groups.

  • Medicine: Null Hypothesis: “No significant difference exists in blood pressure levels between patients given the experimental drug versus those given a placebo.”
  • Education: Null Hypothesis: “There’s no significant variation in test scores between students using a new teaching method and those using traditional teaching.”
  • Economics: Null Hypothesis: “There’s no significant change in consumer spending pre- and post-implementation of a new taxation policy.”
  • Environmental Science: Null Hypothesis: “There’s no substantial difference in pollution levels before and after a water treatment plant’s establishment.”

The principle of the null hypothesis is a fundamental concept in statistical hypothesis testing. It involves making an assumption about the population parameter or the absence of an effect or relationship between variables.

In essence, the null hypothesis (H 0 ) proposes that there is no significant difference, effect, or relationship between variables. It serves as a starting point or a default assumption that there is no real change, no effect or no difference between groups or conditions.

The null hypothesis is usually formulated to be tested against an alternative hypothesis (H 1 or H [Tex]\alpha [/Tex] ) which suggests that there is an effect, difference or relationship present in the population.

Null Hypothesis Rejection

Rejecting the Null Hypothesis occurs when statistical evidence suggests a significant departure from the assumed baseline. It implies that there is enough evidence to support the alternative hypothesis, indicating a meaningful effect or difference. Null Hypothesis rejection occurs when statistical evidence suggests a deviation from the assumed baseline, prompting a reconsideration of the initial hypothesis.

Identifying the Null Hypothesis involves defining the status quotient, asserting no effect and formulating a statement suitable for statistical analysis.

When is Null Hypothesis Rejected?

The Null Hypothesis is rejected when statistical tests indicate a significant departure from the expected outcome, leading to the consideration of alternative hypotheses. It occurs when statistical evidence suggests a deviation from the assumed baseline, prompting a reconsideration of the initial hypothesis.

In statistical hypothesis testing, researchers begin by stating the null hypothesis, often based on theoretical considerations or previous research. The null hypothesis is then tested against an alternative hypothesis (Ha), which represents the researcher’s claim or the hypothesis they seek to support.

The process of hypothesis testing involves collecting sample data and using statistical methods to assess the likelihood of observing the data if the null hypothesis were true. This assessment is typically done by calculating a test statistic, which measures the difference between the observed data and what would be expected under the null hypothesis.

In the realm of hypothesis testing, the null hypothesis (H 0 ) and alternative hypothesis (H₁ or Ha) play critical roles. The null hypothesis generally assumes no difference, effect, or relationship between variables, suggesting that any observed change or effect is due to random chance. Its counterpart, the alternative hypothesis, asserts the presence of a significant difference, effect, or relationship between variables, challenging the null hypothesis. These hypotheses are formulated based on the research question and guide statistical analyses.

Difference Between Null Hypothesis and Alternative Hypothesis

The null hypothesis (H 0 ) serves as the baseline assumption in statistical testing, suggesting no significant effect, relationship, or difference within the data. It often proposes that any observed change or correlation is merely due to chance or random variation. Conversely, the alternative hypothesis (H 1 or Ha) contradicts the null hypothesis, positing the existence of a genuine effect, relationship or difference in the data. It represents the researcher’s intended focus, seeking to provide evidence against the null hypothesis and support for a specific outcome or theory. These hypotheses form the crux of hypothesis testing, guiding the assessment of data to draw conclusions about the population being studied.

Criteria

Null Hypothesis

Alternative Hypothesis

Definition

Assumes no effect or difference

Asserts a specific effect or difference

Symbol

H

H (or Ha)

Formulation

States equality or absence of parameter

States a specific value or relationship

Testing Outcome

Rejected if evidence of a significant effect

Accepted if evidence supports the hypothesis

Let’s envision a scenario where a researcher aims to examine the impact of a new medication on reducing blood pressure among patients. In this context:

Null Hypothesis (H 0 ): “The new medication does not produce a significant effect in reducing blood pressure levels among patients.”

Alternative Hypothesis (H 1 or Ha): “The new medication yields a significant effect in reducing blood pressure levels among patients.”

The null hypothesis implies that any observed alterations in blood pressure subsequent to the medication’s administration are a result of random fluctuations rather than a consequence of the medication itself. Conversely, the alternative hypothesis contends that the medication does indeed generate a meaningful alteration in blood pressure levels, distinct from what might naturally occur or by random chance.

People Also Read:

Mathematics Maths Formulas Probability and Statistics

Example 1: A researcher claims that the average time students spend on homework is 2 hours per night.

Null Hypothesis (H 0 ): The average time students spend on homework is equal to 2 hours per night. Data: A random sample of 30 students has an average homework time of 1.8 hours with a standard deviation of 0.5 hours. Test Statistic and Decision: Using a t-test, if the calculated t-statistic falls within the acceptance region, we fail to reject the null hypothesis. If it falls in the rejection region, we reject the null hypothesis. Conclusion: Based on the statistical analysis, we fail to reject the null hypothesis, suggesting that there is not enough evidence to dispute the claim of the average homework time being 2 hours per night.

Example 2: A company asserts that the error rate in its production process is less than 1%.

Null Hypothesis (H 0 ): The error rate in the production process is 1% or higher. Data: A sample of 500 products shows an error rate of 0.8%. Test Statistic and Decision: Using a z-test, if the calculated z-statistic falls within the acceptance region, we fail to reject the null hypothesis. If it falls in the rejection region, we reject the null hypothesis. Conclusion: The statistical analysis supports rejecting the null hypothesis, indicating that there is enough evidence to dispute the company’s claim of an error rate of 1% or higher.

Q1. A researcher claims that the average time spent by students on homework is less than 2 hours per day. Formulate the null hypothesis for this claim?

Q2. A manufacturing company states that their new machine produces widgets with a defect rate of less than 5%. Write the null hypothesis to test this claim?

Q3. An educational institute believes that their online course completion rate is at least 60%. Develop the null hypothesis to validate this assertion?

Q4. A restaurant claims that the waiting time for customers during peak hours is not more than 15 minutes. Formulate the null hypothesis for this claim?

Q5. A study suggests that the mean weight loss after following a specific diet plan for a month is more than 8 pounds. Construct the null hypothesis to evaluate this statement?

Summary – Null Hypothesis and Alternative Hypothesis

The null hypothesis (H 0 ) and alternative hypothesis (H a ) are fundamental concepts in statistical hypothesis testing. The null hypothesis represents the default assumption, stating that there is no significant effect, difference, or relationship between variables. It serves as the baseline against which the alternative hypothesis is tested. In contrast, the alternative hypothesis represents the researcher’s hypothesis or the claim to be tested, suggesting that there is a significant effect, difference, or relationship between variables. The relationship between the null and alternative hypotheses is such that they are complementary, and statistical tests are conducted to determine whether the evidence from the data is strong enough to reject the null hypothesis in favor of the alternative hypothesis. This decision is based on the strength of the evidence and the chosen level of significance. Ultimately, the choice between the null and alternative hypotheses depends on the specific research question and the direction of the effect being investigated.

FAQs on Null Hypothesis

What does null hypothesis stands for.

The null hypothesis, denoted as H 0 ​, is a fundamental concept in statistics used for hypothesis testing. It represents the statement that there is no effect or no difference, and it is the hypothesis that the researcher typically aims to provide evidence against.

How to Form a Null Hypothesis?

A null hypothesis is formed based on the assumption that there is no significant difference or effect between the groups being compared or no association between variables being tested. It often involves stating that there is no relationship, no change, or no effect in the population being studied.

When Do we reject the Null Hypothesis?

In statistical hypothesis testing, if the p-value (the probability of obtaining the observed results) is lower than the chosen significance level (commonly 0.05), we reject the null hypothesis. This suggests that the data provides enough evidence to refute the assumption made in the null hypothesis.

What is a Null Hypothesis in Research?

In research, the null hypothesis represents the default assumption or position that there is no significant difference or effect. Researchers often try to test this hypothesis by collecting data and performing statistical analyses to see if the observed results contradict the assumption.

What Are Alternative and Null Hypotheses?

The null hypothesis (H0) is the default assumption that there is no significant difference or effect. The alternative hypothesis (H1 or Ha) is the opposite, suggesting there is a significant difference, effect or relationship.

What Does it Mean to Reject the Null Hypothesis?

Rejecting the null hypothesis implies that there is enough evidence in the data to support the alternative hypothesis. In simpler terms, it suggests that there might be a significant difference, effect or relationship between the groups or variables being studied.

How to Find Null Hypothesis?

Formulating a null hypothesis often involves considering the research question and assuming that no difference or effect exists. It should be a statement that can be tested through data collection and statistical analysis, typically stating no relationship or no change between variables or groups.

How is Null Hypothesis denoted?

The null hypothesis is commonly symbolized as H 0 in statistical notation.

What is the Purpose of the Null hypothesis in Statistical Analysis?

The null hypothesis serves as a starting point for hypothesis testing, enabling researchers to assess if there’s enough evidence to reject it in favor of an alternative hypothesis.

What happens if we Reject the Null hypothesis?

Rejecting the null hypothesis implies that there is sufficient evidence to support an alternative hypothesis, suggesting a significant effect or relationship between variables.

What are Test for Null Hypothesis?

Various statistical tests, such as t-tests or chi-square tests, are employed to evaluate the validity of the Null Hypothesis in different scenarios.

Please Login to comment...

Similar reads.

  • Geeks Premier League 2023
  • Math-Concepts
  • Geeks Premier League
  • School Learning

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

Root out friction in every digital experience, super-charge conversion rates, and optimize digital self-service

Uncover insights from any interaction, deliver AI-powered agent coaching, and reduce cost to serve

Increase revenue and loyalty with real-time insights and recommendations delivered to teams on the ground

Know how your people feel and empower managers to improve employee engagement, productivity, and retention

Take action in the moments that matter most along the employee journey and drive bottom line growth

Whatever they’re are saying, wherever they’re saying it, know exactly what’s going on with your people

Get faster, richer insights with qual and quant tools that make powerful market research available to everyone

Run concept tests, pricing studies, prototyping + more with fast, powerful studies designed by UX research experts

Track your brand performance 24/7 and act quickly to respond to opportunities and challenges in your market

Explore the platform powering Experience Management

  • Free Account
  • For Digital
  • For Customer Care
  • For Human Resources
  • For Researchers
  • Financial Services
  • All Industries

Popular Use Cases

  • Customer Experience
  • Employee Experience
  • Net Promoter Score
  • Voice of Customer
  • Customer Success Hub
  • Product Documentation
  • Training & Certification
  • XM Institute
  • Popular Resources
  • Customer Stories
  • Artificial Intelligence
  • Market Research
  • Partnerships
  • Marketplace

The annual gathering of the experience leaders at the world’s iconic brands building breakthrough business results, live in Salt Lake City.

  • English/AU & NZ
  • Español/Europa
  • Español/América Latina
  • Português Brasileiro
  • REQUEST DEMO
  • Experience Management
  • Survey Data Analysis & Reporting
  • Survey Analysis Methods

Try Qualtrics for free

Survey statistical analysis methods.

16 min read Get more from your survey results with tried and trusted statistical tests and analysis methods. The kind of data analysis you choose depends on your survey data, so it makes sense to understand as many statistical analysis options as possible. Here’s a one-stop guide.

Why use survey statistical analysis methods?

Using statistical analysis for survey data is a best practice for businesses and market researchers. But why?

Statistical tests can help you improve your knowledge of the market, create better experiences for your customers, give employees more of what they need to do their jobs, and sell more of your products and services to the people that want them. As data becomes more available and easier to manage using digital tools, businesses are increasingly using it to make decisions, rather than relying on gut instinct or opinion.

When it comes to survey data , collection is only half the picture. What you do with your results can make the difference between uninspiring top-line findings and deep, revelatory insights. Using data processing tools and techniques like statistical tests can help you discover:

  • whether the trends you see in your data are meaningful or just happened by chance
  • what your results mean in the context of other information you have
  • whether one factor affecting your business is more important than others
  • what your next research question should be
  • how to generate insights that lead to meaningful changes

There are several types of statistical analysis for surveys . The one you choose will depend on what you want to know, what type of data you have, the method of data collection, how much time and resources you have available, and the level of sophistication of your data analysis software.

Learn how Qualtrics iQ can help you with advanced statistical analysis

Before you start

Whichever statistical techniques or methods you decide to use, there are a few things to consider before you begin.

Nail your sampling approach

One of the most important aspects of survey research is getting your sampling technique right and choosing the right sample size . Sampling allows you to study a large population without having to survey every member of it. A sample, if it’s chosen correctly, represents the larger population, so you can study your sample data and then use the results to confidently predict what would be found in the population at large.

There will always be some discrepancy between the sample data and the population, a phenomenon known as sampling error , but with a well-designed study, this error is usually so small that the results are still valuable.

There are several sampling methods, including probabilit y and non-probability sampling . Like statistical analysis, the method you choose will depend on what you want to know, the type of data you’re collecting and practical constraints around what is possible.

Define your null hypothesis and alternative hypothesis

A null hypothesis is a prediction you make at the start of your research process to help define what you want to find out. It’s called a null hypothesis because you predict that your expected outcome won’t happen – that it will be null and void. Put simply: you work to reject, nullify or disprove the null hypothesis.

Along with your null hypothesis, you’ll define the alternative hypothesis, which states that what you expect to happen will happen.

For example, your null hypothesis might be that you’ll find no relationship between two variables, and your alternative hypothesis might be that you’ll find a correlation between them. If you disprove the null hypothesis, either your alternative hypothesis is true or something else is happening. Either way, it points you towards your next research question.

Use a benchmark

Benchmarking is a way of standardizing – leveling the playing field – so that you get a clearer picture of what your results are telling you. It involves taking outside factors into account so that you can adjust the parameters of your research and have a more precise understanding of what’s happening.

Benchmarking techniques use weighting to adjust for variables that may affect overall results. What does that mean? Well for example, imagine you’re interested in the growth of crops over a season. Your benchmarking will take into account variables that have an effect on crop growth, such as rainfall, hours of sunlight, any pests or diseases, type and frequency of fertilizer, so that you can adjust for anything unusual that might have happened, such as an unexpected plant disease outbreak on a single farm within your sample that would skew your results.

With benchmarks in place, you have a reference for what is “standard” in your area of interest, so that you can better identify and investigate variance from the norm.

The goal, as in so much of survey data analysis, is to make sure that your sample is representative of the whole population, and that any comparisons with other data are like-for-like.

Inferential or descriptive?

Statistical methods can be divided into inferential statistics and descriptive statistics.

  • Descriptive statistics shed light on how the data is distributed across the population of interest, giving you details like variance within a group and mean values for measurements.
  • Inferential statistics help you to make judgments and predict what might happen in the future, or to extrapolate from the sample you are studying to the whole population. Inferential statistics are the types of analyses used to test a null hypothesis. We’ll mostly discuss inferential statistics in this guide.

Types of statistical analysis

Regression analysis.

Regression is a statistical technique used for working out the relationship between two (or more) variables.

To understand regressions, we need a quick terminology check:

  • Independent variables are “standalone” phenomena (in the context of the study) that influence dependent variables
  • Dependent variables are things that change as a result of their relationship to independent variables

Let’s use an example: if we’re looking at crop growth during the month of August in Iowa, that’s our dependent variable. It’s affected by independent variables including sunshine, rainfall, pollution levels and the prevalence of certain bugs and pests.

A change in a dependent variable depends on, and is associated with, a change in one (or more) of the independent variables.

  • Linear regression uses a single independent variable to predict an outcome of the dependent variable.
  • Multiple regression uses at least two independent variables to predict the effect on the dependent variable. A multiple regression can be linear or non-linear.

The results from a linear regression analysis are shown as a graph with variables on the axes and a ‘regression curve’ that shows the relationships between them. Data is rarely directly proportional, so there’s usually some degree of curve rather than a straight line.

With this kind of statistical test, the null hypothesis is that there is no relationship between the dependent variable and the independent variable. The resulting graph would probably (though not always) look quite random rather than following a clear line.

Regression is a useful test statistic as you’re able to identify not only whether a relationship is statistically significant, but the precise impact of a change in your independent variable.

linear regression graph

The T-test (aka Student’s T-test) is a tool for comparing two data groups which have different mean values. The T-test allows the user to interpret whether differences are statistically significant or merely coincidental.

For example, do women and men have different mean heights? We can tell from running a t-test that there is a meaningful difference between the average height of a man and the average height of a woman – i.e. the difference is statistically significant.

For this test statistic, the null hypothesis would be that there’s no statistically significant difference.

The results of a T-test are expressed in terms of probability (p-value). If the p-value is below a certain threshold, usually 0.05, then you can be very confident that your two groups really are different and it wasn’t just a chance variation between your sample data.

Analysis of variance (ANOVA) test

Like the T-test, ANOVA (analysis of variance) is a way of testing the differences between groups to see if they’re statistically significant. However, ANOVA allows you to compare three or more groups rather than just two.

Also like the T-test, you’ll start off with the null hypothesis that there is no meaningful difference between your groups.

ANOVA is used with a regression study to find out what effect independent variables have on the dependent variable. It can compare multiple groups simultaneously to see if there is a relationship between them.

An example of ANOVA in action would be studying whether different types of advertisements get different consumer responses. The null hypothesis is that none of them have more effect on the audience than the others and they’re all basically as effective as one another. The audience reaction is the dependent variable here, and the different ads are the independent variables.

Cluster analysis

Cluster analysis is a way of processing datasets by identifying how closely related the individual data points are. Using cluster analysis, you can identify whether there are defined groups (clusters) within a large pool of data, or if the data is continuously and evenly spread out.

Cluster analysis comes in a few different forms, depending on the type of data you have and what you’re looking to find out. It can be used in an exploratory way, such as discovering clusters in survey data around demographic trends or preferences, or to confirm and clarify an existing alternative or null hypothesis.

Cluster analysis is one of the more popular statistical techniques in market research , since it can be used to uncover market segments and customer groups.

Factor analysis

Factor analysis is a way to reduce the complexity of your research findings by trading a large number of initial variables for a smaller number of deeper, underlying ones. In performing factor analysis, you uncover “hidden” factors that explain variance (difference from the average) in your findings.

Because it delves deep into the causality behind your data, factor analysis is also a form of research in its own right, as it gives you access to drivers of results that can’t be directly measured.

Conjoint analysis

Market researchers love to understand and predict why people make the complex choices they do. Conjoint analysis comes closest to doing this: it asks people to make trade-offs when making decisions, just as they do in the real world, then analyses the results to give the most popular outcome.

For example, an investor wants to open a new restaurant in a town. They think one of the following options might be the most profitable:

$20 $40 $60
5 miles 2 miles 10 miles
It’s OK It’s OK Loves it!
It’s cheap, fairly near home, partner is just OK with it It’s a bit more expensive but very near home, partner is just OK with it It’s expensive, quite far from home but partner loves it

The investor commissions market research. The options are turned into a survey for the residents:

  • Which type of restaurant do you prefer? (Gourmet burger/Spanish tapas/Thai)
  • What would you be prepared to spend per head? (£20, $40, £60)
  • How far would you be willing to travel? (5km, 2km, 10km)
  • Would your partner…? (Love it, be OK with it)

There are lots of possible combinations of answers – 54 in this case: (3 restaurant types) x (3 price levels) x (3 distances) x (2 partner preferences). Once the survey data is in, conjoint analysis software processes it to figure out how important each option is in driving customer decisions, which levels for each option are preferred, and by how much.

So, from conjoint analysis , the restaurant investor may discover that there’s a statistically significant preference for an expensive Spanish tapas bar on the outskirts of town – something they may not have considered before.

Crosstab analysis

Crosstab (cross-tabulation) is used in quantitative market research to analyze categorical data – that is, variables that are different and mutually exclusive, such as: ‘men’ and ‘women’, or ‘under 30’ and ‘over 30’.

Also known by names like contingency table and data tabulation, crosstab analysis allows you to compare the relationship between two variables by presenting them in easy-to-understand tables.

A statistical method called chi-squared can be used to test whether the variables in a crosstab analysis are independent or not by looking at whether the differences between them are statistically significant.

Text analysis and sentiment analysis

Analyzing human language is a relatively new form of data processing, and one that offers huge benefits in experience management. As part of the Stats iQ package, TextiQ from Qualtrics uses machine learning and natural language processing to parse and categorize data from text feedback, assigning positive, negative or neutral sentiment to customer messages and reviews.

With this data from text analysis in place, you can then employ statistical tools to analyze trends, make predictions and identify drivers of positive change.

The easy way to run statistical analysis

As you can see, using statistical methods is a powerful and versatile way to get more value from your research data, whether you’re running a simple linear regression to show a relationship between two variables, or performing natural language processing to evaluate the thoughts and feelings of a huge population.

Knowing whether what you notice in your results is statistically significant or not gives you the green light to confidently make decisions and present findings based on your results, since statistical methods provide a degree of certainty that most people recognize as valid. So having results that are statistically significant is a hugely important detail for businesses as well as academics and researchers.

Fortunately, using statistical methods, even the highly sophisticated kind, doesn’t have to involve years of study. With the right tools at your disposal, you can jump into exploratory data analysis almost straight away.

Our Stats iQ™ product can perform the most complicated statistical tests at the touch of a button using our online survey software , or data brought in from other sources. Turn your data into insights and actions with CoreXM and Stats iQ . Powerful statistical analysis. No stats degree required.

Learn how Qualtrics iQ can help you understand the experience like never before

Related resources

Analysis & Reporting

Margin of error 11 min read

Data saturation in qualitative research 8 min read, thematic analysis 11 min read, behavioral analytics 12 min read, statistical significance calculator: tool & complete guide 18 min read, regression analysis 19 min read, data analysis 31 min read, request demo.

Ready to learn more about Qualtrics?

IMAGES

  1. Null Hypothesis

    null hypothesis of factor analysis

  2. 15 Null Hypothesis Examples (2024)

    null hypothesis of factor analysis

  3. Null Hypothesis Examples

    null hypothesis of factor analysis

  4. Mastering Hypothesis Writing: Expert Tips for 2023

    null hypothesis of factor analysis

  5. Null hypothesis testing for loadings in unit weighted factor analysis

    null hypothesis of factor analysis

  6. Null hypothesis

    null hypothesis of factor analysis

VIDEO

  1. Hypothesis Testing: the null and alternative hypotheses

  2. Misunderstanding The Null Hypothesis

  3. Factor Analysis-5.2 (How to analyse Factors obtained from Factor Analysis)

  4. How to perform a one-way repeated factor ANOVA in SPSS

  5. Introduction Hypothesis Testing

  6. Hypothesis Testing Using IBM SPSS Statistics

COMMENTS

  1. Lesson 12: Factor Analysis

    Factor Analysis is a method for modeling observed variables, and their covariance structure, in terms of a smaller number of underlying unobservable (latent) "factors." The factors typically are viewed as broad concepts or ideas that may describe an observed phenomenon. ... Under the null hypothesis that the factor model adequately ...

  2. Confirmatory Factor Analysis (CFA)

    Confirmatory Factor Analysis (CFA) is a statistical technique used primarily in the social sciences. ... Both these indices compare the fit of the specified model to the null model. Values close to 0.95 or above are generally considered indicative of a good fit. The primary distinction between CFI and TLI is that TLI penalizes model complexity ...

  3. A Practical Introduction to Factor Analysis: Confirmatory Factor Analysis

    Introduction. Confirmatory factor analysis borrows many of the same concepts from exploratory factor analysis except that instead of letting the data tell us the factor structure, we pre-determine the factor structure and perform a hypothesis test to see if this is true. In this portion of the seminar, we will continue with the example of the SAQ.

  4. Confirmatory factor analysis

    In statistics, confirmatory factor analysis ( CFA) is a special form of factor analysis, most commonly used in social science research. [1] It is used to test whether measures of a construct are consistent with a researcher's understanding of the nature of that construct (or factor). As such, the objective of confirmatory factor analysis is to ...

  5. Factor Analysis

    This page shows an example of a factor analysis with footnotes explaining the output. The data used in this example were collected by Professor James Sidanius, who has generously shared them with us. ... Bartlett's Test of Sphericity - This tests the null hypothesis that the correlation matrix is an identity matrix. An identity matrix is ...

  6. 13.12: Factor analysis and ANOVA

    The null hypothesis is assumed to be true. The null hypothesis is as follows: The population means for the first factor have to be equal. This is similar to the one-way ANOVA for the row factor. The population means for the second factor must also be equal. This is similar to the one-way ANOVA for the column factor.

  7. Factor analysis

    The final section of the analysis output then shows the results of a hypothesis test in which the null hypothesis is that the number of factors used in the model is sufficient to capture most of the variation in the dataset. If the p-value is less than our significance level (for example 0.05), we reject the null hypothesis that the number of ...

  8. Hypothesis Testing

    The null hypothesis in ANOVA is always that there is no difference in means. The research or alternative hypothesis is always that the means are not all equal and is usually written in words rather than in mathematical symbols. ... The analysis in two-factor ANOVA is similar to that illustrated above for one-factor ANOVA. The computations are ...

  9. Factor Analysis

    Factor analysis includes both component analysis and common factor analysis. More than other statistical techniques, factor analysis has suffered from confusion concerning its very purpose. ... To see why, consider first a simpler problem: testing the null hypothesis that a correlation between two variables is 1.0. This hypothesis implies that ...

  10. Confirmatory Factor Analysis: An Introduction for Psychosomatic ...

    CFA, as well as the more familiar exploratory factor analysis (EFA), defines factors that account for covariability or shared variance among measured variables and ignores the variance that is unique to each of the measures. ... As shown in equation 3, the null hypothesis, H 0, states the model-implied and population covariance matrices are ...

  11. Understanding the Null Hypothesis for ANOVA Models

    To decide if we should reject or fail to reject the null hypothesis, we must refer to the p-value in the output of the ANOVA table. If the p-value is less than some significance level (e.g. 0.05) then we can reject the null hypothesis and conclude that not all group means are equal.

  12. 11.1: One-Way ANOVA

    Select the Data tab, then Data Analysis, ANOVA: Single-Factor, then OK. Next, select all three columns of data at once for the input range. Check the box that says Labels in first row (only select this if you actually selected the labels in the input range). ... The null hypothesis will always have the means equal to one another versus the ...

  13. Chi‐square for model fit in confirmatory factor analysis

    The researcher first establishes a hypothesis regarding the model structure expressed as particular factor(s) underlying a set of items. Analysis is then performed to determine how much of the covariance between the items ... Since the test is used to reject a null hypothesis representing perfect fit, chi-square is often referred to as a ...

  14. 3.1

    3.1 - Experiments with One Factor and Multiple Levels. Lesson 3 is the beginning of the one-way analysis of variance part of the course, which extends the two sample situation to k samples.. Text Reading: In addition to these notes, read Chapter 3 of the text and the online supplement. (If you have the 7th edition, also read 13.1.)

  15. 13.6: Post‐hoc Analysis

    When the Null Hypothesis is rejected in one factor ANOVA, the conclusion is that not all means are the same. This however leads to an obvious question: which particular means are different? Seeking further information after the results of a test is called post‐hoc analysis.

  16. One-way ANOVA

    ANOVA, which stands for Analysis of Variance, is a statistical test used to analyze the difference between the means of more than two groups. A one-way ANOVA uses one independent variable, while a two-way ANOVA uses two independent variables. As a crop researcher, you want to test the effect of three different fertilizer mixtures on crop yield.

  17. Null hypothesis

    Basic definitions. The null hypothesis and the alternative hypothesis are types of conjectures used in statistical tests to make statistical inferences, which are formal methods of reaching conclusions and separating scientific claims from statistical noise.. The statement being tested in a test of statistical significance is called the null hypothesis. . The test of significance is designed ...

  18. One-Way Analysis of Variance (ANOVA)

    Null hypothesis: The null hypothesis states that the independent variable has no effect on the dependent variable in any treatment group. Thus, H 0: β j = 0 ... One-way analysis of variance with a single-factor, independent groups design has advantages and disadvantages. Advantages include the following: The design layout is ...

  19. A Guide to Bartlett's Test of Sphericity

    The null hypothesis of the test is that the variables are orthogonal, i.e. not correlated. ... This test is often performed before we use a data reduction technique such as principal component analysis or factor analysis to verify that a data reduction technique can actually compress the data in a meaningful way.

  20. The JASP guidelines for conducting and reporting a Bayesian analysis

    Specifying the goal of the analysis. We recommend that researchers carefully consider their goal, that is, the research question that they wish to answer, prior to the study (Jeffreys, 1939).When the goal is to ascertain the presence or absence of an effect, we recommend a Bayes factor hypothesis test (see Box 1).

  21. Chapter 16 Introduction to Bayesian hypothesis testing

    16.1 Hypothesis testing, relative evidence, and the Bayes factor. In the Frequentist null-hypothesis significance testing procedure, we defined a hypothesis test in terms of comparing two nested models, a general MODEL G and a restricted MODEL R which is a special case of MODEL G. ... Conduct the analysis. To test null-hypotheses, compare the ...

  22. Proportion of Indicator Common Variance Due to a Factor as an Effect

    With factor analysis, communalities for indicators ... This occurred most frequently with nonsignificant tests assessing the null hypothesis of a single underlying factor, a sample size of 200, and loadings on the general factor of .5s (e.g., 83.3% and 47.1% of the replications when the loadings on the group factors were .3s and .4s ...

  23. Null Hypothesis

    What is Null Hypothesis? Null Hypothesis in statistical analysis suggests the absence of statistical significance within a specific set of observed data. Hypothesis testing, using sample data, evaluates the validity of this hypothesis. Commonly denoted as H 0 or simply "null," it plays an important role in quantitative analysis, examining theories related to markets, investment strategies ...

  24. Survey Statistical Analysis Methods

    Define your null hypothesis and alternative hypothesis. ... In performing factor analysis, you uncover "hidden" factors that explain variance (difference from the average) in your findings. Because it delves deep into the causality behind your data, factor analysis is also a form of research in its own right, as it gives you access to ...

  25. PDF Application of Fama-french Five Factor Model on The Russian Market

    The null hypothesis of this test is that all intercepts of an asset pricing model are jointly equal to 0. Thus, in this case, we do not reject the null hypothesis. It is possible to apply this test to any type of asset pricing model to test the alpha-intercepts. ... Additionally, the analysis suggests that the size factor (SMB) might be losing ...