Have a language expert improve your writing
Run a free plagiarism check in 10 minutes, generate accurate citations for free.
 Knowledge Base
Multiple Linear Regression  A Quick Guide (Examples)
Published on February 20, 2020 by Rebecca Bevans . Revised on June 22, 2023.
Regression models are used to describe relationships between variables by fitting a line to the observed data. Regression allows you to estimate how a dependent variable changes as the independent variable(s) change.
Multiple linear regression is used to estimate the relationship between two or more independent variables and one dependent variable . You can use multiple linear regression when you want to know:
 How strong the relationship is between two or more independent variables and one dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth).
 The value of the dependent variable at a certain value of the independent variables (e.g. the expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).
Table of contents
Assumptions of multiple linear regression, how to perform a multiple linear regression, interpreting the results, presenting the results, other interesting articles, frequently asked questions about multiple linear regression.
Multiple linear regression makes all of the same assumptions as simple linear regression :
Homogeneity of variance (homoscedasticity) : the size of the error in our prediction doesn’t change significantly across the values of the independent variable.
Independence of observations : the observations in the dataset were collected using statistically valid sampling methods , and there are no hidden relationships among variables.
In multiple linear regression, it is possible that some of the independent variables are actually correlated with one another, so it is important to check these before developing the regression model. If two independent variables are too highly correlated (r2 > ~0.6), then only one of them should be used in the regression model.
Normality : The data follows a normal distribution .
Linearity : the line of best fit through the data points is a straight line, rather than a curve or some sort of grouping factor.
Prevent plagiarism. Run a free check.
Multiple linear regression formula.
The formula for a multiple linear regression is:
 … = do the same for however many independent variables you are testing
To find the bestfit line for each independent variable, multiple linear regression calculates three things:
 The regression coefficients that lead to the smallest overall model error.
 The t statistic of the overall model.
 The associated p value (how likely it is that the t statistic would have occurred by chance if the null hypothesis of no relationship between the independent and dependent variables was true).
It then calculates the t statistic and p value for each regression coefficient in the model.
Multiple linear regression in R
While it is possible to do multiple linear regression by hand, it is much more commonly done via statistical software. We are going to use R for our examples because it is free, powerful, and widely available. Download the sample dataset to try it yourself.
Dataset for multiple linear regression (.csv)
Load the heart.data dataset into your R environment and run the following code:
This code takes the data set heart.data and calculates the effect that the independent variables biking and smoking have on the dependent variable heart disease using the equation for the linear model: lm() .
Learn more by following the full stepbystep guide to linear regression in R .
To view the results of the model, you can use the summary() function:
This function takes the most important parameters from the linear model and puts them into a table that looks like this:
The summary first prints out the formula (‘Call’), then the model residuals (‘Residuals’). If the residuals are roughly centered around zero and with similar spread on either side, as these do ( median 0.03, and min and max around 2 and 2) then the model probably fits the assumption of heteroscedasticity.
Next are the regression coefficients of the model (‘Coefficients’). Row 1 of the coefficients table is labeled (Intercept) – this is the yintercept of the regression equation. It’s helpful to know the estimated intercept in order to plug it into the regression equation and predict values of the dependent variable:
The most important things to note in this output table are the next two tables – the estimates for the independent variables.
The Estimate column is the estimated effect , also called the regression coefficient or r 2 value. The estimates in the table tell us that for every one percent increase in biking to work there is an associated 0.2 percent decrease in heart disease, and that for every one percent increase in smoking there is an associated .17 percent increase in heart disease.
The Std.error column displays the standard error of the estimate. This number shows how much variation there is around the estimates of the regression coefficient.
The t value column displays the test statistic . Unless otherwise specified, the test statistic used in linear regression is the t value from a twosided t test . The larger the test statistic, the less likely it is that the results occurred by chance.
The Pr( >  t  ) column shows the p value . This shows how likely the calculated t value would have occurred by chance if the null hypothesis of no effect of the parameter were true.
Because these values are so low ( p < 0.001 in both cases), we can reject the null hypothesis and conclude that both biking to work and smoking both likely influence rates of heart disease.
When reporting your results, include the estimated effect (i.e. the regression coefficient), the standard error of the estimate, and the p value. You should also interpret your numbers to make it clear to your readers what the regression coefficient means.
Visualizing the results in a graph
It can also be helpful to include a graph with your results. Multiple linear regression is somewhat more complicated than simple linear regression, because there are more parameters than will fit on a twodimensional plot.
However, there are ways to display your results that include the effects of multiple independent variables on the dependent variable, even though only one independent variable can actually be plotted on the xaxis.
Here, we have calculated the predicted values of the dependent variable (heart disease) across the full range of observed values for the percentage of people biking to work.
To include the effect of smoking on the independent variable, we calculated these predicted values while holding smoking constant at the minimum, mean , and maximum observed rates of smoking.
Here's why students love Scribbr's proofreading services
Discover proofreading & editing
If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.
 Chi square test of independence
 Statistical power
 Descriptive statistics
 Degrees of freedom
 Pearson correlation
 Null hypothesis
Methodology
 Doubleblind study
 Casecontrol study
 Research ethics
 Data collection
 Hypothesis testing
 Structured interviews
Research bias
 Hawthorne effect
 Unconscious bias
 Recall bias
 Halo effect
 Selfserving bias
 Information bias
A regression model is a statistical model that estimates the relationship between one dependent variable and one or more independent variables using a line (or a plane in the case of two or more independent variables).
A regression model can be used when the dependent variable is quantitative, except in the case of logistic regression, where the dependent variable is binary.
Multiple linear regression is a regression model that estimates the relationship between a quantitative dependent variable and two or more independent variables using a straight line.
Linear regression most often uses meansquare error (MSE) to calculate the error of the model. MSE is calculated by:
 measuring the distance of the observed yvalues from the predicted yvalues at each value of x;
 squaring each of these distances;
 calculating the mean of each of the squared distances.
Linear regression fits a line to the data by finding the regression coefficient that results in the smallest MSE.
Cite this Scribbr article
If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.
Bevans, R. (2023, June 22). Multiple Linear Regression  A Quick Guide (Examples). Scribbr. Retrieved September 9, 2024, from https://www.scribbr.com/statistics/multiplelinearregression/
Is this article helpful?
Rebecca Bevans
Other students also liked, simple linear regression  an easy introduction & examples, an introduction to t tests  definitions, formula and examples, types of variables in research & statistics  examples, what is your plagiarism score.
Multiple linear regression
Multiple linear regression #.
Fig. 11 Multiple linear regression #
Errors: \(\varepsilon_i \sim N(0,\sigma^2)\quad \text{i.i.d.}\)
Fit: the estimates \(\hat\beta_0\) and \(\hat\beta_1\) are chosen to minimize the residual sum of squares (RSS):
Matrix notation: with \(\beta=(\beta_0,\dots,\beta_p)\) and \({X}\) our usual data matrix with an extra column of ones on the left to account for the intercept, we can write
Multiple linear regression answers several questions #
Is at least one of the variables \(X_i\) useful for predicting the outcome \(Y\) ?
Which subset of the predictors is most important?
How good is a linear model for these data?
Given a set of predictor values, what is a likely value for \(Y\) , and how accurate is this prediction?
The estimates \(\hat\beta\) #
Our goal again is to minimize the RSS: $ \( \begin{aligned} \text{RSS}(\beta) &= \sum_{i=1}^n (y_i \hat y_i(\beta))^2 \\ & = \sum_{i=1}^n (y_i  \beta_0 \beta_1 x_{i,1}\dots\beta_p x_{i,p})^2 \\ &= \YX\beta\^2_2 \end{aligned} \) $
One can show that this is minimized by the vector \(\hat\beta\) : $ \(\hat\beta = ({X}^T{X})^{1}{X}^T{y}.\) $
We usually write \(RSS=RSS(\hat{\beta})\) for the minimized RSS.
Which variables are important? #
Consider the hypothesis: \(H_0:\) the last \(q\) predictors have no relation with \(Y\) .
Based on our model: \(H_0:\beta_{pq+1}=\beta_{pq+2}=\dots=\beta_p=0.\)
Let \(\text{RSS}_0\) be the minimized residual sum of squares for the model which excludes these variables.
The \(F\) statistic is defined by: $ \(F = \frac{(\text{RSS}_0\text{RSS})/q}{\text{RSS}/(np1)}.\) $
Under the null hypothesis (of our model), this has an \(F\) distribution.
Example: If \(q=p\) , we test whether any of the variables is important. $ \(\text{RSS}_0 = \sum_{i=1}^n(y_i\overline y)^2 \) $
Res.Df  RSS  Df  Sum of Sq  F  Pr(>F) 

<dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 
494  11336.29  NA  NA  NA  NA 
492  11078.78  2  257.5076  5.717853  0.003509036 
The \(t\) statistic associated to the \(i\) th predictor is the square root of the \(F\) statistic for the null hypothesis which sets only \(\beta_i=0\) .
A low \(p\) value indicates that the predictor is important.
Warning: If there are many predictors, even under the null hypothesis, some of the \(t\) tests will have low pvalues even when the model has no explanatory power.
How many variables are important? #
When we select a subset of the predictors, we have \(2^p\) choices.
A way to simplify the choice is to define a range of models with an increasing number of variables, then select the best.
Forward selection: Starting from a null model, include variables one at a time, minimizing the RSS at each step.
Backward selection: Starting from the full model, eliminate variables one at a time, choosing the one with the largest pvalue at each step.
Mixed selection: Starting from some model, include variables one at a time, minimizing the RSS at each step. If the pvalue for some variable goes beyond a threshold, eliminate that variable.
Choosing one model in the range produced is a form of tuning . This tuning can invalidate some of our methods like hypothesis tests and confidence intervals…
How good are the predictions? #
The function predict in R outputs predictions and confidence intervals from a linear model:
fit  lwr  upr 

9.409426  8.722696  10.09616 
14.163090  13.708423  14.61776 
18.916754  18.206189  19.62732 
Prediction intervals reflect uncertainty on \(\hat\beta\) and the irreducible error \(\varepsilon\) as well.
fit  lwr  upr 

9.409426  2.946709  15.87214 
14.163090  7.720898  20.60528 
18.916754  12.451461  25.38205 
These functions rely on our linear regression model $ \( Y = X\beta + \epsilon. \) $
Dealing with categorical or qualitative predictors #
For each qualitative predictor, e.g. Region :
Choose a baseline category, e.g. East
For every other category, define a new predictor:
\(X_\text{South}\) is 1 if the person is from the South region and 0 otherwise
\(X_\text{West}\) is 1 if the person is from the West region and 0 otherwise.
The model will be: $ \(Y = \beta_0 + \beta_1 X_1 +\dots +\beta_7 X_7 + \color{Red}{\beta_\text{South}} X_\text{South} + \beta_\text{West} X_\text{West} +\varepsilon.\) $
The parameter \(\color{Red}{\beta_\text{South}}\) is the relative effect on Balance (our \(Y\) ) for being from the South compared to the baseline category (East).
The model fit and predictions are independent of the choice of the baseline category.
However, hypothesis tests derived from these variables are affected by the choice.
Solution: To check whether region is important, use an \(F\) test for the hypothesis \(\beta_\text{South}=\beta_\text{West}=0\) by dropping Region from the model. This does not depend on the coding.
Note that there are other ways to encode qualitative predictors produce the same fit \(\hat f\) , but the coefficients have different interpretations.
So far, we have:
Defined Multiple Linear Regression
Discussed how to test the importance of variables.
Described one approach to choose a subset of variables.
Explained how to code qualitative variables.
Now, how do we evaluate model fit? Is the linear model any good? What can go wrong?
How good is the fit? #
To assess the fit, we focus on the residuals $ \( e = Y  \hat{Y} \) $
The RSS always decreases as we add more variables.
The residual standard error (RSE) corrects this: $ \(\text{RSE} = \sqrt{\frac{1}{np1}\text{RSS}}.\) $
Fig. 12 Residuals #
Visualizing the residuals can reveal phenomena that are not accounted for by the model; eg. synergies or interactions:
Potential issues in linear regression #
Interactions between predictors
Nonlinear relationships
Correlation of error terms
Nonconstant variance of error (heteroskedasticity)
High leverage points
Collinearity
Interactions between predictors #
Linear regression has an additive assumption: $ \(\mathtt{sales} = \beta_0 + \beta_1\times\mathtt{tv}+ \beta_2\times\mathtt{radio}+\varepsilon\) $
i.e. An increase of 100 USD dollars in TV ads causes a fixed increase of \(100 \beta_2\) USD in sales on average, regardless of how much you spend on radio ads.
We saw that in Fig 3.5 above. If we visualize the fit and the observed points, we see they are not evenly scattered around the plane. This could be caused by an interaction.
One way to deal with this is to include multiplicative variables in the model:
The interaction variable tv \(\cdot\) radio is high when both tv and radio are high.
R makes it easy to include interaction variables in the model:
Nonlinearities #
Fig. 13 A nonlinear fit might be better here. #
Example: Auto dataset.
A scatterplot between a predictor and the response may reveal a nonlinear relationship.
Solution: include polynomial terms in the model.
Could use other functions besides polynomials…
Fig. 14 Residuals for Auto data #
In 2 or 3 dimensions, this is easy to visualize. What do we do when we have too many predictors?
Correlation of error terms #
We assumed that the errors for each sample are independent:
What if this breaks down?
The main effect is that this invalidates any assertions about Standard Errors, confidence intervals, and hypothesis tests…
Example : Suppose that by accident, we duplicate the data (we use each sample twice). Then, the standard errors would be artificially smaller by a factor of \(\sqrt{2}\) .
When could this happen in real life:
Time series: Each sample corresponds to a different point in time. The errors for samples that are close in time are correlated.
Spatial data: Each sample corresponds to a different location in space.
Grouped data: Imagine a study on predicting height from weight at birth. If some of the subjects in the study are in the same family, their shared environment could make them deviate from \(f(x)\) in similar ways.
Correlated errors #
Simulations of time series with increasing correlations between \(\varepsilon_i\)
Nonconstant variance of error (heteroskedasticity) #
The variance of the error depends on some characteristics of the input features.
To diagnose this, we can plot residuals vs. fitted values:
If the trend in variance is relatively simple, we can transform the response using a logarithm, for example.
Outliers from a model are points with very high errors.
While they may not affect the fit, they might affect our assessment of model quality.
Possible solutions: #
If we believe an outlier is due to an error in data collection, we can remove it.
An outlier might be evidence of a missing predictor, or the need to specify a more complex model.
High leverage points #
Some samples with extreme inputs have an outsized effect on \(\hat \beta\) .
This can be measured with the leverage statistic or self influence :
Studentized residuals #
The residual \(e_i = y_i  \hat y_i\) is an estimate for the noise \(\epsilon_i\) .
The standard error of \(\hat \epsilon_i\) is \(\sigma \sqrt{1h_{ii}}\) .
A studentized residual is \(\hat \epsilon_i\) divided by its standard error (with appropriate estimate of \(\sigma\) )
When model is correct, it follows a Studentt distribution with \(np2\) degrees of freedom.
Collinearity #
Two predictors are collinear if one explains the other well:
Problem: The coefficients become unidentifiable .
Consider the extreme case of using two identical predictors limit : $ \( \begin{aligned} \mathtt{balance} &= \beta_0 + \beta_1\times\mathtt{limit} + \beta_2\times\mathtt{limit} + \epsilon \\ & = \beta_0 + (\beta_1+100)\times\mathtt{limit} + (\beta_2100)\times\mathtt{limit} + \epsilon \end{aligned} \) $
For every \((\beta_0,\beta_1,\beta_2)\) the fit at \((\beta_0,\beta_1,\beta_2)\) is just as good as at \((\beta_0,\beta_1+100,\beta_2100)\) .
If 2 variables are collinear, we can easily diagnose this using their correlation.
A group of \(q\) variables is multilinear if these variables “contain less information” than \(q\) independent variables.
Pairwise correlations may not reveal multilinear variables.
The Variance Inflation Factor (VIF) measures how predictable it is given the other variables, a proxy for how necessary a variable is:
Above, \(R^2_{X_jX_{j}}\) is the \(R^2\) statistic for Multiple Linear regression of the predictor \(X_j\) onto the remaining predictors.
Lesson 5: Multiple Linear Regression (MLR) Model & Evaluation
Overview of this lesson.
In this lesson, we make our first (and last?!) major jump in the course. We move from the simple linear regression model with one predictor to the multiple linear regression model with two or more predictors. That is, we use the adjective "simple" to denote that our model has only predictor, and we use the adjective "multiple" to indicate that our model has at least two predictors.
In the multiple regression setting, because of the potentially large number of predictors, it is more efficient to use matrices to define the regression model and the subsequent analyses. This lesson considers some of the more important multiple regression formulas in matrix form. If you're unsure about any of this, it may be a good time to take a look at this Matrix Algebra Review .
The good news is that everything you learned about the simple linear regression model extends — with at most minor modification — to the multiple linear regression model. Think about it — you don't have to forget all of that good stuff you learned! In particular:
 The models have similar "LINE" assumptions. The only real difference is that whereas in simple linear regression we think of the distribution of errors at a fixed value of the single predictor, with multiple linear regression we have to think of the distribution of errors at a fixed set of values for all the predictors. All of the model checking procedures we learned earlier are useful in the multiple linear regression framework, although the process becomes more involved since we now have multiple predictors. We'll explore this issue further in Lesson 6.
 The use and interpretation of r 2 (which we'll denote R 2 in the context of multiple linear regression) remains the same. However, with multiple linear regression we can also make use of an "adjusted" R 2 value, which is useful for model building purposes. We'll explore this measure further in Lesson 11.
 With a minor generalization of the degrees of freedom, we use t tests and t intervals for the regression slope coefficients to assess whether a predictor is significantly linearly related to the response, after controlling for the effects of all the opther predictors in the model.
 With a minor generalization of the degrees of freedom, we use confidence intervals for estimating the mean response and prediction intervals for predicting an individual response. We'll explore these further in Lesson 6.
For the simple linear regression model, there is only one slope parameter about which one can perform hypothesis tests. For the multiple linear regression model, there are three different hypothesis tests for slopes that one could conduct. They are:
 a hypothesis test for testing that one slope parameter is 0
 a hypothesis test for testing that all of the slope parameters are 0
 a hypothesis test for testing that a subset — more than one, but not all — of the slope parameters are 0
In this lesson, we also learn how to perform each of the above three hypothesis tests.
Key Learning Goals for this Lesson: 
in a multiple regression setting. in a multiple regression setting. test for : = ... = = 0. test for any subset of the slope parameters. test or general linear test for : = 0. test for a slope parameter tests the significance of the predictor after adjusting for the other predictors in the model (as can be justified by the equivalence of the test and the corresponding general linear test for one slope). . 
 5.1  Example on IQ and Physical Characteristics
 5.2  Example on Underground Air Quality
 5.3  The Multiple Linear Regression Model
 5.4  A Matrix Formulation of the Multiple Regression Model
 5.5  Three Types of MLR Parameter Tests
 5.6  The General Linear FTest
 5.7  MLR Parameter Tests
 5.8  Partial Rsquared
 5.9  Further MLR Examples
Start Here!
 Welcome to STAT 462!
 Search Course Materials
 Lesson 1: Statistical Inference Foundations
 Lesson 2: Simple Linear Regression (SLR) Model
 Lesson 3: SLR Evaluation
 Lesson 4: SLR Assumptions, Estimation & Prediction
 5.9 Further MLR Examples
 Lesson 6: MLR Assumptions, Estimation & Prediction
 Lesson 7: Transformations & Interactions
 Lesson 8: Categorical Predictors
 Lesson 9: Influential Points
 Lesson 10: Regression Pitfalls
 Lesson 11: Model Building
 Lesson 12: Logistic, Poisson & Nonlinear Regression
 Website for Applied Regression Modeling, 2nd edition
 Notation Used in this Course
 R Software Help
 Minitab Software Help
Copyright © 2018 The Pennsylvania State University Privacy and Legal Statements Contact the Department of Statistics Online Programs
Stack Exchange Network
Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Writing hypothesis for linear multiple regression models
I struggle writing hypothesis because I get very much confused by reference groups in the context of regression models.
For my example I'm using the mtcars dataset. The predictors are wt (weight), cyl (number of cylinders), and gear (number of gears), and the outcome variable is mpg (miles per gallon).
Say all your friends think you should buy a 6 cylinder car, but before you make up your mind you want to know how 6 cylinder cars perform milespergallonwise compared to 4 cylinder cars because you think there might be a difference.
Would this be a fair null hypothesis (since 4 cylinder cars is the reference group)?: There is no difference between 6 cylinder car milespergallon performance and 4 cylinder car milespergallon performance.
Would this be a fair model interpretation ?: 6 cylinder vehicles travel fewer miles per gallon (p=0.010, β 4.00, CI 6.95  1.04) as compared to 4 cylinder vehicles when adjusting for all other predictors, thus rejecting the null hypothesis.
Sorry for troubling, and thanks in advance for any feedback!
 multipleregression
 linearmodel
 interpretation
Yes, you already got the right answer to both of your questions.
 Your null hypothesis in completely fair. You did it the right way. When you have a factor variable as predictor, you omit one of the levels as a reference category (the default is usually the first one, but you also can change that). Then all your other levels’ coefficients are tested for a significant difference compared to the omitted category. Just like you did.
If you would like to compare 6cylinder cars with 8cylinder car, then you would have to change the reference category. In your hypothesis you just could had added at the end (or as a footnote): "when adjusting for weight and gear", but it is fine the way you did it.
 Your model interpretation is correct : It is perfect the way you did it. You could even had said: "the best estimate is that 6 cylinder vehicles travel 4 miles per gallon less than 4 cylinder vehicles (pvalue: 0.010; CI: 6.95, 1.04), when adjusting for weight and gear, thus rejecting the null hypothesis".
Let's assume that your hypothesis was related to gears, and you were comparing 4gear vehicles with 3gear vehicles. Then your result would be β: 0.65; pvalue: 0.67; CI: 2.5, 3.8. You would say that: "There is no statistically significant difference between three and four gear cars in fuel consumption, when adjusting for weight and engine power, thus failing to reject the null hypothesis".
Your Answer
Sign up or log in, post as a guest.
Required, but never shown
By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .
Not the answer you're looking for? Browse other questions tagged r regression multipleregression linearmodel interpretation or ask your own question .
 Featured on Meta
 Bringing clarity to status tag usage on meta sites
 Join Stack Overflow’s CEO and me for the first Stack IRL Community Event in...
Hot Network Questions
 Why is so much of likelihood theory focused on solving the score function equation?
 How would the following changes affect this monster's CR?
 Practice test paper answers all seem incorrect, but provider insists they are ... what am i missing?
 Fantasy book about humans and gnomes entering one another's worlds
 How much technological progress could a group of modern people make in a century?
 The meaning of "sharp" in "sharp sweetness"
 sp_blitzlock returns blank data in SQL Managed instance
 DFT warn: LUMO = HOMO
 Text processing: Filter & republish HTML table
 Please help me identify my Dad's bike collection (80's2000's)
 Why are Mexicans protesting the judicial reform with PJF banners?
 Overstaying knowing I have a new Schengen visa
 Can I land on the "EuroAirport BaselMulhouseFreiburg" with a German National Visa as first destination (NONEU Citizen)?
 jq  ip addr show in tabular format
 Was using an older version of a legal card from a nonlegal set ever not legal?
 How solid is the claim that Alfred Nobel founded the Nobel Prize specifically because of his invention of dynamite?
 What would the natural diet of Bigfoot be?
 Understanding the parabolic state of a quantum particle in the infinite square well
 Getting lost on a Circular Track
 What are the steps to write a book?
 Why does ATSAM3X8E have two separate registers for setting and clearing bits?
 Why hire externally instead of converting contractors?
 Why were there so many OSes that had the name "DOS" in them?
 If a friend hands me a marijuana edible then dies of a heart attack am I guilty of felony murder?
Stats and R
Multiple linear regression made simple, introduction, another interpretation of the intercept, correlation does not imply causation, conditions of application, visualizations, interpretations of coefficients \(\widehat\beta\), \(p\) value associated to the model, coefficient of determination \(r^2\), print model’s parameters, automatic reporting, predictions, linear hypothesis tests, overall effect of categorical variables, interaction.
Remember that descriptive statistics is a branch of statistics that allows to describe your data at hand.
Inferential statistics (with the popular hypothesis tests and confidence intervals) is another branch of statistics that allows to make inferences, that is, to draw conclusions about a population based on a sample .
The last branch of statistics is about modeling the relationship between two or more variables . 1 The most common statistical tool to describe and evaluate the link between variables is linear regression.
There are two types of linear regression:
 Simple linear regression is a statistical approach that allows to assess the linear relationship between two quantitative variables . More precisely, it enables the relationship to be quantified and its significance to be evaluated.
 Multiple linear regression is a generalization of simple linear regression, in the sense that this approach makes it possible to evaluate the linear relationships between a response variable (quantitative) and several explanatory variables (quantitative or qualitative ).
In the real world, multiple linear regression is used more frequently than simple linear regression. This is mostly the case because:
 Multiple linear regression allows to evaluate the relationship between two variables, while controlling for the effect (i.e., removing the effect) of other variables .
 With data collection becoming easier, more variables can be included and taken into account when analyzing data.
Multiple linear regression being such a powerful statistical tool, I would like to present it so that everyone understands it, and perhaps even use it when deemed necessary. However, I cannot afford to write about multiple linear regression without first presenting simple linear regression.
So after a reminder about the principle and the interpretations that can be drawn from a simple linear regression, I will illustrate how to perform multiple linear regression in R. I will also show, in the context of multiple linear regression, how to interpret the output and discuss about its conditions of application. I will then conclude the article by presenting more advanced topics directly linked to linear regression.
Simple linear regression: reminder
Simple linear regression is an asymmetric procedure in which:
 one of the variable is considered the response or the variable to be explained. It is also called dependent variable , and is represented on the \(y\) axis
 the other variable is the explanatory or also called independent variable , and is represented on the \(x\) axis
Simple linear regression allows to evaluate the existence of a linear relationship between two variables and to quantify this link. Note that linearity is a strong assumption in linear regression in the sense that it tests and quantifies whether the two variables are linearly dependent.
What makes linear regression a powerful statistical tool is that it allows to quantify by what quantity the response/dependent variable varies when the explanatory/independent variable increases by one unit .
This concept is key in linear regression and helps to answer the following questions:
 Is there a link between the amount spent in advertising and the sales during a certain period?
 Is the number of years of schooling valued, in financial terms, in the first job?
 Will an increase in tobacco taxes reduce its consumption?
 What is the most likely price of an apartment, depending on the area?
 Does a person’s reaction time to a stimulus depend on gender?
Simple linear regression can be seen as an extension to the analysis of variance (ANOVA) and the Student’s ttest . ANOVA and ttest allow to compare groups in terms of a quantitative variable—2 groups for ttest and 3 or more groups for ANOVA. 2
For these tests, the independent variable, that is, the grouping variable forming the different groups to compare must be a qualitative variable. Linear regression is an extension because in addition to be used to compare groups, it is also used with quantitative independent variables (which is not possible with ttest and ANOVA).
In this article, we are interested in assessing whether there is a linear relationship between the distance traveled with a gallon of fuel and the weight of cars. For this example, we use the mtcars dataset (preloaded in R).
The dataset includes fuel consumption and 10 aspects of automotive design and performance for 32 automobiles: 3
 mpg Miles/(US) gallon (with a gallon \(\approx\) 3.79 liters)
 cyl Number of cylinders
 disp Displacement (cu.in.)
 hp Gross horsepower
 drat Rear axle ratio
 wt Weight (1000 lbs, with 1000 lbs \(\approx\) 453.59 kg)
 qsec 1/4 mile time (with 1/4 mile \(\approx\) 402.34 meters)
 vs Engine (0 = Vshaped, 1 = straight)
 am Transmission (0 = automatic, 1 = manual)
 gear Number of forward gears
 carb Number of carburetors
The scatterplot above shows that there seems to be a negative relationship between the distance traveled with a gallon of fuel and the weight of a car . This makes sense, as the heavier the car, the more fuel it consumes and thus the fewer miles it can drive with a gallon.
This is already a good overview of the relationship between the two variables, but a simple linear regression with the miles per gallon as dependent variable and the car’s weight as independent variable goes further. It will tell us by how many miles the distance varies, on average, when the weight varies by one unit (1000 lbs in this case). This is possible thanks to the regression line.
The principle of simple linear regression is to find the line (i.e., determine its equation) which passes as close as possible to the observations , that is, the set of points formed by the pairs \((x_i, y_i)\) .
In the first step, there are many potential lines. Three of them are plotted:
To find the line which passes as close as possible to all the points, we take the square of the vertical distance between each point and each potential line. Note that we take the square of the distances to make sure that a negative gap (i.e., a point below the line) is not compensated by a positive gap (i.e., a point above the line). The line which passes closest to the set of points is the one which minimizes the sum of these squared distances .
The resulting regression line is presented in blue in the following plot, and the dashed gray lines represent the vertical distance between the points and the fitted line. These vertical distances between each observed point and the fitted line determined by the least squares method are called the residuals of the linear regression model and denoted \(\epsilon\) .
By definition, there is no other line with a smaller total distance between the points and the line. This method is called the least squares method, or OLS for ordinary least squares .
The regression model can be written in the form of the equation:
\[Y = \beta_0 + \beta_1 X + \epsilon\]
 \(Y\) the dependent variable
 \(X\) the independent variable
 \(\beta_0\) the intercept (the mean value of \(Y\) when \(x = 0\) ), also sometimes denoted \(\alpha\)
 \(\beta_1\) the slope (the expected increase in \(Y\) when \(X\) increases by one unit)
 \(\epsilon\) the residuals (the error term of mean 0 which describes the variations of \(Y\) not captured by the model, also referred as the noise)
When we determine the line which passes closest to all the points (we say that we fit a line to the observed data), we actually estimate the unknown parameters \(\beta_0\) and \(\beta_1\) based on the data at hand. Remember from your geometry classes, to draw a line you only need two parameters—the intercept and the slope.
These estimates (and thus the blue line shown in the previous scatterplot) can be computed by hand with the following formulas:
\[\begin{align} \widehat\beta_1 &= \frac{\sum^n_{i = 1} (x_i  \bar{x})(y_i  \bar{y})}{\sum^n_{i = 1}(x_i  \bar{x})^2} \\ &= \frac{\left(\sum^n_{i = 1}x_iy_i\right)  n\bar{x}\bar{y}}{\sum^n_{i = 1}(x_i  \bar{x})^2} \end{align}\]
\[\widehat\beta_0 = \bar{y}  \widehat\beta_1 \bar{x}\]
with \(\bar{x}\) and \(\bar{y}\) denoting the sample mean of \(x\) and \(y\) , respectively.
(If you struggle to compute \(\widehat\beta_0\) and \(\widehat\beta_1\) by hand, see this Shiny app which helps you to easily find these estimates based on your data.)
The intercept \(\widehat\beta_0\) is the mean value of the dependent variable \(Y\) when the independent variable \(X\) takes the value 0 . Its estimation has no interest in evaluating whether there is a linear relationship between two variables. It has, however, an interest if you want to know what the mean value of \(Y\) could be when \(x = 0\) . 4
The slope \(\widehat\beta_1\) , on the other hand, corresponds to the expected variation of \(Y\) when \(X\) varies by one unit . It tells us two important informations:
 The sign of the slope indicates the direction of the line —a positive slope ( \(\widehat\beta_1 > 0\) ) indicates that there is a positive relationship between the two variables of interest (they vary in the same direction), whereas a negative slope ( \(\widehat\beta_1 < 0\) ) means that there is a negative relationship between the two variables (they vary in opposite directions).
 The value of the slope provides information on the speed of evolution of the variable \(Y\) as a function of the variable \(X\) . The larger the slope in absolute value, the larger the expected variation of \(Y\) for each unit of \(X\) . Note, however, that a large value does not necessarily mean that the relationship is statistically significant (more on that in the section about significance of the relationship ).
This is similar to the correlation coefficient , which gives information about the direction and the strength of the relationship between two variables.
To perform a linear regression in R, we use the lm() function (which stands for linear model). The function requires to set the dependent variable first then the independent variable, separated by a tilde ( ~ ).
Applied to our example of weight and car’s consumption, we have:
The summary() function gives the results of the model:
In practice, we usually check the conditions of application before interpreting the coefficients (because if they are not respected, results may be biased).
In this article, however, I present the interpretations before testing the conditions because the point is to show how to interpret the results, and less about finding a valid model.
The results can be summarized as follows (see the column Estimate in the table Coefficients ):
 The intercept \(\widehat\beta_0 =\) 37.29 indicates that, for a hypothetical car weighting 0 lbs, we can expect, on average, a consumption of 37.29 miles/gallon. This interpretation is shown for illustrative purposes, but as a car weighting 0 lbs is impossible, the interpretation has no meaning. In practice, we would therefore refrain from interpreting the intercept in this case. See another interpretation of the intercept when the independent variable is centered around its mean in this section .
 There is a negative relationship between the weight and the distance a car can drive with a gallon (this was expected given the negative trend of the points in the scatterplot shown previously).
 But more importantly, a slope of 5.34 means that, for an increase of one unit in the weight (that is, an increase of 1000 lbs), the number of miles per gallon decreases, on average, by 5.34 units. In other words, for an increase of 1000 lbs, the number of miles/gallon decreases, on average, by 5.34 .
Another useful interpretation of the intercept is when the independent variable is centered around its mean. In this case, the intercept is interpreted as the mean value of \(Y\) for individuals who have a value of \(X\) equal to the mean of \(X\) .
Let’s see it in practice.
We first center the wt variable around the mean then rerun a simple linear model with this new variable:
Based on the results, we see that:
 The slope has not changed, the interpretation is the same than without the centering (which makes sense since the regression line has simply been shifted to the right or left).
 More importantly, the intercept is now \(\widehat\beta_0 =\) 20.09, so we can expect, on average, a consumption of 20.09 miles/gallon for a car with an average weight (the mean of weight is 3.22 so 3220 lbs).
This centering is particularly interesting:
 when the continuous independent variable has no meaningful value of 0 (which is the case here as a car with a weight of 0 lbs is not meaningful), or
 when interpreting the intercept is important.
Note that centering does not have to be done around the mean only. The independent variable can also be centered at some value that is actually in the range of the data. The exact value you center on does not matter as long it’s meaningful and within the range of data (it is not recommended to center it on a value that is not in the range of the data because we are not sure about the type of relationship between the two variables outside that range).
For our example, we may find that choosing the lowest value or the highest value of weight is the best option. So it’s up to us to decide the weight at which it’s most meaningful to interpret the intercept.
Significance of the relationship
As mentioned earlier, the value of the slope does not , by itself, make it possible to assess the significance of the linear relationship .
In other words, a slope different from 0 does not necessarily mean it is significantly different from 0, so it does not mean that there is a significant relationship between the two variables in the population. There could be a slope of 10 that is not significant, and a slope of 2 that is significant.
Significance of the relationship also depends on the variability of the slope, which is measured by its standard error and generally noted \(se(\widehat\beta_1)\) .
Without going too much into details, to assess the significance of the linear relationship, we divide the slope by its standard error. This ratio is the test statistic and follows a Student distribution with \(n  2\) degrees of freedom: 5
\[T_{n  2} = \frac{\widehat\beta_1}{se(\widehat\beta_1)}\]
For a bilateral test, the null and alternative hypotheses are: 6
 \(H_0 : \beta_1 = 0\) (there is no (linear) relationship between the two variables)
 \(H_1 : \beta_1 \ne 0\) (there is a (linear) relationship between the two variables)
Roughly speaking, if this ratio is greater than 2 in absolute value then the slope is significantly different from 0, and therefore the relationship between the two variables is significant (and in that case it is positive or negative depending on the sign of the estimate \(\widehat\beta_1\) ).
The standard error and the test statistic are shown in the column Std. Error and t value in the table Coefficients .
Fortunately, R gives a more precise and easier way to assess to the significance of the relationship. The information is provided in the column Pr(>t) of the Coefficients table. This is the p value of the test. As for any statistical test , if the p value is greater than or equal to the significance level (usually \(\alpha = 0.05\) ), we do not reject the null hypothesis, and if the p value is lower than the significance level, we reject the null hypothesis.
If we do not reject the null hypothesis, we do not reject the hypothesis of no relationship between the two variables (because we do not reject the hypothesis of a slope of 0). On the contrary, if we reject the null hypothesis of no relationship, we can conclude that there is a significant linear relationship between the two variables.
In our example, the p value = 1.29e10 < 0.05 so we reject the null hypothesis at the significance level \(\alpha = 5\%\) . We therefore conclude that there is a significant relationship between a car’s weight and its fuel consumption .
Tip: In order to make sure I interpret only parameters that are significant, I tend to first check the significance of the parameters thanks to the p values, and then interpret the estimates accordingly. For completeness, note that the test is also performed on the intercept. The p value being smaller than 0.05, we also conclude that the intercept is significantly different from 0.
Be careful that a significant relationship between two variables does not necessarily mean that there is an influence of one variable on the other or that there is a causal effect between these two variables!
A significant relationship between \(X\) and \(Y\) can appear in several cases:
 \(X\) causes \(Y\)
 \(Y\) causes \(X\)
 a third variable cause \(X\) and \(Y\)
 a combination of these three reasons
A statistical model alone cannot establish a causal link between two variables. Demonstrating causality between two variables is more complex and requires, among others, a specific experimental design, the repeatability of the results over time, as well as various samples.
This is the reason you will often read “ Correlation does not imply causation ” and linear regression follows the same principle.
Unfortunately, linear regression cannot be used in all situations.
In addition to the requirement that the dependent variable must be a continuous quantitative variables, simple linear regression requires that the data satisfy the following conditions:
 Linearity: The relationship between the two variables should be linear (at least roughly). For this reason it is always necessary to represent graphically the data with a scatterplot before performing a simple linear regression. 7
 Independence: Observations must be independent. It is the sampling plan and the experimental design that usually provide information on this condition. If the data come from different individuals or experimental units, they are usually independent. On the other hand, if the same individuals are measured at different periods, the data are probably not independent.
 Normality of the residuals: For large sample sizes, confidence intervals and tests on the coefficients are (approximately) valid whether the error follows a normal distribution or not (a consequence of the central limit theorem, see more in Ernst and Albers ( 2017 ) and Lumley et al. ( 2002 ) )! For small sample sizes, residuals should follow a normal distribution. This condition can be tested visually (via a QQplot and/or a histogram ), or more formally (via the ShapiroWilk test for instance).
 Homoscedasticity of the residuals: The variance of the errors should be constant. There is a lack of homoscedasticity when the dispersion of the residuals increases with the predicted values (fitted values). This condition can be tested visually (by plotting the standardized residuals vs. the fitted values) or more formally (via the BreuschPagan test).
 No influential points: If the data contain outliers , it is essential to identify them so that they do not , on their own, influence the results of the regression. Note that an outlier is not an issue per se if the point is in the alignment of the regression line for example because it does not influence the regression line. It becomes a problem in the context of linear regression if it influences in a substantial manner the estimates (and in particular the slope of the regression line). This can be tackled by identifying outliers (via the Cook’s distance 8 or the leverage index 9 for instance), and comparing the results with and without the potential outliers. Do the results remain the same with the two approaches? If yes, outliers are not really an issue in this case. If results are much different, you can use the TheilSen estimator, robust regression or quantile regression which are all more robust to outliers.
Tip: I remember the first 4 conditions thanks to the acronym “LINE”, for Linearity, Independence, Normality and Equality of variance.
If any of the condition is not met, the tests and the conclusions could be erroneous so it is best to avoid using and interpreting the model. If this is the case, sometimes the conditions can be met by transforming the data (e.g., logarithmic transformation, square or square root, BoxCox transformation, etc.) or by adding a quadratic term to the model.
If it does not help, it could be worth thinking about removing some variables or adding other variables, or even considering other types of models such as nonlinear models.
Keep in mind that in practice, conditions of application should be verified before drawing any conclusion based on the model. I refrain here from testing the conditions on our data because it will be covered in details in the context of multiple linear regression (see this section ).
If you are a frequent reader of the blog, you may know that I like to draw (simple but efficient) visualizations to illustrate my statistical analyses. Linear regression is not an exception.
There are numerous ways to visualize the relationship between the two variables of interest, but the easiest one I found so far is via the visreg() function from the package of the same name:
I like this approach for its simplicity—only a single line of code.
However, other elements could be displayed on the regression plot (for example the regression equation and the \(R^2\) ). This can easily be done with the stat_regline_equation() and stat_cor() functions from the {ggpubr} package:
Multiple linear regression
Now that you understand the principle behind simple linear regression and you know how to interpret the results, it is time to discuss about multiple linear regression.
We also start with the underlying principle of multiple linear regression, then show how to interpret the results, how to test the conditions of application and finish with more advanced topics.
Multiple linear regression is a generalization of simple linear regression, in the sense that this approach makes it possible to relate one variable with several variables through a linear function in its parameters.
Multiple linear regression is used to assess the relationship between two variables while taking into account the effect of other variables . By taking into account the effect of other variables, we cancel out the effect of these other variables in order to isolate and measure the relationship between the two variables of interest. This point is the main difference with simple linear regression.
To illustrate how to perform a multiple linear regression in R, we use the same dataset than the one used for simple linear regression ( mtcars ). Below a short preview:
We have seen that there is a significant and negative linear relationship between the distance a car can drive with a gallon and its weight ( \(\widehat\beta_1 =\) 5.34, \(p\) value < 0.001).
However, one may wonder whether there are not in reality other factors that could explain a car’s fuel consumption.
To explore this, we can visualize the relationship between a car’s fuel consumption ( mpg ) together with its weight ( wt ), horsepower ( hp ) and displacement ( disp ) (engine displacement is the combined swept (or displaced) volume of air resulting from the upanddown movement of pistons in the cylinders, usually the higher the more powerful the car):
It seems that, in addition to the negative relationship between miles per gallon and weight, there is also:
 a negative relationship between miles/gallon and horsepower (lighter points, indicating more horsepower, tend to be more present in low levels of miles per gallon)
 a negative relationship between miles/gallon and displacement (bigger points, indicating larger values of displacement, tend to be more present in low levels of miles per gallon).
Therefore, we would like to evaluate the relation between the fuel consumption and the weight, but this time by adding information on the horsepower and displacement. By adding this additional information, we are able to capture only the direct relationship between miles/gallon and weight (the indirect effect due to horsepower and displacement is canceled out).
This is the whole point of multiple linear regression! In fact, in multiple linear regression, the estimated relationship between the dependent variable and an explanatory variable is an adjusted relationship, that is, free of the linear effects of the other explanatory variables.
Let’s illustrate this notion of adjustment by adding both horsepower and displacement in our linear regression model:
We can see that now, the relationship between miles/gallon and weight is weaker in terms of slope ( \(\widehat\beta_1 =\) 3.8 now, against \(\widehat\beta_1 =\) 5.34 when only the weight was considered).
The effect of weight on fuel consumption was adjusted according to the effect of horsepower and displacement. This is the remaining effect between miles/gallon and weight after the effects of horsepower and displacement have been taken into account. More detailed interpretations in this section .
Multiple linear regression models are defined by the equation
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p + \epsilon\]
It is similar than the equation of simple linear regression, except that there is more than one independent variables ( \(X_1, X_2, \dots, X_p\) ).
Estimation of the parameters \(\beta_0, \dots, \beta_p\) by the method of least squares is based on the same principle as that of simple linear regression, but applied to \(p\) dimensions. It is thus no longer a question of finding the best line (the one which passes closest to the pairs of points ( \(y_i, x_i\) )), but finding the \(p\) dimensional plane which passes closest to the coordinate points ( \(y_i, x_{i1}, \dots, x_{ip}\) ).
This is done by minimizing the sum of the squares of the deviations of the points on the plane :
The least squares method results in an adjusted estimate of the coefficients. The term adjusted means after taking into account the linear effects of the other independent variables on the dependent variable, but also on the predictor variable.
In other words, the coefficient \(\beta_1\) corresponds to the slope of the relationship between \(Y\) and \(X_1\) when the linear effects of the other explanatory variables ( \(X_2, \dots, X_p\) ) have been removed, both at the level of the dependent variable \(Y\) but also at the level of \(X_1\) .
Applied to our model with weight, horsepower and displacement as independent variables, we have:
The table Coefficients gives the estimate for each parameter (column Estimate ), together with the \(p\) value of the nullity of the parameter (column Pr(>t) ).
The hypotheses are the same as for simple linear regression, that is:
 \(H_0 : \beta_j = 0\)
 \(H_1 : \beta_j \ne 0\)
The test of \(\beta_j = 0\) is equivalent to testing the hypothesis: is the dependent variable associated with the independent variable studied, all other things being equal, that is to say, at constant level of the other independent variables.
In other words:
 the test of \(\beta_1 = 0\) corresponds to testing the hypothesis: is fuel consumption associated with a car’s weight, at a constant level of horsepower and displacement
 the test of \(\beta_2 = 0\) corresponds to testing the hypothesis: is fuel consumption associated with horsepower, at a constant level of weight and displacement
 the test of \(\beta_3 = 0\) corresponds to testing the hypothesis: is fuel consumption associated with displacement, at a constant level of weight and displacement
 (for the sake of completeness: the test of \(\beta_0 = 0\) corresponds to testing the hypothesis: is miles/gallon different from 0 when weight, horsepower and displacement are equal to 0)
In practice, we usually check the conditions of application before interpreting the coefficients (because if they are not respected, results may be biased). In this article, however, I present the interpretations before testing the conditions because the point is to show how to interpret the results, and less about finding a valid model.
Based on the output of our model, we conclude that:
 There is a significant and negative relationship between miles/gallon and weight, all else being equal . So for an increase of one unit in the weight (that is, an increase of 1000 lbs), the number of miles/gallon decreases, on average, by 3.8, for a constant level of horsepower and displacement ( \(p\) value = 0.001).
 There is a significant and negative relationship between miles/gallon and horsepower, all else being equal. So for an increase of one unit of horsepower, the distance traveled with a gallon decreases, on average, by 0.03 mile, for a constant level of weight and displacement ( \(p\) value = 0.011).
 We do not reject the hypothesis of no relationship between miles/gallon and displacement when weight and horsepower stay constant (because \(p\) value = 0.929 > 0.05).
 (For completeness but it should be interpreted only when it makes sense: for a weight, horsepower and displacement = 0, we can expect that a car has, on average, a fuel consumption of 37.11 miles/gallon ( \(p\) value < 0.001). See a more useful interpretation of the intercept when the independent variables are centered in this section .)
This is how to interpret quantitative independent variables. Interpreting qualitative independent variables is slightly different in the sense that it quantifies the effect of a level in comparison with the reference level, sill all else being equal.
So it compares the different groups (formed by the different levels of the categorical variable) in terms of the dependent variable (this is why linear regression can be seen as an extension to the ttest and ANOVA).
For the illustration, we model the fuel consumption ( mpg ) on the weight ( wt ) and the shape of the engine ( vs ). The variable vs has two levels: Vshaped (the reference level ) and straight engine. 10
 For a Vshaped engine and for an increase of one unit in the weight (that is, an increase of 1000 lbs), the number of miles/gallon decreases, on average, by 4.44 ( \(p\) value < 0.001).
 The distance traveled with a gallon of fuel increases by, on average, 3.15 miles when the engine is straight compared to a Vshaped engine , for a constant weight ( \(p\) value = 0.013).
 (For completeness but it should be interpreted only when it makes sense: for a weight = 0 and a Vshaped engine, we can expect that the car has, on average, a fuel consumption of 33 miles/gallon ( \(p\) value < 0.001). See a more useful interpretation of the intercept when the independent variables are centered in this section .)
As for simple linear regression, multiple linear regression requires some conditions of application for the model to be usable and the results to be interpretable. Conditions for simple linear regression also apply to multiple linear regression, that is:
 Linearity of the relationships between the dependent and independent variables 11
 Independence of the observations
 Normality of the residuals
 Homoscedasticity of the residuals
 No influential points ( outliers )
But there is one more condition for multiple linear regression:
 No multicollinearity: Multicollinearity arises when there is a strong linear correlation between the independent variables , conditional on the other variables in the model. It is important to check it because it may lead to an imprecision or an instability of the estimated parameters when a variable changes. It can be assessed by studying the correlation between each pair of independent variables, or even better, by computing the variance inflation factor (VIF). The VIF measures how much the variance of an estimated regression coefficient increases, relative to a situation in which the explanatory variables are strictly independent. A high value of VIF is a sign of multicollinearity (the threshold is generally admitted at 5 or 10 depending on the domain). The easiest way to reduce the VIF is to remove some correlated independent variables, or eventually to standardize the data.
You will often see that these conditions are verified by running plot(model, which = 1:6) and it is totally correct. However, I recently discovered the check_model() function from the {performance} package which tests these conditions all at the same time (and let’s be honest, in a more elegant way). 12
Applied on our model2 with miles/gallon as dependent variable, and weight, horsepower and displacement as independent variables, we have:
In addition to testing all conditions at the same time, it also gives insight on how to interpret the different diagnostic plots and what you should expect (see in the subtitles of each plot).
Based on these diagnostic plots, we see that:
 Homogeneity of variance (middle left plot) is respected
 Multicollinearity (bottom left plot) is not an issue (I tend to use the threshold of 10 for VIF, and all of them are below 10) 13
 There is no influential points (middle right plot)
 Normality of the residuals (bottom right plot) is also not perfect due to 3 points deviating from the reference line but it still seems acceptable to me. In any case, the number of observations is large enough given the number of parameters 14 and given the small deviation from normality so tests on the coefficients are (approximately) valid whether the error follows a normal distribution or not
 Linearity (top right plot) is not perfect so let’s check each independent variable separately:
It seems that the relationship between miles/gallon and horsepower is not linear, which could be the main component of the slight linearity defect of the model.
To improve linearity, the variable could be removed, a transformation could be applied (logarithmic and/or squared for instance) or a quadratic term could be added to the model. 15 If this does not fix the issue of linearity, other types of models could be considered.
If you want to read more about these conditions of applications and how to deal with them, here is a very complete chapter on diagnostics for linear models written by Prof. Dustin Fife.
For the sake of easiness and for illustrative purposes, we assume linearity for the rest of the article.
When the conditions of application are met, we usually say that the model is valid. But not all valid models are good models. The next section deals with model selection.
How to choose a good linear model?
A model which satisfies the conditions of application is the minimum requirement, but you will likely find several models that meet this criteria. So one may wonder how to choose between different models that are all valid?
The three most common tools to select a good linear model are according to:
 the \(p\) value associated to the model,
 the coefficient of determination \(R^2\) and
 the Akaike Information Criterion
The approaches are detailed in the next sections. Note that the first two are applicable to simple and multiple linear regression, whereas the third is only applicable to multiple linear regression.
Before interpreting the estimates of a model, it is a good practice to first check the \(p\) value associated to the model. This \(p\) value indicates if the model is better than a model with only the intercept .
The hypotheses of the test (called Ftest) are:
 \(H_0: \beta_1 = \beta_2 = \dots = \beta_p = 0\)
 \(H_1:\) at least one coefficient \(\beta \ne 0\)
This \(p\) value can be found at the bottom of the summary() output:
The \(p\) value = 8.65e11. The null hypothesis is rejected, so we conclude that our model is better than a model with only the intercept because at least one coefficient \(\beta\) is significantly different from 0.
If this \(p\) value > 0.05 for one of your model, it means that none of the variables you selected help in explaining the dependent variable. In other words, you should completely forget about this model because it cannot do better than simply taking the mean of the dependent variable.
The coefficient of determination, \(R^2\) , is a measure of the goodness of fit of the model . It measures the proportion of the total variability that is explained by the model, or how well the model fits the data.
\(R^2\) varies between 0 and 1:
 \(R^2 = 0\) : the model explains nothing
 \(R^2 = 1\) : the model explains everything
 \(0 < R^2 < 1\) : the model explains part of the variability
 the higher the \(R^2\) , the better the model explains the dependent variable. As a rule of thumb, a \(R^2 > 0.7\) indicates a good fit of the model 16
Note that in a simple linear regression model, the coefficient of determination is equal to the square of the Pearson correlation coefficient :
\[R^2 = corr(X, Y)^2\]
\(R^2\) is displayed at the bottom of the summary() output or can be extracted with summary(model2)$r.squared .
\(R^2\) for this model is 0.8268, which means that 82.68% of the variability of the distance traveled with a gallon is explained by the weight, horsepower and displacement of the car. The relatively high \(R^2\) means that the weight, horsepower and displacement of a car are good characteristics to explain the distance it can drive with a gallon of fuel.
Note that if you want to compare models with different number of independent variables, it is best to refer to the adjusted \(R^2\) (= 0.8083 here).
Indeed, adding variables to the model cannot make the \(R^2\) to decrease, even if the variables are not related to the dependent variables (so the \(R^2\) will artificially increase when adding variables to the model, or at least stay constant). Therefore, the adjusted \(R^2\) takes into account the complexity of the model (the number of variables) by penalizing for additional variables, so it is a compromise between goodness of fit and parsimony.
A parsimonious model (few variables) is usually preferred over a complex model (many variables). There are two ways to obtain a parsimonious model from a model with many independent variables:
 We can iteratively remove the independent variable least significantly related to the dependent variable (i.e., the one with the highest \(p\) value in an analysis of variance table ) until all of them are significantly associated to the response variable, or
 We can select the model based on the Akaike Information Criterion (AIC) . AIC expresses a desire to fit the model with the smallest number of coefficients possible and allows to compare models. According to this criterion, the best model is the one with the lowest AIC. This criterion is based on a compromise between the quality of the fit and its complexity. We usually start from a global model with many independent variables, and the procedure (referred as stepwise algorithm) 17 automatically compares models then selects the best one according to the AIC.
We show how to do the second option in R. For the illustration, we start with a model with all variables in the dataset as independent variables (do not forget to transform the factor variables first):
( Tip: The formula mpg ~ . is a shortcut to consider all variables present in the dataset as independent variables, except the one that has been specified as the dependent variable ( mpg here)).
The model that has been selected according to this criterion is the following:
Be careful when using an automatic procedure because, even though it is the best model that is selected, it is based:
 on a single criterion (AIC in this case), but more importantly;
 it is based on some set of mathematical rules, which means that industry knowledge or human expertise is not taken into consideration.
I believe that this kind of automatic procedure for model’s selection is a good starting point, but I also believe that the final model should always be checked and tested against other models to make sure it makes sense in practice (apply common sense).
Last but not least, do not forget to also verify the conditions of application because the stepwise procedure does not guarantee that they are respected.
There are many ways to visualize results of a linear regression. The easiest ones I am aware of are:
 visreg() illustrates the relationships between the dependent and independent variables in different plots (one for each independent variable unless you specify which relationship you want to illustrate):
 ggcoefstats() illustrates the results in one single plot, with many statistical details:
In this plot:
 when the solid line does not cross the vertical dashed line, the estimates is significantly different from 0 at the 5% significance level (i.e., \(p\) value < 0.05)
 furthermore, a point to the right (left) of the vertical dashed line means that there is a positive (negative) relationship between the two variables
 the more extreme the point, the stronger the relationship
 plot_summs() which also illustrates the results but in a more concise way:
The advantage of this approach is that it is possible to compare coefficients of multiple models simultaneously (particularly interesting when the models are nested):
To go further
Below some more advanced topics related to linear regression. Feel free to comment at the end of the article if you believe I missed an important one.
Thanks to the model_parameters() function from the {parameters} package, you can print a summary of the model in a nicely formatted way to make the output more readable:
And if you are using R Markdown , you can use the print_html() function to get a compact and yet comprehensive summary table in your HTML file:
Parameter  Coefficient  SE  95% CI  t(28)  p 

(Intercept)  9.62  6.96  (4.64, 23.87)  1.38  0.178 
wt  3.92  0.71  (5.37, 2.46)  5.51  < .001 
qsec  1.23  0.29  (0.63, 1.82)  4.25  < .001 
am (Manual)  2.94  1.41  (0.05, 5.83)  2.08  0.047 
Model: mpg ~ wt + qsec + am (32 Observations) Residual standard deviation: 2.459 (df = 28) R2: 0.850; adjusted R2: 0.834 
The report() function from the package of the same name allows to automatically produces reports of models according to best practices guidelines:
Note that the function also works for dataframes, statistical tests and other models.
Linear regression is also very often used for predictive purposes . Confidence and prediction intervals for new data can be computed with the predict() function.
Suppose we want to predict the miles/gallon for a car with a manual transmission, weighting 3000 lbs and which drives a quarter of a mile ( qsec ) in 18 seconds:
Based on our model, it is expected that this car will drive 22.87 miles with a gallon.
The difference between the confidence and prediction interval is that:
 a confidence interval gives the predicted value for the mean of \(Y\) for a new observation, whereas
 a prediction interval gives the predicted value for an individual \(Y\) for a new observation.
The prediction interval is wider than the confidence interval to account for the additional uncertainty due to predicting an individual response , and not the mean, for a given value of \(X\) .
Linear hypothesis tests make it possible to generalize the Ftest mentioned in this section , while offering the possibility to perform either tests of comparison of coefficients, or tests of equality of linear combinations of coefficients.
For example, to test the linear constraint:
 \(H_0: \beta_1 = \beta_2 = 0\)
 \(H_1:\) not \(H_0\)
we use the linearHypothesis() function of the {car} package as follows:
We reject the null hypothesis and we conclude that at least one of \(\beta_1\) and \(\beta_2\) is different from 0 ( \(p\) value = 1.55e09).
When the independent variables are categorical with \(k\) categories, the regression table provides \(k1\) \(p\) values:
The variables vs and am have 2 levels so one is displayed in the regression output. The variable cyl has 3 levels (4, 6 and 8) so 2 of them are displayed. The overall effect of vs and am are reported in the Pr(>t) column, but not the overall effect of cyl because there are more than 2 levels for this variable.
To get the \(p\) value of the overall effect of a categorical variable, we need to get an analysis of variance table via the Anova() function from the {car} package: 18
From this analysis of variance table, we conclude that:
 vs is not significantly associated with mpg ( \(p\) value = 0.451)
 am and cyl are significantly associated with mpg ( \(p\) values < 0.05)
So far we have covered multiple linear regression without any interaction.
There is an interaction effect between factors A and B if the effect of factor A on the response depends on the level taken by factor B .
In R, interaction can be added as follows:
From the output we conclude that there is an interaction between the weight and the transmission ( \(p\) value = 0.00102). This means that the effect of the weight on the distance traveled with a gallon depends on the transmission type .
The easiest way to handle interaction is to visualize the relationship for each level of the categorical variable:
We see that the relationship between weight and miles/gallon is stronger (the slope is steeper) for cars with a manual transmission compared to cars with an automatic transmission.
This is a good example to illustrate the point that when studying a relationship between two variables, say \(X\) and \(Y\) , if one also has data for other variables which are potentially associated with both \(X\) and \(Y\) , it is important to include them in the regression and to analyze the relationship conditionally on these variables .
Omitting some variables that should be included in the model may lead to erroneous and misleading conclusions, up to the point that the relationship is completely reversed (a phenomenon referred as Simpson’s paradox ).
In this article, we started with a reminder of simple linear regression and in particular its principle and how to interpret the results .
This laid the foundations for a better understanding of multiple linear regression . After explaining its principle , we showed how to interpret the output and how to choose a good linear model . We then mentioned a couple of visualizations and finished the article with some more advanced topics .
Thanks for reading.
I hope this article helped you to understand better linear regression and gave you the confidence to do your own linear regressions in R. If you need to model a binary variable instead of a quantitative continuous variable, see how to perform a binary logistic regression in R .
As always, if you have a question or a suggestion related to the topic covered in this article, please add it as a comment so other readers can benefit from the discussion.
Some people see regression analysis as a part of inferential statistics. It is true, as a sample is taken to evaluate the link between two or more variables in a population of interest. I tend to distinguish regression from inferential statistics for the simple reasons that (i) regressions are often used to a broader extent (for predictive analyses, among others), and because (ii) the main goal of linear regression (see this section ) differs from the objectives of confidence intervals and hypothesis testing well known in the field of inferential statistics. ↩︎
Formally, ANOVA can also be used to compare 2 groups, but in practice we tend to use it for 3 or more groups, leaving the ttest for 2 groups. ↩︎
More information about the dataset can be found by executing ?mtcars . ↩︎
Note that it best to avoid interpreting the intercept when \(X\) cannot be equal to 0 or when it makes no sense in practice. ↩︎
\(n\) is the number of observations. ↩︎
Other values than 0 are accepted as well. In that case, the test statistic becomes \(T_{n  2} = \frac{\widehat\beta  a}{se(\widehat\beta_1)}\) where \(a\) is the hypothesized slope. ↩︎
Note that linearity can be checked with a scatterplot of the two variables, or via a scatterplot of the residuals and the fitted values. See more about this in this section . ↩︎
An observation is considered as an outlier based on the Cook’s distance if its value is > 1. ↩︎
An observation has a high leverage value (and thus needs to be investigated) if it is greater than \(2p/n\) , where \(p\) is the number of parameters in the model (intercept included) and \(n\) is the number of observations. ↩︎
You can always change the reference level with the relevel() function. See more data manipulation techniques . ↩︎
Note that linearity can also be tested with a scatterplot of the residuals and the fitted values. ↩︎
After installing the {performance} package, you will also need to install the {see} package manually. See how to install a R package if you need more help. ↩︎
I use the threshold of 10 because, as shown by James et al. ( 2013 ) , a value between 5 and 10 indicates a moderate correlation, while VIF values greater than 10 indicate a high and nontolerable correlation. ↩︎
Austin and Steyerberg ( 2015 ) showed that two subjects per variable tends to permit accurate estimation of regression coefficients in a linear regression model estimated using ordinary least squares. Moreover, the general rule of thumb says that there should be at least 10 observations per variable ( Schmidt and Finan 2018 ) . Our dataset contains 32 observations, above the minimum of 10 subjects per variable. ↩︎
If you apply a logarithmic transformation, see two guides on how to interpret the results: in English and in French . ↩︎
Note that a high \(R^2\) does not guarantee that you selected the best variables or that your model is good. It simply tells that the model fits the data quite well. It is advised to apply common sense when comparing models and not only refer to \(R^2\) (in particular when \(R^2\) are close). ↩︎
There are two main methods; backward and forward. The backward method consists in starting from the model containing all the explanatory variables likely to be relevant, then recursively removing the variable which reduces the information criterion of the model, until no reduction is possible. The forward method is the reverse of the backward method in the sense that we start from a onevariable model with the lowest information criterion and at each step, an explanatory variable is added. By default, the step() function in R combines the backward and forward methods. ↩︎
To not be confused with the anova() function because it provides results that depend on the order in which the variables appear in the model. ↩︎
Related articles
 Binary logistic regression in R
 Hypothesis test by hand
 EpiLPS for estimation of incubation times
 Twoway ANOVA in R
 What is survival analysis? Examples by hand and in R
Liked this post?
 Get updates every time a new article is published (no spam and unsubscribe anytime):
Yes, receive new posts by email
 Support the blog
FAQ Contribute Sitemap
Multiple Linear Regression
Summarizing linear relationships in high dimensions
I n the last lecture we built our first linear model: an equation of a line drawn through the scatter plot.
\[ \hat{y} = 96.2 + 0.89 x \]
While the idea is simple enough, there is a sea of terminology that floats around this method. A linear model is any model that explains the \(y\) , often called the response variable or dependent variable , as a linear function of the \(x\) , often called the explanatory variable or independent variable . There are many different methods that can be used to decide which line to draw through a scatter plot. The most commonlyused approach is called the method of least squares , a method we’ll look at closely when we turn to prediction. If we think more generally, a linear model fit by least squares is one example of a regression model , which refers to any model (linear or nonlinear) used to explain a numerical response variable.
The reason for all of this jargon isn’t purely to infuriate students of statistics. Linear models are one of the most widely used statistical tools; you can find them in use in diverse fields like biology, business, and political science. Each field tends to adapt the tool and the language around them to their specific needs.
A reality of practicing statistics in these field, however, is that most data sets are more complex than the example that we saw in the last notes, where there were only two variables. Most phenomena have many different variables that relate to one another in complex ways. We need a more more powerful tool to help guide us into these higher dimensions. A good starting point is to expand simple linear regression to include more than one explanatory variable!
To fit a multiple linear regression model using least squares in R, you can use the lm() function, with each additional explanatory variable separated by a + .
Multiple linear regression is powerful because it has no limit to the number of variables that we can include in the model. While Hans Rosling was able to fit 5 variables into a single graphic, what if we had 10 variables? Multiple linear regression allows us to understand high dimensional linear relationships beyond whats possible using our visual system.
In today’s notes, we’ll discuss two specific examples where a multiple linear regression model might be applicable
A scenario involving two numerical variables and one categorical variable
A scenario involving three numerical variables.
Two numerical, one categorical
The Zagat Guide was for many years the authoritative source of restaurant reviews. Their approach was very different from Yelp!. Zagat’s review of a restaurant was compiled by a professional restaurant reviewer who would visit a restaurant and rate it on a 30 point scale across three categories: food, decor, and service. They would also note the average price of a meal and write up a narrative review.
Here’s an example of a review from an Italian restaurant called Marea in New York City.
In addition to learning about the food scores (27), and getting some helpful tips (“bring your bank manager”), we see they’ve also recorded a few more variables on this restaurant: the phone number and website, their opening hours, and the neighborhood (Midtown).
You might ask:
What is the relationship between the food quality and the price of a meal at Italian restaurant? Are these two variables positively correlated or is the best Italian meal in New York a simple and inexpensive slice of pizza?
To answer these questions, we need more data. The data frame below contains Zagat reviews from 168 Italian restaurants in Manhattan.
Applying the taxonomy of data, we see that for each restaurant we have recorded the price of an average meal, the food, decor, and service scores (all numerical variables) as well as a note regarding geography (a categorical nominal variable). geo captures whether the restaurant is located on the east side or the west side of Manhattan 1 .
Let’s summarize the relationship between food quality, price, and one categorical variable  geography  using a colored scatter plot.
It looks like if you want a very tasty meal, you’ll have to pay for it. There is a moderately strong, positive, and linear relationship between food quality and price. This plot, however, has a third variable in it: geography. The restaurants from the east and west sides are fairly well mixed, but to my eye the points on the west side might be a tad bit lower on price than the points from the east side. I could numerically summarize the relationship between these three variables by handdrawing two lines, one for each neighborhood.
For a more systematic approach for drawing lines through the center of scatter plots, we need to return to the method of least squares, which is done in R using lm() . In this linear model, we wish to explain the \(y\) variable as a function of two explanatory variables, food and geo , both found in the zagat data frame. We can express that relationship using the formula notation.
It worked . . . or did it? If extend our reasoning from the last notes, we should write this model as
\[\widehat{price} = 15.97 + 2.87 \times food  1.45 \times geo\]
What does it mean to put a categorical variable, geo , into a linear model? And how do three three numbers translate into the two lines shown above?
Indiciator variables
When working with linear models like the one above, the value of the explanatory variable, \(geowest\) , is multiplied by a slope, 1.45. According to the Taxonomy of Data, arithmetic functions like multiplication are only defined for numerical variables. While that would seem to rule out categorical variables for use as explanatory variables, statisticians have come up with a clever workaround: the indicator variable.
The categorical variable geo can be converted into an indicator variable by shifting the question from “Which side of Manhattan are you on?” to “Are you on the west side of Manhattan?” This is a mutate step.
The new indicator variable geowest is a logical variable, so it has a dual representation as TRUE / FALSE as well as 1/0. Previously, this allowed us to do Boolean algebra. Here, it allows us to include an indicator variable in a linear model.
While you can create indicator variables by hand using mutate , in practice, you will not need to do this. That’s because they are created automatically whenever you put a categorical variable into lm() . Let’s revisit the linear model that we fit above with geowest in the place of geo .
\[\widehat{price} = 15.97 + 2.87 \times food  1.45 \times geowest\]
To understand the geometry of this model, let’s focus on what the fitted values will be for any restaurant that is on the west side. For those restaurants, the geowest indicator variable will take a value of 1, so if we plug that in and rearrange,
\[\begin{eqnarray} \widehat{price} &= 15.97 + 2.87 \times food  1.45 \times 1 \\ &= (15.97  1.45) + 2.87 \times food \\ &= 17.42 + 2.87 \times food \end{eqnarray}\]
That is a familiar sight: that is an equation for a line.
Let’s repeat this process for the restaurants on the east side, where the geowest indicator variable will now take a value of 0.
\[\begin{eqnarray} \widehat{price} &= 15.97 + 2.87 \times food  1.45 \times 0 \\ &= 15.97 + 2.87 \times food \end{eqnarray}\]
That is also the equation for line.
If you look back and forth between these two equations, you’ll notice that they share the same slope and have different yintercepts. Geometrically, this means that the output of lm() was describing the equation of two parallel lines :
 one where geowest is 1 (for restaurants on the west side of town)
 one where geowest is 0 (for restaurants on the east side of town).
That means we can use the output of lm() to replace my handdrawn lines with ones that arise from the method of least squares.
Reference levels
One question you still might have: Why did R include the indicator variable for the west side of town as opposed to the one for the east side? . The answer lies in the type of variable that geo is recorded as in the zagat dataframe. If you look closely at the initial output, you will see that geo is currently designated chr , which is short for character . geo is indeed a categorical variable with two levels: east and west .
Like in previous settings, R will determine the “order” of levels in a categorical variable registered as a character by way of the alphabet. This means that east will be tagged first and chosen as the reference level : the level of a categorical variable which does not have an indicator variable in the model. If you would like west to be the reference level, then you would need to reorder the levels using factor() inside of a mutate() so that west comes first. This would change the equation that results from then fitting a linear model with lm() , as you can see below!
Now our equation looks a little bit different!
\[\widehat{price} = 17.43 + 2.87 \times food + 1.46 \times geoeast\]
In general, if you include a categorical variable with \(k\) levels in a regression model, there will be \(k1\) indicator variables (and thus, coefficients) associated with it in the model: one for each level of the variable except the reference level 2 . Knowing the reference level also helps us interpret indicator variables that are part of the regression equation; we will see this in a moment. For now, let’s move to our second scenario.
Three numerical
While the standard scatter plot allows us to understand the association between two numerical variables like price and food , to understand the relationship between three numerical variables, we will need to build this scatterplot in 3D 3 .
Take a moment to explore this scatter plot 4 . Can you find the name of the restaurant with very bad decor but pretty good food and a price to match? (It’s Gennaro.) What about the restaurant that equally bad decor but has rock bottom prices that’s surprising given it’s food quality isn’t actually somewhat respectable? (It’s Lamarca.)
Instead of depicting the relationship between these three variables graphically, let’s do it numerically by fitting a linear model.
We can write the corresponding equation of the model as
\[ \widehat{price} = 24.5 + 1.64 \times food + 1.88 \times decor \]
To understand the geometry of this model, we can’t use the trick that we did with indicator variables. decor is a numerical variable just like food , so it takes more values than just 0 and 1.
Indeed this linear model is describing a plane .
If you inspect this plane carefully you’re realize that the tilt of the plane is not quite the same in every dimension. The tilt in the decor dimension is just a little bit steeper than that in the food dimension, a geometric expression of the fact that the coefficient in front of decor, 1.88, is just a bit higher than the coefficient in front of food, 1.64.
Interpreting coefficients
When moving from simple linear regression, with one explanatory variable, to the multiple linear regression, with many, the interpretation of the coefficients becomes trickier but also more insightful.
Mathematically, the coefficient in front of \(food\) , 1.64, can be interpreted a few different ways:
It is the difference that we would expect to see in the response variable, \(price\) , when two Italian restaurants are separated by a food rating of one and they have the same decor rating.
Controlling for \(decor\) , a one point increase in the food rating is associated with a $1.64 increase in the \(price\) .
Similarly for interpreting \(decor\) : controlling for the quality of the food, a onepoint increase in \(decor\) is associated with a $1.88 increase in the \(price\) .
This conditional interpretation of the coefficients extends to the first setting we looked at, when one variable is numerical and the other is an indicator. Here is that model:
One might interpret \(food\) like this:
 For two restaurants both on the same side of Manhattan, a one point increase in food score is associated with a $2.87 increase in the price of a meal.
As for \(geowest\) :
 For two restaurants with the exact same quality of food, the restaurant on the west side is expected to be $1.45 cheaper than the restaurant on the east side.
We make the comparison to the the east side since this level is the reference level according to the linear model shown. This is a useful bit of insight  it gives a sense of what the premium is of being on the eastside.
It is also visible in the geometry of the model. When we’re looking at restaurants with the same food quality, we’re looking at a vertical slice of the scatter plot. Here the vertical gray line is indicating restaurants where the food quality gets a score of 18. The difference in expected price of meals on the east side and west side is the vertical distance between the red line and the blue line, which is exactly 1.45. We could draw this vertical line anywhere on the graph and the distance between the red line and the blue will still be exactly 1.45.
We began this unit on Summarizing Data with graphical and numerical summaries of just a single variable: histograms and bar charts, means and standard deviations. In the last set of notes we introduced our first bivariate numerical summaries: the correlation coefficient, and the linear model. In these notes, we introduced multiple linear regression , a method that can numerically describe the linear relationships between an unlimited number of variables. The types of variables that can be included in these models is similarly vast. Numerical variables can be included directly, generalizing the geometry of a line into a plane in a higher dimension. Categorical variables can be included using the trick of creating indicator variables : logical variables that take a value of 1 where a particular condition is true. The interpretation of all of the coefficients that result from a multiple regression is challenging but rewarding: it allows us to answer questions about the relationship between two variables after controlling for the values of other variables.
If this felt like a deep dive into a multiple linear regression, don’t worry. Linear models are one of the most commonly used statistical tools, so we’ll be revisiting them throughout the course: investigating their use in making generalizations, causal claims, and predictions.
Fifth Avenue is the wide northsouth street that divides Manhattan into an east side and a west side. ↩︎
This is the case for a model including an intercept term; these models will be our focus this semester and are the most rcommonly used. ↩︎
While ggplot2 is the best package for static statistical graphics, it does not have any interactive functionality. This plot was made using a system called plotly , which can be used both in R and Python. Read more about how it works at https://plotly.com/r/ . ↩︎
This is a screenshot from an interactive 3D scatter plot. We’ll see the interactive plot in class tomorrow. ↩︎
Multivariable Methods
 Page:
 1
  2
  3
  4
  5
  6
  7
  8
  9
  10
 Introduction
 Learning Objectives
 Confounding
 Determining Whether a Variable is a Confounder
 A Stratified Analysis
 The CochranMantelHaenszel Method
 Data Layout for CochranMantelHaenszel Estimates
 Effect Modification
 Introduction to Correlation and Regression Analysis
 Correlation Analysis
 Example  Correlation of Gestational Age and Birth Weight
 Regression Analysis
 Simple Linear Regression
 BMI and Total Cholesterol
 BMI and HDL Cholesterol
 Comparing Mean HDL Levels With Regression Analysis
 The Controversy Over Environmental Tobacco Smoke Exposure
Multiple Linear Regression Analysis
Controlling for confounding with multiple linear regression, relative importance of the independent variables , evaluating effect modification with multiple linear regression , "dummy" variables in regression models , example of the use of dummy variables.
 Multiple Logistic Regression Analysis
 Example of Logistic Regression  Association Between Obesity and CVD
 Example  Risk Factors Associated With Low Infant Birth Weight
Module Topics
All Modules
Multiple linear regression analysis is an extension of simple linear regression analysis, used to assess the association between two or more independent variables and a single continuous dependent variable. The multiple linear regression equation is as follows:
Multiple regression analysis is also used to assess whether confounding exists. Since multiple linear regression analysis allows us to estimate the association between a given independent variable and the outcome holding all other variables constant, it provides a way of adjusting for (or accounting for) potentially confounding variables that have been included in the model.
Suppose we have a risk factor or an exposure variable, which we denote X 1 (e.g., X 1 =obesity or X 1 =treatment), and an outcome or dependent variable which we denote Y. We can estimate a simple linear regression equation relating the risk factor (the independent variable) to the dependent variable as follows:
where b 1 is the estimated regression coefficient that quantifies the association between the risk factor and the outcome.
Suppose we now want to assess whether a third variable (e.g., age) is a confounder . We denote the potential confounder X 2 , and then estimate a multiple linear regression equation as follows:
In the multiple linear regression equation, b 1 is the estimated regression coefficient that quantifies the association between the risk factor X 1 and the outcome, adjusted for X 2 (b 2 is the estimated regression coefficient that quantifies the association between the potential confounder and the outcome). As noted earlier, some investigators assess confounding by assessing how much the regression coefficient associated with the risk factor (i.e., the measure of association) changes after adjusting for the potential confounder. In this case, we compare b 1 from the simple linear regression model to b 1 from the multiple linear regression model. As a rule of thumb, if the regression coefficient from the simple linear regression model changes by more than 10%, then X 2 is said to be a confounder.
Once a variable is identified as a confounder, we can then use multiple linear regression analysis to estimate the association between the risk factor and the outcome adjusting for that confounder. The test of significance of the regression coefficient associated with the risk factor can be used to assess whether the association between the risk factor is statistically significant after accounting for one or more confounding variables. This is also illustrated below.
Example  The Association Between BMI and Systolic Blood Pressure
Suppose we want to assess the association between BMI and systolic blood pressure using data collected in the seventh examination of the Framingham Offspring Study. A total of n=3,539 participants attended the exam, and their mean systolic blood pressure was 127.3 with a standard deviation of 19.0. The mean BMI in the sample was 28.2 with a standard deviation of 5.3.
A simple linear regression analysis reveals the following:
Independent Variable  Regression Coefficient  T  Pvalue 
Intercept  108.28  62.61  0.0001 
BMI  0.67  11.06  0.0001 
The simple linear regression model is:
Suppose we now want to assess whether age (a continuous variable, measured in years), male gender (yes/no), and treatment for hypertension (yes/no) are potential confounders, and if so, appropriately account for these using multiple linear regression analysis. For analytic purposes, treatment for hypertension is coded as 1=yes and 0=no. Gender is coded as 1=male and 0=female. A multiple regression analysis reveals the following:
Independent Variable  Regression Coefficient  T  Pvalue 
Intercept  68.15  26.33  0.0001 
BMI  0.58  10.30  0.0001 
Age  0.65  20.22  0.0001 
Male gender  0.94  1.58  0.1133 
Treatment for hypertension  6.44  9.74  0.0001 
The multiple regression model is:
Notice that the association between BMI and systolic blood pressure is smaller (0.58 versus 0.67) after adjustment for age, gender and treatment for hypertension. BMI remains statistically significantly associated with systolic blood pressure (p=0.0001), but the magnitude of the association is lower after adjustment. The regression coefficient decreases by 13%.
[Actually, doesn't it decrease by 15.5%. In this case the true "beginning value" was 0.58, and confounding caused it to appear to be 0.67. so the actual % change = 0.09/0.58 = 15.5%.]
Using the informal rule (i.e., a change in the coefficient in either direction by 10% or more), we meet the criteria for confounding. Thus, part of the association between BMI and systolic blood pressure is explained by age, gender and treatment for hypertension.
This also suggests a useful way of identifying confounding. Typically, we try to establish the association between a primary risk factor and a given outcome after adjusting for one or more other risk factors. One useful strategy is to use multiple regression models to examine the association between the primary risk factor and the outcome before and after including possible confounding factors. If the inclusion of a possible confounding variable in the model causes the association between the primary risk factor and the outcome to change by 10% or more, then the additional variable is a confounder.
Assessing only the pvalues suggests that these three independent variables are equally statistically significant. The magnitude of the t statistics provides a means to judge relative importance of the independent variables. In this example, age is the most significant independent variable, followed by BMI, treatment for hypertension and then male gender. In fact, male gender does not reach statistical significance (p=0.1133) in the multiple regression model.
Some investigators argue that regardless of whether an important variable such as gender reaches statistical significance it should be retained in the model. Other investigators only retain variables that are statistically significant.
[Not sure what you mean here; do you mean to control for confounding?] /WL
This is yet another example of the complexity involved in multivariable modeling. The multiple regression model produces an estimate of the association between BMI and systolic blood pressure that accounts for differences in systolic blood pressure due to age, gender and treatment for hypertension.
A one unit increase in BMI is associated with a 0.58 unit increase in systolic blood pressure holding age, gender and treatment for hypertension constant. Each additional year of age is associated with a 0.65 unit increase in systolic blood pressure, holding BMI, gender and treatment for hypertension constant.
Men have higher systolic blood pressures, by approximately 0.94 units, holding BMI, age and treatment for hypertension constant and persons on treatment for hypertension have higher systolic blood pressures, by approximately 6.44 units, holding BMI, age and gender constant. The multiple regression equation can be used to estimate systolic blood pressures as a function of a participant's BMI, age, gender and treatment for hypertension status. For example, we can estimate the blood pressure of a 50 year old male, with a BMI of 25 who is not on treatment for hypertension as follows:
We can estimate the blood pressure of a 50 year old female, with a BMI of 25 who is on treatment for hypertension as follows:
On page 4 of this module we considered data from a clinical trial designed to evaluate the efficacy of a new drug to increase HDL cholesterol. One hundred patients enrolled in the study and were randomized to receive either the new drug or a placebo. The investigators were at first disappointed to find very little difference in the mean HDL cholesterol levels of treated and untreated subjects.
 Sample Size  Mean HDL  Standard Deviation of HDL 
New Drug  50  40.16  4.46 
Placebo  50  39.21  3.91 
However, when they analyzed the data separately in men and women, they found evidence of an effect in men, but not in women. We noted that when the magnitude of association differs at different levels of another variable (in this case gender), it suggests that effect modification is present.




New Drug  40  38.88  3.97 
Placebo  41  39.24  4.21 








New Drug  10  45.25  1.89 
Placebo  9  39.06  2.22 
Multiple regression analysis can be used to assess effect modification. This is done by estimating a multiple regression equation relating the outcome of interest (Y) to independent variables representing the treatment assignment, sex and the product of the two (called the treatment by sex interaction variable ). For the analysis, we let T = the treatment assignment (1=new drug and 0=placebo), M = male gender (1=yes, 0=no) and TM, i.e., T * M or T x M, the product of treatment and male gender. In this case, the multiple regression analysis revealed the following:
Independent Variable  Regression Coefficient  T  Pvalue 
Intercept  39.24  65.89  0.0001 
T (Treatment)  0.36  0.43  0.6711 
M (Male Gender)  0.18  0.13  0.8991 
TM (Treatment x Male Gender)  6.55  3.37  0.0011 
The multiple regression model is:
The details of the test are not shown here, but note in the table above that in this model, the regression coefficient associated with the interaction term, b 3 , is statistically significant (i.e., H 0 : b 3 = 0 versus H 1 : b 3 ≠ 0). The fact that this is statistically significant indicates that the association between treatment and outcome differs by sex.
The model shown above can be used to estimate the mean HDL levels for men and women who are assigned to the new medication and to the placebo. In order to use the model to generate these estimates, we must recall the coding scheme (i.e., T = 1 indicates new drug, T=0 indicates placebo, M=1 indicates male sex and M=0 indicates female sex).
The expected or predicted HDL for men (M=1) assigned to the new drug (T=1) can be estimated as follows:
The expected HDL for men (M=1) assigned to the placebo (T=0) is:
Similarly, the expected HDL for women (M=0) assigned to the new drug (T=1) is:
The expected HDL for women (M=0)assigned to the placebo (T=0) is:
Notice that the expected HDL levels for men and women on the new drug and on placebo are identical to the means shown the table summarizing the stratified analysis. Because there is effect modification, separate simple linear regression models are estimated to assess the treatment effect in men and women:
 Regression Coefficient  T  Pvalue 
Intercept  39.08  57.09  0.0001 
T (Treatment)  6.19  6.56  0.0001 




 Regression Coefficient  T  Pvalue 
Intercept  39.24  61.36  0.0001 
T (Treatment)  0.36  0.40  0.6927 
The regression models are:


In men, the regression coefficient associated with treatment (b 1 =6.19) is statistically significant (details not shown), but in women, the regression coefficient associated with treatment (b 1 = 0.36) is not statistically significant (details not shown).
Multiple linear regression analysis is a widely applied technique. In this section we showed here how it can be used to assess and account for confounding and to assess effect modification. The techniques we described can be extended to adjust for several confounders simultaneously and to investigate more complex effect modification (e.g., threeway statistical interactions).
There is an important distinction between confounding and effect modification. Confounding is a distortion of an estimated association caused by an unequal distribution of another risk factor. When there is confounding, we would like to account for it (or adjust for it) in order to estimate the association without distortion. In contrast, effect modification is a biological phenomenon in which the magnitude of association is differs at different levels of another factor, e.g., a drug that has an effect on men, but not in women. In the example, present above it would be in inappropriate to pool the results in men and women. Instead, the goal should be to describe effect modification and report the different effects separately.
There are many other applications of multiple regression analysis. A popular application is to assess the relationships between several predictor variables simultaneously, and a single, continuous outcome. For example, it may be of interest to determine which predictors, in a relatively large set of candidate predictors, are most important or most strongly associated with an outcome. It is always important in statistical analysis, particularly in the multivariable arena, that statistical modeling is guided by biologically plausible associations.
Independent variables in regression models can be continuous or dichotomous. Regression models can also accommodate categorical independent variables. For example, it might be of interest to assess whether there is a difference in total cholesterol by race/ethnicity. The module on Hypothesis Testing presented analysis of variance as one way of testing for differences in means of a continuous outcome among several comparison groups. Regression analysis can also be used. However, the investigator must create indicator variables to represent the different comparison groups (e.g., different racial/ethnic groups). The set of indicator variables ( also called dummy variables ) are considered in the multiple regression model simultaneously as a set independent variables. For example, suppose that participants indicate which of the following best represents their race/ethnicity: White, Black or African American, American Indian or Alaskan Native, Asian, Native Hawaiian or Pacific Islander or Other Race. This categorical variable has six response options. To consider race/ethnicity as a predictor in a regression model, we create five indicator variables (one less than the total number of response options) to represent the six different groups. To create the set of indicators, or set of dummy variables, we first decide on a reference group or category. In this example, the reference group is the racial group that we will compare the other groups against. Indicator variable are created for the remaining groups and coded 1 for participants who are in that group (e.g., are of the specific race/ethnicity of interest) and all others are coded 0. In the multiple regression model, the regression coefficients associated with each of the dummy variables (representing in this example each race/ethnicity group) are interpreted as the expected difference in the mean of the outcome variable for that race/ethnicity as compared to the reference group, holding all other predictors constant. The example below uses an investigation of risk factors for low birth weight to illustrates this technique as well as the interpretation of the regression coefficients in the model.
An observational study is conducted to investigate risk factors associated with infant birth weight. The study involves 832 pregnant women. Each woman provides demographic and clinical data and is followed through the outcome of pregnancy. At the time of delivery, the infant s birth weight is measured, in grams, as is their gestational age, in weeks. Birth weights vary widely and range from 404 to 5400 grams. The mean birth weight is 3367.83 grams with a standard deviation of 537.21 grams. Investigators wish to determine whether there are differences in birth weight by infant gender, gestational age, mother's age and mother's race. In the study sample, 421/832 (50.6%) of the infants are male and the mean gestational age at birth is 39.49 weeks with a standard deviation of 1.81 weeks (range 2243 weeks). The mean mother's age is 30.83 years with a standard deviation of 5.76 years (range 1745 years). Approximately 49% of the mothers are white; 41% are Hispanic; 5% are black; and 5% identify themselves as other race. A multiple regression analysis is performed relating infant gender (coded 1=male, 0=female), gestational age in weeks, mother's age in years and 3 dummy or indicator variables reflecting mother's race. The results are summarized in the table below.
Independent Variable  Regression Coefficient  T  Pvalue 
Intercept  3850.92  11.56  0.0001 
Male infant  174.79  6.06  0.0001 
Gestational age, weeks  179.89  22.35  0.0001 
Mother's age, years  1.38  0.47  0.6361 
Black race  138.46  1.93  0.0535 
Hispanic race  13.07  0.37  0.7103 
Other race  68.67  1.05  0.2918 
Many of the predictor variables are statistically significantly associated with birth weight. Male infants are approximately 175 grams heavier than female infants, adjusting for gestational age, mother's age and mother's race/ethnicity. Gestational age is highly significant (p=0.0001), with each additional gestational week associated with an increase of 179.89 grams in birth weight, holding infant gender, mother's age and mother's race/ethnicity constant. Mother's age does not reach statistical significance (p=0.6361). Mother's race is modeled as a set of three dummy or indicator variables. In this analysis, white race is the reference group. Infants born to black mothers have lower birth weight by approximately 140 grams (as compared to infants born to white mothers), adjusting for gestational age, infant gender and mothers age. This difference is marginally significant (p=0.0535). There are no statistically significant differences in birth weight in infants born to Hispanic versus white mothers or to women who identify themselves as other race as compared to white.
return to top  previous page  next page
Stack Exchange Network
Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Null and Alternative hypothesis for multiple linear regression
I have 1 dependent variable and 3 independent variables.
I run multiple regression, and find that the p value for one of the independent variables is higher than 0.05 (95% is my confidence level).
I take that variable out and run it again. Both remaining independent variables have $p$value less than 0.05 so I conclude I have my model.
Am I correct in thinking that initially, my null hypothesis is
$$H_0= β_1=β_2 = \dots =β_{k1} = 0$$
and that the alternative hypothesis is
$$H_1=\textrm{At least one } β \neq 0 \textrm{ whilst } p<0.05$$
And that after the first regression, I do not reject, as one variable does not meet my confidence level needs...
So I run it again, and then reject the null as all $p$values are significant?
Is what I have written accurate?
Edit: Thanks to Bob Jansen for improving this aesthetics of this post.
2 Answers 2
The hypothesis $H_0: β_1=β_2=\dots =β_{k−1}=0$ is normally tested by the $F$test for the regression.
You are carrying out 3 independent tests of your coefficients (Do you also have a constant in the regression or is the constant one of your three variables?) If you do three independent tests at a 5% level you have a probability of over 14% of finding one of the coefficients significant at the 5% level even if all coefficients are truly zero (the null hypothesis). This is often ignored but be careful. Even so, If the coefficient is close to significant I would think about the underlying theory before coming to a decision.
If you add dummies you will have a beta for each dummy
 $\begingroup$ Thanks for your response. I don't have a constant, all of my pvalues are very significant (the least is a dummy variable at 0.039). What would my null hypothesis be? My knowledge is that I'm seeking pvalues because that'd give me my model. I don't understand the technicalities of it and want to learn it :) $\endgroup$ – Harry Commented Jan 7, 2015 at 22:36
 $\begingroup$ I think you meant to say 14% of committing a type one error (probability of 0.14 of finding at least one of the coefficient significant when there true value is actually the null hypothesis value) $\endgroup$ – Kamster Commented Jan 8, 2015 at 0:36
 $\begingroup$ @Kamster Thanks. You are correct and I have amended my answer. $\endgroup$ – user1483 Commented Jan 21, 2015 at 21:26
These are independent variables so the hypothesis applies to each parameter independently.
 $\begingroup$ +1: Yes, you are right  but the rest of it should be fine $\endgroup$ – vonjd Commented Jan 2, 2015 at 21:18
 $\begingroup$ sorry, could you clarify? How do I change the equation so it applies to each parameter independently? And also, what is the effect of adding 3 dummy variables. Is it simply 2 more betas? Or do they require their own symbol $\endgroup$ – Harry Commented Jan 4, 2015 at 0:32
 $\begingroup$ It just means that you have an H_0 and an H_1 for every parameter. $\endgroup$ – vonjd Commented Jan 4, 2015 at 11:33
 $\begingroup$ Ok I see. Do you know the procedure for dummy variables? Are they just additional beta? Or is it more accurate to refer to them as delta? $\endgroup$ – Harry Commented Jan 4, 2015 at 11:43
 $\begingroup$ Maybe I have this wrong but isn't it true if you remain your individual significance levels at 0.05 that the probability of type one error (ie the probability that reject null hypothesis when it is actually true; significance level) will be greater than or equal 0.14 $\endgroup$ – Kamster Commented Jan 8, 2015 at 0:43
Your Answer
Sign up or log in, post as a guest.
Required, but never shown
By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .
Not the answer you're looking for? Browse other questions tagged modelling or ask your own question .
 Featured on Meta
 Join Stack Overflow’s CEO and me for the first Stack IRL Community Event in...
 Bringing clarity to status tag usage on meta sites
Hot Network Questions
 Is it secure to block passwords that are too similar to other employees' old passwords?
 Does the Supremacy Clause allow states to retain abortion rights enshrined in each states' constitution?
 What are the steps to write a book?
 Subtle racism in the lab: how to respond
 Mistake on car insurance policy about use of car (commuting/social)
 How are you supposed to trust SSO popups in desktop and mobile applications?
 How much technological progress could a group of modern people make in a century?
 If a friend hands me a marijuana edible then dies of a heart attack am I guilty of felony murder?
 Can I land on the "EuroAirport BaselMulhouseFreiburg" with a German National Visa as first destination (NONEU Citizen)?
 What is the least number of colours Peter could use to color the 3x3 square?
 How can I get the bounding box of a path (drawn with hobby)?
 Please help me identify my Dad's bike collection (80's2000's)
 Can you spell memento as mement?
 Understanding the parabolic state of a quantum particle in the infinite square well
 Is this map real?
 Geo Nodes: store attribute "line length" for every point in the line
 How can a microcontroller (such as an Arduino Uno) that requires 721V input voltage be powered via USBB which can only run 5V?
 Should I write an email to a Latino teacher working in the US in English or Spanish?
 Can flood basalt eruptions start in historical timescales?
 Does a debt exist for a Parking Charge Notice?
 Remove spaces from the 3rd line onwards in a file on linux
 How solid is the claim that Alfred Nobel founded the Nobel Prize specifically because of his invention of dynamite?
 Getting lost on a Circular Track
 Increasing vs non decreasing
IMAGES
VIDEO
COMMENTS
As in simple linear regression, under the null hypothesis t 0 = βˆ j seˆ(βˆ j) ∼ t n−p−1. We reject H 0 if t 0 > t n−p−1,1−α/2. This is a partial test because βˆ j depends on all of the other predictors x i, i 6= j that are in the model. Thus, this is a test of the contribution of x j given the other predictors in the model.
The formula for a multiple linear regression is: = the predicted value of the dependent variable. = the yintercept (value of y when all other parameters are set to 0) = the regression coefficient () of the first independent variable () (a.k.a. the effect that increasing the value of the independent variable has on the predicted y value ...
xi: The value of the predictor variable xi. Multiple linear regression uses the following null and alternative hypotheses: H0: β1 = β2 = … = βk = 0. HA: β1 = β2 = … = βk ≠ 0. The null hypothesis states that all coefficients in the model are equal to zero. In other words, none of the predictor variables have a statistically ...
Organized by textbook: https://learncheme.com/ See Part 2: https://www.youtube.com/watch?v=ziGbG0dRlsAMade by faculty at the University of Colorado Boulder, ...
A population model for a multiple linear regression model that relates a y variable to p 1 x variables is written as. y i = β 0 + β 1 x i, 1 + β 2 x i, 2 + … + β p − 1 x i, p − 1 + ϵ i. We assume that the ϵ i have a normal distribution with mean 0 and constant variance σ 2. These are the same assumptions that we used in simple ...
The hypotheses are: Find the critical value using dfE = n − p − 1 = 13 for a twotailed test α = 0.05 inverse tdistribution to get the critical values ± 2.160. Draw the sampling distribution and label the critical values, as shown in Figure 1214. Figure 1214: Graph of tdistribution with labeled critical values.
When could this happen in real life: Time series: Each sample corresponds to a different point in time. The errors for samples that are close in time are correlated. Spatial data: Each sample corresponds to a different location in space. Grouped data: Imagine a study on predicting height from weight at birth. If some of the subjects in the study are in the same family, their shared environment ...
Here, Y is the output variable, and X terms are the corresponding input variables. Notice that this equation is just an extension of Simple Linear Regression, and each predictor has a corresponding slope coefficient (β).The first β term (βo) is the intercept constant and is the value of Y in absence of all predictors (i.e when all X terms are 0). It may or may or may not hold any ...
5.3  The Multiple Linear Regression Model. Notation for the Population Model. A population model for a multiple linear regression model that relates a y variable to k x variables is written as. yi = β0 +β1xi,1 +β2xi,2 + … +βkxi,k +ϵi. Here we're using " k " for the number of predictor variables, which means we have k +1 regression ...
Assumptions of Multiple Linear Regression. There are four key assumptions that multiple linear regression makes about the data: 1. Linear relationship: There exists a linear relationship between the independent variable, x, and the dependent variable, y. 2. Independence: The residuals are independent.
a hypothesis test for testing that a subset — more than one, but not all — of the slope parameters are 0. In this lesson, we also learn how to perform each of the above three hypothesis tests. Key Learning Goals for this Lesson: Be able to interpret the coefficients of a multiple regression model. Understand what the scope of the model is ...
Multiple linear regression is one of the most fundamental statistical models due to its simplicity and interpretability of results. For prediction purposes, linear models can sometimes outperform fancier nonlinear models, especially in situations with small numbers of training cases, low signaltonoise ratio, or sparse data (Hastie et al., 2009).
In contrast, the simple regression slope is called the marginal (or unadjusted) coefficient. The multiple regression model can be written in matrix form. To estimate the parameters b 0, b 1,..., b p using the principle of least squares, form the sum of squared deviations of the observed yj's from the regression line:
Formally, our "null model" corresponds to the fairly trivial "regression" model in which we include 0 predictors, and only include the intercept term b 0. H 0:Y i =b 0 +ϵ i. If our regression model has K predictors, the "alternative model" is described using the usual formula for a multiple regression model: H1: Yi = (∑K k=1 ...
2. I struggle writing hypothesis because I get very much confused by reference groups in the context of regression models. For my example I'm using the mtcars dataset. The predictors are wt (weight), cyl (number of cylinders), and gear (number of gears), and the outcome variable is mpg (miles per gallon). Say all your friends think you should ...
Principle. The principle of simple linear regression is to find the line (i.e., determine its equation) which passes as close as possible to the observations, that is, the set of points formed by the pairs \((x_i, y_i)\).. In the first step, there are many potential lines. Three of them are plotted: To find the line which passes as close as possible to all the points, we take the square of the ...
Multiple Linear Regression. Multiple Linear Regression. A method of explaining a continuous numerical y variable in terms of a linear function of p explanatory terms, x i. y ^ = b 0 + b 1 x 1 + b 2 x 2 + … + b p x p Each of the b i are called coefficients. To fit a multiple linear regression model using least squares in R, you can use the lm ...
The multiple regression model is: The details of the test are not shown here, but note in the table above that in this model, the regression coefficient associated with the interaction term, b 3, is statistically significant (i.e., H 0: b 3 = 0 versus H 1: b 3 ≠ 0). The fact that this is statistically significant indicates that the association between treatment and outcome differs by sex.
now you have multiple independent variables. The object of multiple linear regression is to develop a prediction equation that permits the estimation of the value of the dependent variable based on the knowledge of multiple independent variables. The data requirements for multiple linear regression are the same as for simple linear regression.
This tutorial explains how to perform multiple linear regression by hand. Example: Multiple Linear Regression by Hand. Suppose we have the following dataset with one response variable y and two predictor variables X 1 and X 2: Use the following steps to fit a multiple linear regression model to this dataset. Step 1: Calculate X 1 2, X 2 2, X 1 ...
This is the use of linear regression with multiple variables, and the equation is: Y = b0 + b1X1 + b2X2 + b3X3 + … + bnXn + e. Y and b0 are the same as in the simple linear regression model. b1X1 represents the regression coefficient (b1) on the first independent variable (X1). The same analysis applies to all the remaining regression ...
Null and Alternative hypothesis for multiple linear regression. Ask Question Asked 9 years, 8 months ago. ... I have 1 dependent variable and 3 independent variables. I run multiple regression, and find that the p value for one of the independent variables is higher than 0.05 (95% is my confidence level). ... see our tips on writing great ...
Use multiple regression when you have three or more measurement variables. One of the measurement variables is the dependent (Y) variable. The rest of the variables are the independent (X) variables; you think they may have an effect on the dependent variable. The purpose of a multiple regression is to find an equation that best predicts the Y ...