Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Multiple Linear Regression | A Quick Guide (Examples)

Published on February 20, 2020 by Rebecca Bevans . Revised on June 22, 2023.

Regression models are used to describe relationships between variables by fitting a line to the observed data. Regression allows you to estimate how a dependent variable changes as the independent variable(s) change.

Multiple linear regression is used to estimate the relationship between  two or more independent variables and one dependent variable . You can use multiple linear regression when you want to know:

  • How strong the relationship is between two or more independent variables and one dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth).
  • The value of the dependent variable at a certain value of the independent variables (e.g. the expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).

Table of contents

Assumptions of multiple linear regression, how to perform a multiple linear regression, interpreting the results, presenting the results, other interesting articles, frequently asked questions about multiple linear regression.

Multiple linear regression makes all of the same assumptions as simple linear regression :

Homogeneity of variance (homoscedasticity) : the size of the error in our prediction doesn’t change significantly across the values of the independent variable.

Independence of observations : the observations in the dataset were collected using statistically valid sampling methods , and there are no hidden relationships among variables.

In multiple linear regression, it is possible that some of the independent variables are actually correlated with one another, so it is important to check these before developing the regression model. If two independent variables are too highly correlated (r2 > ~0.6), then only one of them should be used in the regression model.

Normality : The data follows a normal distribution .

Linearity : the line of best fit through the data points is a straight line, rather than a curve or some sort of grouping factor.

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

Multiple linear regression formula

The formula for a multiple linear regression is:

y = {\beta_0} + {\beta_1{X_1}} + … + {{\beta_n{X_n}} + {\epsilon}

  • … = do the same for however many independent variables you are testing

B_nX_n

To find the best-fit line for each independent variable, multiple linear regression calculates three things:

  • The regression coefficients that lead to the smallest overall model error.
  • The t statistic of the overall model.
  • The associated p value (how likely it is that the t statistic would have occurred by chance if the null hypothesis of no relationship between the independent and dependent variables was true).

It then calculates the t statistic and p value for each regression coefficient in the model.

Multiple linear regression in R

While it is possible to do multiple linear regression by hand, it is much more commonly done via statistical software. We are going to use R for our examples because it is free, powerful, and widely available. Download the sample dataset to try it yourself.

Dataset for multiple linear regression (.csv)

Load the heart.data dataset into your R environment and run the following code:

This code takes the data set heart.data and calculates the effect that the independent variables biking and smoking have on the dependent variable heart disease using the equation for the linear model: lm() .

Learn more by following the full step-by-step guide to linear regression in R .

To view the results of the model, you can use the summary() function:

This function takes the most important parameters from the linear model and puts them into a table that looks like this:

R multiple linear regression summary output

The summary first prints out the formula (‘Call’), then the model residuals (‘Residuals’). If the residuals are roughly centered around zero and with similar spread on either side, as these do ( median 0.03, and min and max around -2 and 2) then the model probably fits the assumption of heteroscedasticity.

Next are the regression coefficients of the model (‘Coefficients’). Row 1 of the coefficients table is labeled (Intercept) – this is the y-intercept of the regression equation. It’s helpful to know the estimated intercept in order to plug it into the regression equation and predict values of the dependent variable:

The most important things to note in this output table are the next two tables – the estimates for the independent variables.

The Estimate column is the estimated effect , also called the regression coefficient or r 2 value. The estimates in the table tell us that for every one percent increase in biking to work there is an associated 0.2 percent decrease in heart disease, and that for every one percent increase in smoking there is an associated .17 percent increase in heart disease.

The Std.error column displays the standard error of the estimate. This number shows how much variation there is around the estimates of the regression coefficient.

The t value column displays the test statistic . Unless otherwise specified, the test statistic used in linear regression is the t value from a two-sided t test . The larger the test statistic, the less likely it is that the results occurred by chance.

The Pr( > | t | ) column shows the p value . This shows how likely the calculated t value would have occurred by chance if the null hypothesis of no effect of the parameter were true.

Because these values are so low ( p < 0.001 in both cases), we can reject the null hypothesis and conclude that both biking to work and smoking both likely influence rates of heart disease.

When reporting your results, include the estimated effect (i.e. the regression coefficient), the standard error of the estimate, and the p value. You should also interpret your numbers to make it clear to your readers what the regression coefficient means.

Visualizing the results in a graph

It can also be helpful to include a graph with your results. Multiple linear regression is somewhat more complicated than simple linear regression, because there are more parameters than will fit on a two-dimensional plot.

However, there are ways to display your results that include the effects of multiple independent variables on the dependent variable, even though only one independent variable can actually be plotted on the x-axis.

Multiple regression in R graph

Here, we have calculated the predicted values of the dependent variable (heart disease) across the full range of observed values for the percentage of people biking to work.

To include the effect of smoking on the independent variable, we calculated these predicted values while holding smoking constant at the minimum, mean , and maximum observed rates of smoking.

Prevent plagiarism. Run a free check.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Chi square test of independence
  • Statistical power
  • Descriptive statistics
  • Degrees of freedom
  • Pearson correlation
  • Null hypothesis

Methodology

  • Double-blind study
  • Case-control study
  • Research ethics
  • Data collection
  • Hypothesis testing
  • Structured interviews

Research bias

  • Hawthorne effect
  • Unconscious bias
  • Recall bias
  • Halo effect
  • Self-serving bias
  • Information bias

A regression model is a statistical model that estimates the relationship between one dependent variable and one or more independent variables using a line (or a plane in the case of two or more independent variables).

A regression model can be used when the dependent variable is quantitative, except in the case of logistic regression, where the dependent variable is binary.

Multiple linear regression is a regression model that estimates the relationship between a quantitative dependent variable and two or more independent variables using a straight line.

Linear regression most often uses mean-square error (MSE) to calculate the error of the model. MSE is calculated by:

  • measuring the distance of the observed y-values from the predicted y-values at each value of x;
  • squaring each of these distances;
  • calculating the mean of each of the squared distances.

Linear regression fits a line to the data by finding the regression coefficient that results in the smallest MSE.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Multiple Linear Regression | A Quick Guide (Examples). Scribbr. Retrieved August 13, 2024, from https://www.scribbr.com/statistics/multiple-linear-regression/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, simple linear regression | an easy introduction & examples, an introduction to t tests | definitions, formula and examples, types of variables in research & statistics | examples, what is your plagiarism score.

STATS191 - Home

Multiple Linear Regression

Multiple linear regression #.

RStudio: RMarkdown , Quarto

Case studies:

A. Effect of light on meadowfoam flowering

B. Studying the brain sizes of mammals

Specifying the model.

Fitting the model: least squares.

Interpretation of the coefficients.

\(F\) -statistic revisited

Matrix approach to linear regression.

Investigating the design matrix

Case study A: #

A data.frame: 6 × 3
FlowersTimeIntensity
<dbl><int><int>
162.31150
277.41150
355.31300
454.21300
549.61450
661.91450

Researchers manipulate timing and intensity of light to investigate effect on number of flowers.

Case study B: #

A data.frame: 6 × 4
BrainBodyGestationLitter
<dbl><dbl><int><dbl>
Aardvark 9.6 2.20 315.0
Acouchis 9.9 0.78 981.2
African elephant4480.02800.006551.0
Agoutis 20.3 2.801041.3
Axis deer 219.0 89.002181.0
Badger 53.0 6.00 602.2

How are litter size, gestation period associated to brain size in mammals?

A model for the brains data #

Figure depicts our model: to generate \(y_i\) : #.

First fix \(X=(X_1,\dots,X_p)\) , form the mean ( \(\beta_0 + \sum_j \beta_j X_{j}\) ), add an error \(\epsilon\)

A model for brains #

Multiple linear regression model #.

A matrix: 4 × 4 of type dbl
EstimateStd. Errort valuePr(>|t|)
(Intercept)-225.292132883.05875218-2.7124437.971598e-03
Body 0.9858781 0.0942826310.4566242.517636e-17
Gestation 1.8087434 0.35444885 5.1029741.790007e-06
Litter 27.648639417.41429351 1.5876981.157857e-01

Another model for brains #

A matrix: 3 × 4 of type dbl
EstimateStd. Errort valuePr(>|t|)
(Intercept)145.02838045.51865579 3.1861311.963384e-03
Body 1.301621 0.0801461916.2405836.755835e-29
Litter-29.02206715.11200472-1.9204645.786311e-02

Fitting a multiple linear regression model #

Just as in simple linear regression, model is fit by minimizing

Minimizers: \(\widehat{\beta} = (\widehat{\beta}_0, \dots, \widehat{\beta}_p)\) are the “least squares estimates”: are also normally distributed as in simple linear regression.

Estimating \(\sigma^2\) #

As in simple regression

independent of \(\widehat{\beta}\) .

Why \(\chi^2_{n-p-1}\) ? Typically, the degrees of freedom in the estimate of \(\sigma^2\) is \(n-\# \text{number of parameters in regression function}\) .

Interpretation of \(\beta_j\) in brains.lm #

Take \(\beta_1=\beta_{\tt Body}\) for example. This is the amount the average Brain weight increases for one kg of increase in Body , keeping everything else constant.

We refer to this as the effect of Body allowing for or controlling for the other variables.

Let’s take Beaked whale and artificially add a kg to its Body and compute the predicted weight

Same example in simpler.lm #

To emphasize the parameters depend on the other variables, let’s redo in the simpler.lm model

\(R^2\) for multiple regression #

\(R^2\) is now called the multiple correlation coefficient of the model, or the coefficient of multiple determination .

The sums of squares and \(R^2\) are defined analogously to those in simple linear regression.

Computing \(R^2\) by hand #

Adjusted \(r^2\) #.

As we add more and more variables to the model – even random ones, \(R^2\) will increase to 1.

Adjusted \(R^2\) tries to take this into account by replacing sums of squares by mean squares

Computing \(R^2_a\) by hand #

\(f\) -test in summary(brains.lm) #.

Full model:

Reduced model:

Right triangle again #

Sides of the triangle: \(df_R-df_F=3\) , \(df_F=92\)

Hypotenuse: \(df_R=95\)

Matrix formulation #

\({X}\) is called the design matrix of the model

\({\varepsilon} \sim N(0, \sigma^2 I_{n \times n})\) is multivariate normal

\(SSE\) in matrix form #

Design matrix #.

The design matrix is the \(n \times (p+1)\) matrix with entries

A matrix: 6 × 4 of type dbl
(Intercept)BodyGestationLitter
1 2.20 315.0
1 0.78 981.2
12800.006551.0
1 2.801041.3
1 89.002181.0
1 6.00 602.2

The matrix X is the same as formed by R

A matrix: 6 × 4 of type dbl
(Intercept)BodyGestationLitter
Aardvark1 2.20 315.0
Acouchis1 0.78 981.2
African elephant12800.006551.0
Agoutis1 2.801041.3
Axis deer1 89.002181.0
Badger1 6.00 602.2

Math aside: least squares solution #

Normal equations

Equivalent to

Distribution: \(\widehat{\beta} \sim N(\beta, \sigma^2 (X^TX)^{-1}).\)

Math aside: multivariate normal #

To obtain the distribution of \(\hat{\beta}\) we used the following fact about the multivariate Normal.

Suppose \(Z \sim N(\mu,\Sigma)\) . Then, for any fixed matrix \(A\)

Math aside: how did we derive the distribution of \(\hat{\beta}\) ? #

Above, we saw that \(\hat{\beta}\) is equal to a matrix times \(Y\) . The matrix form of our model is

Math aside: checking the equation #

Categorical variables #.

Recall case study A: the flower experiment

A matrix: 2 × 4 of type dbl
EstimateStd. Errort valuePr(>|t|)
(Intercept)50.058333.61551013.8454402.433418e-12
factor(Time)212.158335.113104 2.3778772.652620e-02

Design matrix with categorical variables #

R has used a binary column for factor(Time) .

A matrix: 24 × 2 of type dbl
(Intercept)factor(Time)2
110
210
310
410
510
610
710
810
910
1010
1110
1210
1311
1411
1511
1611
1711
1811
1911
2011
2111
2211
2311
2411

How categorical variables are encoded #

We can change the columns in the design matrix:

A matrix: 24 × 2 of type dbl
factor(Time)1factor(Time)2
110
210
310
410
510
610
710
810
910
1010
1110
1210
1301
1401
1501
1601
1701
1801
1901
2001
2101
2201
2301
2401

By default, R discards one of the columns. Why?

A matrix: 24 × 6 of type dbl
(Intercept)factor(Intensity)300factor(Intensity)450factor(Intensity)600factor(Intensity)750factor(Intensity)900
1100000
2100000
3110000
4110000
5101000
6101000
7100100
8100100
9100010
10100010
11100001
12100001
13100000
14100000
15110000
16110000
17101000
18101000
19100100
20100100
21100010
22100010
23100001
24100001

Some additional models #

~ intensity #.

A matrix: 2 × 4 of type dbl
EstimateStd. Errort valuePr(>|t|)
(Intercept)77.385000004.16118616418.5968616.059011e-15
Intensity-0.040471430.007123293-5.6815621.029503e-05

~ Intensity + factor(Time) #

A matrix: 3 × 4 of type dbl
EstimateStd. Errort valuePr(>|t|)
(Intercept)71.305833333.2737720221.7809406.767274e-16
Intensity-0.040471430.00513237-7.8855251.036787e-07
factor(Time)212.158333332.62955696 4.6237191.463776e-04

~ factor(Intensity) + factor(Time) #

A matrix: 7 × 4 of type dbl
EstimateStd. Errort valuePr(>|t|)
(Intercept) 67.195833.62869318.5179161.049173e-12
factor(Intensity)300 -9.125004.751074-1.9206187.171506e-02
factor(Intensity)450-13.375004.751074-2.8151531.191898e-02
factor(Intensity)600-23.225004.751074-4.8883681.384982e-04
factor(Intensity)750-27.750004.751074-5.8407841.965699e-05
factor(Intensity)900-29.350004.751074-6.1775501.012760e-05
factor(Time)2 12.158332.743034 4.4324403.648933e-04

Interactions #

Suppose we believe that Flowers varies linearly with Intensity but the slope depends on Time .

We’d need two parameters for Intensity

A matrix: 4 × 4 of type dbl
EstimateStd. Errort valuePr(>|t|)
(Intercept)71.6233333334.34330459916.49051584.143572e-13
Intensity-0.0410761900.007435051-5.52466822.083392e-05
factor(Time)211.5233333336.142360270 1.87604327.532164e-02
Intensity:factor(Time)2 0.0012095240.010514750 0.11503129.095675e-01

What is the regression line when Time==1 ? And Time==2 ?

Different models across groups #

Set \(\beta_1=\beta_{\tt Intensity}\) , \(\beta_2=\beta_{\tt Time2}\) , \(\beta_3=\beta_{\tt Time2:Intensity}\) .

In Time==1 group, one unit change of Intensity leads to \(\beta_1\) units of change in Flower .

In Time==2 group, one unit change of Intensity leads to \(\beta_1 + \beta_3\) units of change in Flower .

Test \(H_0\) slope is the same within each group.

Visualizing interaction #

../../_images/02a12b38709bd2c04a6d16ac1f5e486e2a6d53ebb989b94c5c37cbfa05853d4f.png

Multiple Linear Regression

Summarizing linear relationships in high dimensions

I n the last lecture we built our first linear model: an equation of a line drawn through the scatter plot.

\[ \hat{y} = 96.2 + -0.89 x \]

While the idea is simple enough, there is a sea of terminology that floats around this method. A linear model is any model that explains the \(y\) , often called the response variable or dependent variable , as a linear function of the \(x\) , often called the explanatory variable or independent variable . There are many different methods that can be used to decide which line to draw through a scatter plot. The most commonly-used approach is called the method of least squares , a method we’ll look at closely when we turn to prediction. If we think more generally, a linear model fit by least squares is one example of a regression model , which refers to any model (linear or non-linear) used to explain a numerical response variable.

The reason for all of this jargon isn’t purely to infuriate students of statistics. Linear models are one of the most widely used statistical tools; you can find them in use in diverse fields like biology, business, and political science. Each field tends to adapt the tool and the language around them to their specific needs.

A reality of practicing statistics in these field, however, is that most data sets are more complex than the example that we saw in the last notes, where there were only two variables. Most phenomena have many different variables that relate to one another in complex ways. We need a more more powerful tool to help guide us into these higher dimensions. A good starting point is to expand simple linear regression to include more than one explanatory variable!

To fit a multiple linear regression model using least squares in R, you can use the lm() function, with each additional explanatory variable separated by a + .

Multiple linear regression is powerful because it has no limit to the number of variables that we can include in the model. While Hans Rosling was able to fit 5 variables into a single graphic, what if we had 10 variables? Multiple linear regression allows us to understand high dimensional linear relationships beyond whats possible using our visual system.

In today’s notes, we’ll discuss two specific examples where a multiple linear regression model might be applicable

A scenario involving two numerical variables and one categorical variable

A scenario involving three numerical variables.

Two numerical, one categorical

The Zagat Guide was for many years the authoritative source of restaurant reviews. Their approach was very different from Yelp!. Zagat’s review of a restaurant was compiled by a professional restaurant reviewer who would visit a restaurant and rate it on a 30 point scale across three categories: food, decor, and service. They would also note the average price of a meal and write up a narrative review.

Here’s an example of a review from an Italian restaurant called Marea in New York City.

A picture of a Zagat review of the Italian restaurant Marea in New York City, with scores on food, decor, and service along with quotations from the narrative review.

In addition to learning about the food scores (27), and getting some helpful tips (“bring your bank manager”), we see they’ve also recorded a few more variables on this restaurant: the phone number and website, their opening hours, and the neighborhood (Midtown).

You might ask:

What is the relationship between the food quality and the price of a meal at Italian restaurant? Are these two variables positively correlated or is the best Italian meal in New York a simple and inexpensive slice of pizza?

To answer these questions, we need more data. The data frame below contains Zagat reviews from 168 Italian restaurants in Manhattan.

Applying the taxonomy of data, we see that for each restaurant we have recorded the price of an average meal, the food, decor, and service scores (all numerical variables) as well as a note regarding geography (a categorical nominal variable). geo captures whether the restaurant is located on the east side or the west side of Manhattan 1 .

Let’s summarize the relationship between food quality, price, and one categorical variable - geography - using a colored scatter plot.

multiple linear regression assignment

It looks like if you want a very tasty meal, you’ll have to pay for it. There is a moderately strong, positive, and linear relationship between food quality and price. This plot, however, has a third variable in it: geography. The restaurants from the east and west sides are fairly well mixed, but to my eye the points on the west side might be a tad bit lower on price than the points from the east side. I could numerically summarize the relationship between these three variables by hand-drawing two lines, one for each neighborhood.

multiple linear regression assignment

For a more systematic approach for drawing lines through the center of scatter plots, we need to return to the method of least squares, which is done in R using lm() . In this linear model, we wish to explain the \(y\) variable as a function of two explanatory variables, food and geo , both found in the zagat data frame. We can express that relationship using the formula notation.

It worked . . . or did it? If extend our reasoning from the last notes, we should write this model as

\[\widehat{price} = -15.97 + 2.87 \times food - 1.45 \times geo\]

What does it mean to put a categorical variable, geo , into a linear model? And how do three three numbers translate into the two lines shown above?

Indiciator variables

When working with linear models like the one above, the value of the explanatory variable, \(geowest\) , is multiplied by a slope, 1.45. According to the Taxonomy of Data, arithmetic functions like multiplication are only defined for numerical variables. While that would seem to rule out categorical variables for use as explanatory variables, statisticians have come up with a clever work-around: the indicator variable.

The categorical variable geo can be converted into an indicator variable by shifting the question from “Which side of Manhattan are you on?” to “Are you on the west side of Manhattan?” This is a mutate step.

The new indicator variable geowest is a logical variable, so it has a dual representation as TRUE / FALSE as well as 1/0. Previously, this allowed us to do Boolean algebra. Here, it allows us to include an indicator variable in a linear model.

While you can create indicator variables by hand using mutate , in practice, you will not need to do this. That’s because they are created automatically whenever you put a categorical variable into lm() . Let’s revisit the linear model that we fit above with geowest in the place of geo .

\[\widehat{price} = -15.97 + 2.87 \times food - 1.45 \times geowest\]

To understand the geometry of this model, let’s focus on what the fitted values will be for any restaurant that is on the west side. For those restaurants, the geowest indicator variable will take a value of 1, so if we plug that in and rearrange,

\[\begin{eqnarray} \widehat{price} &= -15.97 + 2.87 \times food - 1.45 \times 1 \\ &= (-15.97 - 1.45) + 2.87 \times food \\ &= -17.42 + 2.87 \times food \end{eqnarray}\]

That is a familiar sight: that is an equation for a line.

Let’s repeat this process for the restaurants on the east side, where the geowest indicator variable will now take a value of 0.

\[\begin{eqnarray} \widehat{price} &= -15.97 + 2.87 \times food - 1.45 \times 0 \\ &= -15.97 + 2.87 \times food \end{eqnarray}\]

That is also the equation for line.

If you look back and forth between these two equations, you’ll notice that they share the same slope and have different y-intercepts. Geometrically, this means that the output of lm() was describing the equation of two parallel lines :

  • one where geowest is 1 (for restaurants on the west side of town)
  • one where geowest is 0 (for restaurants on the east side of town).

That means we can use the output of lm() to replace my hand-drawn lines with ones that arise from the method of least squares.

multiple linear regression assignment

Reference levels

One question you still might have: Why did R include the indicator variable for the west side of town as opposed to the one for the east side? . The answer lies in the type of variable that geo is recorded as in the zagat dataframe. If you look closely at the initial output, you will see that geo is currently designated chr , which is short for character . geo is indeed a categorical variable with two levels: east and west .

Like in previous settings, R will determine the “order” of levels in a categorical variable registered as a character by way of the alphabet. This means that east will be tagged first and chosen as the reference level : the level of a categorical variable which does not have an indicator variable in the model. If you would like west to be the reference level, then you would need to reorder the levels using factor() inside of a mutate() so that west comes first. This would change the equation that results from then fitting a linear model with lm() , as you can see below!

Now our equation looks a little bit different!

\[\widehat{price} = -17.43 + 2.87 \times food + 1.46 \times geoeast\]

In general, if you include a categorical variable with \(k\) levels in a regression model, there will be \(k-1\) indicator variables (and thus, coefficients) associated with it in the model: one for each level of the variable except the reference level 2 . Knowing the reference level also helps us interpret indicator variables that are part of the regression equation; we will see this in a moment. For now, let’s move to our second scenario.

Three numerical

While the standard scatter plot allows us to understand the association between two numerical variables like price and food , to understand the relationship between three numerical variables, we will need to build this scatterplot in 3D 3 .

multiple linear regression assignment

Take a moment to explore this scatter plot 4 . Can you find the name of the restaurant with very bad decor but pretty good food and a price to match? (It’s Gennaro.) What about the restaurant that equally bad decor but has rock bottom prices that’s surprising given it’s food quality isn’t actually somewhat respectable? (It’s Lamarca.)

Instead of depicting the relationship between these three variables graphically, let’s do it numerically by fitting a linear model.

We can write the corresponding equation of the model as

\[ \widehat{price} = -24.5 + 1.64 \times food + 1.88 \times decor \]

To understand the geometry of this model, we can’t use the trick that we did with indicator variables. decor is a numerical variable just like food , so it takes more values than just 0 and 1.

Indeed this linear model is describing a plane .

multiple linear regression assignment

If you inspect this plane carefully you’re realize that the tilt of the plane is not quite the same in every dimension. The tilt in the decor dimension is just a little bit steeper than that in the food dimension, a geometric expression of the fact that the coefficient in front of decor, 1.88, is just a bit higher than the coefficient in front of food, 1.64.

Interpreting coefficients

When moving from simple linear regression, with one explanatory variable, to the multiple linear regression, with many, the interpretation of the coefficients becomes trickier but also more insightful.

Mathematically, the coefficient in front of \(food\) , 1.64, can be interpreted a few different ways:

It is the difference that we would expect to see in the response variable, \(price\) , when two Italian restaurants are separated by a food rating of one and they have the same decor rating.

Controlling for \(decor\) , a one point increase in the food rating is associated with a $1.64 increase in the \(price\) .

Similarly for interpreting \(decor\) : controlling for the quality of the food, a one-point increase in \(decor\) is associated with a $1.88 increase in the \(price\) .

This conditional interpretation of the coefficients extends to the first setting we looked at, when one variable is numerical and the other is an indicator. Here is that model:

One might interpret \(food\) like this:

  • For two restaurants both on the same side of Manhattan, a one point increase in food score is associated with a $2.87 increase in the price of a meal.

As for \(geowest\) :

  • For two restaurants with the exact same quality of food, the restaurant on the west side is expected to be $1.45 cheaper than the restaurant on the east side.

We make the comparison to the the east side since this level is the reference level according to the linear model shown. This is a useful bit of insight - it gives a sense of what the premium is of being on the eastside.

It is also visible in the geometry of the model. When we’re looking at restaurants with the same food quality, we’re looking at a vertical slice of the scatter plot. Here the vertical gray line is indicating restaurants where the food quality gets a score of 18. The difference in expected price of meals on the east side and west side is the vertical distance between the red line and the blue line, which is exactly 1.45. We could draw this vertical line anywhere on the graph and the distance between the red line and the blue will still be exactly 1.45.

multiple linear regression assignment

We began this unit on Summarizing Data with graphical and numerical summaries of just a single variable: histograms and bar charts, means and standard deviations. In the last set of notes we introduced our first bivariate numerical summaries: the correlation coefficient, and the linear model. In these notes, we introduced multiple linear regression , a method that can numerically describe the linear relationships between an unlimited number of variables. The types of variables that can be included in these models is similarly vast. Numerical variables can be included directly, generalizing the geometry of a line into a plane in a higher dimension. Categorical variables can be included using the trick of creating indicator variables : logical variables that take a value of 1 where a particular condition is true. The interpretation of all of the coefficients that result from a multiple regression is challenging but rewarding: it allows us to answer questions about the relationship between two variables after controlling for the values of other variables.

If this felt like a deep dive into a multiple linear regression, don’t worry. Linear models are one of the most commonly used statistical tools, so we’ll be revisiting them throughout the course: investigating their use in making generalizations, causal claims, and predictions.

Fifth Avenue is the wide north-south street that divides Manhattan into an east side and a west side. ↩︎

This is the case for a model including an intercept term; these models will be our focus this semester and are the most rcommonly used. ↩︎

While ggplot2 is the best package for static statistical graphics, it does not have any interactive functionality. This plot was made using a system called plotly , which can be used both in R and Python. Read more about how it works at https://plotly.com/r/ . ↩︎

This is a screenshot from an interactive 3D scatter plot. We’ll see the interactive plot in class tomorrow. ↩︎

Recall, the degrees of freedom for the t-test is DFE = n – v – 1. There are only 2 explanatory variables left in the model, so the degrees of freedom for the t-tests = 10 – 2 – 1 = 7 .

B. A is not correct because it is possible that a backwards selection process will eliminate all variables. But\ , remember that we’ll stop eliminating variables once all remaining variables have p-values less than 0.05, which is the case here. Therefore, C is also incorrect.

C . Note that B is not correct – “keeping the number of radios and TV sets the same” is used i\ n the interpretation of the coefficient of newspaper copies and is different than the phrase “after accounting for the effects of the number of radios and number of TV sets in the country.”

False. Whenever all explanatory variables in a model have p-values from the t-test less than 0.05 \(or so\), we\ stop the backwards selection process. Such a model would be considered our “final model”.

t 6 = \(0.0005421 - 0\) / 0.0008653 = 0.6265 . Some notes: 1\) the degrees of freedom for the t-test is DFE = n – v –\ 1 = 6. As has been mentioned several times, when performing a t-test in regression, the degrees of freedom is ALWAYS DFE. 2\) Notice in the output above, the t-statistic for this t-test is given in the row for newspaper copies and under “T” – it is rounded to two decimal places in t\ he output. All the t-statistics \(under “T”\) in the regression output are calculated by dividing the “Coef” by “SE Coef”.

MSM = 0.16132 and MSE = 0.03477. 0.16132/0.03477 = 4.6396 or 4.64 rounded to two decimal places.

numerator df = DFM = # explanatory variables = 3 . denominator df = n – v – 1 = 10 – 3 – 1 = 6 . Both are highlighted in red in the output below: Analysis of Variance Source DF SS MS F P Regression 3 0.48397 0.16132 4.64 0.053 Residual Error 6 0.20859 0.03477 \ Total 9 0.69256

There is suggestive, but weak, evidence to indicate that at least one of number of daily newspaper copies, nu\ mber of radios, and/or number of TV sets help to explain a country’s literacy rate \(p-value = 0.053\). Some notes: 1\) even\ though the evidence is weak, we should continue the analysis to find out for sure if there is at least one explanatory variable that is a significant predictor of literacy rate and, if so, which one or ones. Anytime the p-value is less than 0.1 for the F\ -test, we should continue the analysis. 2\) Remember, the conclusion states that there is suggestive evidence that at least one explanatory variable is a significant predictor of literacy rate. It does NOT tell us how many or which one or ones are significant predictors of l\ iteracy rate – only that there is at least one that is. 3\) if the F-test indicates no evidence to reject the null hypothesis, then there is no need to continue the analysis as there is no evidence to indicate that any of the explanatory variables are helpful in explai\ ning the response variable. However, if there is even the slightest bit of evidence to reject the null hypothesis from the F-test \(i.e., p-value < 0.10\), we should continue the analysis. This will involve doing t-tests on each explanatory variable, a\ s we will see below.

First, we must check that we’re not extrapolating: all values of the explanatory variables are within \ the range of the data collected, so we’re okay. \(To illustrate, 200 daily newspapers is between the minimum of 10 daily newspaper copies per 1000 people in Kenya and 391 daily newspaper copies per 1000 people in Norway.\) Second, make sure you put \ the right values in for the right x’s – recall that x 1 = number of daily newspaper copies, x 2 = number of radios, and x 3 = number of TV sets \(all per 1000 people\): y^ = 0.840 . We’d predict about 84% of the residents to be literate in such a country.

B . Since the coefficient is negative, we’d expect the literacy rate to be lower for every additional radio per 1000 people in the \ population \(for countries with the same number of daily newspaper copies and TV sets per 1000 people in the population\).

Response variable: literacy rate . Explanatory variables: number of daily newspaper copies, number of radios, and number of TV sets \(all per 1000 people in the population of the country\).

y^ = 0.51486 + 0.00054x 1 - 0.00035x 2 + 0.00199x 3 where y^= predicted literacy rate x 1 = the number of daily ne\ wspaper copies in the country \(per 1000 people\) x 2 = the number of radios in the country \(per 1000 people\) x 3\ = the number of TV sets in the country \(per 1000 people\) Note where these numbers come from in the output – they are highl\ ighted in red in the outp\ ut below. It is important that we make sure we get the right coefficient with the right variable! Predicto\ r Coef SE Coef T P Constant 0.51486 0.09368 5.50 0.002 newspaper copies 0.0005421 0.0008653 0.63 0.554 radios -0.000\ 3535 0.0003285 -1.08 0.323 television sets 0.001988 0.001550 1.28 0.247

The “constant” term . If all the x’s and the residual equal 0, the model would be: y = B 0 + B 1 \(0\) + B 2 \(0\) + … + B v \(0\) + 0 = B 0

Recall, the degrees of freedom for any hypothesis test or confidence interval that involves a t-statistic is DFE = n – v – 1, where v = the number of explanatory variables in the model. In our problem, n = 10 and\ v = 3. Therefore, the degrees of freedom for the t* critical value is 10 – 3- 1 = 6 .

b 3 = 0.00199, SE\(b 3 \) = 0.00155, and = 2.447. Therefore, the lower bound = \(0.00199\) – \(2.447\)\(0.00155\) = -0.00180 . The upper bound = \(0.00199\) + \(2.447\)\(0.00155\) = 0.00578 . We write the 95% confidence interval for B 3 as \(-0.00180 , 0.00578\).

D . The interpretation is a combination of the interpretation of a confidence interval and the interpretation of\ the coefficient.

Both A and C are correct! The backwards selection process says to remove the variable with the highest p-value \ from the t-test as long as it’s greater than 0.05 \(or so\). All three variables have p-values greater than 0.05 and newspaper copies has the highest p-value, so it gets removed first since it is the “least significant” explanatory variable. So, C is correct. A \ is also correct because the closer a t-statistic is to 0, the higher its p-value. \(Think about that – a t-statistic tells us how many standard errors an observation is from the mean. The more standard errors an observation is from the mean, the low\ er the tail area probability, which means a lower p-value.\) It is important to note that the backwards selection process only eliminates on\ e variable at a time! Therefore, E is not correct – again, we never remove more than one variable at a time!!

Sound Clip \(243 KB\)

Sound Clip \(362 KB\)

Introduction to Data Analysis in R

8 multiple linear regression (mlr).

Multiple linear regression is the most common form of linear regression analysis. As a predictive analysis, multiple linear regression is used to explain the relationship between one continuous dependent variable (or, the response variable) and two or more independent variables (or, the predictor variables). The independent variables can be continuous OR categorical. Unlike a simple linear regression, where we describe the relationship between X and Y (two dimensional) and can simply plot them against each other, we are now working with multiple X’s and Y - which is three-dimensional.

Here we are using the pie_crab data set again to develop a multiple linear regression model to predict crab size with additional variables from the data set, latitude , air_temp , and water_temp . Let’s first plot each of our predictor variables’ linear relationship with our response variable, crab size:

multiple linear regression assignment

A multiple linear regression, at the location of each observation, incorporates each of our three variable’s simple linear relationships with crab size using the following equation:

\(y = β0 + (β1 * x1) + (β2 * x2) + (β3 * x3) + ε\)

In this equation, y is a our response variable, crab size, while each x represents one of our predictor variables. β0 represents the intercept; we can think of this as the value of y if all of our x ’s were zero. Each β is called a partial regression coefficient ; this is because we can think of each as the slope in the x ’s dimension if all of our other x ’s were held constant. Lastly, ε is the distance between our observation, and what our model predicts for it (i.e., observed - predicted).

8.1 MLR in R

Running a multiple linear regression is very similar to the simple linear regression, but now we specify our multiple predictor variables by adding them together with a + sign (the order of our predictor variables does not matter). Here we are using the pie_crab data set again to develop a multiple linear regression model with additional variables from the data set:

77.7460 is our line’s intercept (β0)

-1.0587 is the slope in the latitude dimension, or the estimated change in crab size for a unit change in latitude among crabs living with the same air temperature and water temperature conditions.

-2.4041 is the slope in the air temperature dimension, or the estimated change in crab size for a unit change in air temperature among crabs living with the same water temperature and latitude conditions.

0.7563 is the slope in the water temperature dimension, or the estimated change in crab size for a unit change in water temperature among crabs living with the same air temperature and latitude conditions.

\(y = -1.0587x1 -2.4041x2 + 0.7563x3 + 77.7460\)

In the model’s summary, our p-value is indicated in the Pr(>|t|) column for each variable: because our p-values are well below 0.01, we can deduce that each variable has a significant effect on crab size.

Our multiple R-squared (R 2 ) is the Pearson correlation between the observed and the fitted (i.e. predicted) values. We can interpret this as 42.06% of the variability in crab size is explained by the linear regression on water temperature, air temperature, and latitude. NOTE: R 2 always increases when an additional predictor is added to a linear model.

8.1.1 Predicting crab size

With this multiple linear equation, we can now predict crab size across different varieties of latitude, air temperature, and water temperature using the base R predict() function:

8.1.2 MLR Assumptions

An important aspect when building a multiple linear regression model is to make sure that the following key assumptions are met:

All observations are independent of one another.

There must be a linear relationship between the dependent and the independent variables.

The variance of the residual errors is similar across the value of each independent variable.

multiple linear regression assignment

This “Residuals vs Fitted” (fitted meaning the predicted values) plot gives an indication if there are non-linear patterns. This is a bit subjective, but a good way of verifying that this assumption is met is by ensuring that no clear trend seems so exist. The residuals should also occupy equal space above and below the line, and along the length of the line.

The residual error values are normally distributed.

multiple linear regression assignment

… also a bit subjective, but so long as the points on the Q-Q plot follow the dotted line, this assumption is fulfilled.

The independent variables are not highly correlated with each other.

Multicolinearity can lead to unreliable coefficient estimates, while adding more variables to the model will always increase the R 2 value, reflecting a higher proportion of variance explained by the model that is unjust.

Normally, we should exclude variables that have a correlation coefficient greater than 0.7/-0.7. Alas, all of our variables are HIGHLY correlated with each other. Therefore, these predictors should not all be used in our model. Which is also to say… it is a good idea to check your predictor variables for colinearity before developing a model.

8.2 Exercises

We are interested in developing a multiple linear regression model to predict mean annual stream flow across the Eastern US. For every state, we have a handful of watershed and site characteristic data associated with USGS stream gauging stations.

Download the ‘usgs’ folder on Canvas and store it in a ‘data’ folder in this assignment’s project directory. Here is a list of all of these files:

1. Read in each of the data sets associated with the assignment and combine them into a single data set. (HINT: What does map_dfr() do? 2.5 pts.

2. Using our combined data set, plot each variable against mean annual stream flow to identify variables that seem to have a linear relationship with stream flow. 5 pts.

3. Develop a multiple linear regression model using any combination of the variables in the data set. What is your R-squared value? Which of your variables (if any) are significant predictors of stream flow? 5 pts.

4. Check to see if your model meets the model assumptions required for MLR. 2.5 pts.

5. Use your model to predict mean annual stream flow for two new sets of predictor data. 2.5 pts.

6. If your model does not meet the model’s assumptions, what are some ways of manipulating the data set so that it might? (HINT: review chapter 6) 2.5 pts.

8.3 Citations

Data Source: Johnson, D. 2019. Fiddler crab body size in salt marshes from Florida to Massachusetts, USA at PIE and VCR LTER and NOAA NERR sites during summer 2016. ver 1. Environmental Data Initiative. https://doi.org/10.6073/pasta/4c27d2e778d3325d3830a5142e3839bb (Accessed 2021-05-27).

Johnson DS, Crowley C, Longmire K, Nelson J, Williams B, Wittyngham S. The fiddler crab, Minuca pugnax, follows Bergmann’s rule. Ecol Evol. 2019;00:1–9. https://doi.org/10.1002/ece3.5883

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

Contains Solutions and Notes for the Machine Learning Specialization By Stanford University and Deeplearning.ai - Coursera (2022) by Prof. Andrew NG

greyhatguy007/Machine-Learning-Specialization-Coursera

Folders and files.

NameName
139 Commits
ISSUE_TEMPLATE ISSUE_TEMPLATE

Repository files navigation

Machine learning specialization coursera.

multiple linear regression assignment

Contains Solutions and Notes for the Machine Learning Specialization by Andrew NG on Coursera

Note : If you would like to have a deeper understanding of the concepts by understanding all the math required, have a look at Mathematics for Machine Learning and Data Science

Course 1 : Supervised Machine Learning: Regression and Classification

  • Practice quiz: Regression
  • Practice quiz: Supervised vs unsupervised learning
  • Practice quiz: Train the model with gradient descent
  • Model Representation
  • Cost Function
  • Gradient Descent
  • Practice quiz: Gradient descent in practice
  • Practice quiz: Multiple linear regression
  • Numpy Vectorization
  • Multi Variate Regression
  • Feature Scaling
  • Feature Engineering
  • Sklearn Gradient Descent
  • Sklearn Normal Method
  • Linear Regression
  • Practice quiz: Cost function for logistic regression
  • Practice quiz: Gradient descent for logistic regression
  • Classification
  • Sigmoid Function
  • Decision Boundary
  • Logistic Loss
  • Scikit Learn - Logistic Regression
  • Overfitting
  • Regularization
  • Logistic Regression

Certificate Of Completion

Course 2 : advanced learning algorithms.

  • Practice quiz: Neural networks intuition
  • Practice quiz: Neural network model
  • Practice quiz: TensorFlow implementation
  • Practice quiz : Neural Networks Implementation in Numpy
  • Neurons and Layers
  • Coffee Roasting
  • Coffee Roasting Using Numpy
  • Neural Networks for Binary Classification
  • Practice quiz : Neural Networks Training
  • Practice quiz : Activation Functions
  • Practice quiz : Multiclass Classification
  • Practice quiz : Additional Neural Networks Concepts
  • Multiclass Classification
  • Neural Networks For Handwritten Digit Recognition - Multiclass
  • Practice quiz : Advice for Applying Machine Learning
  • Practice quiz : Bias and Variance
  • Practice quiz : Machine Learning Development Process
  • Advice for Applied Machine Learning
  • Practice quiz : Decision Trees
  • Practice quiz : Decision Trees Learning
  • Practice quiz : Decision Trees Ensembles
  • Decision Trees

Certificate of Completion

Course 3 : unsupervised learning, recommenders, reinforcement learning.

  • Practice quiz : Clustering
  • Practice quiz : Anomaly Detection
  • Anomaly Detection
  • Practice quiz : Collaborative Filtering
  • Practice quiz : Recommender systems implementation
  • Practice quiz : Content-based filtering
  • Collaborative Filtering RecSys
  • RecSys using Neural Networks
  • Practice quiz : Reinforcement learning introduction
  • Practice Quiz : State-action value function
  • Practice Quiz : Continuous state spaces
  • Deep Q-Learning - Lunar Lander

Specialization Certificate

Stargazers over time.

Stargazers over time

Course Review :

This Course is a best place towards becoming a Machine Learning Engineer. Even if you're an expert, many algorithms are covered in depth such as decision trees which may help in further improvement of skills.

Special thanks to Professor Andrew Ng for structuring and tailoring this Course.

An insight of what you might be able to accomplish at the end of this specialization :

Write an unsupervised learning algorithm to Land the Lunar Lander Using Deep Q-Learning

  • The Rover was trained to land correctly on the surface, correctly between the flags as indicators after many unsuccessful attempts in learning how to do it.
  • The final landing after training the agent using appropriate parameters :

Write an algorithm for a Movie Recommender System

  • A movie database is collected based on its genre.
  • A content based filtering and collaborative filtering algorithm is trained and the movie recommender system is implemented.
  • It gives movie recommendentations based on the movie genre.

movie_recommendation

  • And Much More !!

Concluding, this is a course which I would recommend everyone to take. Not just because you learn many new stuffs, but also the assignments are real life examples which are exciting to complete .

Happy Learning :))

Code of conduct

Contributors 7.

@greyhatguy007

  • Jupyter Notebook 97.5%
  • Python 2.5%
  • Data Science
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • Deep Learning
  • Computer Vision
  • Artificial Intelligence
  • AI ML DS Interview Series
  • AI ML DS Projects series
  • Data Engineering
  • Web Scrapping

Multiple Linear Regression With scikit-learn

In this article, let’s learn about multiple linear regression using scikit-learn in the Python programming language.

Regression is a statistical method for determining the relationship between features and an outcome variable or result. Machine learning, it’s utilized as a method for predictive modeling, in which an algorithm is employed to forecast continuous outcomes. Multiple linear regression, often known as multiple regression, is a statistical method that predicts the result of a response variable by combining numerous explanatory variables. Multiple regression is a variant of linear regression (ordinary least squares)  in which just one explanatory variable is used.

Mathematical Imputation:

To improve prediction, more independent factors are combined. The following is the linear relationship between the dependent and independent variables:

multiple linear regression assignment

here, y is the dependent variable.

  • x1, x2,x3,… are independent variables.
  • b0 =intercept of the line.
  • b1, b2, … are coefficients.
for a simple linear regression line is of the form : y = mx+c for example if we take a simple example, : feature 1: TV feature 2: radio feature 3:   Newspaper output variable: sales Independent variables are the features feature1 , feature 2 and feature 3. Dependent variable is sales. The equation for this problem will be: y = b0+b1x1+b2x2+b3x3 x1, x2 and x3 are the feature variables. 

In this example, we use scikit-learn to perform linear regression. As we have multiple feature variables and a single outcome variable, it’s a Multiple linear regression. Let’s see how to do this step-wise.

Stepwise Implementation

Step 1: import the necessary packages.

The necessary packages such as pandas, NumPy, sklearn, etc… are imported.

Step 2: Import the CSV file:

The CSV file is imported using pd.read_csv() method. To access the CSV file click here. The ‘No ‘ column is dropped as an index is already present. df.head() method is used to retrieve the first five rows of the dataframe. df.columns attribute returns the name of the columns. The column names starting with ‘X’ are the independent features in our dataset. The column ‘Y house price of unit area’ is the dependent variable column. As the number of independent or exploratory variables is more than one, it is a Multilinear regression.

To view and download the CSV file click here .

 
   X1 transaction date  X2 house age  …  X6 longitude  Y house price of unit area 0             2012.917          32.0  …     121.54024                        37.9 1             2012.917          19.5  …     121.53951                        42.2 2             2013.583          13.3  …     121.54391                        47.3 3             2013.500          13.3  …     121.54391                        54.8 4             2012.833           5.0  …     121.54245                        43.1 [5 rows x 7 columns] Index([‘X1 transaction date’, ‘X2 house age’,       ‘X3 distance to the nearest MRT station’,       ‘X4 number of convenience stores’, ‘X5 latitude’, ‘X6 longitude’,       ‘Y house price of unit area’],      dtype=’object’)

Step 3: Create a scatterplot to visualize the data:

A scatterplot is created to visualize the relation between the ‘X4 number of convenience stores’ independent variable and the ‘Y house price of unit area’ dependent feature.

multiple linear regression assignment

Step 4: Create feature variables: 

To model the data we need to create feature variables, X variable contains independent variables and y variable contains a dependent variable. X and Y feature variables are printed to see the data.

    X1 transaction date  X2 house age  …  X5 latitude  X6 longitude 0               2012.917          32.0  …     24.98298     121.54024 1               2012.917          19.5  …     24.98034     121.53951 2               2013.583          13.3  …     24.98746     121.54391 3               2013.500          13.3  …     24.98746     121.54391 4               2012.833           5.0  …     24.97937     121.54245 ..                   …           …  …          …           … 409             2013.000          13.7  …     24.94155     121.50381 410             2012.667           5.6  …     24.97433     121.54310 411             2013.250          18.8  …     24.97923     121.53986 412             2013.000           8.1  …     24.96674     121.54067 413             2013.500           6.5  …     24.97433     121.54310 [414 rows x 6 columns] 0      37.9 1      42.2 2      47.3 3      54.8 4      43.1       …  409    15.4 410    50.0 411    40.6 412    52.5 413    63.9 Name: Y house price of unit area, Length: 414, dtype: float64

Step 5: Split data into train and test sets:

Here, train_test_split() method is used to create train and test sets, the feature variables are passed in the method. test size is given as 0.3, which means 30% of the data goes into test sets, and train set data contains 70% data. the random state is given for data reproducibility.

Step 6: Create a linear regression model

A simple linear regression model is created. LinearRegression() class is used to create a simple regression model, the class is imported from sklearn.linear_model package.

Step 7: Fit the model with training data.

After creating the model, it fits with the training data. The model gains knowledge about the statistics of the training model. fit() method is used to fit the data.

Step 8: Make predictions on the test data set.

In this model.predict() method is used to make predictions on the X_test data, as test data is unseen data and the model has no knowledge about the statistics of the test set. 

Step 9: Evaluate the model with metrics.

The multi-linear regression model is evaluated with mean_squared_error and mean_absolute_error metric. when compared with the mean of the target variable, we’ll understand how well our model is predicting. mean_squared_error is the mean of the sum of residuals. mean_absolute_error is the mean of the absolute errors of the model. The less the error, the better the model performance is.

mean absolute error = it’s the mean of the sum of the absolute values of residuals.

multiple linear regression assignment

mean square error =  it’s the mean of the sum of the squares of residuals.

multiple linear regression assignment

  • y= actual value
  • y hat = predictions

For data collection, there should be a significant discrepancy between the numbers. If you want to ignore outliers in your data, MAE is a preferable alternative, but if you want to account for them in your loss function, MSE/RMSE is the way to go. MSE is always higher than MAE in most cases, MSE equals MAE only when the magnitudes of the errors are the same.

Here, is the full code together, combining the above steps.

                     
   X1 transaction date  X2 house age  …  X6 longitude  Y house price of unit area 0             2012.917          32.0  …     121.54024                        37.9 1             2012.917          19.5  …     121.53951                        42.2 2             2013.583          13.3  …     121.54391                        47.3 3             2013.500          13.3  …     121.54391                        54.8 4             2012.833           5.0  …     121.54245                        43.1 [5 rows x 7 columns] Index([‘X1 transaction date’, ‘X2 house age’,       ‘X3 distance to the nearest MRT station’,       ‘X4 number of convenience stores’, ‘X5 latitude’, ‘X6 longitude’,       ‘Y house price of unit area’],      dtype=’object’)     X1 transaction date  X2 house age  …  X5 latitude  X6 longitude 0               2012.917          32.0  …     24.98298     121.54024 1               2012.917          19.5  …     24.98034     121.53951 2               2013.583          13.3  …     24.98746     121.54391 3               2013.500          13.3  …     24.98746     121.54391 4               2012.833           5.0  …     24.97937     121.54245 ..                   …           …  …          …           … 409             2013.000          13.7  …     24.94155     121.50381 410             2012.667           5.6  …     24.97433     121.54310 411             2013.250          18.8  …     24.97923     121.53986 412             2013.000           8.1  …     24.96674     121.54067 413             2013.500           6.5  …     24.97433     121.54310 [414 rows x 6 columns] 0      37.9 1      42.2 2      47.3 3      54.8 4      43.1       …  409    15.4 410    50.0 411    40.6 412    52.5 413    63.9 Name: Y house price of unit area, Length: 414, dtype: float64 mean_squared_error :  46.21179783493418 mean_absolute_error :  5.392293684756571

author

Please Login to comment...

Similar reads.

  • Python scikit-module

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

Python Tutorial

File handling, python modules, python numpy, python pandas, python matplotlib, python scipy, machine learning, python mysql, python mongodb, python reference, module reference, python how to, python examples, machine learning - multiple regression, multiple regression.

Multiple regression is like linear regression , but with more than one independent value, meaning that we try to predict a value based on two or more variables.

Take a look at the data set below, it contains some information about cars.

Car Model Volume Weight CO2
Toyota Aygo 1000 790 99
Mitsubishi Space Star 1200 1160 95
Skoda Citigo 1000 929 95
Fiat 500 900 865 90
Mini Cooper 1500 1140 105
VW Up! 1000 929 105
Skoda Fabia 1400 1109 90
Mercedes A-Class 1500 1365 92
Ford Fiesta 1500 1112 98
Audi A1 1600 1150 99
Hyundai I20 1100 980 99
Suzuki Swift 1300 990 101
Ford Fiesta 1000 1112 99
Honda Civic 1600 1252 94
Hundai I30 1600 1326 97
Opel Astra 1600 1330 97
BMW 1 1600 1365 99
Mazda 3 2200 1280 104
Skoda Rapid 1600 1119 104
Ford Focus 2000 1328 105
Ford Mondeo 1600 1584 94
Opel Insignia 2000 1428 99
Mercedes C-Class 2100 1365 99
Skoda Octavia 1600 1415 99
Volvo S60 2000 1415 99
Mercedes CLA 1500 1465 102
Audi A4 2000 1490 104
Audi A6 2000 1725 114
Volvo V70 1600 1523 109
BMW 5 2000 1705 114
Mercedes E-Class 2100 1605 115
Volvo XC70 2000 1746 117
Ford B-Max 1600 1235 104
BMW 2 1600 1390 108
Opel Zafira 1600 1405 109
Mercedes SLK 2500 1395 120

We can predict the CO2 emission of a car based on the size of the engine, but with multiple regression we can throw in more variables, like the weight of the car, to make the prediction more accurate.

How Does it Work?

In Python we have modules that will do the work for us. Start by importing the Pandas module.

import pandas

Learn about the Pandas module in our Pandas Tutorial .

The Pandas module allows us to read csv files and return a DataFrame object.

The file is meant for testing purposes only, you can download it here: data.csv

df = pandas.read_csv("data.csv")

Then make a list of the independent values and call this variable X .

Put the dependent values in a variable called y .

X = df[['Weight', 'Volume']] y = df['CO2']

Tip: It is common to name the list of independent values with a upper case X, and the list of dependent values with a lower case y.

We will use some methods from the sklearn module, so we will have to import that module as well:

from sklearn import linear_model

From the sklearn module we will use the LinearRegression() method to create a linear regression object.

This object has a method called fit() that takes the independent and dependent values as parameters and fills the regression object with data that describes the relationship:

regr = linear_model.LinearRegression() regr.fit(X, y)

Now we have a regression object that are ready to predict CO2 values based on a car's weight and volume:

#predict the CO2 emission of a car where the weight is 2300kg, and the volume is 1300cm 3 : predictedCO2 = regr.predict([[2300, 1300]])

See the whole example in action:

Run example »

We have predicted that a car with 1.3 liter engine, and a weight of 2300 kg, will release approximately 107 grams of CO2 for every kilometer it drives.

Advertisement

Coefficient

The coefficient is a factor that describes the relationship with an unknown variable.

Example: if x is a variable, then 2x is x two times. x is the unknown variable, and the number 2 is the coefficient.

In this case, we can ask for the coefficient value of weight against CO2, and for volume against CO2. The answer(s) we get tells us what would happen if we increase, or decrease, one of the independent values.

Print the coefficient values of the regression object:

Result Explained

The result array represents the coefficient values of weight and volume.

Weight: 0.00755095 Volume: 0.00780526

These values tell us that if the weight increase by 1kg, the CO2 emission increases by 0.00755095g.

And if the engine size (Volume) increases by 1 cm 3 , the CO2 emission increases by 0.00780526 g.

I think that is a fair guess, but let test it!

We have already predicted that if a car with a 1300cm 3 engine weighs 2300kg, the CO2 emission will be approximately 107g.

What if we increase the weight with 1000kg?

Copy the example from before, but change the weight from 2300 to 3300:

We have predicted that a car with 1.3 liter engine, and a weight of 3300 kg, will release approximately 115 grams of CO2 for every kilometer it drives.

Which shows that the coefficient of 0.00755095 is correct:

107.2087328 + (1000 * 0.00755095) = 114.75968

Get Certified

COLOR PICKER

colorpicker

Contact Sales

If you want to use W3Schools services as an educational institution, team or enterprise, send us an e-mail: [email protected]

Report Error

If you want to report an error, or if you want to make a suggestion, send us an e-mail: [email protected]

Top Tutorials

Top references, top examples, get certified.

Browse Course Material

Course info.

  • Prof. Dimitris Bertsimas

Departments

  • Sloan School of Management

As Taught In

  • Operations Management
  • Probability and Statistics

Learning Resource Types

The analytics edge, 2 linear regression.

2.1 Welcome to Unit 2

  • 2.1.1 Welcome to Unit 2

2.2 The Statistical Sommelier: An Introduction to Linear Regression

  • 2.2.1 Video 1: Predicting the Quality of Wine
  • 2.2.2 Quick Question
  • 2.2.3 Video 2: One-Variable Linear Regression
  • 2.2.4 Quick Question
  • 2.2.5 Video 3: Multiple Linear Regression
  • 2.2.6 Quick Question
  • 2.2.7 Video 4: Linear Regression in R
  • 2.2.8 Quick Question
  • 2.2.9 Video 5: Understanding the Model
  • 2.2.10 Quick Question
  • 2.2.11 Video 6: Correlation and Multicollinearity
  • 2.2.12 Quick Question
  • 2.2.13 Video 7: Making Predictions
  • 2.2.14 Quick Question
  • 2.2.15 Video 8: Comparing the Model to the Experts

2.3 Moneyball: The Power of Sports Analytics

  • 2.3.1 A Quick Introduction to Baseball
  • 2.3.2 Video 1: The Story of Moneyball
  • 2.3.3 Video 2: Making it to the Playoffs
  • 2.3.4 Quick Question
  • 2.3.5 Video 3: Predicting Runs
  • 2.3.6 Quick Question
  • 2.3.7 Video 4: Using the Models to Make Predictions
  • 2.3.8 Quick Question
  • 2.3.9 Video 5: Winning the World Series
  • 2.3.10 Quick Question
  • 2.3.11 Video 6: The Analytics Edge in Sports
  • 2.3.12 Quick Question

2.4 Playing Moneyball in the NBA (Recitation)

  • 2.4.1 Welcome to Recitation 2
  • 2.4.2 Video 1: The Data
  • 2.4.3 Video 2: Playoffs and Wins
  • 2.4.4 Video 3: Points Scored
  • 2.4.5 Video 4: Making Predictions

2.5 Assignment 2

  • 2.5.1 Climate Change
  • 2.5.2 Reading Test Scores
  • 2.5.3 Detecting Flu Epidemics via Search Engine Query Data
  • 2.5.4 State Data

Back: 1.5 Assignment Internet Privacy Poll

Welcome to Unit 2

  • Download video
  • Download transcript

Video 1: Predicting the Quality of Wine

The slides from all videos in this Lecture Sequence can be downloaded here:  Introduction to Linear Regression (PDF - 1.3MB) .

Continue: Quick Question

Introduction to Baseball Video

If you are unfamiliar with the game of baseball, please watch this short video clip for a quick introduction to the game. You don’t need to be a baseball expert to understand this lecture, but basic knowledge of the game will be helpful to you.

TruScribe. “Baseball Rules of Engagement.” March 27, 2012. YouTube. This video is from TrueScribeVideos  and is not covered by our Creative Commons license .

  • Back: Video 8: Comparing the Model to the Experts
  • Continue: Video 1: The Story of Moneyball

Welcome to Recitation 2

  • Back: Quick Question
  • Continue: Video 1: The Data

Climate Change

There have been many studies documenting that the average global temperature has been increasing over the last century. The consequences of a continued rise in global temperature will be dire. Rising sea levels and an increased frequency of extreme weather events will affect billions of people.

In this problem, we will attempt to study the relationship between average global temperature and several other factors.

The file climate_change (CSV)  contains climate data from May 1983 to December 2008. The available variables include:

  • Year : the observation year.
  • Month : the observation month.
  • Temp : the difference in degrees Celsius between the average global temperature in that period and a reference value. This data comes from the Climatic Research Unit at the University of East Anglia .
  • CO2 ,  N2O , CH4 ,  CFC.11 , CFC.12 : atmospheric concentrations of carbon dioxide (CO2), nitrous oxide (N2O), methane  (CH4), trichlorofluoromethane (CCl3F; commonly referred to as CFC-11) and dichlorodifluoromethane (CCl2F2; commonly referred to as CFC-12), respectively. This data comes from the ESRL/NOAA Global Monitoring Division .
  • CO2, N2O and CH4 are expressed in ppmv (parts per million by volume  – i.e., 397 ppmv of CO2 means that CO2 constitutes 397 millionths of the total volume of the atmosphere)
  • CFC.11 and CFC.12 are expressed in ppbv (parts per billion by volume). 
  • Aerosols : the mean stratospheric aerosol optical depth at 550 nm. This variable is linked to volcanoes, as volcanic eruptions result in new particles being added to the atmosphere, which affect how much of the sun’s energy is reflected back into space. This data is from the Godard Institute for Space Studies at NASA .
  • TSI : the total solar irradiance (TSI) in W/m2 (the rate at which the sun’s energy is deposited per unit area). Due to sunspots and other solar phenomena, the amount of energy that is given off by the sun varies substantially with time. This data is from the SOLARIS-HEPPA project website .  
  • MEI : multivariate El Nino Southern Oscillation index (MEI), a measure of the strength of the El Nino/La Nina-Southern Oscillation (a weather effect in the Pacific Ocean that affects global temperatures). This data comes from the ESRL/NOAA Physical Sciences Division .

Problem 1.1 - Creating Our First Model

We are interested in how changes in these variables affect future temperatures, as well as how well these variables explain temperature changes so far. To do this, first read the dataset climate_change.csv into R.

Then, split the data into a training set , consisting of all the observations up to and including 2006, and a testing set consisting of the remaining years (hint: use subset). A training set refers to the data that will be used to build the model (this is the data we give to the lm() function), and a testing set refers to the data we will use to test our predictive ability.

Next, build a linear regression model to predict the dependent variable Temp, using MEI, CO2, CH4, N2O, CFC.11, CFC.12, TSI, and Aerosols as independent variables ( Year and Month should NOT be used in the model). Use the training set to build the model.

Enter the model R2 (the “Multiple R-squared” value):

 Numerical Response 

Explanation

First, read in the data and split it using the subset command:

climate = read.csv(“climate_change.csv”)

train = subset(climate, Year <= 2006)

test = subset(climate, Year > 2006)

Then, you can create the model using the command:

climatelm = lm(Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 + TSI + Aerosols, data=train)

Lastly, look at the model using summary(climatelm). The Multiple R-squared value is 0.7509.

CheckShow Answer

Problem 1.2 - Creating Our First Model

Which variables are significant in the model? We will consider a variable signficant only if the p-value is below 0.05. (Select all that apply.)

If you look at the model we created in the previous problem using summary(climatelm), all of the variables have at least one star except for CH4 and N2O. So MEI, CO2, CFC.11, CFC.12, TSI, and Aerosols are all significant.

Problem 2.1 - Understanding the Model

Current scientific opinion is that nitrous oxide and CFC-11 are greenhouse gases: gases that are able to trap heat from the sun and contribute to the heating of the Earth. However, the regression coefficients of both the N2O and CFC-11 variables are negative , indicating that increasing atmospheric concentrations of either of these two compounds is associated with lower global temperatures.

Which of the following is the simplest correct explanation for this contradiction?

 Climate scientists are wrong that N2O and CFC-11 are greenhouse gases - this regression analysis constitutes part of a disproof. 

 There is not enough data, so the regression coefficients being estimated are not accurate. 

 All of the gas concentration variables reflect human development - N2O and CFC.11 are correlated with other variables in the data set. 

The linear correlation of N2O and CFC.11 with other variables in the data set is quite large. The first explanation does not seem correct, as the warming effect of nitrous oxide and CFC-11 are well documented, and our regression analysis is not enough to disprove it. The second explanation is unlikely, as we have estimated eight coefficients and the intercept from 284 observations.

Problem 2.2 - Understanding the Model

Compute the correlations between all the variables in the training set. Which of the following independent variables is N2O highly correlated with (absolute correlation greater than 0.7)? Select all that apply.

Which of the following independent variables is CFC.11 highly correlated with? Select all that apply.

You can calculate all correlations at once using cor(train) where train is the name of the training data set.

Problem 3 - Simplifying the Model

Given that the correlations are so high, let us focus on the N2O variable and build a model with only MEI, TSI, Aerosols and N2O as independent variables. Remember to use the training set to build the model.

Enter the coefficient of N2O in this reduced model:

(How does this compare to the coefficient in the previous model with all of the variables?)

Enter the model R2:

We can create this simplified model with the command:

LinReg = lm(Temp ~ MEI + N2O + TSI + Aerosols, data=train)

You can get the coefficient for N2O and the model R-squared by typing summary(LinReg).

We have observed that, for this problem, when we remove many variables the sign of N2O flips. The model has not lost a lot of explanatory power (the model R2 is 0.7261 compared to 0.7509 previously) despite removing many variables. As discussed in lecture, this type of behavior is typical when building a model where many of the independent variables are highly correlated with each other. In this particular problem many of the variables (CO2, CH4, N2O, CFC.11 and CFC.12) are highly correlated, since they are all driven by human industrial development.

  • Back: Video 4: Making Predictions
  • Continue: Reading Test Scores

facebook

You are leaving MIT OpenCourseWare

Multiple Regression Analysis using SPSS Statistics

Introduction.

Multiple regression is an extension of simple linear regression. It is used when we want to predict the value of a variable based on the value of two or more other variables. The variable we want to predict is called the dependent variable (or sometimes, the outcome, target or criterion variable). The variables we are using to predict the value of the dependent variable are called the independent variables (or sometimes, the predictor, explanatory or regressor variables).

For example, you could use multiple regression to understand whether exam performance can be predicted based on revision time, test anxiety, lecture attendance and gender. Alternately, you could use multiple regression to understand whether daily cigarette consumption can be predicted based on smoking duration, age when started smoking, smoker type, income and gender.

Multiple regression also allows you to determine the overall fit (variance explained) of the model and the relative contribution of each of the predictors to the total variance explained. For example, you might want to know how much of the variation in exam performance can be explained by revision time, test anxiety, lecture attendance and gender "as a whole", but also the "relative contribution" of each independent variable in explaining the variance.

This "quick start" guide shows you how to carry out multiple regression using SPSS Statistics, as well as interpret and report the results from this test. However, before we introduce you to this procedure, you need to understand the different assumptions that your data must meet in order for multiple regression to give you a valid result. We discuss these assumptions next.

SPSS Statistics

Assumptions.

When you choose to analyse your data using multiple regression, part of the process involves checking to make sure that the data you want to analyse can actually be analysed using multiple regression. You need to do this because it is only appropriate to use multiple regression if your data "passes" eight assumptions that are required for multiple regression to give you a valid result. In practice, checking for these eight assumptions just adds a little bit more time to your analysis, requiring you to click a few more buttons in SPSS Statistics when performing your analysis, as well as think a little bit more about your data, but it is not a difficult task.

Before we introduce you to these eight assumptions, do not be surprised if, when analysing your own data using SPSS Statistics, one or more of these assumptions is violated (i.e., not met). This is not uncommon when working with real-world data rather than textbook examples, which often only show you how to carry out multiple regression when everything goes well! However, don’t worry. Even when your data fails certain assumptions, there is often a solution to overcome this. First, let's take a look at these eight assumptions:

  • Assumption #1: Your dependent variable should be measured on a continuous scale (i.e., it is either an interval or ratio variable). Examples of variables that meet this criterion include revision time (measured in hours), intelligence (measured using IQ score), exam performance (measured from 0 to 100), weight (measured in kg), and so forth. You can learn more about interval and ratio variables in our article: Types of Variable . If your dependent variable was measured on an ordinal scale, you will need to carry out ordinal regression rather than multiple regression. Examples of ordinal variables include Likert items (e.g., a 7-point scale from "strongly agree" through to "strongly disagree"), amongst other ways of ranking categories (e.g., a 3-point scale explaining how much a customer liked a product, ranging from "Not very much" to "Yes, a lot").
  • Assumption #2: You have two or more independent variables , which can be either continuous (i.e., an interval or ratio variable) or categorical (i.e., an ordinal or nominal variable). For examples of continuous and ordinal variables , see the bullet above. Examples of nominal variables include gender (e.g., 2 groups: male and female), ethnicity (e.g., 3 groups: Caucasian, African American and Hispanic), physical activity level (e.g., 4 groups: sedentary, low, moderate and high), profession (e.g., 5 groups: surgeon, doctor, nurse, dentist, therapist), and so forth. Again, you can learn more about variables in our article: Types of Variable . If one of your independent variables is dichotomous and considered a moderating variable, you might need to run a Dichotomous moderator analysis .
  • Assumption #3: You should have independence of observations (i.e., independence of residuals ), which you can easily check using the Durbin-Watson statistic, which is a simple test to run using SPSS Statistics. We explain how to interpret the result of the Durbin-Watson statistic, as well as showing you the SPSS Statistics procedure required, in our enhanced multiple regression guide.
  • Assumption #4: There needs to be a linear relationship between (a) the dependent variable and each of your independent variables, and (b) the dependent variable and the independent variables collectively . Whilst there are a number of ways to check for these linear relationships, we suggest creating scatterplots and partial regression plots using SPSS Statistics, and then visually inspecting these scatterplots and partial regression plots to check for linearity. If the relationship displayed in your scatterplots and partial regression plots are not linear, you will have to either run a non-linear regression analysis or "transform" your data, which you can do using SPSS Statistics. In our enhanced multiple regression guide, we show you how to: (a) create scatterplots and partial regression plots to check for linearity when carrying out multiple regression using SPSS Statistics; (b) interpret different scatterplot and partial regression plot results; and (c) transform your data using SPSS Statistics if you do not have linear relationships between your variables.
  • Assumption #5: Your data needs to show homoscedasticity , which is where the variances along the line of best fit remain similar as you move along the line. We explain more about what this means and how to assess the homoscedasticity of your data in our enhanced multiple regression guide. When you analyse your own data, you will need to plot the studentized residuals against the unstandardized predicted values. In our enhanced multiple regression guide, we explain: (a) how to test for homoscedasticity using SPSS Statistics; (b) some of the things you will need to consider when interpreting your data; and (c) possible ways to continue with your analysis if your data fails to meet this assumption.
  • Assumption #6: Your data must not show multicollinearity , which occurs when you have two or more independent variables that are highly correlated with each other. This leads to problems with understanding which independent variable contributes to the variance explained in the dependent variable, as well as technical issues in calculating a multiple regression model. Therefore, in our enhanced multiple regression guide, we show you: (a) how to use SPSS Statistics to detect for multicollinearity through an inspection of correlation coefficients and Tolerance/VIF values; and (b) how to interpret these correlation coefficients and Tolerance/VIF values so that you can determine whether your data meets or violates this assumption.
  • Assumption #7: There should be no significant outliers , high leverage points or highly influential points . Outliers, leverage and influential points are different terms used to represent observations in your data set that are in some way unusual when you wish to perform a multiple regression analysis. These different classifications of unusual points reflect the different impact they have on the regression line. An observation can be classified as more than one type of unusual point. However, all these points can have a very negative effect on the regression equation that is used to predict the value of the dependent variable based on the independent variables. This can change the output that SPSS Statistics produces and reduce the predictive accuracy of your results as well as the statistical significance. Fortunately, when using SPSS Statistics to run multiple regression on your data, you can detect possible outliers, high leverage points and highly influential points. In our enhanced multiple regression guide, we: (a) show you how to detect outliers using "casewise diagnostics" and "studentized deleted residuals", which you can do using SPSS Statistics, and discuss some of the options you have in order to deal with outliers; (b) check for leverage points using SPSS Statistics and discuss what you should do if you have any; and (c) check for influential points in SPSS Statistics using a measure of influence known as Cook's Distance, before presenting some practical approaches in SPSS Statistics to deal with any influential points you might have.
  • Assumption #8: Finally, you need to check that the residuals (errors) are approximately normally distributed (we explain these terms in our enhanced multiple regression guide). Two common methods to check this assumption include using: (a) a histogram (with a superimposed normal curve) and a Normal P-P Plot; or (b) a Normal Q-Q Plot of the studentized residuals. Again, in our enhanced multiple regression guide, we: (a) show you how to check this assumption using SPSS Statistics, whether you use a histogram (with superimposed normal curve) and Normal P-P Plot, or Normal Q-Q Plot; (b) explain how to interpret these diagrams; and (c) provide a possible solution if your data fails to meet this assumption.

You can check assumptions #3, #4, #5, #6, #7 and #8 using SPSS Statistics. Assumptions #1 and #2 should be checked first, before moving onto assumptions #3, #4, #5, #6, #7 and #8. Just remember that if you do not run the statistical tests on these assumptions correctly, the results you get when running multiple regression might not be valid. This is why we dedicate a number of sections of our enhanced multiple regression guide to help you get this right. You can find out about our enhanced content as a whole on our Features: Overview page, or more specifically, learn how we help with testing assumptions on our Features: Assumptions page.

In the section, Procedure , we illustrate the SPSS Statistics procedure to perform a multiple regression assuming that no assumptions have been violated. First, we introduce the example that is used in this guide.

A health researcher wants to be able to predict "VO 2 max", an indicator of fitness and health. Normally, to perform this procedure requires expensive laboratory equipment and necessitates that an individual exercise to their maximum (i.e., until they can longer continue exercising due to physical exhaustion). This can put off those individuals who are not very active/fit and those individuals who might be at higher risk of ill health (e.g., older unfit subjects). For these reasons, it has been desirable to find a way of predicting an individual's VO 2 max based on attributes that can be measured more easily and cheaply. To this end, a researcher recruited 100 participants to perform a maximum VO 2 max test, but also recorded their "age", "weight", "heart rate" and "gender". Heart rate is the average of the last 5 minutes of a 20 minute, much easier, lower workload cycling test. The researcher's goal is to be able to predict VO 2 max based on these four attributes: age, weight, heart rate and gender.

Setup in SPSS Statistics

In SPSS Statistics, we created six variables: (1) VO 2 max , which is the maximal aerobic capacity; (2) age , which is the participant's age; (3) weight , which is the participant's weight (technically, it is their 'mass'); (4) heart_rate , which is the participant's heart rate; (5) gender , which is the participant's gender; and (6) caseno , which is the case number. The caseno variable is used to make it easy for you to eliminate cases (e.g., "significant outliers", "high leverage points" and "highly influential points") that you have identified when checking for assumptions. In our enhanced multiple regression guide, we show you how to correctly enter data in SPSS Statistics to run a multiple regression when you are also checking for assumptions. You can learn about our enhanced data setup content on our Features: Data Setup page. Alternately, see our generic, "quick start" guide: Entering Data in SPSS Statistics .

Test Procedure in SPSS Statistics

The seven steps below show you how to analyse your data using multiple regression in SPSS Statistics when none of the eight assumptions in the previous section, Assumptions , have been violated. At the end of these seven steps, we show you how to interpret the results from your multiple regression. If you are looking for help to make sure your data meets assumptions #3, #4, #5, #6, #7 and #8, which are required when using multiple regression and can be tested using SPSS Statistics, you can learn more in our enhanced guide (see our Features: Overview page to learn more).

Note: The procedure that follows is identical for SPSS Statistics versions 18 to 28 , as well as the subscription version of SPSS Statistics, with version 28 and the subscription version being the latest versions of SPSS Statistics. However, in version 27 and the subscription version , SPSS Statistics introduced a new look to their interface called " SPSS Light ", replacing the previous look for versions 26 and earlier versions , which was called " SPSS Standard ". Therefore, if you have SPSS Statistics versions 27 or 28 (or the subscription version of SPSS Statistics), the images that follow will be light grey rather than blue. However, the procedure is identical .

Menu for a multiple regression analysis in SPSS Statistics

Published with written permission from SPSS Statistics, IBM Corporation.

Note: Don't worry that you're selecting A nalyze > R egression > L inear... on the main menu or that the dialogue boxes in the steps that follow have the title, Linear Regression . You have not made a mistake. You are in the correct place to carry out the multiple regression procedure. This is just the title that SPSS Statistics gives, even when running a multiple regression procedure.

'Linear Regression' dialogue box for a multiple regression analysis in SPSS Statistics. All variables on the left

Interpreting and Reporting the Output of Multiple Regression Analysis

SPSS Statistics will generate quite a few tables of output for a multiple regression analysis. In this section, we show you only the three main tables required to understand your results from the multiple regression procedure, assuming that no assumptions have been violated. A complete explanation of the output you have to interpret when checking your data for the eight assumptions required to carry out multiple regression is provided in our enhanced guide. This includes relevant scatterplots and partial regression plots, histogram (with superimposed normal curve), Normal P-P Plot and Normal Q-Q Plot, correlation coefficients and Tolerance/VIF values, casewise diagnostics and studentized deleted residuals.

However, in this "quick start" guide, we focus only on the three main tables you need to understand your multiple regression results, assuming that your data has already met the eight assumptions required for multiple regression to give you a valid result:

Determining how well the model fits

The first table of interest is the Model Summary table. This table provides the R , R 2 , adjusted R 2 , and the standard error of the estimate, which can be used to determine how well a regression model fits the data:

'Model Summary' table for a multiple regression analysis in SPSS. 'R', 'R Square' & 'Adjusted R Square' highlighted

The " R " column represents the value of R , the multiple correlation coefficient . R can be considered to be one measure of the quality of the prediction of the dependent variable; in this case, VO 2 max . A value of 0.760, in this example, indicates a good level of prediction. The " R Square " column represents the R 2 value (also called the coefficient of determination), which is the proportion of variance in the dependent variable that can be explained by the independent variables (technically, it is the proportion of variation accounted for by the regression model above and beyond the mean model). You can see from our value of 0.577 that our independent variables explain 57.7% of the variability of our dependent variable, VO 2 max . However, you also need to be able to interpret " Adjusted R Square " ( adj. R 2 ) to accurately report your data. We explain the reasons for this, as well as the output, in our enhanced multiple regression guide.

Statistical significance

The F -ratio in the ANOVA table (see below) tests whether the overall regression model is a good fit for the data. The table shows that the independent variables statistically significantly predict the dependent variable, F (4, 95) = 32.393, p < .0005 (i.e., the regression model is a good fit of the data).

'ANOVA' table for a multiple regression analysis in SPSS Statistics. 'df', 'F' & 'Sig.' highlighted

Estimated model coefficients

The general form of the equation to predict VO 2 max from age , weight , heart_rate , gender , is:

predicted VO 2 max = 87.83 – (0.165 x age ) – (0.385 x weight ) – (0.118 x heart_rate ) + (13.208 x gender )

This is obtained from the Coefficients table, as shown below:

'Coefficients' table for a multiple regression analysis in SPSS Statistics. 'Unstandardized Coefficients B' highlighted

Unstandardized coefficients indicate how much the dependent variable varies with an independent variable when all other independent variables are held constant. Consider the effect of age in this example. The unstandardized coefficient, B 1 , for age is equal to -0.165 (see Coefficients table). This means that for each one year increase in age, there is a decrease in VO 2 max of 0.165 ml/min/kg.

Statistical significance of the independent variables

You can test for the statistical significance of each of the independent variables. This tests whether the unstandardized (or standardized) coefficients are equal to 0 (zero) in the population. If p < .05, you can conclude that the coefficients are statistically significantly different to 0 (zero). The t -value and corresponding p -value are located in the " t " and " Sig. " columns, respectively, as highlighted below:

'Coefficients' table for a multiple regression analysis in SPSS Statistics. 't' & 'Sig.' highlighted

You can see from the " Sig. " column that all independent variable coefficients are statistically significantly different from 0 (zero). Although the intercept, B 0 , is tested for statistical significance, this is rarely an important or interesting finding.

Putting it all together

You could write up the results as follows:

A multiple regression was run to predict VO 2 max from gender, age, weight and heart rate. These variables statistically significantly predicted VO 2 max, F (4, 95) = 32.393, p < .0005, R 2 = .577. All four variables added statistically significantly to the prediction, p < .05.

If you are unsure how to interpret regression equations or how to use them to make predictions, we discuss this in our enhanced multiple regression guide. We also show you how to write up the results from your assumptions tests and multiple regression output if you need to report this in a dissertation/thesis, assignment or research report. We do this using the Harvard and APA styles. You can learn more about our enhanced content on our Features: Overview page.

  • Assignment 5- Multiple Linear Regression
  • by Eliana Almazan
  • Last updated 11 months ago
  • Hide Comments (–) Share Hide Toolbars

Twitter Facebook Google+

Or copy & paste this link into an email or IM:

  • Search Search Please fill out this field.
  • An Overview

Linear Regression

Multiple regression, the bottom line.

  • Corporate Finance
  • Financial Analysis

Linear vs. Multiple Regression: What's the Difference?

Thomas J Catalano is a CFP and Registered Investment Adviser with the state of South Carolina, where he launched his own financial advisory firm in 2018. Thomas' experience gives him expertise in a variety of areas including investments, retirement, insurance, and financial planning.

multiple linear regression assignment

Getty Images, Cultura RM Exclusive/yellowdog

Linear Regression vs. Multiple Regression: An Overview

Linear regression (also called simple regression) is one of the most common techniques of regression analysis. Multiple regression is a broader class of regression analysis, which encompasses both linear and nonlinear regressions with multiple explanatory variables.

Regression analysis is a statistical method used in finance and investing . Regression analysis pools data together to help people and companies make informed decisions. There are different variables at play in this type of statistical analysis, including a dependent variable—the main variable that you're trying to understand—and an independent variable(s)—factors that may have an impact on the dependent variable.

There are several main reasons people use regression analysis:

  • To predict future economic conditions, trends, or values.
  • To determine the relationship between two or more variables.
  • To understand how one variable changes when another changes.

While there are many different kinds of regression analysis, this article will examine two different types: linear regression and multiple regression.

Key Takeaways

  • Regression analysis is a common statistical method used in finance and investing.
  • Linear regression (also called simple regression) is one of the most common techniques of regression analysis; in linear regression, there are only two variables: the independent variable and the dependent variable.
  • Whereas linear regression only has one independent variable, multiple regression encompasses both linear and nonlinear regressions and incorporates multiple independent variables.
  • Each independent variable in multiple regression has its own coefficient to ensure each variable is weighted appropriately.

Also called simple regression, linear regression establishes the relationship between two variables. Linear regression is graphically depicted using a straight line; the slope defines how the change in one variable impacts a change in the other. The y-intercept of a linear regression relationship represents the value of one variable, when the value of the other is 0.

In linear regression, every dependent value has a single corresponding independent variable that drives its value. For example, in the linear regression formula of y = 3x + 7, there is only one possible outcome of "y" if "x" is defined as 2.

If the relationship between two variables does not follow a straight line, nonlinear regression may be used instead. Linear and nonlinear regression are similar in that both track a particular response from a set of variables. As the relationship between the variables becomes more complex, nonlinear models have greater flexibility and capability of depicting the non-constant slope.

For complex connections between data, the relationship might be explained by more than one variable. In this case, an analyst uses multiple regression; multiple regression attempts to explain a dependent variable using more than one independent variable.

There are two main uses for multiple regression analysis. The first is to determine the dependent variable based on multiple independent variables. For example, you may be interested in determining what a crop yield will be based on temperature, rainfall, and other independent variables. The second is to determine how strong the relationship is between each variable. For example, you may be interested in knowing how a crop yield will change if rainfall increases—or the temperature decreases.

Multiple regression assumes there is not a strong relationship between each independent variable. It also assumes there is a correlation between each independent variable and the single dependent variable. Each of these relationships is weighted to ensure more impactful independent variables drive the dependent value by adding a unique regression coefficient to each independent variable.

A company can not only use regression analysis to understand certain situations, like why customer service calls are dropping, but also to make forward-looking predictions, like sales figures in the future.

Linear Regression vs. Multiple Regression Example

Consider an analyst who wishes to establish a relationship between the daily change in a company's stock prices and the daily change in trading volume . Using linear regression, the analyst can attempt to determine the relationship between the two variables:

Daily Change in Stock Price = (Coefficient)(Daily Change in Trading Volume) + (y-intercept)

If the stock price increases $0.10 before any trades occur and increases $0.01 for every share sold, the linear regression outcome is:

Daily Change in Stock Price = ($0.01)(Daily Change in Trading Volume) + $0.10

However, the analyst realizes there are several other factors to consider including the company's P/E ratio, dividends, and prevailing inflation rate. The analyst can perform multiple regression to determine which—and how strongly—each of these variables impacts the stock price:

Daily Change in Stock Price = (Coefficient)(Daily Change in Trading Volume) + (Coefficient)(Company's P/E Ratio) + (Coefficient)(Dividend) + (Coefficient)(Inflation Rate)

Is Multiple Linear Regression Better Than Simple Linear Regression?

Multiple linear regression is a more specific calculation than simple linear regression. For straight-forward relationships, simple linear regression may easily capture the relationship between the two variables. For more complex relationships requiring more consideration, multiple linear regression is often better.

When Should You Use Multiple Linear Regression?

Multiple linear regression should be used when multiple independent variables determine the outcome of a single dependent variable. This is often the case when forecasting more complex relationships.

How Do You Interpret Multiple Regression?

A multiple regression formula has multiple slopes (one for each variable) and one y-intercept. It is interpreted the same as a simple linear regression formula—except there are multiple variables that all impact the slope of the relationship.

Regression analysis is a statistical method. There are many different types of regression analysis, including linear regression and multiple regression (among others). Linear regression captures the relationship between two variables—for example, the relationship between the daily change in a company's stock prices and the daily change in trading volume. Multiple linear regression is a more specific (and complex) calculation than simple linear regression. It incorporates multiple independent variables. For example, multiple regression could capture how the daily change in a company's stock price is impacted by the company's P/E ratio, dividends, the prevailing inflation rate, and the daily change in trading volume.

multiple linear regression assignment

  • Terms of Service
  • Editorial Policy
  • Privacy Policy

COMMENTS

  1. Multiple linear regression

    Linear regression has an additive assumption: $ sales = β 0 + β 1 × tv + β 2 × radio + ε $. i.e. An increase of 100 USD dollars in TV ads causes a fixed increase of 100 β 2 USD in sales on average, regardless of how much you spend on radio ads. We saw that in Fig 3.5 above.

  2. Multiple Linear Regression

    The formula for a multiple linear regression is: = the predicted value of the dependent variable. = the y-intercept (value of y when all other parameters are set to 0) = the regression coefficient () of the first independent variable () (a.k.a. the effect that increasing the value of the independent variable has on the predicted y value ...

  3. Assignment-05-Multiple-Linear-Regression-1

    Multiple-Linear-Regression-1. Consider only the below columns and prepare a prediction model for predicting Price of Toyota Corolla. - vaitybharati/Assignment-05 ...

  4. Multiple linear regression

    Multiple linear regression is an extension of simple linear regression. It allows us to predict a quantitative response using more than one predictor variable. The equation for a multiple linear regression model with D predictor variables is given by: y = β0 +β1x1 +β2x2+... +βDxD + ϵ,

  5. Lesson 5: Multiple Linear Regression

    Multiple linear regression, in contrast to simple linear regression, involves multiple predictors and so testing each variable can quickly become complicated. For example, suppose we apply two separate tests for two predictors, say \ (x_1\) and \ (x_2\), and both tests have high p-values. One test suggests \ (x_1\) is not needed in a model with ...

  6. Introduction to Multiple Linear Regression

    Assumptions of Multiple Linear Regression. There are four key assumptions that multiple linear regression makes about the data: 1. Linear relationship: There exists a linear relationship between the independent variable, x, and the dependent variable, y. 2. Independence: The residuals are independent.

  7. Multiple Linear Regression

    Fitting a multiple linear regression model #. Just as in simple linear regression, model is fit by minimizing. S S E ( β 0, …, β p) = ∑ i = 1 n ( Y i − ( β 0 + ∑ j = 1 p β j X i j)) 2 = ‖ Y − Y ^ ( β) ‖ 2. Minimizers: β ^ = ( β ^ 0, …, β ^ p) are the "least squares estimates": are also normally distributed as in ...

  8. Lesson 5: Multiple Linear Regression

    The only real difference is that whereas in simple linear regression we think of the distribution of errors at a fixed value of the single predictor, with multiple linear regression we have to think of the distribution of errors at a fixed set of values for all the predictors. All of the model-checking procedures we learned earlier are useful ...

  9. Multiple Linear Regression in R: Tutorial With Examples

    A Step-By-Step Guide to Multiple Linear Regression in R. In this section, we will dive into the technical implementation of a multiple linear regression model using the R programming language. We will use the customer churn data set from DataCamp's workspace to estimate the customer value. What do we mean by customer value?

  10. Multiple linear regression: Theory and applications

    Photo by Ferdinand Stöhr on Unsplash. Multiple linear regression is one of the most fundamental statistical models due to its simplicity and interpretability of results. For prediction purposes, linear models can sometimes outperform fancier nonlinear models, especially in situations with small numbers of training cases, low signal-to-noise ratio, or sparse data (Hastie et al., 2009).

  11. GitHub

    Assignment-05-Multiple-Linear-Regression-2. Prepare a prediction model for profit of 50_startups data. Do transformations for getting better predictions of profit and make a table containing R^2 value for each prepared model. R&D Spend -- Research and devolop spend in the past few years Administration -- spend on administration in the past few y…

  12. Stat 20

    Multiple Linear Regression. A method of explaining a continuous numerical y variable in terms of a linear function of p explanatory terms, x i. y ^ = b 0 + b 1 x 1 + b 2 x 2 + … + b p x p Each of the b i are called coefficients . To fit a multiple linear regression model using least squares in R, you can use the lm() function, with each ...

  13. PDF Lesson 21: Multiple Linear Regression Analysis

    In multiple linear regression, we'll have more than one explanatory variable, so we'll have more than one "x" in the equation. We'll distinguish between the explanatory variables by putting subscripts next to the "x's" in the equation. Multiple Linear Regression Model: y = β 0 + β 1 x 1 + β 2 x 2 + ... + βv xv + ε.

  14. 8 Multiple Linear Regression (MLR)

    A multiple linear regression, at the location of each observation, incorporates each of our three variable's simple linear relationships with crab size using the following equation: ... Download the 'usgs' folder on Canvas and store it in a 'data' folder in this assignment's project directory. Here is a list of all of these files ...

  15. PDF Assignment 5

    Assignment 5 - Multiple Linear Regression Math 158, Linear Models Spring 2016 ... The tasks in this homework assignment focus on creating, interpreting, and performing inference on the multiple regression model. Assignment 1.Set up the X matrix and vector for each of the following regression models (assume i= 1;2;3;4). (a) Y i = 0 + 1X i1 + 2X ...

  16. greyhatguy007/Machine-Learning-Specialization-Coursera

    Practice quiz: Multiple linear regression; Optional Labs. Numpy Vectorization; Multi Variate Regression; Feature Scaling; Feature Engineering; Sklearn Gradient Descent; Sklearn Normal Method; Programming Assignment. Linear Regression; Week 3. Practice quiz: Cost function for logistic regression; Practice quiz: Gradient descent for logistic ...

  17. PDF Assignment 5

    Assignment 5 - Multiple Linear Regression your name goes here Due: Wednesday, February 28, 2018, noon, to Sakai Summary The tasks in this homework assignment focus on understanding the decomposition of the sums of squares associated with variables in the model. Assignment 1. State the number of degrees of freedom that are associated with each ...

  18. Multiple Linear Regression With scikit-learn

    Multiple linear regression, often known as multiple regression, is a statistical method that predicts the result of a response variable by combining numerous explanatory variables. Multiple regression is a variant of linear regression (ordinary least squares) in which just one explanatory variable is used. Mathematical Imputation: To improve ...

  19. Python Machine Learning Multiple Regression

    Multiple Regression. Multiple regression is like linear regression, but with more than one independent value, meaning that we try to predict a value based on two or more variables. Take a look at the data set below, it contains some information about cars. Up! We can predict the CO2 emission of a car based on the size of the engine, but with ...

  20. 2 Linear Regression

    Next, build a linear regression model to predict the dependent variable Temp, using MEI, CO2, CH4, N2O, CFC.11, CFC.12, TSI, and Aerosols as independent variables ( Year and Month should NOT be used in the model). Use the training set to build the model. Enter the model R2 (the "Multiple R-squared" value): Exercise 1.

  21. Bike Sharing : Multiple Linear Regression

    If the issue persists, it's likely a problem on our side. Unexpected token < in JSON at position 4. keyboard_arrow_up. content_copy. SyntaxError: Unexpected token < in JSON at position 4. Refresh. Explore and run machine learning code with Kaggle Notebooks | Using data from Bike Sharing.

  22. Multiple Regression Analysis using SPSS Statistics

    Multiple regression is an extension of simple linear regression. It is used when we want to predict the value of a variable based on the value of two or more other variables. The variable we want to predict is called the dependent variable (or sometimes, the outcome, target or criterion variable). The variables we are using to predict the value ...

  23. RPubs

    R Pubs. by RStudio. Sign in Register. Assignment 5- Multiple Linear Regression. by Eliana Almazan. Last updated 11 months ago.

  24. Linear vs. Multiple Regression: What's the Difference?

    Multiple linear regression is a more specific (and complex) calculation than simple linear regression. It incorporates multiple independent variables. For example, multiple regression could ...