6 Linear Regression - Theory II: Multiple Linear Regression
Maybe explaining the grade a student receives solely based on the hours of
invested time does not paint the whole picture. As we have alluded to, there
may be other variables that could affect the relationship between hours
and
grade
.
If we fail to include these in our model, we may not get an unbiased estimate
for our effect of interest. Maybe the actual effect for hours
is even stronger,
maybe it is weaker, or maybe there is no effect at all.
To assess this, we have to move from simple to multiple linear regression.
6.1 Objectives
- Expand the idea to multiple linear regression
- Interpreting different types of independent variables
- Understand measures of uncertainty
6.2 Multiple Linear Regression
A simple linear regression only allows for one independent variable. This is why we need multiple linear regression if we want to start introducing additional variables into the model. Luckily this is easy to understand as we already know the formula for a simple linear regression:
\[y = \beta_0 + \beta_1*x_1 + \epsilon\]
To change a simple into a multiple linear regression, we just start adding the additional variables and their coefficients additively to the formula.
\[y = \beta_0 + \beta_1*x_1 + \beta_2*x_2 + ... + \beta_k*x_k + \epsilon\]
So to add a second variable and its coefficient we add the term \(+ \beta_2*x_2\) and so on until we added all independent variables of interest \(k\) to the model. Everything else works exactly as for the simple model.
6.2.1 Adding additional metric variables
We already expected that the mean of the previous grades could be a strong predictor for future grades. We could understand these as a proxy variable for the general skill level of a student. The higher the skill level, the higher previous grades will have been.
How we can add additional variables in R code will again be a topic for the next
session, but let us look at the results of a regression of grade
on
hours_centered
and previous_grades_centered
, the latter being centered on the
mean previous grade of \(2.935\).
##
## Call:
## lm(formula = grade ~ hours_centered + previous_grades_centered,
## data = grades)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.44462 -0.30556 0.00622 0.32878 1.31002
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.967500 0.038316 77.449 <2e-16 ***
## hours_centered -0.056543 0.006114 -9.248 <2e-16 ***
## previous_grades_centered 0.904079 0.039830 22.699 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5419 on 197 degrees of freedom
## Multiple R-squared: 0.7492, Adjusted R-squared: 0.7467
## F-statistic: 294.3 on 2 and 197 DF, p-value: < 2.2e-16
As we added a new variable, we now see three coefficients. The intercept has not changed. It now indicates the estimated grade for a student who invests the mean amount of hours, \(40.33\), and whose previous grades are exactly \(2.935\), the mean of the variable.
The coefficient for hours_centered
got mildly more negative, still telling us
that the value of grade
gets lower, the more hours are invested in writing the
paper. This coefficient now gives us the effect while controlling for the
effect of previous_grades_centered
. This is what multiple linear regression
does, giving us the coefficients for our variables of interest while keeping all
other independent variables at specific values. As we have centered the variable
for previous grades, the coefficient for hours centered
gives us the effect
when the previous grades were exactly at the mean of \(2.935\).
In the same way, the coefficient for previous_grades_centered
gives us the
effect of previous grades when the invested hours are controlled for, in this
case when the invested hours were exactly \(40.33\). The coefficient is rather
high and positive. This indicates that a student with a previous grade value
that is \(1\) above the mean, is estimated to receive a new grade that is \(0.9\)
points above the intercept. This means, that the previous grade is a very strong
predictor for the new grade.
While plotting in more than two dimensions gets really hard, we can still calculate \(\hat{y}\) for certain values of both independent variables. We already know the predicted grade for a student with mean values on both independent variables, as this is the intercept. To make sure that we correct, we can calculate it again.
\[b_0 + b_{hours\_centered}*0 + b_{previous\_grades\_centered}*0 = 2.9675\]
For this case we can see that the previous grade actually is a strong predictor, as the previous and new grades are substantially the same.
What if a student whose previous grades were \(1\) above the mean, so just below \(4.0\), but who decides to invest \(10\) hours more than the mean in writing the new paper?
\[2.9675 - 0.056543 * 10 + 0.904079 * 1 = 3.306149\]
So the good message is, while previous grades are a strong predictor, putting in more hours still leads to better grades.
What if a really good student decides to rely on their skill and to work less this time?
\[2.9675 - 0.056543 * -10 + 0.904079 * -2 = 1.724772\]
While \(1.7\) is still a very good grade, working 10 less hours than the mean of students leads to a substantially worse estimate compared to the about \(1.0\) received in previous grades.
6.2.2 Adding dummy variables
Another variable that could be of interest in explaining the received grade,
is if a student attended most of the seminar sessions.
attendance
holds this information in the form of a dummy variable. Dummies can
only have two states. “Yes” or “No”, “1” or “0” or in this case “TRUE
” or
“FALSE
”.
Let us add the variable to our model.
##
## Call:
## lm(formula = grade ~ hours_centered + previous_grades_centered +
## attendance, data = grades)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.41059 -0.30910 0.01667 0.35607 1.29849
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.157411 0.078658 40.141 < 2e-16 ***
## hours_centered -0.053942 0.006088 -8.860 4.85e-16 ***
## previous_grades_centered 0.911802 0.039282 23.212 < 2e-16 ***
## attendanceTRUE -0.248250 0.090246 -2.751 0.0065 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5331 on 196 degrees of freedom
## Multiple R-squared: 0.7586, Adjusted R-squared: 0.7549
## F-statistic: 205.3 on 3 and 196 DF, p-value: < 2.2e-16
This gives us a new line in the R Output holding an estimate for
attendanceTRUE
. What is meant by this? In contrast to the metric variables we
have uses in our model up to this point, a dummy variable - or binary variable -
can only have two states. As we are using a logical variable here, it can only
have the value TRUE
- here indicating regular attendance - or FALSE
. So what
the output shows us, is the effect of attendance being TRUE
compared to being
FALSE
. If a student did regularly attend the seminar, the estimated grade is
\(-0.248250\) lower compared to when they did not.
We can observe what happens in the formula:
\[\hat{y} = b_0 + b_{hours\_centered}*x_{hours\_centered} + \\ b_{previous\_grades\_centered} * x_{previous\_grades\_centered} +\\ b_{attendance} * x_{attendance}\]
If you calculate with TRUE
and FALSE
in R, the values \(1\) and \(0\) are used
respectively. So \(x_{attendance}\) can either have the value \(1\) for regular
attendance or \(0\) for not so regular attendance.
If a student did regularly attend, the coefficient $b_{attendance} becomes a part of the estimate \(\hat{y}\):
\[\hat{y} = b_0 + b_{hours\_centered}*x_{hours\_centered} +\\ b_{previous\_grades\_centered}*x_{previous\_grades\_centered} +\\ b_{attendance} * 1\]
If student did not regularly attended, this happens:
\[\hat{y} = b_0 + b_{hours\_centered}*x_{hours\_centered} \\ + b_{previous\_grades\_centered}*x_{previous\_grades\_centered} +\\ b_{attendance} * 0\]
Which shortens to:
\[\hat{y} = b_0 + b_{hours\_centered}*x_{hours\_centered} +\\ b_{previous\_grades\_centered}*x_{previous\_grades\_centered}\]
The coefficient is no longer a part of the estimate. One can basically say, the coefficient gets switched on or off by the value of the dummy variable.
So while the estimate for a student with mean values for invested hours and previous grades who did not attend is equal to the intercept of \(3.157411\), for a similar student who attended we can calculate the estimate as:
\[3.157411 - 0.053942*0 + 0.911802*0 - 0.248250 * 1 = 3.157411 - 0.248250 \\ = 2.909161\]
It seems attending class is an easy way to raise one’s grades.
6.2.3 Adding categorical variables
We have one further variable in our simulated data set that could be of interest
in explaining, what makes a good grade in a seminar paper. contact
is a
categorical variable, or a factor variable in R terms.
It can take three different categories. No contact
indicates that
the student did not contact the lecturer to discuss a research question or the
laid out plan for the paper. E-Mail
means that there was some written contact
and at least the basics for the paper were discussed before writing. Lastly,
In Person
stands for an in depth discussion with the lecturer, clearing up
problems beforehand and thus potentially having a more stringent vision for the
paper before writing the first word.
Let us add the variable to our model.
##
## Call:
## lm(formula = grade ~ hours_centered + previous_grades_centered +
## attendance + contact, data = grades)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3835 -0.2525 0.0167 0.2678 0.9347
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.617949 0.068077 53.145 < 2e-16 ***
## hours_centered -0.050830 0.004433 -11.466 < 2e-16 ***
## previous_grades_centered 0.874123 0.028657 30.503 < 2e-16 ***
## attendanceTRUE -0.324653 0.065781 -4.935 1.72e-06 ***
## contactE-Mail -0.413808 0.069817 -5.927 1.39e-08 ***
## contactIn Person -0.853252 0.063964 -13.340 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3869 on 194 degrees of freedom
## Multiple R-squared: 0.8741, Adjusted R-squared: 0.8709
## F-statistic: 269.4 on 5 and 194 DF, p-value: < 2.2e-16
Wait, we entered three categories into the model and got estimates for two of
them. What happened? What R does is to create two dummy variables on the fly.
The first discerns between having E-Mail contact and no contact at all. The
second one between having contact in person and no contact at all. So for
categorical variables in regression models we always compare being in one of the
categories to being in the base category. In this case the base category is
No contact
, but we could also change the base category. It depends on what we
are interested in comparing to. For our example comparing the effects of having
more in depth contact to having none makes sense.
Let us look at our formula again:
\[\hat{y} = b_0 + b_{hours\_centered}*x_{hours\_centered} +\\ b_{previous\_grades\_centered} * x_{previous\_grades\_centeerd} +\\ b_{attendance} * x_{attendance} +\\ b_{E-Mail} * x_{E-Mail} + b_{In Person} * x_{In Person}\]
Now there are three possibilities. A student can have no contact at all. In this case both dummy variables are equal to \(0\). To make our formula easier to read, we have abbreviated the middle part for now:
\[\hat{y} = b_0 + ... + b_{E-Mail} * 0 + b_{In Person} * 0\]
So in this case controlling all other independent variables at their default
values, the mean for the metric variables and FALSE
for attendance
, the
intercept gives us the estimate for the grade when the student had no contact to
the lecturer, as both dummy variables that were created for contact
are “switched off”.
The two other possibilities are that a student either had E-Mail contact or an in person discussion:
\[\hat{y} = b_0 + ... + b_{E-Mail} * 1 + b_{In Person} * 0\]
\[\hat{y} = b_0 + ... + b_{E-Mail} * 0 + b_{In Person} * 1\]
In both cases the relevant dummy variable is “switched on” while the other does not factor into the equation.
Looking at the estimates we can see that having contact to the lecturer before
writing has strong negative effects, resulting in better grades. Having E-Mail
contact reduces the value of grade
by \(-0.413808\) points, having an in person
discussion by \(-0.853252\).
So what grade can a student whose previous grades were at the mean of \(2.935\), but who decided to put in 20 hours more compared to their peers, regularly attend the seminar and have an in-depth personal discussion before writing their paper expect on average as their new grade?
\[3.617949 - 0.050830 * 20 + 0.874123 * 0 - 0.324653 * 1 - 0.413808 * 0 - 0.853252 * 1 \\ = 1.423444\]
Putting in the hours, attending and working with your lecturer seems to pay off, at least in our simulated data set.
6.3 Returning to our research question
Our exemplary research question concerned itself with what makes a good grade in a seminar paper. In particular we were interested in the effect of the invested hours, as our main hypothesis was that more hours lead to better grades. What do we know now?
Our analysis points towards a clear effect from hours
on grade
. This effect
was consistently visible in all of our models. But did we correctly identify
and estimate the effect of interest? Maybe. The problem is, we actually did not
approach the analysis correctly. In a real analysis we should absolutely
refrain from adding variables to our model that could be relevant until we are
satisfied or even until all available variables are bunched into one huge model.
It was fine to do this in this introduction to linear regression to learn how
different types of variables can be used in a regression model. But in a real
project, we have to invest time to think about which variables to add because we
assume that they have a relevant effect based on theoretical assumptions about
the processes we are interested in.
So let us do this now and vow do make this our first step in all future endeavors. While we do not have a clear theoretical basis, we can make clear assumptions on the data generating process and draw these in a DAG.
Our central assumption, and the effect we want to identify and estimate, is the
direct effect from hours
on grade
in the bottom line. The more hours a
student invests, the better the grade should be.
The assumed effect of contact
is more complex. For one we assume that a more
in-depth contact with the lecturer will increase the grade directly. The
research question will be more focused, the student will know what is important
to a certain lecturer, common mistakes can be avoided if they are cleared up
beforehand and so on. But we will also assume that contact
will have an effect
on hours
in the sense that the hours invested can be used more efficiently if
an in-depth discussion has taken place. Instead of wasting time running into
problems that could have been avoided most of the invested time can actually go
into constructive work. This makes contact
a confounder for grade
and
hours
.
A student’s skill level will also have a direct effect on grade
. As we do not
have a direct measure of skill in our data, we use previous_grades
as a proxy
for skill level. attendance
is also assumed to have a direct effect on grade
as students who were present in the seminar will not only have learned the seminar’s
contents, but will also have a better understanding of what is expected in their
seminar papers.
Tapping into the knowledge from session 4, we can now make implications for our
model from the DAG. Let us list all paths from hours
to grade
:
\[A: Hours \rightarrow Grade\]
\[B: Hours \leftarrow Contact \rightarrow Grade\]
Path A represents our effect of interest. On path B, contact
is a confounder
for hours
and grade
. To close this path, we have to control for contact
.
As neither skill level - or previous_grades
- nor attendance
lie on a path
from our independent variable of interest to our dependent variable, we should
not control for them in our model.
That leaves us with hours
and contact
to be included in our linear
regression, if our goal is to get an unbiased estimate for the effect of
invested time on the final grade. So let us do this:
##
## Call:
## lm(formula = grade ~ hours_centered + contact, data = grades)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.85595 -0.74624 -0.02106 0.66648 2.50161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.44352 0.10404 33.098 < 2e-16 ***
## hours_centered -0.04967 0.01052 -4.723 4.43e-06 ***
## contactE-Mail -0.46482 0.16785 -2.769 0.00616 **
## contactIn Person -1.02804 0.15240 -6.746 1.67e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9305 on 196 degrees of freedom
## Multiple R-squared: 0.2643, Adjusted R-squared: 0.253
## F-statistic: 23.47 on 3 and 196 DF, p-value: 5.072e-13
This is our estimate. Each hour invested beyond the mean of \(40.33\) hours changes the grade by about \(-0.05\) points. This supports our hypotheses and we can conclude, that investing more hours into writing a seminar paper actually is a worthwhile investment.
But remember: This is correct as long as our DAG is drawn correctly, which is
always debatable. Maybe we should assume an effect from skill level on hours
.
The higher the skill level the more efficiently the available time can be used.
For this example we know the DAG is correct, because we have simulated the data
exactly in this way. For real world applications we never know if our DAG is
correct. All we can and have to do is base it on thorough thinking,
theoretical work, exploratory data analysis and sound arguments.
This all is true if our goal is to estimate an effect of interest as precisely
as possible. But as we have alluded to in the introduction to this session we
could also use modelling with a different goal, i.e. predicting a grade as
accurately as possible. For this task, the model which only includes hours
and
contact
will not do the best job. From our DAG we know that attendance
and
previous_grade
should have an effect on grade
, as we have also seen in our
models. For this task the full model including all these variables will produce
better estimates. We will return to this in a later session, but for now we
should remember that we have to know our task because the task dictates which is
the best model to use.
6.4 Adressing the uncertainty
Looking at the coefficient block from the output, we see more
than just our estimates. The Std. Error
- standard error - is a measure for the
uncertainty of our estimates. It basically tells us, how far away the actual
values of the observations used to compute the model are from our estimate
on average. The smaller the standard error, the more accurate our
estimate. The standard error is presented in the units of the estimate and we
can thus compare them. A large standard error for a large estimate is far less
problematic compared to a large standard error for a small estimate.
The estimate and it’s standard error are the basis for hypothesis testing.
What we are testing is the alternative hypotheses \(H_a\) that there actually
is an effect of our independent variable on the dependent variable against the
null hypothesis \(H_0\) that there is no effect. To reject the null hypothesis
and be confident that we are observing an actual effect, versus an effect that
is just based on random variation in our sample, the estimate has to be far
away enough from \(0\) and be accurate enough, i.e. have a small standard error.
This relationship is computed in the t-statistic, t value
in
our output. From this the p-value can be computed, Pr(>|t|)
in the output.
The p-value tells us the probability to observe an association
between the independent and the dependent variable as large or larger than our
estimate suggests, if the true association would actually be \(0\). If the p-value
is small enough, we can reject \(H_0\) and conclude that we observed an actual
effect. There are certain agreed upon cutoffs in statistics, while values that
meet this cutoffs are considered statistically significant. The most common
cutoff in social sciences is \(0.05\) indicated by one *
in the output.
Other common cutoffs are indicated by more asterisks. It is important to note,
that we can not say that more *
behind an effect mean that this effect has a
higher statistical significance. We have to decide on a level of statistical
significance that we want to test for before we see the results. If we decide on
testing for a significance level of \(5\%\), seeing more than one star then only
means that the effect is also statistically significant at the \(5\%\) level.
Interpreting p-values correctly and not falling into the common pitfalls like the one mentioned above is a topic on its own. We do not have the time to dive into this here, so for now we can agree that p-values below \(0.05\) indicate that we can reject \(H_0\) and thus conclude that we have actually observed an effect. Still, our interpretation of regression results should not focus solely on p-values or lead us to disregard any effects that did not meet the cutoff. For example, we can have very small p-values for effects that are so small that they are substantially irrelevant. One way to address this is to inspect the actual magnitudes of the effects. On the other hand, we can have p-values larger than \(0.05\) for effects that are still relevant. Maybe the problem is not that there is no effect but that we were not able to measure the variable in question precisely enough or that we just did not have enough observations. We can not go any deeper than this here, but we should remember that the practice of declaring every effect with stars a win, or even with more stars a bigger win, and disregarding everything without them may still be common but is not the way to go forward.
In our model, we can see that the effect of interest is statistically
significant. We can thus conclude, that we have observed an actual effect
from hours
on grade
. Our estimate is large enough and our standard error
small enough to reach this conclusion.
6.5 Moving on
We have attained an estimate for our effect of interest which supports our hypotheses that investing more hours into writing a paper leads to better grades. So can we wrap a bow on the question and move on to finally figuring out what is going on in our NBA data? Almost, but not yet. We still do not know, if our model actually works as intended. Linear regression, as well as every other modelling technique, has some underlying statistical assumptions that we have to meet for the model to accurately estimate an effect. In the next session we will get to know these assumptions and how we can test for them.