9 Linear Regression - Exercise

9.1 Linear Regression - Exercise

In this exercise, you will again use the Boston Housing data set to explore the relationship between housing prices and various features of the houses and their surroundings.

For an overview on the data set see session 3.

Prepare by creating a new R Markdown file. Write all solutions in this file using code chunks as well as text to structure the file (headers) and answer the questions. Do not forget to save the file in the format your_name_exercise_2.Rmd. You will have to turn in this file to the lecturers through Moodle. Write a first code chunk in which you load all necessary packages and import the data. For a reminder see tasks 1 & 2 from the first exercise.

  • Task 1: We start out by analysing at the relationship between the distance to employement centers and the median price of houses. To explore this relationship, fit a simple linear regression model with medv as the dependent variable and dis as the independent variable. Interpret the estimated coefficients and the R-squared value.

  • Task 2: Conduct regression diagnostics on the model from task 1. Identify any potential problems such as non-linearity, heteroscedasticity, outliers, or multicollinearity; and suggest potential solutions.

  • Task 3: We now explore which other variables may be relevant in estimating the effect from distance to employment centers on the price of a house. Examine the variables present in the data set. You can see them listed in the first exercise. Which variables do you suspect to be relevant? Draw a directed acyclic graph (DAG) using dagitty() to represent the relationships among the variables you deem relevant.

  • Task 4: Use the DAG you constructed above to decide on further variables to include in the model. Briefly describe your decision. Now fit a multiple linear regression model based on this. Interpret the estimated coefficients and the R-squared value. What changed compared to the first model?

  • Task 5: Rerun the regression diagnostics for the second model. What changed? Have some problems been solved? Have new problems emerged? What steps could be undertaken to solve remaining problems?

  • Bonus Task Implement changes you suggested in task 5 and rerun the model as well as the diagnostics. Did it help?