12 Prediction & Machine Learning - Exercise
12.1 Prediction & Machine Learning - Exercise
In this exercise, you will again use the Boston Housing data set. This time the goal is to make accurate predictions for the median value of houses based on various features of the houses and their surroundings.
For an overview on the data set see session 3.
Prepare by creating a new R Markdown file. Write all solutions in this
file using code chunks as well as text to structure the file (headers)
and answer the questions. Do not forget to save the file in the format
your_name_exercise_3.Rmd
. You will have to turn in this file to the
lecturers through Moodle. Write a first code chunk in which you load all
necessary packages and import the data. For a reminder see tasks 1 & 2
from the first exercise.
Task 1: Rerun your best linear regression model from exercise 2 with
medv
as the outcome variable.Task 2: Use
augment()
from thebroom
package to add the predicted value for each observation.Task 3: Calculate the \(R^2\) and RSME values for your model. What do these values tell you about the quality of your predictions?
Task 4: Prepare the data set for machine learning with
tidymodels
by dividing the data into training and test set. Then set up folds for a k-fold crossvalidation.Task 5: Think about additional variables that could be useful for making accurate predictions on the median value of a house. Define a
recipe()
containing your expanded model formula.Task 6: Prepare a random forest algorithm using the default values for all hyperparameters, i.e. just using
rand_forest()
without any arguments. Set the engine toranger
and the mode to"regression"
. Combine the recipe and the prepared algorithm into a workflow object.Task 7: Run the model using
fit_resamples()
. Inspect the performance metrics. Compare them to the metrics from task 3. Can the new model make better predictions?Bonus task: Tune the
min_n
hyperparameter for your random forest model. Inspect the results and select a sensible value. Use this for your last fit. What do the performance metrics tell you? were you able to improve on the results from task 7? Did you overfit?