3 Exploratory Data Analysis - Exercise

This week we will apply the concepts from last week in an exercise where you will conduct an exploratory data analysis on a different data set. To document your work on these exercises, you will use R Markdown.

3.1 What is R Markdown?

R Markdown allows you to combine written text that is easy to format with R Code that is executed when knitting or compiling the document into the chosen output format. This allows us to describe our research, analyse our data, display results as tables or plots and interpret these, all in one file. In this way we can not only create reports on seminar exercises but also write websites - like the one your are looking at in this very moment -, seminar papers, articles or create presentations.

It is also a great notebook for projects you are working on. More often than not, our work on a specific analysis will span multiple days, weeks or even months and it is often hard to remember what we were thinking the last time we worked on our code.

“I am sure I had my reasons for writing this piece of code, but I can not for the life of me remember any of them…”

— Anonymous Coder 2023

If we use R Markdown to document our work we can add text that explains our reasons, thoughts, ideas and plans at that very moment and pick up our work from there the next time we open the file.

R Markdown allows output to different file formats, including html, docx, pptx and pdf. Note that you need a LaTeX installation to knit to pdf. LaTeX is a typesetting language and used for producing high quality pdf documents. For simple pdf reports or presentations - sidenote: if you bring a pptx to a talk something will most probably go wrong or stop working… - you do not really need to know how LaTeX works, you just need an installed distribution. For these purposes the package tinytex gives you all you need and is easy to install from within R. This site explains how to install it. You can also get an overview of all possible output formats here.

3.2 Creating a R Markdown file

Before you can create and compile R Markdown documents, you first have to install the package by writing install.packages("rmarkdown") in your console.

Creating a new R Markdown file is as straightforward as it can be. In RStudio you can click of File > New File > R Markdown.... In the new window you can set up some basic information on the document - which will be displayed in the output - and chose your desired format. You can basically write R Markdown files in any text editor, just make sure that the file extension is saved as .Rmd. We still recommend using RStudio because it gives you some convenient options that a text editor will not, e.g. displaying a preview of your document and easy knitting of the final file.

3.3 Writing in R Markdown

3.3.1 Document components

When you followed the steps above, a new R Markdown file will have been created. It basically consists of two main parts:

  • A YAML header - surrounded by three dashes --- - where options for the document can be set. The good news is that you do not have to do anything here until you get more profound with using R Markdown. For now it is enough that all the options you set when creating the new file - the title, author, date and format of the output - are present and will be included in your output file.

  • A body that contains the actual content of your document. Text is directly written in the body at the location where it is to be displayed in the output. We can use the simple Markdown syntax for formatting using a set of symbols, some of which we will explain below. We can also include R code in so called chunks, specifying if we also want it and/or the results to be displayed or “just” to be executed in the background. The code chunks will be executed when we compile the final document and everything that we want to include in the output - e.g. tables, plots or code examples - will be displayed where it occurs.

3.3.2 Formatting

Here are some of the more common formatting elements you will need when starting out using R Markdown:

3.3.2.1 Headers

To include sections in a document we use # followed by the header we want to be displayed. We can define levels for sections by using multiple # in this way:

# Section 1
## Section 1.1
## Section 1.2
### Section 1.2.1
### Section 1.2.2
## Section 1.3
# Section 2

3.3.2.2 Text

We write the text between the section headers at the place where it should be displayed in the final document. We can insert line breaks at any point but these will not be rendered in the output. To include an actual paragraph we will have to include a blank line between between two blocks of text. Two emphasize certain words or phrases, we can wrap them in * for italics or ** for bold face.

Consider this Markdown code:

This is the first paragraph.
This still is the first paragraph.

Here begins the second paragraph.
It includes emphasis, by using *italics* and also **bold face** words.

It is rendered as:

This is the first paragraph. This still is the first paragraph.

Here begins the second paragraph. It includes emphasis, by using italics and also bold face words.

3.3.2.3 Lists

Unordered lists or bullet points can be inserted by adding a -, * or + at the beginning of a line. To create levels, we have to indent lines using tab stops.

* Level 1
* Level 1
  * Level 2
    * Level 3
  * Level 2
    * Level 3
* Level 1

The above will be rendered as:

  • Level 1
  • Level 1
    • Level 2
      • Level 3
    • Level 2
      • Level 3
  • Level 1

We can also create ordered lists by using numbers followed by a . instead of the * etc.

1.  Bulletpoint 1
2.  Bulletpoint 2
3.  Bulletpoint 3
  1. Bulletpoint 1
  2. Bulletpoint 2
  3. Bulletpoint 3

3.3.3 Code chunks

Codechunks have to be started and ended with three backticks ```. After the first set of backticks we also have to include {r} to let Markdown know that we want to run the code as R code. The code that is written after this and up to the second set of backticks will be executed when knitting the file.

You can see some examples of this in the newly created R Markdown file if you followed the steps above.

We can also always run the code in a chunk before knitting by clicking on the green arrow in the upper right corner of the chunk. We can also execute individual lines of code by placing our keyboard cursor in the line and pressing Shift + Enter.

3.3.3.1 Chunk options

We can change the way code chunks are handled when knitting by adding one or multiple chunk options between the curly brackets like this: {r option=value}. If we want to use multiple options they have to be written like this: {r option1=value1, option2=value2}.

There are many options available but most are not needed when starting out. The ones that may be of interest to you are:

  • {r echo=FALSE}: This prevents the code to be displayed in the output while the results will be included. This is useful if you want to show the results of a computation or a plot but do not want the document to be cluttered with the underlying code.
  • {r include=FALSE}: This prevents the code as well as the output from being displayed. The code is still run in the background.
  • {r eval=FALSE}: This prevents the code from being run but displays it. This can be useful if you want to show code examples for illustrative purposes.

3.4 Further resources

3.5 EDA - Exercise

In this and all following exercises, we will work with the Boston Housing Data. The data set contains information on 506 census tracts in the US city of Boston from the 1970 census. It contains information on the median value of owner-occupied homes and additional statistics on houses and socio-demographics for each tract.

This is an overview of all variables included in the data set:

1. crim - per capita crime rate by town
2. zn - proportion of residential land zoned for lots over 25,000 sq.ft.
3. indus - proportion of non-retail business acres per town.
4. chas - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
5. nox - nitric oxides concentration (parts per 10 million)
6. rm - average number of rooms per dwelling
7. age - proportion of owner-occupied units built prior to 1940
8. dis - weighted distances to five Boston employment centres
9. rad - index of accessibility to radial highways
10. tax - full-value property-tax rate per $10,000
11. ptratio - pupil-teacher ratio by town
12. b - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. lstat - % of lower status of the population
14. medv - Median value of owner-occupied homes in $1000's

For more information on the data set and its implementation in mlbench please follow this link.

  1. Prepare your R Markdown File: Create a new R Markdown file setting the output format to html or pdf (if you have a LaTeX distribution installed) and setting the title, author and date to appropriate values. Remove the sample text and code chunks taking care to not remove the YAML header. Write all solutions in this file using code chunks as well as text to structure the file (headers) and answer the questions. Do not forget to save the file in the format your_name_exercise_1.Rmd. You will have to turn in this file to the lecturers through Moodle.

  2. Import the dataset: Import the data set by using the code below. The package mlbench conveniently includes the data set. You have to install the package beforehand by writing: install.packages("mlbench"). You should also load the tidyverse package to ease the further analysis. Not that this code also has to be included in your R Markdown file.

library(mlbench)
library(tidyverse)

data(BostonHousing)
  1. Inspect the dataset: Use the proper functions to inspect the structure and contents of the data set. How many categorical variables are there? How many numerical variables are there? Are there any missing values?

  2. Summarize categorical variables: Create a frequency table for the categorical variable chas (Charles River dummy variable).

  3. Summarize numerical variables: Generate summary statistics for the numerical variables crim (per capita crime rate by town), rm (average number of rooms per dwelling) and dis (weighted distances to five Boston employment centres).

  4. Create new variables: Create a new variable that groups houses by their proximity to employment centers as indicated in the variable dis. Create the three categories Near, Medium and Far based on the value of dis. Choose appropriate cutoff values based on the summary statistics you computed above. Briefly describe your decision.

  5. Compare distributions: Compare the distribution of ptratio by the newly created variable distance_group using boxplots. Briefly interpret the results.

  6. Visualize relationships: Create scatter plots with an added regression line to explore the relationship between housing prices (medv) and the numeric variables : rm, age, dis, lstat. Briefly interpret the plots.

  7. Correlation analysis: Calculate correlation coefficients between housing prices (medv), the average number of rooms per dwelling (rm) and the crime rate (crim). Display the correlations as a matrix or as a plot. Interpret the result.