2 Week 2: Getting started
2.1 Installing R and loading/exploring your dataset
Objective of this session:
Get you started with R
Load your first dataset in R
Explain some basic terminology and concepts
Explain how to structure any data analysis project
Learn how to run commands and save scripts
R commands covered in this session:
library()
install.packages()
read_excel()
read_xlsx()
read_csv()
commenting
head()
glimpse()
str()
summary()
table()
tabyl()
load()
save()
search()
setwd()
pipe operator (%>%)
2.2 Install R
It is likely that R is already installed on your computer if you are using one at the university. However, I still recommend installing R on your personal or work laptop for future use.
R is two things: R – the programming language (base R) – and Rstudio – the integrated development environment. Don’t worry about this now. It just means you have to install both R and Rstudio on your computer.
To install R / RStudio, watch this video or follow these steps described here. Make sure you choose the right version depending on your laptop (Windows vs. Mac; 32-bit vs. 64-bit).
2.3 First steps
When everything is installed, run Rstudio. It should look like this:
Now, click “File,” followed by “New File” and then “R script.” The R script is your most important piece here. Rather than conducting data analysis through a “clickable” user interface (as Stata and SPSS offer), R requires “code” to run the analysis step-by-step. The R script can save your code, so do not forget to save the script as often as possible. It is basically the documentation of what you tell R to do.
Scripts have several advantages over other “clickable” interfaces: 1) They are “replicable,” i.e. others can re-do what you did. 2) If your analysis requires several steps, you don’t have to do all the steps manually over and over again. Once the script is complete, you can re-run it as many times as you like. 3) It is easier to adapt and adjust your analysis later on.
When your script is open, R should roughly look like this:
There are 4 different parts to the R user interface:
Script
We will use the script to write code. This is the only window that we will actively change and work in. You “run” or execute a line of code with Ctrl + Enter or by using the Run button in the toolbar right above your script. This runs the currently selected line but you can run multiple lines by e.g. marking the relevant lines and running them then. Have a look here for more info on that!
Environment
On the upper-right hand corner, you can see the “environment.” When we later save different datasets and variables, they will be listed here. Don’t worry about the content of it for now.
Plots
On the bottom right-hand corner, you have different sheets or tabs.
“Plots” will display any figures and plots that we produce. When you run code in the script part that produces a visual, it will be displayed here in “plots.”
Another relevant sheet/tab here is “files.” If you create an R project (more on that later), then the “files” tab will show you the files currently saved in the working folder on your computer.
You also have a “help” sheet (which is open in the visual above). You
can either search for functions/commands using the search field or run
the following things in the script or console: help(command)
or
?command
. The help window will open a page explaining the command or
function of interest.
Console
The console shows which code (that is, which lines of code) was executed. It shows you the output/ results from your analysis (except plots, they pop up in another window on the bottom right hand). This is also where warning or error messages may appear, allowing you to figure out what went wrong and/or copy paste it into Google. Don’t get intimidated by the red font, they can be quite informative! You will see very soon. Basically, don’t touch the console. You can put in commands there and run them, but there is rarely a reason to run commands only once. Stick to the script.
For review, here is a video on introducing the R interface to new users.Tip: The look of R can be adjusted in the settings, so that sometimes, the user interface might look a bit different when you watch tutorials. However, all Rstudio interfaces will have the same number and types of windows: console, plots, environment and script. Cool kids sometimes change the background color or the color of code. You can do that under Tools -> Global Settings -> Appearance.
2.4 Loading & saving
Alright, let’s get some data into R. This process is also called
“loading” data, “importing” data, “reading” data – basically they are
all the same. Before we can do that, we need to introduce two very
important commands: install.packages()
and library()
.
Before you can do (most) things in R, you need to first install and load packages that come with a library of handy functions/operations. Imagine R is your pen. Packages are your ink. You first need to get your ink (i.e. install the package) and then put the ink in the pen (load the package) before you can start writing (execute a command). Base R (raw R without the packages) might provide you with some standard ink but other packages can provide more valuable and interesting ink. We are going to apply this to our first problem: How do we get data into R?
For that objective, we will require our first package: readxl. This package has been developed and graciously shared by Hadley Wickham and Jennifer Bryan. They have used base R to build functions that make it very easy for us to import Excel data into R.
The dataset that we will be using in this class is data on university students, so basically you. Hopefully, this will make things easier to relate to. The data is made up, so no data privacy concerns here. We are providing the data in the form of an Excel file (.xlsx) here.
Most datasets are generally available in some kind of Excel format, often it is .csv, .xls, .xlsx etc. Sometimes you may have data from other programs (stata = .dta; or SPSS).
To load an xlsx file, we need the readxl-package.
Type install.packages("tidyverse")
or install.packages("readxl").
tidyverseis a collection of very useful packages. When you install it,
readxl` will also be installed. However, you could also just install readxl individually.
To execute a command you can press “ctrl + enter” or click on the “run” button on the top center.
Then type
# load package
library(tidyverse)
Tip: You can use the “#” sign to write a comment in the script. It is good habit to note down the steps and your thought process, because it will make it much easier to remember what intended by each step. Feel free to comment excessively like you are writing a (data) journal.
Now, the command is ready to be executed:
<- readxl::read_excel("data/students.xlsx") students
To import a .csv file, you do not need to install a certain package. It already comes with the default library in your Rstudio:
<- read.csv("data/students.csv") students
When you import data into R, you need to give the dataset a name. That
name is the label of the object. Anything can be an object. An object is
basically a container where something is stored. Could be data, a list
of things, a number or variables. Here, we have created a the object
with the label “students.” <-
is the assignment operator, by
which we are assigning the dataset to the variable “students.” R then
knows that every time you use the name you reference the previously
loaded dataset. This calls for caution: unless you want to overwrite it,
do not use the same name twice! You also need to tell the function where
the Excel file is located. In my case, the file is saved in the directory “data.” You would need to change this and include the path to the folder where you saved the file.
There is a more convenient way to specify the location of files on your computer
using setwd()
(i.e. "set working directory). You define it once and then R
knows where your files are. Then, you only need to put the name of the files in the relevant command.
setwd("your path here")
<- readxl::read_excel("yourfilename.xlsx") students
In our command line (code), we are telling R to read the excel file and then store it in the object “students.” R usually automatically recognizes what you want to store in an object. In this case, we want to store a dataset. R calls datasets data frames.
You have successfully loaded a data frame into R. Imagine you’ve done some manipulations to the data frame. It is time to save your new output as “.RData” as follows:
save(students, file = "path/to/folder/students.RData")
Here, you first say which loaded file you would like to save and then specify the file name in “….” You can load an “.RData” file using load()
. Note here, that you do not need to specify a name here: load("data/students.RData")
will load the “students.RData” file from the “data” folder and add it to your environment under its given name “students” automatically.
Alternatively, you can also save it as a “.xlsx” or “.csv” file:
install.packages("openxlsx")
write.xlsx(students, file = "path/to/folder/students.xlsx")
# no need for a package here as this function it in-built, i.e. already included
write.csv(students,"path/to/folder/students.csv")
Important: For some sadistic reason, R only accept forward slashes (“/”), not backward slashes (“\”). When you copy a location from your folder directory, the slashes are often backward. So make sure you change them. Otherwise, R won’t read the location correctly and cannot find your data.
Tip: type in library()
to see all packages that are currently
installed. Type in search()
to see packages currently loaded. You
only have to install packages once, but you need to load them once in
every new session.
Additional resources on importing data:
For information on importing Stata, SPSS or csv file, please click here.
2.5 Explore your dataset
The most important step before running any analysis or producing fancy visuals is to understand what kind of data you are dealing with. R will help you run the analysis and make your graphs look pretty, but it cannot help you decide what you should and should not be doing with your data. Exploring your data is a key first step.TIPP: In the meanwhile, save your script again! Look for “Save as” in the “File” header, give it a name and save it your working folder. Make sure you regularly save your script. A useful shortcut is “ctrl + s.” You absolutely do not want to start from the beginning in case your computer crashes or freezes. Trust me.
So, let’s see what we are dealing with:
Head()
gives you a sneak preview of what the first 6 rows of your
dataset looks like. str()
gives you the “structure” of the dataset.
A “tibble” is another name for “tidy dataset,” meaning that the data is
organized in structured, clear rows and columns. “(1,000 x 12)” means
that the dataset contains 1000 rows and 12 columns. Commonly, in social
sciences, rows are referred to as “observations” and columns as
“variables.” In our case, there are 1000 observation to each of the
12 variables.
# look at the first 6 rows of the dataframe
head(students)
## # A tibble: 6 × 23
## faculty course age cob gpa_2010 gpa_2011 gpa_2012 gpa_2013 gpa_2014
## <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Business accounti… 26.3 Spain 3.1 2.8 2.8 2.3 0.5
## 2 Business accounti… 28.8 Netherl… 2.7 1.2 3.8 1.8 3
## 3 Business accounti… 23.9 Netherl… 1.7 2.5 3.8 2.9 2.4
## 4 Business accounti… 27.4 Spain 1.2 2.1 1.9 1.9 2.3
## 5 Business accounti… 29.3 Germany 3 2.3 1.6 2.6 3.5
## 6 Business accounti… 31.3 Italy 2.4 1 1.7 2.4 1.8
## # … with 14 more variables: gpa_2015 <chr>, gpa_2016 <chr>, gpa_2017 <chr>,
## # gpa_2018 <chr>, gpa_2019 <chr>, gpa_2020 <chr>, job <chr>, lifesat <chr>,
## # like <chr>, relationship <chr>, sex <chr>, term <chr>, university <chr>,
## # workingclass <chr>
# explore structure dataset
str(students)
## tibble [1,000 × 23] (S3: tbl_df/tbl/data.frame)
## $ faculty : chr [1:1000] "Business" "Business" "Business" "Business" ...
## $ course : chr [1:1000] "accounting" "accounting" "accounting" "accounting" ...
## $ age : num [1:1000] 26.3 28.8 23.9 27.4 29.3 ...
## $ cob : chr [1:1000] "Spain" "Netherlands" "Netherlands" "Spain" ...
## $ gpa_2010 : chr [1:1000] "3.1" "2.7" "1.7" "1.2" ...
## $ gpa_2011 : chr [1:1000] "2.8" "1.2" "2.5" "2.1" ...
## $ gpa_2012 : chr [1:1000] "2.8" "3.8" "3.8" "1.9" ...
## $ gpa_2013 : chr [1:1000] "2.3" "1.8" "2.9" "1.9" ...
## $ gpa_2014 : chr [1:1000] "0.5" "3" "2.4" "2.3" ...
## $ gpa_2015 : chr [1:1000] "3.9" "3" "2.4" "4.1" ...
## $ gpa_2016 : chr [1:1000] "3" "1.8" "1.6" "1" ...
## $ gpa_2017 : chr [1:1000] "1.1" "1" "3.7" "0.8" ...
## $ gpa_2018 : chr [1:1000] "2.6" "4" "2.8" "2.6" ...
## $ gpa_2019 : chr [1:1000] "3.1" "2.5" "4.1" "3.4" ...
## $ gpa_2020 : chr [1:1000] "3.9" "3" "5.7" "3.4" ...
## $ job : chr [1:1000] "no" "no" "no" "yes" ...
## $ lifesat : chr [1:1000] "68.4722360275967" "60.7386549043799" "67.2921321180378" "73.1391944810778" ...
## $ like : chr [1:1000] "3" "3" "4" "4" ...
## $ relationship: chr [1:1000] "In a relationship" "Single" "In a relationship" "Single" ...
## $ sex : chr [1:1000] "Male" "Female" "Female" "Male" ...
## $ term : chr [1:1000] "7" "4" "5" "10" ...
## $ university : chr [1:1000] "Berlin" "Berlin" "Berlin" "Berlin" ...
## $ workingclass: chr [1:1000] "yes" "yes" "no" "yes" ...
Tip: With your mouse, go to the environment panel (upper-right) and click on the “students” object. It pops up and you can browse through it. This is often a good idea to get a first feel for the data, but only recommended if your dataset is relatively small.
In the second row, we see listed underneath each other the list of
“variables” or “column” names: age, cob (short for country of birth),
course, faculty, job, lifesat (life satisfaction as in how satisfied are
students with their classes), relationship status, sex, term,
university, workingclass.
After the column name, we see the data “type” – which is of very high importance – and the first 4-5 values in the dataset.
2.5.1 Data types
Data can occur in many forms, but for this lesson, we are going to focus on two specific data types: numeric and character. Numeric variables represent numeric numbers with which you can perform mathematical operations. Character variables represent strings or sequences of characters, including letters, symbols and numbers. The data type defines the operations that are possible with the one data type, i.e. the meaning of the data. For example, we can take the mean of numeric variables, but we cannot take the mean of character variables.
Think of data types as different cutlery in your drawer. Stay with me here for a second, I won’t give up this metaphor just yet: If you want to eat a soup, you need a spoon. A fork won’t help you. If you need to cut something, you need a knife and you will have a hard time using a spoon to cut a piece of meat. It is the same with data types in your dataset. You need to understand the type of variable before you use it for something. This will prevent a lot of frustration later on because R gives you an error message every time you try to use a fork to eat a soup (i.e. you want to, for example, get the average of a character variable).
There are the following types in R:
character: "a", "swc", "6_21", "This is also a string of characters, symbols, etc.!"
numeric: 2, 15.5 (can be real or decimal)
integer: 2LB (the L tells R to store this as an integer)
logical: TRUE, FALSE
complex: 1+4i (complex numbers with real and imaginary parts)
You can also use glimpse()
to get a little bit more info about the
range of values:
glimpse(students)
## Rows: 1,000
## Columns: 23
## $ faculty <chr> "Business", "Business", "Business", "Business", "Business…
## $ course <chr> "accounting", "accounting", "accounting", "accounting", "…
## $ age <dbl> 26.30697, 28.77746, 23.86429, 27.43746, 29.29478, 31.3362…
## $ cob <chr> "Spain", "Netherlands", "Netherlands", "Spain", "Germany"…
## $ gpa_2010 <chr> "3.1", "2.7", "1.7", "1.2", "3", "2.4", "3.9", "2", "0.4"…
## $ gpa_2011 <chr> "2.8", "1.2", "2.5", "2.1", "2.3", "1", "2", "0.5", "2", …
## $ gpa_2012 <chr> "2.8", "3.8", "3.8", "1.9", "1.6", "1.7", "2.3", "2.3", "…
## $ gpa_2013 <chr> "2.3", "1.8", "2.9", "1.9", "2.6", "2.4", "2.5", "1.4", "…
## $ gpa_2014 <chr> "0.5", "3", "2.4", "2.3", "3.5", "1.8", "3.4", "3.1", "4.…
## $ gpa_2015 <chr> "3.9", "3", "2.4", "4.1", "5.2", "2.6", "2.7", "3.8", "2.…
## $ gpa_2016 <chr> "3", "1.8", "1.6", "1", "1.7", "1.4", "0.1", "2.2", "1.9"…
## $ gpa_2017 <chr> "1.1", "1", "3.7", "0.8", "0.1", "1.6", "0.9", "1.3", "1.…
## $ gpa_2018 <chr> "2.6", "4", "2.8", "2.6", "2.3", "4.5", "2.3", "4.7", "3.…
## $ gpa_2019 <chr> "3.1", "2.5", "4.1", "3.4", "3.6", "3.6", "2.8", "3", "1.…
## $ gpa_2020 <chr> "3.9", "3", "5.7", "3.4", "6", "4.4", "5.7", "3.9", "4", …
## $ job <chr> "no", "no", "no", "yes", "yes", "no", "no", "no", "yes", …
## $ lifesat <chr> "68.4722360275967", "60.7386549043799", "67.2921321180378…
## $ like <chr> "3", "3", "4", "4", "3", "3", "2", "3", "5", "5", "5", "3…
## $ relationship <chr> "In a relationship", "Single", "In a relationship", "Sing…
## $ sex <chr> "Male", "Female", "Female", "Male", "Male", "Male", "Fema…
## $ term <chr> "7", "4", "5", "10", "14", "13", "9", "0", "12", "3", "9"…
## $ university <chr> "Berlin", "Berlin", "Berlin", "Berlin", "Berlin", "Berlin…
## $ workingclass <chr> "yes", "yes", "no", "yes", "yes", "no", "no", "no", "yes"…
In addition, to the first rows, you now also have the range of values included in each row on the right-hand side.
When you are not sure what an object in R is (and you don’t want to see
the whole dataset using str()
, type class()
. For example, let’s
try it with our dataset “students”:
class(students)
## [1] "tbl_df" "tbl" "data.frame"
This tells you that the object students
is of the type data frame.
Let’s try it on a column:
class(students$age)
## [1] "numeric"
This tells you that the object age
in the dataframe students
is
numeric. The $
sign extracts items from a list based on their names. In
our case, we extracted the column based on its name from the dataframe
students
. R allows you to load many datasets at the same time. This
is why you always need to specify which dataset you are referring to.
class(students$cob)
## [1] "character"
Now we tried it with cob
. The object is of the character type,
meaning it entails letters and probably refers to the country of birth
for every student in our data.
The summary()
command will get you a “summary” of the whole dataset:
summary(students)
## faculty course age cob
## Length:1000 Length:1000 Min. :10.00 Length:1000
## Class :character Class :character 1st Qu.:23.67 Class :character
## Mode :character Mode :character Median :25.85 Mode :character
## Mean :25.92
## 3rd Qu.:27.98
## Max. :80.00
## NA's :9
## gpa_2010 gpa_2011 gpa_2012 gpa_2013
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## gpa_2014 gpa_2015 gpa_2016 gpa_2017
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## gpa_2018 gpa_2019 gpa_2020 job
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## lifesat like relationship sex
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## term university workingclass
## Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
As we can see, this only works for numeric variables. R gives you the mean, median and so forth. It does not work on the other columns/ variables because some of them are still (erroneously) defined as character objects. Let’s change that quickly, run this (don’t worry about it for now):
<- students %>% mutate(like = as.factor(like),
students term = as.numeric(term))
Let’s try the summary command again:
summary(students)
## faculty course age cob
## Length:1000 Length:1000 Min. :10.00 Length:1000
## Class :character Class :character 1st Qu.:23.67 Class :character
## Mode :character Mode :character Median :25.85 Mode :character
## Mean :25.92
## 3rd Qu.:27.98
## Max. :80.00
## NA's :9
## gpa_2010 gpa_2011 gpa_2012 gpa_2013
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## gpa_2014 gpa_2015 gpa_2016 gpa_2017
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## gpa_2018 gpa_2019 gpa_2020 job
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## lifesat like relationship sex
## Length:1000 1 : 90 Length:1000 Length:1000
## Class :character 2 :199 Class :character Class :character
## Mode :character 3 :263 Mode :character Mode :character
## 4 :248
## 5 :144
## 6 : 47
## NA's: 9
## term university workingclass
## Min. : 0.000 Length:1000 Length:1000
## 1st Qu.: 3.000 Class :character Class :character
## Median : 7.000 Mode :character Mode :character
## Mean : 7.026
## 3rd Qu.:11.000
## Max. :14.000
## NA's :9
We now see that two more numerical variables are there: term
and
like
. Term is the number of terms that the student has been enrolled
at university. summary()
will give you the mean and median in addition
many more statistics. like
is the measure for how much students are
fond of their class. This variable is a factor variable. It is very
important to understand
factor
variables. Factors are used to represent categorical data.
Another distinction of data types available is the one between continuous and categorical data. Categorical variables contain a finite number of categories or distinct groups. They might or might not have a logical order. An example of an ordered categorical variable (ordinal) can be a Likert scale with five levels: “Strongly Disagree,” “Disagree,” “Neutral,” “Agree,” “Strongly Agree.” An example of an unordered categorical variable (nominal) can be ice cream types or weekdays. Continuous variables are numeric variables that have an infinite number of values between any two values. Examples of continuous variables are length, temperature, weight, and speed.
Factors are an important class for statistical analysis and for plotting. Factors can both take on the form of integers or characters. They have labels associated with these unique integers or characters.
Detour: In our example, the values 1-6 in our like
variable
represent different levels of liking the class from 1 = “I hate it” to 6
“I love it” – a typical likert variable. If the cells contained “I hate it” written in characters, then it would be more obvious that we are dealing with a character
variable. Because the world isn’t fair, the character levels were
converted into numbers 1-6 before the dataset was created and shared
with us. Now, R is in a weird position. It has numerical values in the
like
variable but thinks that it is text. This often happens because
when we import data from Excel files, sometimes variables do not take up
the type of variable that we think it should. We will learn later how to
convert types.
If you have a large dataset, maybe you only want to look at a few variables:
# let's do it for specifc vars
summary(students$like)
## 1 2 3 4 5 6 NA's
## 90 199 263 248 144 47 9
summary(students$term)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 3.000 7.000 7.026 11.000 14.000 9
What if you want to look at specific non-numeric categorical variables
(i.e. string/character)? You can use table()
to get a feel for
those. table()
returns a one-way frequency table, i.e. the count of
observations for each category of the variable (see next section). As a refresher: It is
very important to understand the type of variable that you are dealing
with. The most important ones are:
Categorical: e.g. Education, sex, religion
Nominal: e.g. religion (no clear order)
Ordinal: e.g. school grades (A, B ,C)
Numerical (continuous): e.g. age, income, satisfaction etc.
Tip: Think about questions in common surveys that you have come across in the past. Which types should they be in our data frame?
2.5.2 Data structures
The data type defines the operations that are possible with the one data type, i.e. the meaning of the data. Data structures, on the other hand, define the way data (one or multiple types) is stored and the operations that are possible with the data type(s).
Now on to data structures. For this class, you will only need to know 2:
vectors and data frames. Still, this is useful to know! To reiterate, data
structures define the way data is stored and the operations that are possible
with the data type(s). Vectors combine data of one type (indicated by the colors
in the picture above) and are one/single-dimensional. You can create a vector
using the function c()
.
# let's create vectors
<- c(1, 2, 3)
vector1 <- c("random", "words", "here")
vector2 <- c(FALSE, TRUE, TRUE, FALSE)
vector3 <- c(1.6, 4.5, 0.55, -10) vector4
Do you know what data types the vectors above contain?
Maybe you just want to take the average (mean) of the first and last vector. Or the first two letters of the second vector. For these relatively simple tasks, vectors are perfect.
By contrast, data frames can combine several data types and are two-dimensional: instead of only one row or column, we can now store information in both multiple
rows and columns of the same length. To create a data frame, we can use the
easy-to-remember and handy data.frame()
function.
# let's create a dataframe
<- data.frame(vector1, vector2, vector3, vector4) dataframe
This will not work. We will receive an error message. Try it out yourselves: copy the code into your console and run it. A data frame requires the same number of items for each column and each row. It is not possible to introduce columns with 3 rows and 4 rows respectively into a data frame. Let’s correct that, shall we?
# let's create a dataframe
<- data.frame(vector1, vector2,
dataframe vector3 = c(FALSE, TRUE, TRUE),
vector4 = c(1.6, 0.55, -10))
Data frames are useful for storing complex information: information that is not one-dimensional. An example would be an address book, containing character data (name, address, city name, etc.), numeric data (phone number, postal code, coordinates, etc.), float data (internet IP address, etc.), logical data (wants to receive ads: FALSE/TRUE, etc.), etc.
Matrices are also two-dimensional but can only contain one data type. Finally, lists can consist of vectors of various lengths and data types.
2.5.3 Look at variables
So let’s try it.
These functions will get you the frequency table for the variable
“country of birth.” The first command includes missing information
(labelled as NA), the second one excludes missing information. Missing
values are values that ... well …. are missing. Maybe the student did
not provide that information when they enrolled, so we simply have no
information about the country of birth for some of them. To include the
missing values, add the argument useNA = "ifany"
to the table()
function. There are other things you can add. Type in help(table)
to
find out more.
# look at specific variables
table(students$cob)
##
## Austria France Germany Italy Netherlands Spain
## 91 85 153 265 54 268
## UK
## 75
table(students$cob, useNA = "ifany")
##
## Austria France Germany Italy Netherlands Spain
## 91 85 153 265 54 268
## UK <NA>
## 75 9
Let’s try another package to get more info on univariate (just one
variable) and bivariate (2 variables) descriptive statistics. First, we
need to install the janitor
package with install.packages("janitor")
. Then let’s call the tabyl()
function. In addition to the frequency, this will return the proportion
in percent. Next to how many students were born in what country, how
many percent of students were born in each country (e.g. 25% for Italy).
In other words, the function returns the absolute and relative frequency
distribution.
# let's try another package
library("janitor")
tabyl(students$cob)
## students$cob n percent valid_percent
## Austria 91 0.091 0.09182644
## France 85 0.085 0.08577195
## Germany 153 0.153 0.15438951
## Italy 265 0.265 0.26740666
## Netherlands 54 0.054 0.05449041
## Spain 268 0.268 0.27043391
## UK 75 0.075 0.07568113
## <NA> 9 0.009 NA
Let’s try to look at two variables using a crosstab(ublation). A crosstab displays the relationship between two or more variables by returning the number of observations with each combination of possible values of the two variables in each cell of the table.
Here, we get to see how many men and women are in each faculty. As before, the first example includes NA values and the second one excludes NA values.
# cross-tabular (2 variables)
%>% tabyl(sex, faculty) students
## sex Business Economics Political Science Sociology NA_
## Female 131 74 161 78 3
## Male 205 149 103 80 7
## <NA> 3 2 0 4 0
# do not show missings
%>% tabyl(sex, faculty, show_na = FALSE) students
## sex Business Economics Political Science Sociology
## Female 131 74 161 78
## Male 205 149 103 80
There is one new weird sign here: %>%
. This thing is called a
“pipe” or the pipe operator. You will love it because it makes your
life easier. You can read it as telling R: “first do this and then do
that.” So, first you tell R to use the object “students” and then
(%>%
) you tell R to take that object and perform the tabyl()
operation with that object. Pipes makes it easier to comprehend the
order of steps in operations and makes reading code more intuitive in
some cases.
Now, let’s add the percentages. First use the help()
function to see
how to do that.
# as percentages
help(tabyl)
?tabyl
%>% tabyl(sex, faculty, show_na = FALSE) %>%
students adorn_percentages("row") %>%
adorn_pct_formatting(digits = 1) %>%
adorn_ns()
## sex Business Economics Political Science Sociology
## Female 29.5% (131) 16.7% (74) 36.3% (161) 17.6% (78)
## Male 38.2% (205) 27.7% (149) 19.2% (103) 14.9% (80)
See how the piping works here. You can pipe as many times as you wish. Each time something is added to the things you did before.
Adorn_pct_formatting(digits=1)
tells R to report all decimal
numbers with just one digit.
Tipp: Before it is your turn to get coding, I would like to remind you of two very important tips:
Run help(command)
and a window will pop up that explains how each
command works.
Even more important: Google until your fingers bleed. There are millions of R users out there. They all have started where you are now. Most questions have been answered in online fora or addressed through a tutorial. Google: “How to ….in R” or “R table” or “R import data” etc.
2.6 Data used in the course
The data used in the course consists of 4 randomly generated datasets.
The students
dataset entails demographic and school-related information on imaginary students, such as
- faculty: Business, Economics, Political Science, Sociology
- course: accounting, migration, statistics. leadership, marketing, math, etc.
- age: range from 10 to 80 (There are no limits!)
- cob (country of birth): Netherlands, Germany, Uk, etc.
- gpa_2010 - gpa_2020: Grade Point Average (average mark) from 0 to 6
- job: yes for job, no for no job
- lifesat: general life satisfaction from 0 to 100
- like: perceived satisfaction with course from 1 to 6
- relationship: In a relationship or Single
- sex: Male or Female
- term: from 1 to 14
- university: Berlin
- workingclass: yes for working class background, no for no working class background
The students2
dataset deviates from the students
dataset only in formatted design in Excel.
The faculty
dataset contains aggregated data per faculty:
- faculty: Business, Economics, Political Science, Sociology
- students: number of students
- profs: number of profs
- salary: amount of salary
- costs: amount of costs
The course
dataset depicts information on the individual courses.
- course: accounting, employment, EU, leadership, theory, etc.
- students: number of students
- profexp: years of experience by professor
- profsex: Male or Female professor
- deliverable: exam, paper or presentation
- timing (of course): 10 am, noon, 2 pm, etc.
2.7 Exercises I (based on class data)
Download R and R Studio and get a feeling for the interface. It's not that scary!
Open a new R script and load the student data as we just did in class. A new object called
students
now appears in the environment panel (upper right), click it if you wish to have a closer look on the data!Explore the data set by running the commands
head(students)
,str(students)
,glimpse(students)
andsummary(students)
in your R script. You will use these a lot in the future, so have a closer look at the different outputs in the console (lower left). Remember to save your script!Now focus on the variables
age
,job
andrelationship
. Find out more about their type! If the variable is numeric, have a look at its summary, if not create a table to get to know them better. Hint: be mindful about missing observations!Finally, create a cross-tabulation of the variables
job
andrelationship
usingtabyl()
from the janitor package. Do not show the missing observations!Re-create the table using percentages with two digits and numbers of observations in brackets. Hint: Look at the command we used in class and add each piped line step by step to understand what each line does!
2.8 Exercises II (based on your own data)
Think about a topic or question you are interested in (Hobbies? Geek interest? Political agenda?). Then think about some data that could be relevant for the subject. See if you find data online that is publicly accessible. Download it, import it into R and explore it. Get an overview of what types of variables and observations are in there. Save your script and submit it on moodle. In the coming weeks, think hard about a dataset that could really interest you for the duration of this course. It will make it easier if you work on it alongside this course. It will keep you motivated. Each week you will get short exercises to complete using your own dataset.
If you cannot think of any data, here are some suggestions you could explore:
Free but requiring registration
Eurobaromter (public opinion survey)
World Values Survey (public opinion survey)
PEW research (politics, journalism, public opinion)
Publicly accessible
On a side note, sometimes R does not recognize a variable, for example when its name starts with a number or includes spaces. In those cases, simply use backticks around the variable (e.g. `2018 GDP`
) when refering to it.