11 Week 11: Shortcuts
Objective of this session:
Installing packages faster
Iteration: loops
Simple functions
R commands covered in this session:
Pacman
across()
contains()
starts_with()
function(x)
apply()
11.1 Intro
In this session, I will introduce you to a couple of tricks to make your analysis faster and less tedious. The tools that will help us to write more efficient code are also key concepts in general programming, such as loops and functions. Wrapping your head around the basic idea behind them will help you understand better what programmers do and whether your journey into data science shall continue (or not).
11.2 Application
11.2.1 Installing packages quicker
So far, we have always installed packages individually and loaded them in each session whenever we needed them. Last week, during our session on Rmarkdown, I recommended to install and load packages in one compact script at the very beginning of your analysis. When your analysis gets longer and you require a lot of packages, installing and loading them can take a long time. It also is not very efficient, because you repeat the same commands over and over. Repeating the same commands also includes installing the same packages over and over.
There is a shortcut using the pacman
package. First, install it. You
may notice, that the installation of “pacman” here looks a bit different
than our normal installation process: a look at the documentation of the
first function require()
by running ?require
in the console or
script tells us that require()
is similar to library()
. However,
library()
will return an error message if it fails to load a package,
require()
on the other hand will return a FALSE
statement.
Short tangent: The exclamation mark in R has two main usages: 1) does
not equal, as in variable1 != "X"
, which you might read as
variable1 does not equal the character string X. This is useful for
filtering. 2) Negation, as in !TRUE will become FALSE and !FALSE will
become TRUE. Therefore, if we can successfully load the package pacman,
a TRUE statement will be returned, which is negated by the exclamation
mark, turning the TRUE statement into a FALSE statement. In that case,
the condition is FALSE, i.e. not fulfilled, and the action of installing
the packages will not be run by the system. In other words,
require()
first tries to load the package and if this does not work,
meaning it is not installed yet, we ask R to install the package. This
sort of conditional installation is not unique to the “pacman” package
but just very fitting for the topic at hand.
Then, by using the “pacman” package, list all the packages you need in
the p_load
function. I used all packages that we used throughout
this course so far. The code below is 4 lines long. If you did it
individually, you would require 30 lines of code. p_load()
does
something very similar to what we have seen above: for all the packages
listed in the brackets, it checks whether they are already installed,
installs them if needed and loads them.
# clear all
rm(list = ls())
# installand load pacman
if (!require("pacman")) install.packages("pacman")
# load all package (and install them if they are not yet installed)
::p_load("tidyverse", "dplyr", "stringr",
pacman"hrbrthemes", "gcookbook",
"remotes", "ggridges","viridis",
"flextable", "stargazer",
"janitor", "gtsummary", "writexl",
"knitr", "rmarkdown")
11.2.2 Iteration/ loops
In programming and data science, loops are a very important concept. Data scientists are lazy, so they hate to repeat things in their code. They always want to get to their goal with the least lines of code possible while not losing sight of what is going on (and still being able to explain what is happening – remember our discussion on reproducibility?). Loops are immensely helpful for that.
Loops are an approach to handle iteration which basically means “repeating things.” Whenever you find yourself copying and pasting some code more than 2 or three times, it is probably more efficient to think about a loop.
As with anything in R, there are many different ways to use loops in R
including the lapply()
, for
loops, while
, purr
package
and others. I strongly recommend exploring those approaches further.
Here, however, I will briefly introduce easy built-in tidyverse
functions that allow you to conduct loops across columns with the recent
across()
function.
Tip: Functions such as summarise()
or mutate()
come in very
handy when managing your data and handling multiple columns at once.
They come with suffixes such as _at
, _all
and _if
to
specify even better which columns commands are applied to. across()
is now considered the follow-up version of these suffixed functions,
also known as “scoped variants!”
11.2.2.1 Across()
The across()
function is a helpful built-in function in tidyverse.
We will use it to apply the same operation to multiple columns in our
dataset. It is especially powerful when we use it together with
mutate()
and summarise()
(not their suffixed versions!). This
speeds up cleaning and analysis and will shorten your code overall.
Let’s go back to our “students” dataframe. First, we load it and then
only select the GPA grade variables. Remember, we can do that by
specifying the column range as numbers. In this case, we can also do it
using the helper function contains()
. contains()
only selects
objects if they match a character sequence. In our case, we want to
select only variables that contain “gpa” in the variable name.
We can also select all variables except the grade variables by using the “!” operator explained and used above.
# load data
<- readxl::read_excel("data/students.xlsx")
students
## select only certain columns (i.e. grades)
<- students %>% select(5:15)
grades str(grades)
## tibble [1,000 × 11] (S3: tbl_df/tbl/data.frame)
## $ gpa_2010: chr [1:1000] "3.1" "2.7" "1.7" "1.2" ...
## $ gpa_2011: chr [1:1000] "2.8" "1.2" "2.5" "2.1" ...
## $ gpa_2012: chr [1:1000] "2.8" "3.8" "3.8" "1.9" ...
## $ gpa_2013: chr [1:1000] "2.3" "1.8" "2.9" "1.9" ...
## $ gpa_2014: chr [1:1000] "0.5" "3" "2.4" "2.3" ...
## $ gpa_2015: chr [1:1000] "3.9" "3" "2.4" "4.1" ...
## $ gpa_2016: chr [1:1000] "3" "1.8" "1.6" "1" ...
## $ gpa_2017: chr [1:1000] "1.1" "1" "3.7" "0.8" ...
## $ gpa_2018: chr [1:1000] "2.6" "4" "2.8" "2.6" ...
## $ gpa_2019: chr [1:1000] "3.1" "2.5" "4.1" "3.4" ...
## $ gpa_2020: chr [1:1000] "3.9" "3" "5.7" "3.4" ...
# use shortcut
<- students %>% select(contains("gpa"))
grades <- students %>% select(!contains("gpa")) nongrades
Alright. We now have a dataframe “grades” that contains only the grades
per year for all students. To do anything with the grade information, we
need to first convert all the columns to numeric format. We could do
this one by one but it would require a lot of lines of codes. Instead,
we will use a built-in loop using across()
in combination with
starts_with()
, where()
or contains()
(all are useful helper
functions).
In the across function (type ?across
), we have two inputs. First, we
define which columns we want to do things with, second, we input a
function that tells R what to do with the columns.
The dataframes grades1-4 all contain the same information but we used
different ways to get there to show you which options exist. You can see
that in all four versions, the second input of across()
is –
unsurprisingly – as.numeric
since that is what we want to do with
the columns: turn the data into numeric data to do calculations. The
first argument of the across()
function varies. The first version
searches all columns of the object grades for columns with character
data. The second version looks for the string “gpa” in the beginning of
the variable name, the third one for the string “gpa” somewhere in the
variable name and, finally, the last one takes columns 1 to 11.
## convert many columns at once
# convert to numeric
<- grades %>%
grades1 mutate(across(where(is.character), as.numeric))
str(grades1)
## tibble [1,000 × 11] (S3: tbl_df/tbl/data.frame)
## $ gpa_2010: num [1:1000] 3.1 2.7 1.7 1.2 3 2.4 3.9 2 0.4 4.3 ...
## $ gpa_2011: num [1:1000] 2.8 1.2 2.5 2.1 2.3 1 2 0.5 2 1.6 ...
## $ gpa_2012: num [1:1000] 2.8 3.8 3.8 1.9 1.6 1.7 2.3 2.3 4.3 3.6 ...
## $ gpa_2013: num [1:1000] 2.3 1.8 2.9 1.9 2.6 2.4 2.5 1.4 2.7 2.4 ...
## $ gpa_2014: num [1:1000] 0.5 3 2.4 2.3 3.5 1.8 3.4 3.1 4.1 1.9 ...
## $ gpa_2015: num [1:1000] 3.9 3 2.4 4.1 5.2 2.6 2.7 3.8 2.1 1.9 ...
## $ gpa_2016: num [1:1000] 3 1.8 1.6 1 1.7 1.4 0.1 2.2 1.9 3.2 ...
## $ gpa_2017: num [1:1000] 1.1 1 3.7 0.8 0.1 1.6 0.9 1.3 1.9 1.6 ...
## $ gpa_2018: num [1:1000] 2.6 4 2.8 2.6 2.3 4.5 2.3 4.7 3.8 4.1 ...
## $ gpa_2019: num [1:1000] 3.1 2.5 4.1 3.4 3.6 3.6 2.8 3 1.9 3.3 ...
## $ gpa_2020: num [1:1000] 3.9 3 5.7 3.4 6 4.4 5.7 3.9 4 2.7 ...
# same but different
<- grades %>%
grades2 mutate(across(starts_with("gpa"), as.numeric))
<- grades %>%
grades3 mutate(across(contains("gpa"), as.numeric))
<- grades %>% mutate(across(1:11, as.numeric)) grades4
Imagine you don’t want to convert all variables with a specific name or
within a specific column range, but only very specific variables that do
not appear in the same order. You can do this using across()
and
specifying the variable names in a vector c()
separated by commas.
# only selected variables
<- students %>% mutate(across(c(gpa_2010, gpa_2011,gpa_2012), as.numeric))
grades5 str(grades5)
## tibble [1,000 × 23] (S3: tbl_df/tbl/data.frame)
## $ faculty : chr [1:1000] "Business" "Business" "Business" "Business" ...
## $ course : chr [1:1000] "accounting" "accounting" "accounting" "accounting" ...
## $ age : num [1:1000] 26.3 28.8 23.9 27.4 29.3 ...
## $ cob : chr [1:1000] "Spain" "Netherlands" "Netherlands" "Spain" ...
## $ gpa_2010 : num [1:1000] 3.1 2.7 1.7 1.2 3 2.4 3.9 2 0.4 4.3 ...
## $ gpa_2011 : num [1:1000] 2.8 1.2 2.5 2.1 2.3 1 2 0.5 2 1.6 ...
## $ gpa_2012 : num [1:1000] 2.8 3.8 3.8 1.9 1.6 1.7 2.3 2.3 4.3 3.6 ...
## $ gpa_2013 : chr [1:1000] "2.3" "1.8" "2.9" "1.9" ...
## $ gpa_2014 : chr [1:1000] "0.5" "3" "2.4" "2.3" ...
## $ gpa_2015 : chr [1:1000] "3.9" "3" "2.4" "4.1" ...
## $ gpa_2016 : chr [1:1000] "3" "1.8" "1.6" "1" ...
## $ gpa_2017 : chr [1:1000] "1.1" "1" "3.7" "0.8" ...
## $ gpa_2018 : chr [1:1000] "2.6" "4" "2.8" "2.6" ...
## $ gpa_2019 : chr [1:1000] "3.1" "2.5" "4.1" "3.4" ...
## $ gpa_2020 : chr [1:1000] "3.9" "3" "5.7" "3.4" ...
## $ job : chr [1:1000] "no" "no" "no" "yes" ...
## $ lifesat : chr [1:1000] "68.4722360275967" "60.7386549043799" "67.2921321180378" "73.1391944810778" ...
## $ like : chr [1:1000] "3" "3" "4" "4" ...
## $ relationship: chr [1:1000] "In a relationship" "Single" "In a relationship" "Single" ...
## $ sex : chr [1:1000] "Male" "Female" "Female" "Male" ...
## $ term : chr [1:1000] "7" "4" "5" "10" ...
## $ university : chr [1:1000] "Berlin" "Berlin" "Berlin" "Berlin" ...
## $ workingclass: chr [1:1000] "yes" "yes" "no" "yes" ...
Now, let’s get the average grade for all years. Again, rather than doing
it column by column, we will use one line to apply the mean function to
all columns by using summarise()
in combination with across()
.
## get mean for all grade variables
<- grades1 %>% summarise(across(1:11, mean, na.rm=TRUE)) mean_grades
That was quick and easy. Now let’s get the mean by group, e.g. sex.
Luckily, the across()
function works well with the group_by()
function that you are already very familiar with. Read this as: From the
students dataframe, we select all variables containing the string “gpa”
and the “sex” variable, we then convert all “gpa” containing variables
to numeric variables, then group the data by sex to compute the mean
value of all “gpa” containing variables for students based on sex while
removing missing observations.
## get mean for all grade variables by group
<- students %>%
mean_grades select(contains("gpa"), sex) %>%
mutate(across(contains("gpa"), as.numeric)) %>%
group_by(sex) %>%
summarise(across(contains("gpa"), mean, na.rm=TRUE))
Remember that we have solved the same problem using another method in an
earlier session? We also wanted to get the average grade of students in
different faculties. Rather than using a loop, we reshaped or
restructured the data from wide to long format and then applied the
mean()
function to a single column (which then contained all grades
stacked by year).
The approach you use depends on the specific task at hand and, to some degree, on your personal preferences.
11.2.2.2 Apply()
We have already covered multiple average GPA scores we might be
interested in with across()
. Now, just to inform you of a different
and similarly versatile function, let’s have a look at the apply()
function family. apply()
can apply (haha) functions to a dataframe,
its variants such as lapply()
or tapply()
require a more
specific input. Interesting for us now is the ability of apply()
to
apply a function on rows as well as columns. So, say for example that
you want to find out each students’ average GPA over the years, how
would that look like? We create a new dataframe “av_grades” based on
one of our previously created “grades” dataframes where the gpa scores
are already coded numerically. We then create a new variable using
mutate()
and apply()
. The first input argument specifies the
relevant data, the second the direction: 1
means applying the
function to each row, 2
to each column. Finally, we have the function
we are interested in (mean
) and the optional argument taking care of
missing observations.
# let's have a look at the apply documentation
?apply
# create a new variable av_gpa using apply
<- grades1 %>%
av_grades mutate(av_gpa = apply(grades1, 1, mean, na.rm=TRUE))
Extra: sapply(dataframe, class)
and sapply(dataframe, unique)
11.2.2.3 “FOR” loop
After having gotten to know and cherish the across()
and
apply()
functions, you might wonder whether you can do all that
manually as well. The answer, of course, is yes – but you would be well
advised to simply use a pre-existing function whenever available. Still,
let’s have a quick look into loops so that you can get a general
understanding of the concept. This way, if you’d be ever in need for a
loop, your google queries will be a bit more proficient to start with.
The most general loop type is a for
loop. Here, we can iterate over
a vector and apply a function on every element in it (sounds familiar,
doesn’t it?). The general structure contains for
followed by i in
in
parentheses ()
and two braces {}
.
Read this as “for every element i in a given vector (i is commonly used but you can basically name it as you wish), apply whatever happens in between the {}.” Now, let’s consider our students data again: Imagine that for whatever reason, you want to shift the grade range by 1, meaning that you add 1 to all GPA values.
# first loop
for(col in 1:ncol(grades2)) {
<- grades2[ , col] + 1
grades2[ , col] }
Let’s see what happened here: This for
loop iterates over every
column col
in the number of columns of the dataframe grades2. Note
here the syntax 1:ncol()
which means starting from 1 up to the
number of columns in the specified dataframe. This allows us to index
the dataframe to specify exactly to which cells we want to apply
something. Now, moving on to the body (the part in between {}), we
replace the values in grades2’ columns by their original value plus 1.
The [ , col]
indexing part is new: Square brackets allows us to
access a cell in a dataframe by specifying their coordinates [row,
column] in them. Here, we leave the row
space empty because it
applies to all rows and only indicate the columns we are interested in
using the col
variable from the loop initialization. What happens
now is basically that col
takes on every value between 1 and the
number of columns in grades2. This means that in the first round of the
loop, col
is equal to 1 so that R processes grades2[ , 1] <- grades2[ , 1] + 1
thereby adding 1 to each value in the first column
of grades2 (“gpa_2010”). This is repeated for every column leaving then
all values in grades2 greater by 1.
Obviously, loops can get much more complicated than this so take this as
a simple introduction to the general idea of loops. Besides for
loops, there also exist while
or repeat
loops and other ways to
introduce conditions. Have a look
here
or
here.
11.2.3 User-defined Functions
You already know functions. You have been using them a lot already.
Mean()
is a function, summary()
is a function. A function in R
is just a name for a pre-defined operation that is performed on an
object. The mean()
function can be applied to a list of numbers and
then it will give you the arithmetic mean. While calculating the mean
may seem rather straightforward, functions are generally suitable
shortcuts for operations we want to perform.
A function always requires some sort of “input,” called arguments, between its brackets, then it will “process” the input of the argument to return a specific “output.” In R, you can write your own customized functions. For more advanced users, this can speed up the work or actually make it easier to get somewhere where conventional commands won’t get you easily.
Tip: Before spending a lot of time writing your own function, check online whether somebody has already done it. Just google “How to …in R.”
The structure of a function is quite straightforward. You define the name of the function. Then you can include arguments as input into the function – although they are not necessarily required in all cases. If your function takes two arguments as input as seen below, make sure to also always specify two input arguments when applying the function (or set some default values for the argument when building the function). Lastly, you have the body of the function where you put the operation (a formula, an algorithm, an equation…etc.).
<- function(arg1, arg2) {
my_fun
body }
Here is a simple example for a new function called “triple.” All it does
is multiply an input number by 3 and returning the tripled number as
output. When we apply triple()
to the object a (which is 5), we see
the output 15 in the console. If we call object a again, we can see that a is
still 5 and has not been overwritten by this function.
# Function
<- function(x) {
triple <- 3*x
x
x
}
# Test
<- 5
a triple(a)
## [1] 15
a
## [1] 5
Above we used a type of tidyverse loop to get the mean of multiple columns containing information on grades. Imagine we want to transform the grades, so that the best scoring student has a score close to zero and the worst performing student has a score close to 1. We are basically re-scaling the grades to scores between 0-1 (this is also called min-max normalization).
The formula is:\(\frac {(x - min(x))} {(max(x) - min(x))}\) so the student’s grade minus the lowest observed grade, divided by the maximum observed grade minus the minimum grade. If we wanted to create a function in R to perform that calculation, it would look like this. The new function called “min_max_norm” is a function of x (any input).
# write min-max formula
<- function(x) {
min_max_norm - min(x)) / (max(x) - min(x))
(x }
When you run this, a new formula pops up in your environment (see upper right corner).
Now, let’s apply it to a vector (a column in our dataframe). Now R
applies our own calculation to every element of our column. Notice here,
that there is no specified argument x in the brackets here, instead R
recognizes the selected variable gpa_2010
as its input.
# apply to one column
<- grades1 %>% select(gpa_2010) %>%
grades5 filter(!is.na(gpa_2010)) %>%
min_max_norm()
Let’s try to combine this with the previous example of a loop across columns. We want the min-max transformation for all grade variables, not just 2010.
## combine with loop
<- grades1 %>%
grades6 drop_na() %>%
mutate(across(contains("gpa"), min_max_norm))
11.3 Exercises I (based on class data)
Your successful work for the university president got around and now two departments asked you to work for them. Congratulations! After feeling a bit overwhelmed at first, you remember that most of the things you had to do in your first job are quite repetitive. With some clever shortcutting you will save time and will easily have time to help the other two departments with their R problems for a couple of hours a week each. To convince the heads of department that you can work for both as well as having enough time to study and relax, they ask you to put together a quick showing of your shortcutting skills. First, you want to load the students data and tidyverse and janitor package as usual but for showing off sake, you use the pacman package for the package!
From previous work for the university president, you remember that all variables in the students dataframe are initially coded as character variables. In order to demonstrate that you can not only convert variables to numeric variables, you decide to convert all character variables besides grades, term, life satisfaction and age into factor variables. Use a shortcut!
While going through the materials you were given, you suddenly realize that someone messed with the term count. The rest of the dataframe is up to date but the term variable still shows the term students were in a year ago. Write a function to correct that error and apply it to the students dataset! Hint: Remember to check whether you need to convert it first!
You are very happy with your progress and decide to take a lunch break. While standing in line, someone asks you how you're doing. Of course, this immediately reminds you of the life satisfaction variable and you recognize your chance to change it in your show-off report to the departments! You decide to change its range using the normalization function
min_max_norm()
you encountered in class.Motivated by your success, you decide to go one step further and change the age variable. Instead of normalizing it, you decide to standardize it and write your own function! Look here for a short definition of standardization.
11.4 Exercises II (based on your own data)
Identify ways to implement shortcuts in the dataset you have been working on:
Convert at least two variables to their correct data type using one of the
across()
short cuts introduced this week.Normalize at least one variable using the
normalization()
function we created.
Explore your data and come up with a helpful function. Fox example, your dataset may have a lot of variables with country names and you would like to update them because their official title has changed. Here, a custom function applied to multiple columns would make sense! Hint: remember the
case_when()
function we encountered in week 3?
11.5 More helpful resources and online tutorials:
On loops and integration in R: 21 Iteration | R for Data Science (had.co.nz)
Examples of for loops: R Loop Through Data Frame Columns & Rows (4 Examples) | for & while (statisticsglobe.com)
Introduction to string manipulation via the package stringr