11 Week 11: Shortcuts

Objective of this session:

Installing packages faster
Iteration: loops
Simple functions

R commands covered in this session:

Pacman
across()
contains()
starts_with()
function(x)
apply()

11.1 Intro

In this session, I will introduce you to a couple of tricks to make your analysis faster and less tedious. The tools that will help us to write more efficient code are also key concepts in general programming, such as loops and functions. Wrapping your head around the basic idea behind them will help you understand better what programmers do and whether your journey into data science shall continue (or not).

11.2 Application

11.2.1 Installing packages quicker

So far, we have always installed packages individually and loaded them in each session whenever we needed them. Last week, during our session on Rmarkdown, I recommended to install and load packages in one compact script at the very beginning of your analysis. When your analysis gets longer and you require a lot of packages, installing and loading them can take a long time. It also is not very efficient, because you repeat the same commands over and over. Repeating the same commands also includes installing the same packages over and over.

There is a shortcut using the pacman package. First, install it. You may notice, that the installation of “pacman” here looks a bit different than our normal installation process: a look at the documentation of the first function require() by running ?require in the console or script tells us that require() is similar to library(). However, library() will return an error message if it fails to load a package, require() on the other hand will return a FALSE statement.

Short tangent: The exclamation mark in R has two main usages: 1) does not equal, as in variable1 != "X", which you might read as variable1 does not equal the character string X. This is useful for filtering. 2) Negation, as in !TRUE will become FALSE and !FALSE will become TRUE. Therefore, if we can successfully load the package pacman, a TRUE statement will be returned, which is negated by the exclamation mark, turning the TRUE statement into a FALSE statement. In that case, the condition is FALSE, i.e. not fulfilled, and the action of installing the packages will not be run by the system. In other words, require() first tries to load the package and if this does not work, meaning it is not installed yet, we ask R to install the package. This sort of conditional installation is not unique to the “pacman” package but just very fitting for the topic at hand.

Then, by using the “pacman” package, list all the packages you need in the p_load function. I used all packages that we used throughout this course so far. The code below is 4 lines long. If you did it individually, you would require 30 lines of code. p_load() does something very similar to what we have seen above: for all the packages listed in the brackets, it checks whether they are already installed, installs them if needed and loads them.

# clear all
rm(list = ls())

# installand load pacman
if (!require("pacman")) install.packages("pacman")

# load all package (and install them if they are not yet installed)
pacman::p_load("tidyverse", "dplyr", "stringr",
               "hrbrthemes", "gcookbook",
               "remotes", "ggridges","viridis", 
               "flextable", "stargazer",
               "janitor", "gtsummary", "writexl", 
               "knitr", "rmarkdown")

11.2.2 Iteration/ loops

In programming and data science, loops are a very important concept. Data scientists are lazy, so they hate to repeat things in their code. They always want to get to their goal with the least lines of code possible while not losing sight of what is going on (and still being able to explain what is happening – remember our discussion on reproducibility?). Loops are immensely helpful for that.

Loops are an approach to handle iteration which basically means “repeating things.” Whenever you find yourself copying and pasting some code more than 2 or three times, it is probably more efficient to think about a loop.

As with anything in R, there are many different ways to use loops in R including the lapply(), for loops, while, purr package and others. I strongly recommend exploring those approaches further.

Here, however, I will briefly introduce easy built-in tidyverse functions that allow you to conduct loops across columns with the recent across() function.

Tip: Functions such as summarise() or mutate() come in very handy when managing your data and handling multiple columns at once. They come with suffixes such as _at, _all and _if to specify even better which columns commands are applied to. across() is now considered the follow-up version of these suffixed functions, also known as “scoped variants!”

11.2.2.1 Across()

The across() function is a helpful built-in function in tidyverse. We will use it to apply the same operation to multiple columns in our dataset. It is especially powerful when we use it together with mutate() and summarise() (not their suffixed versions!). This speeds up cleaning and analysis and will shorten your code overall.

Let’s go back to our “students” dataframe. First, we load it and then only select the GPA grade variables. Remember, we can do that by specifying the column range as numbers. In this case, we can also do it using the helper function contains(). contains() only selects objects if they match a character sequence. In our case, we want to select only variables that contain “gpa” in the variable name.

We can also select all variables except the grade variables by using the “!” operator explained and used above.

# load data

students <- readxl::read_excel("data/students.xlsx")

## select only certain columns (i.e. grades)
grades <- students %>% select(5:15)
str(grades)

## tibble [1,000 × 11] (S3: tbl_df/tbl/data.frame)
##  $ gpa_2010: chr [1:1000] "3.1" "2.7" "1.7" "1.2" ...
##  $ gpa_2011: chr [1:1000] "2.8" "1.2" "2.5" "2.1" ...
##  $ gpa_2012: chr [1:1000] "2.8" "3.8" "3.8" "1.9" ...
##  $ gpa_2013: chr [1:1000] "2.3" "1.8" "2.9" "1.9" ...
##  $ gpa_2014: chr [1:1000] "0.5" "3" "2.4" "2.3" ...
##  $ gpa_2015: chr [1:1000] "3.9" "3" "2.4" "4.1" ...
##  $ gpa_2016: chr [1:1000] "3" "1.8" "1.6" "1" ...
##  $ gpa_2017: chr [1:1000] "1.1" "1" "3.7" "0.8" ...
##  $ gpa_2018: chr [1:1000] "2.6" "4" "2.8" "2.6" ...
##  $ gpa_2019: chr [1:1000] "3.1" "2.5" "4.1" "3.4" ...
##  $ gpa_2020: chr [1:1000] "3.9" "3" "5.7" "3.4" ...

# use shortcut
grades <- students %>% select(contains("gpa"))
nongrades <- students %>% select(!contains("gpa"))

Alright. We now have a dataframe “grades” that contains only the grades per year for all students. To do anything with the grade information, we need to first convert all the columns to numeric format. We could do this one by one but it would require a lot of lines of codes. Instead, we will use a built-in loop using across() in combination with starts_with(), where() or contains() (all are useful helper functions).

In the across function (type ?across), we have two inputs. First, we define which columns we want to do things with, second, we input a function that tells R what to do with the columns.

The dataframes grades1-4 all contain the same information but we used different ways to get there to show you which options exist. You can see that in all four versions, the second input of across() is – unsurprisingly – as.numeric since that is what we want to do with the columns: turn the data into numeric data to do calculations. The first argument of the across() function varies. The first version searches all columns of the object grades for columns with character data. The second version looks for the string “gpa” in the beginning of the variable name, the third one for the string “gpa” somewhere in the variable name and, finally, the last one takes columns 1 to 11.

## convert many columns at once
# convert to numeric

grades1 <- grades %>% 
  mutate(across(where(is.character), as.numeric))
str(grades1)

## tibble [1,000 × 11] (S3: tbl_df/tbl/data.frame)
##  $ gpa_2010: num [1:1000] 3.1 2.7 1.7 1.2 3 2.4 3.9 2 0.4 4.3 ...
##  $ gpa_2011: num [1:1000] 2.8 1.2 2.5 2.1 2.3 1 2 0.5 2 1.6 ...
##  $ gpa_2012: num [1:1000] 2.8 3.8 3.8 1.9 1.6 1.7 2.3 2.3 4.3 3.6 ...
##  $ gpa_2013: num [1:1000] 2.3 1.8 2.9 1.9 2.6 2.4 2.5 1.4 2.7 2.4 ...
##  $ gpa_2014: num [1:1000] 0.5 3 2.4 2.3 3.5 1.8 3.4 3.1 4.1 1.9 ...
##  $ gpa_2015: num [1:1000] 3.9 3 2.4 4.1 5.2 2.6 2.7 3.8 2.1 1.9 ...
##  $ gpa_2016: num [1:1000] 3 1.8 1.6 1 1.7 1.4 0.1 2.2 1.9 3.2 ...
##  $ gpa_2017: num [1:1000] 1.1 1 3.7 0.8 0.1 1.6 0.9 1.3 1.9 1.6 ...
##  $ gpa_2018: num [1:1000] 2.6 4 2.8 2.6 2.3 4.5 2.3 4.7 3.8 4.1 ...
##  $ gpa_2019: num [1:1000] 3.1 2.5 4.1 3.4 3.6 3.6 2.8 3 1.9 3.3 ...
##  $ gpa_2020: num [1:1000] 3.9 3 5.7 3.4 6 4.4 5.7 3.9 4 2.7 ...

# same but different
grades2 <- grades %>% 
  mutate(across(starts_with("gpa"), as.numeric))

grades3 <- grades %>% 
  mutate(across(contains("gpa"), as.numeric))

grades4 <- grades %>% mutate(across(1:11, as.numeric))

Imagine you don’t want to convert all variables with a specific name or within a specific column range, but only very specific variables that do not appear in the same order. You can do this using across() and specifying the variable names in a vector c() separated by commas.

# only selected variables
grades5 <- students %>% mutate(across(c(gpa_2010, gpa_2011,gpa_2012), as.numeric))
str(grades5)

## tibble [1,000 × 23] (S3: tbl_df/tbl/data.frame)
##  $ faculty     : chr [1:1000] "Business" "Business" "Business" "Business" ...
##  $ course      : chr [1:1000] "accounting" "accounting" "accounting" "accounting" ...
##  $ age         : num [1:1000] 26.3 28.8 23.9 27.4 29.3 ...
##  $ cob         : chr [1:1000] "Spain" "Netherlands" "Netherlands" "Spain" ...
##  $ gpa_2010    : num [1:1000] 3.1 2.7 1.7 1.2 3 2.4 3.9 2 0.4 4.3 ...
##  $ gpa_2011    : num [1:1000] 2.8 1.2 2.5 2.1 2.3 1 2 0.5 2 1.6 ...
##  $ gpa_2012    : num [1:1000] 2.8 3.8 3.8 1.9 1.6 1.7 2.3 2.3 4.3 3.6 ...
##  $ gpa_2013    : chr [1:1000] "2.3" "1.8" "2.9" "1.9" ...
##  $ gpa_2014    : chr [1:1000] "0.5" "3" "2.4" "2.3" ...
##  $ gpa_2015    : chr [1:1000] "3.9" "3" "2.4" "4.1" ...
##  $ gpa_2016    : chr [1:1000] "3" "1.8" "1.6" "1" ...
##  $ gpa_2017    : chr [1:1000] "1.1" "1" "3.7" "0.8" ...
##  $ gpa_2018    : chr [1:1000] "2.6" "4" "2.8" "2.6" ...
##  $ gpa_2019    : chr [1:1000] "3.1" "2.5" "4.1" "3.4" ...
##  $ gpa_2020    : chr [1:1000] "3.9" "3" "5.7" "3.4" ...
##  $ job         : chr [1:1000] "no" "no" "no" "yes" ...
##  $ lifesat     : chr [1:1000] "68.4722360275967" "60.7386549043799" "67.2921321180378" "73.1391944810778" ...
##  $ like        : chr [1:1000] "3" "3" "4" "4" ...
##  $ relationship: chr [1:1000] "In a relationship" "Single" "In a relationship" "Single" ...
##  $ sex         : chr [1:1000] "Male" "Female" "Female" "Male" ...
##  $ term        : chr [1:1000] "7" "4" "5" "10" ...
##  $ university  : chr [1:1000] "Berlin" "Berlin" "Berlin" "Berlin" ...
##  $ workingclass: chr [1:1000] "yes" "yes" "no" "yes" ...

Now, let’s get the average grade for all years. Again, rather than doing it column by column, we will use one line to apply the mean function to all columns by using summarise() in combination with across().

## get mean for all grade variables
mean_grades <- grades1 %>% summarise(across(1:11, mean, na.rm=TRUE))

That was quick and easy. Now let’s get the mean by group, e.g. sex. Luckily, the across() function works well with the group_by() function that you are already very familiar with. Read this as: From the students dataframe, we select all variables containing the string “gpa” and the “sex” variable, we then convert all “gpa” containing variables to numeric variables, then group the data by sex to compute the mean value of all “gpa” containing variables for students based on sex while removing missing observations.

## get mean for all grade variables by group
mean_grades <- students %>%
  select(contains("gpa"), sex) %>%
  mutate(across(contains("gpa"), as.numeric)) %>% 
  group_by(sex) %>%
  summarise(across(contains("gpa"), mean, na.rm=TRUE))

Remember that we have solved the same problem using another method in an earlier session? We also wanted to get the average grade of students in different faculties. Rather than using a loop, we reshaped or restructured the data from wide to long format and then applied the mean() function to a single column (which then contained all grades stacked by year).

The approach you use depends on the specific task at hand and, to some degree, on your personal preferences.

11.2.2.2 Apply()

We have already covered multiple average GPA scores we might be interested in with across(). Now, just to inform you of a different and similarly versatile function, let’s have a look at the apply() function family. apply() can apply (haha) functions to a dataframe, its variants such as lapply() or tapply() require a more specific input. Interesting for us now is the ability of apply() to apply a function on rows as well as columns. So, say for example that you want to find out each students’ average GPA over the years, how would that look like? We create a new dataframe “av_grades” based on one of our previously created “grades” dataframes where the gpa scores are already coded numerically. We then create a new variable using mutate() and apply(). The first input argument specifies the relevant data, the second the direction: 1 means applying the function to each row, 2to each column. Finally, we have the function we are interested in (mean) and the optional argument taking care of missing observations.

# let's have a look at the apply documentation
?apply

# create a new variable av_gpa using apply 
av_grades <- grades1 %>% 
  mutate(av_gpa = apply(grades1, 1, mean, na.rm=TRUE))

Extra: sapply(dataframe, class) and sapply(dataframe, unique)

11.2.2.3 “FOR” loop

After having gotten to know and cherish the across() and apply()functions, you might wonder whether you can do all that manually as well. The answer, of course, is yes – but you would be well advised to simply use a pre-existing function whenever available. Still, let’s have a quick look into loops so that you can get a general understanding of the concept. This way, if you’d be ever in need for a loop, your google queries will be a bit more proficient to start with.

The most general loop type is a for loop. Here, we can iterate over a vector and apply a function on every element in it (sounds familiar, doesn’t it?). The general structure contains for followed by i in in parentheses () and two braces {}.

Read this as “for every element i in a given vector (i is commonly used but you can basically name it as you wish), apply whatever happens in between the {}.” Now, let’s consider our students data again: Imagine that for whatever reason, you want to shift the grade range by 1, meaning that you add 1 to all GPA values.

# first loop 
for(col in 1:ncol(grades2)) {        
  grades2[ , col] <- grades2[ , col] + 1 
}

Let’s see what happened here: This for loop iterates over every column col in the number of columns of the dataframe grades2. Note here the syntax 1:ncol() which means starting from 1 up to the number of columns in the specified dataframe. This allows us to index the dataframe to specify exactly to which cells we want to apply something. Now, moving on to the body (the part in between {}), we replace the values in grades2’ columns by their original value plus 1. The [ , col] indexing part is new: Square brackets allows us to access a cell in a dataframe by specifying their coordinates [row, column] in them. Here, we leave the row space empty because it applies to all rows and only indicate the columns we are interested in using the col variable from the loop initialization. What happens now is basically that col takes on every value between 1 and the number of columns in grades2. This means that in the first round of the loop, col is equal to 1 so that R processes grades2[ , 1] <- grades2[ , 1] + 1 thereby adding 1 to each value in the first column of grades2 (“gpa_2010”). This is repeated for every column leaving then all values in grades2 greater by 1.

Obviously, loops can get much more complicated than this so take this as a simple introduction to the general idea of loops. Besides for loops, there also exist while or repeat loops and other ways to introduce conditions. Have a look here or here.

11.2.3 User-defined Functions

You already know functions. You have been using them a lot already. Mean() is a function, summary() is a function. A function in R is just a name for a pre-defined operation that is performed on an object. The mean() function can be applied to a list of numbers and then it will give you the arithmetic mean. While calculating the mean may seem rather straightforward, functions are generally suitable shortcuts for operations we want to perform.

A function always requires some sort of “input,” called arguments, between its brackets, then it will “process” the input of the argument to return a specific “output.” In R, you can write your own customized functions. For more advanced users, this can speed up the work or actually make it easier to get somewhere where conventional commands won’t get you easily.

Tip: Before spending a lot of time writing your own function, check online whether somebody has already done it. Just google “How to …in R.”

The structure of a function is quite straightforward. You define the name of the function. Then you can include arguments as input into the function – although they are not necessarily required in all cases. If your function takes two arguments as input as seen below, make sure to also always specify two input arguments when applying the function (or set some default values for the argument when building the function). Lastly, you have the body of the function where you put the operation (a formula, an algorithm, an equation…etc.).

my_fun <- function(arg1, arg2) { 
  body 
}

Here is a simple example for a new function called “triple.” All it does is multiply an input number by 3 and returning the tripled number as output. When we apply triple() to the object a (which is 5), we see the output 15 in the console. If we call object a again, we can see that a is still 5 and has not been overwritten by this function.

# Function
triple <- function(x) { 
  x <- 3*x 
  x 
} 

# Test
a <- 5 
triple(a)

## [1] 15

## [1] 5

Above we used a type of tidyverse loop to get the mean of multiple columns containing information on grades. Imagine we want to transform the grades, so that the best scoring student has a score close to zero and the worst performing student has a score close to 1. We are basically re-scaling the grades to scores between 0-1 (this is also called min-max normalization).

The formula is:\(\frac {(x - min(x))} {(max(x) - min(x))}\) so the student’s grade minus the lowest observed grade, divided by the maximum observed grade minus the minimum grade. If we wanted to create a function in R to perform that calculation, it would look like this. The new function called “min_max_norm” is a function of x (any input).

# write min-max formula
min_max_norm <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}

When you run this, a new formula pops up in your environment (see upper right corner).

Now, let’s apply it to a vector (a column in our dataframe). Now R applies our own calculation to every element of our column. Notice here, that there is no specified argument x in the brackets here, instead R recognizes the selected variable gpa_2010 as its input.

# apply to one column
grades5 <- grades1 %>% select(gpa_2010) %>%
  filter(!is.na(gpa_2010)) %>%
  min_max_norm()

Let’s try to combine this with the previous example of a loop across columns. We want the min-max transformation for all grade variables, not just 2010.

## combine with loop
grades6 <- grades1 %>%
  drop_na() %>%
  mutate(across(contains("gpa"), min_max_norm))

11.3 Exercises I (based on class data)

Your successful work for the university president got around and now two departments asked you to work for them. Congratulations! After feeling a bit overwhelmed at first, you remember that most of the things you had to do in your first job are quite repetitive. With some clever shortcutting you will save time and will easily have time to help the other two departments with their R problems for a couple of hours a week each. To convince the heads of department that you can work for both as well as having enough time to study and relax, they ask you to put together a quick showing of your shortcutting skills. First, you want to load the students data and tidyverse and janitor package as usual but for showing off sake, you use the pacman package for the package!
From previous work for the university president, you remember that all variables in the students dataframe are initially coded as character variables. In order to demonstrate that you can not only convert variables to numeric variables, you decide to convert all character variables besides grades, term, life satisfaction and age into factor variables. Use a shortcut!
While going through the materials you were given, you suddenly realize that someone messed with the term count. The rest of the dataframe is up to date but the term variable still shows the term students were in a year ago. Write a function to correct that error and apply it to the students dataset! Hint: Remember to check whether you need to convert it first!
You are very happy with your progress and decide to take a lunch break. While standing in line, someone asks you how you're doing. Of course, this immediately reminds you of the life satisfaction variable and you recognize your chance to change it in your show-off report to the departments! You decide to change its range using the normalization function min_max_norm() you encountered in class.
Motivated by your success, you decide to go one step further and change the age variable. Instead of normalizing it, you decide to standardize it and write your own function! Look here for a short definition of standardization.

11.4 Exercises II (based on your own data)

Identify ways to implement shortcuts in the dataset you have been working on:
1. Convert at least two variables to their correct data type using one of the across()short cuts introduced this week.
2. Normalize at least one variable using the normalization() function we created.
Explore your data and come up with a helpful function. Fox example, your dataset may have a lot of variables with country names and you would like to update them because their official title has changed. Here, a custom function applied to multiple columns would make sense! Hint: remember the case_when() function we encountered in week 3?

11.5 More helpful resources and online tutorials:

On loops and integration in R: 21 Iteration | R for Data Science (had.co.nz)
Examples of for loops: R Loop Through Data Frame Columns & Rows (4 Examples) | for & while (statisticsglobe.com)
Introduction to string manipulation via the package stringr