Introduction

Before we get into the meat of things, a bit about the organization of this document: I start with basic R stuff and proceed through different processes and libraries. Feel free to skip around, but if you don’t understand something syntactically, it might be wise to review earlier sections. (Or if you’re really stuck, reach out!)

There’s a few cheat sheets that I’d recommend, depending on what you’re after. They don’t necessarily show the same methods or formats as found in this document, but that’s okay! There are many ways to do the same thing in R—what’s important is that you understand how it works and that you get the result you’re after. The links to the cheat sheets are:

Help files are very useful, and can be accessed in that particular pane or by typing ?function_name into the console. (That is, if you’d like to know more about mean(), you’d use ?mean.) Other than that, Googling is your best friend—try to be as specific as possible, and if you’re getting any error messages, definitely include those.

The main takeaway from this should ideally be that if you can think of something you’d like to do in R, you can—remember that. I hope that this document is helpful, and happy coding!

Importing data

Obviously, you can import data in different ways. The most common method will likely be reading in an Excel file or csv, which is pretty simple to load in as a data frame. You can also load data (including things that you’ve saved) from .RData files for the whole environment, or from .rds files for particular variables/data frames. There’s also, of course, ways to load in data that’s stored in .dta (Stata) format. There are other libraries that will let you load in other types of files, but these are the main ones that you’ll need to know to get started. See the code below for different examples of this.

zillow <- read.csv("C:/Users/knigh/Downloads/zillow_dataset.csv")
head(zillow[,3:8])

##     RegionName RegionType StateName X2000.01.31 X2000.02.29 X2000.03.31
## 1   California      state        NA   187783.93   188416.66    189269.8
## 2        Texas      state        NA   107911.13   107968.44    107995.5
## 3      Florida      state        NA   104513.44   104740.17    105014.7
## 4     New York      state        NA   129469.17   129930.91    130374.9
## 5 Pennsylvania      state        NA    92347.12    92544.77     92730.5
## 6     Illinois      state        NA   114873.73   114966.85    115167.0

library(readxl) #easiest for Excel files
excel <- read_excel('C:/Users/knigh/Downloads/Financial Sample.xlsx')
head(excel)

## # A tibble: 6 × 16
##   Segment    Country Product  `Discount Band` `Units Sold` `Manufacturing Price`
##   <chr>      <chr>   <chr>    <chr>                  <dbl>                 <dbl>
## 1 Government Canada  Carrete… None                   1618.                     3
## 2 Government Germany Carrete… None                   1321                      3
## 3 Midmarket  France  Carrete… None                   2178                      3
## 4 Midmarket  Germany Carrete… None                    888                      3
## 5 Midmarket  Mexico  Carrete… None                   2470                      3
## 6 Government Germany Carrete… None                   1513                      3
## # ℹ 10 more variables: `Sale Price` <dbl>, `Gross Sales` <dbl>,
## #   Discounts <dbl>, Sales <dbl>, COGS <dbl>, Profit <dbl>, Date <dttm>,
## #   `Month Number` <dbl>, `Month Name` <chr>, Year <chr>

library(haven) #for Stata files
gdp <- read_dta('http://www.principlesofeconometrics.com/stata/gdp.dta') #works for local files, too
head(gdp)

## # A tibble: 6 × 2
##     usa   aus
##   <dbl> <dbl>
## 1  38.3  38.2
## 2  38.4  38.8
## 3  38.7  38.8
## 4  38.3  38.9
## 5  39.4  39.6
## 6  39.6  39.6

There are different ways that R can store and access files on your computer. You can set and access your current working directory, along with what files are in it:

setwd('C:/Users/knigh/OneDrive/Documents/RStudio/')
getwd()

## [1] "C:/Users/knigh/OneDrive/Documents/RStudio"

list.files()

## [1] "823.RData"                        "E-Scooter_Trips_-_2019_Pilot.csv"
## [3] "image.RData"                      "image_real.RData"                
## [5] "rbackups"                         "replication_ec823"               
## [7] "RLIS_Portland"                    "x.rds"                           
## [9] "y.rds"

get a list of what variables you’ve created:

ls()

## [1] "excel"  "gdp"    "zillow"

save and load an image of all variables:

save.image('image.RData') #store all variables
rm(list=ls()) #deletes all variables in memory
ls()

## character(0)

load('image.RData') 
ls()

## [1] "excel"  "gdp"    "zillow"

and save/load single variables as .rds files:

x <- 1:10
saveRDS(x, 'x.rds')
y <- readRDS('x.rds')
x == y

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

For file paths on your computer, especially if you’re using Windows, keep in mind that you can’t use backslashes—forward slashes only. So rather than accessing your C:\ drive, it’d be your “C:/” drive.

Base data manipulation

As an example, we’re going to load the “mtcars” dataset from our package, to see a few useful (base R) functions and ways to access data.

data('mtcars') #loading from a library. Not typical
head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

head() displays the first few rows of a data frame (generally 6, though you can specify this):

head(mtcars, 15)

##                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4          21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag      21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710         22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive     21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant            18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360         14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D          24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230           22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280           19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C          17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE         16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL         17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC        15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4

If you’d like to see the full dataset, you can use View(), as below.

View(mtcars) #doesn't work in a markdown document, unfortunately

summary() lets us look at some information for every (numeric) column in the dataset, which might be helpful to get a handle on what’s all going on in it.

summary(mtcars)

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

If you want to access particular elements of a data frame, you can do that with brackets, specifying [row, column]. If you don’t specify a row/column, all values of that row/column will be included (that is, df[,] just returns all of df).

mtcars[10:20,] #rows 10-20

##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1

mtcars[,1] #first column

##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4

Alternatively, you can specify a column by name either with the dollar sign ($) or by feeding in the name as a string, as you can see below. Aside from the single-bracket approach, all these ways are equivalent to extract a column.

mtcars$mpg

##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4

mtcars['mpg'] #still a data frame

##                      mpg
## Mazda RX4           21.0
## Mazda RX4 Wag       21.0
## Datsun 710          22.8
## Hornet 4 Drive      21.4
## Hornet Sportabout   18.7
## Valiant             18.1
## Duster 360          14.3
## Merc 240D           24.4
## Merc 230            22.8
## Merc 280            19.2
## Merc 280C           17.8
## Merc 450SE          16.4
## Merc 450SL          17.3
## Merc 450SLC         15.2
## Cadillac Fleetwood  10.4
## Lincoln Continental 10.4
## Chrysler Imperial   14.7
## Fiat 128            32.4
## Honda Civic         30.4
## Toyota Corolla      33.9
## Toyota Corona       21.5
## Dodge Challenger    15.5
## AMC Javelin         15.2
## Camaro Z28          13.3
## Pontiac Firebird    19.2
## Fiat X1-9           27.3
## Porsche 914-2       26.0
## Lotus Europa        30.4
## Ford Pantera L      15.8
## Ferrari Dino        19.7
## Maserati Bora       15.0
## Volvo 142E          21.4

mtcars[['mpg']] #vector

##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4

You can also assign new columns as below:

mtcars$mpg2 <- mtcars$mpg + 5
head(mtcars, 8) #to demonstrate

##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb mpg2
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4 26.0
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4 26.0
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1 27.8
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1 26.4
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2 23.7
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1 23.1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4 19.3
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2 29.4

This all works, but is not the easiest or most legible way to make changes a lot of the time. I and many other prefer packages that are part of the “tidyverse”, which allows us to use a few new functions and a different syntax for data manipulation. That said, the basic ways are still helpful to know. Everything works in base R, it just might be more confusing to implement or read.

For the rest of this document, we’ll be using tidyverse conventions. There are other options for similar results, the most prominent of which is data.table, but the tidyverse is very legible. If you’re working with very large data, it might be worthwhile to explore other options, but this is an edge case.

Tidyverse data manipulation

The main difference is in the existence and use of the pipe operator, %>%. It allows us to take the result of one operation and “pipe” it into another function, which is great for legibility. Without a pipe operator, to apply multiple functions, we’d either need to use multiple lines and intermediary steps, or to nest the functions such that to understand what’s going on, we’d have to read the code from the outside in. For instance, there are two ways to take a random sample of 25 integers from 1-1000 and find their mean below, but one is much clearer to read than the other:

set.seed(1)
mean(sample(1:1000, 25))

## [1] 551.92

library(tidyverse)
set.seed(1)
1:1000 %>% sample(25) %>% mean()

## [1] 551.92

(The set.seed() function allows us to get the same random numbers every time, which is good for reproducibility.)

Note that specifically, the pipe operator inserts the preceding result as the first argument of a function. If you need things in a different order, you can use the period (.) to tell the function where to place the piped object:

set.seed(2)
1:10 %>% sample(., 4) #unnecessary, but helps illustrate the point

## [1] 5 6 9 1

set.seed(2)
4 %>% sample(1:10, .)

## [1] 5 6 9 1

This is not only an asset to regular function use, but allows us to do a great deal with data frames in a smoother way. This works because each function requires a data frame as its first argument, which makes things smooth! glimpse() lets us look at data frames and the types of each column, though sideways:

mtcars <- mtcars %>% mutate(rowname = rownames(.)) %>%
  as_tibble() %>% select(rowname, where(is.numeric)) #better printing
mtcars %>% glimpse() #note the sideways display!

## Rows: 32
## Columns: 13
## $ rowname <chr> "Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", …
## $ mpg     <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17…
## $ cyl     <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4,…
## $ disp    <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8,…
## $ hp      <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, …
## $ drat    <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.…
## $ wt      <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150,…
## $ qsec    <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90,…
## $ vs      <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,…
## $ am      <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,…
## $ gear    <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3,…
## $ carb    <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1,…
## $ mpg2    <dbl> 26.0, 26.0, 27.8, 26.4, 23.7, 23.1, 19.3, 29.4, 27.8, 24.2, 22…

mutate() allows us to change/add columns as we go:

mtcars %>% mutate(test=1, test2=FALSE, test3=ifelse(wt > 3, 1, 0)) %>% glimpse()

## Rows: 32
## Columns: 16
## $ rowname <chr> "Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", …
## $ mpg     <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17…
## $ cyl     <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4,…
## $ disp    <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8,…
## $ hp      <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, …
## $ drat    <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.…
## $ wt      <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150,…
## $ qsec    <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90,…
## $ vs      <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,…
## $ am      <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,…
## $ gear    <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3,…
## $ carb    <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1,…
## $ mpg2    <dbl> 26.0, 26.0, 27.8, 26.4, 23.7, 23.1, 19.3, 29.4, 27.8, 24.2, 22…
## $ test    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ test2   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ test3   <dbl> 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,…

and filter() lets us choose a particular subset of the data.

mtcars %>% filter(cyl < 6)

## # A tibble: 11 × 13
##    rowname       mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Datsun 710   22.8     4 108      93  3.85  2.32  18.6     1     1     4     1
##  2 Merc 240D    24.4     4 147.     62  3.69  3.19  20       1     0     4     2
##  3 Merc 230     22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
##  4 Fiat 128     32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
##  5 Honda Civic  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
##  6 Toyota Cor…  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
##  7 Toyota Cor…  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1
##  8 Fiat X1-9    27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
##  9 Porsche 91…  26       4 120.     91  4.43  2.14  16.7     0     1     5     2
## 10 Lotus Euro…  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2
## 11 Volvo 142E   21.4     4 121     109  4.11  2.78  18.6     1     1     4     2
## # ℹ 1 more variable: mpg2 <dbl>

Note that none of these actually change the underlying dataset, because we haven’t stored any of these alterations:

mtcars %>% head(10)

## # A tibble: 10 × 13
##    rowname       mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Mazda RX4    21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2 Mazda RX4 …  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3 Datsun 710   22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4 Hornet 4 D…  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5 Hornet Spo…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6 Valiant      18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8 Merc 240D    24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9 Merc 230     22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10 Merc 280     19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # ℹ 1 more variable: mpg2 <dbl>

We can also use pull() to select columns, or use the dot operator to do the same.

mtcars %>% .$mpg

##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4

mtcars %>% pull(mpg)

##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4

select() lets us pick particular columns, and we can do “-col_name” to drop a column.

mtcars %>% select(mpg, am) #select only certain columns

## # A tibble: 32 × 2
##      mpg    am
##    <dbl> <dbl>
##  1  21       1
##  2  21       1
##  3  22.8     1
##  4  21.4     0
##  5  18.7     0
##  6  18.1     0
##  7  14.3     0
##  8  24.4     0
##  9  22.8     0
## 10  19.2     0
## # ℹ 22 more rows

mtcars %>% select(-mpg) #drop column

## # A tibble: 32 × 12
##    rowname       cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb  mpg2
##    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Mazda RX4       6  160    110  3.9   2.62  16.5     0     1     4     4  26  
##  2 Mazda RX4 …     6  160    110  3.9   2.88  17.0     0     1     4     4  26  
##  3 Datsun 710      4  108     93  3.85  2.32  18.6     1     1     4     1  27.8
##  4 Hornet 4 D…     6  258    110  3.08  3.22  19.4     1     0     3     1  26.4
##  5 Hornet Spo…     8  360    175  3.15  3.44  17.0     0     0     3     2  23.7
##  6 Valiant         6  225    105  2.76  3.46  20.2     1     0     3     1  23.1
##  7 Duster 360      8  360    245  3.21  3.57  15.8     0     0     3     4  19.3
##  8 Merc 240D       4  147.    62  3.69  3.19  20       1     0     4     2  29.4
##  9 Merc 230        4  141.    95  3.92  3.15  22.9     1     0     4     2  27.8
## 10 Merc 280        6  168.   123  3.92  3.44  18.3     1     0     4     4  24.2
## # ℹ 22 more rows

Here’s a bit more with mutate(); if we’d like to perform mathematical operations with any variables, we can access them by name directly, as below.

mtcars %>% mutate(hp_wt = hp / wt) %>% glimpse()

## Rows: 32
## Columns: 14
## $ rowname <chr> "Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", …
## $ mpg     <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17…
## $ cyl     <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4,…
## $ disp    <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8,…
## $ hp      <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, …
## $ drat    <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.…
## $ wt      <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150,…
## $ qsec    <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90,…
## $ vs      <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,…
## $ am      <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,…
## $ gear    <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3,…
## $ carb    <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1,…
## $ mpg2    <dbl> 26.0, 26.0, 27.8, 26.4, 23.7, 23.1, 19.3, 29.4, 27.8, 24.2, 22…
## $ hp_wt   <dbl> 41.98473, 38.26087, 40.08621, 34.21462, 50.87209, 30.34682, 68…

mtcars %>% mutate(hp_wt = hp / wt) %>% pull(hp_wt)

##  [1] 41.98473 38.26087 40.08621 34.21462 50.87209 30.34682 68.62745 19.43574
##  [9] 30.15873 35.75581 35.75581 44.22604 48.25737 47.61905 39.04762 39.63864
## [17] 43.03087 30.00000 32.19814 35.42234 39.35091 42.61364 43.66812 63.80208
## [25] 45.51365 34.10853 42.52336 74.68605 83.28076 63.17690 93.83754 39.20863

Now, we’d like to tidy up the makes and models of the cars we’re working with here. I’m going to use regular expressions to do this, which might look a bit complicated, but all I’m effectively doing is splitting the string by the first space. (Some cars have multiple words as the model.)

#preparing makes and models
makes <- mtcars %>% pull(rowname) %>% gsub('([^ ]+) (.+)', '\\1', .)
models <- mtcars %>% pull(rowname) %>% gsub('([^ ]+) (.+)', '\\2', .)
makes #to see

##  [1] "Mazda"    "Mazda"    "Datsun"   "Hornet"   "Hornet"   "Valiant" 
##  [7] "Duster"   "Merc"     "Merc"     "Merc"     "Merc"     "Merc"    
## [13] "Merc"     "Merc"     "Cadillac" "Lincoln"  "Chrysler" "Fiat"    
## [19] "Honda"    "Toyota"   "Toyota"   "Dodge"    "AMC"      "Camaro"  
## [25] "Pontiac"  "Fiat"     "Porsche"  "Lotus"    "Ford"     "Ferrari" 
## [31] "Maserati" "Volvo"

models

##  [1] "RX4"         "RX4 Wag"     "710"         "4 Drive"     "Sportabout" 
##  [6] "Valiant"     "360"         "240D"        "230"         "280"        
## [11] "280C"        "450SE"       "450SL"       "450SLC"      "Fleetwood"  
## [16] "Continental" "Imperial"    "128"         "Civic"       "Corolla"    
## [21] "Corona"      "Challenger"  "Javelin"     "Z28"         "Firebird"   
## [26] "X1-9"        "914-2"       "Europa"      "Pantera L"   "Dino"       
## [31] "Bora"        "142E"

Now we can store this and use the new columns to filter our dataset!

mtcars <- mtcars %>% mutate(make = makes, model = models)
mtcars %>% glimpse()

## Rows: 32
## Columns: 15
## $ rowname <chr> "Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", …
## $ mpg     <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17…
## $ cyl     <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4,…
## $ disp    <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8,…
## $ hp      <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, …
## $ drat    <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.…
## $ wt      <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150,…
## $ qsec    <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90,…
## $ vs      <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,…
## $ am      <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,…
## $ gear    <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3,…
## $ carb    <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1,…
## $ mpg2    <dbl> 26.0, 26.0, 27.8, 26.4, 23.7, 23.1, 19.3, 29.4, 27.8, 24.2, 22…
## $ make    <chr> "Mazda", "Mazda", "Datsun", "Hornet", "Hornet", "Valiant", "Du…
## $ model   <chr> "RX4", "RX4 Wag", "710", "4 Drive", "Sportabout", "Valiant", "…

mtcars %>% filter(make=='Hornet')

## # A tibble: 2 × 15
##   rowname        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Hornet 4 Dr…  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
## 2 Hornet Spor…  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
## # ℹ 3 more variables: mpg2 <dbl>, make <chr>, model <chr>

mtcars %>% filter(make=='Hornet', cyl > 6)

## # A tibble: 1 × 15
##   rowname        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Hornet Spor…  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
## # ℹ 3 more variables: mpg2 <dbl>, make <chr>, model <chr>

mtcars %>% filter(make=='Hornet' | cyl > 6)

## # A tibble: 15 × 15
##    rowname       mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Hornet 4 D…  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  2 Hornet Spo…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  3 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  4 Merc 450SE   16.4     8  276.   180  3.07  4.07  17.4     0     0     3     3
##  5 Merc 450SL   17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3
##  6 Merc 450SLC  15.2     8  276.   180  3.07  3.78  18       0     0     3     3
##  7 Cadillac F…  10.4     8  472    205  2.93  5.25  18.0     0     0     3     4
##  8 Lincoln Co…  10.4     8  460    215  3     5.42  17.8     0     0     3     4
##  9 Chrysler I…  14.7     8  440    230  3.23  5.34  17.4     0     0     3     4
## 10 Dodge Chal…  15.5     8  318    150  2.76  3.52  16.9     0     0     3     2
## 11 AMC Javelin  15.2     8  304    150  3.15  3.44  17.3     0     0     3     2
## 12 Camaro Z28   13.3     8  350    245  3.73  3.84  15.4     0     0     3     4
## 13 Pontiac Fi…  19.2     8  400    175  3.08  3.84  17.0     0     0     3     2
## 14 Ford Pante…  15.8     8  351    264  4.22  3.17  14.5     0     1     5     4
## 15 Maserati B…  15       8  301    335  3.54  3.57  14.6     0     1     5     8
## # ℹ 3 more variables: mpg2 <dbl>, make <chr>, model <chr>

For filter(), note that & works for “and” and “|” works for or. Also, you need double equals signs for equality tests.

Regex

This is very much something that you’ll need to practice more than I can teach you, but I’d advise learning regex if you have to work with strings—it makes life easier. R’s regular expressions work generally like any others, though there’s a couple of differences I’ll acknowledge. In general, if you’d like to learn regular expressions, I recommend the website regexr.com to test things out; it has a built-in cheatsheet, which is nice. In general, R works similarly, but capture groups (things to identify to keep) are accessed with \\1 rather than $1, and some characters require additional backslashes as compared to the JavaScript version that regexr uses. The main functions you might be interested in are below, though the section above contains a more complicated example of regular expressions if you’re curious.

grepl(pattern='[a-z]', 'test') #is there a lowercase letter in "test"?

## [1] TRUE

gsub(pattern='[a-z]', replacement='', 'LaTeX') #replace all instances of a lowercase letter with nothing

## [1] "LTX"

In general, square brackets denote a set of possible character matches, curly braces help specify number of repetitions, and parentheses help you denote capture groups. Periods are wildcards, * is 0 or more instances, + is 1 or more instances, and ? matches 0 or 1 instances. This is very much the kind of thing that it’s hard to learn until you really need it, but it’s helpful.

Summarizing, reshaping, joins

None of what we’ve done thus far has really changed the shape of our data very much—we’ve added columns and changed things a little bit, but what if the data’s in the wrong form altogether? One easy example would be data that’s far too wide. The Zillow dataset that we imported earlier has 50 rows (one for each state) and way too many columns, since many of the columns are monthly values. We can use tools like pivot_longer():

zillow2 <- zillow %>% pivot_longer(starts_with('X')) 
zillow2 %>% head(10)

## # A tibble: 10 × 7
##    RegionID SizeRank RegionName RegionType StateName name          value
##       <int>    <int> <chr>      <chr>      <lgl>     <chr>         <dbl>
##  1        9        0 California state      NA        X2000.01.31 187784.
##  2        9        0 California state      NA        X2000.02.29 188417.
##  3        9        0 California state      NA        X2000.03.31 189270.
##  4        9        0 California state      NA        X2000.04.30 191121.
##  5        9        0 California state      NA        X2000.05.31 193271.
##  6        9        0 California state      NA        X2000.06.30 195590.
##  7        9        0 California state      NA        X2000.07.31 198040.
##  8        9        0 California state      NA        X2000.08.31 200598.
##  9        9        0 California state      NA        X2000.09.30 203164.
## 10        9        0 California state      NA        X2000.10.31 205597.

zillow2 <- zillow2 %>% 
  mutate(year = name %>% gsub('^X([0-9]{4}).([0-9]{2}).([0-9]{2})', '\\1', .) %>% as.numeric(),
         month = name %>% gsub('^X([0-9]{4}).([0-9]{2}).([0-9]{2})', '\\2', .) %>% as.numeric()) %>%
    mutate(pd = 1:n(), .by=RegionName)
zillow2 %>% filter(year==2008, month==8)

## # A tibble: 51 × 10
##    RegionID SizeRank RegionName    RegionType StateName name   value  year month
##       <int>    <int> <chr>         <chr>      <lgl>     <chr>  <dbl> <dbl> <dbl>
##  1        9        0 California    state      NA        X200… 3.65e5  2008     8
##  2       54        1 Texas         state      NA        X200… 1.38e5  2008     8
##  3       14        2 Florida       state      NA        X200… 1.93e5  2008     8
##  4       43        3 New York      state      NA        X200… 2.38e5  2008     8
##  5       47        4 Pennsylvania  state      NA        X200… 1.57e5  2008     8
##  6       21        5 Illinois      state      NA        X200… 1.77e5  2008     8
##  7       44        6 Ohio          state      NA        X200… 1.21e5  2008     8
##  8       16        7 Georgia       state      NA        X200… 1.54e5  2008     8
##  9       36        8 North Caroli… state      NA        X200… 1.64e5  2008     8
## 10       30        9 Michigan      state      NA        X200… 1.15e5  2008     8
## # ℹ 41 more rows
## # ℹ 1 more variable: pd <int>

There’s a bit going on here, so let me explain. pivot_longer() takes all the columns we’ve selected (those that start with “X”, all the monthly values) and makes them into two columns: “name” and “value”. “name” is the original column name, “value” is the value. We then use regular expressions to extract the month and year information, because the formatting of all names is XYYYY.MM.DD, where DD is just the last day of the month. We then add a new variable, pd, which helps us turn year/month info into time-period info. The .by argument means that the mutated values are calculated by RegionName (state, actually), rather than counting linearly through the whole data frame. If it wasn’t already in order, we’d need to arrange() the data before the counting.

In practice, commands like these are built up over time. This is why I think that not immediately changing values is an asset—you can look at what the result is for any intermediate step in the process and make sure everything functioned right before moving on to the next step. The second part of this, though, is that I can just run one command to get the exact result that I want!

A quick aside—if we’re working with dates and times frequently, we might want to use the lubridate package (cheat sheet here), which makes operations simpler:

library(lubridate)
zillow3 <- zillow2 %>% mutate(time = gsub('X', '', name) %>% ymd())
zillow3 %>% glimpse()

## Rows: 14,331
## Columns: 11
## $ RegionID   <int> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,…
## $ SizeRank   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ RegionName <chr> "California", "California", "California", "California", "Ca…
## $ RegionType <chr> "state", "state", "state", "state", "state", "state", "stat…
## $ StateName  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ name       <chr> "X2000.01.31", "X2000.02.29", "X2000.03.31", "X2000.04.30",…
## $ value      <dbl> 187783.9, 188416.7, 189269.8, 191120.9, 193270.9, 195590.1,…
## $ year       <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000,…
## $ month      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7,…
## $ pd         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ time       <date> 2000-01-31, 2000-02-29, 2000-03-31, 2000-04-30, 2000-05-31…

zillow3$time[2] - zillow3$time[1] #end of jan to end of feb

## Time difference of 29 days

There are other things we might want to do, though. We’re going to return to the mtcars dataset here; say we want to know how many models from each manufacturer are included in the dataset. We can use group_by() to…well, group the data by some variable(s) and derive some results.

mtcars %>% glimpse()

## Rows: 32
## Columns: 15
## $ rowname <chr> "Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", …
## $ mpg     <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17…
## $ cyl     <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4,…
## $ disp    <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8,…
## $ hp      <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, …
## $ drat    <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.…
## $ wt      <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150,…
## $ qsec    <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90,…
## $ vs      <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,…
## $ am      <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,…
## $ gear    <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3,…
## $ carb    <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1,…
## $ mpg2    <dbl> 26.0, 26.0, 27.8, 26.4, 23.7, 23.1, 19.3, 29.4, 27.8, 24.2, 22…
## $ make    <chr> "Mazda", "Mazda", "Datsun", "Hornet", "Hornet", "Valiant", "Du…
## $ model   <chr> "RX4", "RX4 Wag", "710", "4 Drive", "Sportabout", "Valiant", "…

mtcars %>% group_by(make) %>% summarize(N=n())

## # A tibble: 22 × 2
##    make         N
##    <chr>    <int>
##  1 AMC          1
##  2 Cadillac     1
##  3 Camaro       1
##  4 Chrysler     1
##  5 Datsun       1
##  6 Dodge        1
##  7 Duster       1
##  8 Ferrari      1
##  9 Fiat         2
## 10 Ford         1
## # ℹ 12 more rows

If we want to sort things, we can use arrange():

mtcars %>% group_by(make) %>% summarize(N=n()) %>% arrange(N) #lowest first

## # A tibble: 22 × 2
##    make         N
##    <chr>    <int>
##  1 AMC          1
##  2 Cadillac     1
##  3 Camaro       1
##  4 Chrysler     1
##  5 Datsun       1
##  6 Dodge        1
##  7 Duster       1
##  8 Ferrari      1
##  9 Ford         1
## 10 Honda        1
## # ℹ 12 more rows

mtcars %>% group_by(make) %>% summarize(N=n()) %>% arrange(-N) #highest first

## # A tibble: 22 × 2
##    make         N
##    <chr>    <int>
##  1 Merc         7
##  2 Fiat         2
##  3 Hornet       2
##  4 Mazda        2
##  5 Toyota       2
##  6 AMC          1
##  7 Cadillac     1
##  8 Camaro       1
##  9 Chrysler     1
## 10 Datsun       1
## # ℹ 12 more rows

And we can also group by multiple variables, giving us information about more particular subsets. Be careful that it sorts by particular value, not a bin, so if you’d like to sort by ranges (say, having 1-3 cylinders), you’d need to make another variable specifying as much.

mtcars %>% group_by(cyl, am) %>% summarize(N=n(), mean_hp = mean(hp))

## # A tibble: 6 × 4
## # Groups:   cyl [3]
##     cyl    am     N mean_hp
##   <dbl> <dbl> <int>   <dbl>
## 1     4     0     3    84.7
## 2     4     1     8    81.9
## 3     6     0     4   115. 
## 4     6     1     3   132. 
## 5     8     0    12   194. 
## 6     8     1     2   300.

Any function that would give you information about the makeup of your data—mean, standard deviation, max, min, etc.—they all work here. This lets us know that there are three 4-cylinder, non-American cars in the dataset. Neat!

We can also do a bit more than that: if we’d like to summarize many variables, we can use across() to select them and state what we’d like to do with them. We use a vector here to do multiple functions, name them so the column names are as we want, and we can even specify parts of the function. Note that the mean function removes missing values; the ~fn() syntax is necessary for edits/specifications like that, where .x is what’s being summarized.

mtcars %>% group_by(cyl) %>% 
  summarize(across(where(is.numeric), 
                   c(mn=~mean(.x, na.rm=TRUE), sd=sd)))

## # A tibble: 3 × 23
##     cyl mpg_mn mpg_sd disp_mn disp_sd hp_mn hp_sd drat_mn drat_sd wt_mn wt_sd
##   <dbl>  <dbl>  <dbl>   <dbl>   <dbl> <dbl> <dbl>   <dbl>   <dbl> <dbl> <dbl>
## 1     4   26.7   4.51    105.    26.9  82.6  20.9    4.07   0.365  2.29 0.570
## 2     6   19.7   1.45    183.    41.6 122.   24.3    3.59   0.476  3.12 0.356
## 3     8   15.1   2.56    353.    67.8 209.   51.0    3.23   0.372  4.00 0.759
## # ℹ 12 more variables: qsec_mn <dbl>, qsec_sd <dbl>, vs_mn <dbl>, vs_sd <dbl>,
## #   am_mn <dbl>, am_sd <dbl>, gear_mn <dbl>, gear_sd <dbl>, carb_mn <dbl>,
## #   carb_sd <dbl>, mpg2_mn <dbl>, mpg2_sd <dbl>

If we’d like to, say, keep all the values in the original dataset but add some mean value, we can do that one of a few ways. We can mutate() with the .by argument, group_by() before we mutate, or even join a summarized data frame. (Joining is linking two datasets based on one or more common variables, which is pretty useful.) The group_by() version is also ungrouped at the end, since occasionally keeping the groups in a data frame leads to undesired results. Summarizing peels back one level of grouping for each summarize() call; in practice, I don’t think this comes up much, but if you’re having any weird issues, see if that matters.

The below should all be identical, though illustrate different ways to get the same result: one with mutate() and .by, one with group_by(), and one with joins (a left join, to be specific.)

mtcars %>% mutate(hp_mean_bycyl = mean(hp), .by=cyl) %>% 
  select(rowname, mpg, cyl, hp_mean_bycyl) #too many variables for good display

## # A tibble: 32 × 4
##    rowname             mpg   cyl hp_mean_bycyl
##    <chr>             <dbl> <dbl>         <dbl>
##  1 Mazda RX4          21       6         122. 
##  2 Mazda RX4 Wag      21       6         122. 
##  3 Datsun 710         22.8     4          82.6
##  4 Hornet 4 Drive     21.4     6         122. 
##  5 Hornet Sportabout  18.7     8         209. 
##  6 Valiant            18.1     6         122. 
##  7 Duster 360         14.3     8         209. 
##  8 Merc 240D          24.4     4          82.6
##  9 Merc 230           22.8     4          82.6
## 10 Merc 280           19.2     6         122. 
## # ℹ 22 more rows

mtcars %>% group_by(cyl) %>% mutate(hp_mean_bycyl = mean(hp)) %>% ungroup() %>% 
  select(rowname, mpg, cyl, hp_mean_bycyl) #too many variables for good display

## # A tibble: 32 × 4
##    rowname             mpg   cyl hp_mean_bycyl
##    <chr>             <dbl> <dbl>         <dbl>
##  1 Mazda RX4          21       6         122. 
##  2 Mazda RX4 Wag      21       6         122. 
##  3 Datsun 710         22.8     4          82.6
##  4 Hornet 4 Drive     21.4     6         122. 
##  5 Hornet Sportabout  18.7     8         209. 
##  6 Valiant            18.1     6         122. 
##  7 Duster 360         14.3     8         209. 
##  8 Merc 240D          24.4     4          82.6
##  9 Merc 230           22.8     4          82.6
## 10 Merc 280           19.2     6         122. 
## # ℹ 22 more rows

mtcars %>% 
  left_join(
    mtcars %>% group_by(cyl) %>% summarize(hp_mean_bycyl = mean(hp))
  ) %>% 
  select(rowname, mpg, cyl, hp_mean_bycyl) #too many variables for good display

## # A tibble: 32 × 4
##    rowname             mpg   cyl hp_mean_bycyl
##    <chr>             <dbl> <dbl>         <dbl>
##  1 Mazda RX4          21       6         122. 
##  2 Mazda RX4 Wag      21       6         122. 
##  3 Datsun 710         22.8     4          82.6
##  4 Hornet 4 Drive     21.4     6         122. 
##  5 Hornet Sportabout  18.7     8         209. 
##  6 Valiant            18.1     6         122. 
##  7 Duster 360         14.3     8         209. 
##  8 Merc 240D          24.4     4          82.6
##  9 Merc 230           22.8     4          82.6
## 10 Merc 280           19.2     6         122. 
## # ℹ 22 more rows

If you so choose, you can also use the table() command for tallies. I think it’s harder to read, but it’s up to you.

table(mtcars$am, mtcars$gear)

##    
##      3  4  5
##   0 15  4  0
##   1  0  8  5

Missing values

Sometimes there are missing values in datasets, which show up as NA. The easy, loaded datasets don’t really have this issue, but in practice this will be relevant. Look at the zillow dataset from earlier:

zillow2 %>% group_by(RegionName) %>% 
  summarize(prop_missing = sum(is.na(value)) / n(),
            mean_val = mean(value),
            mean_val_nomissing = mean(value, na.rm=TRUE))

## # A tibble: 51 × 4
##    RegionName           prop_missing mean_val mean_val_nomissing
##    <chr>                       <dbl>    <dbl>              <dbl>
##  1 Alabama                   0.00356      NA             119037.
##  2 Alaska                    0.00712      NA             237470.
##  3 Arizona                   0.00356      NA             217589.
##  4 Arkansas                  0        108971.            108971.
##  5 California                0        414039.            414039.
##  6 Colorado                  0        290685.            290685.
##  7 Connecticut               0        261084.            261084.
##  8 Delaware                  0        234044.            234044.
##  9 District of Columbia      0        400532.            400532.
## 10 Florida                   0        194654.            194654.
## # ℹ 41 more rows

You can see that some states have all their values missing, but others just have a few. But as soon as you have missing values, many functions don’t work, which you can see in the mean_val column as compared to the one without missing values. You might wish to replace those values with zeroes in some circumstances, or ignore them altogether, but make sure you understand what you’re getting back.

Graphing with ggplot (and gridExtra)

So we’ve done a lot of data manipulation, but what about graphing? R has base plotting functions, but ggplot is much better. You build up graphs by adding pieces: a plot first, then points, regression lines, connecting lines, labels, etc. You can give each individual piece different data if you’d like, too. The aes() portion of the plot/piece tells ggplot what to do with things—what to put on which axis, but also if things need separate regression lines, or different colors, sizes, shapes, or the like. If you want everything to be the same size/color/shape, you can specify that outside of aes(); sometimes, as below, this needs to be specified in the actual point/line/etc. rather than in the ggplot() call itself.

Note the slight alterations between the code for each graph that follows:

ggplot(data=mtcars, aes(x=wt, y=mpg)) + geom_point()

ggplot(data=mtcars, aes(x=wt, y=mpg, col=am)) + geom_point()

ggplot(data=mtcars, aes(x=wt, y=mpg)) + geom_point(col='red') #all same color

ggplot(data=mtcars, aes(x=wt, y=mpg, col=am)) + geom_point() + geom_smooth() #very wiggly by default

ggplot(data=mtcars, aes(x=wt, y=mpg, group=am, col=(am==1))) + geom_point() + geom_smooth(method='lm') #linear

You can also split the data out in such a way that different subsets are graphed side by side using facet_wrap(), which takes up to two variables.

ggplot(data=mtcars, aes(x=wt, y=mpg, group=am, col=(am==1))) + geom_point() + geom_smooth(method='lm') + facet_wrap(~am)

ggplot(data=mtcars, aes(x=wt, y=mpg, col=(am==1))) + geom_point() + geom_smooth(method='lm') + facet_wrap(am~cyl)

If you’d like to specify particular x- and y-axis limits, you can do so using two different options (though one is a worse idea than the other). The first, good, option is coord_cartesian(): it will simply zoom in the plot to the limits you specify. This is nice because if you are, say, adding a regression line, it will stay the same as you zoom in:

ggplot(data=mtcars, aes(x=wt, y=mpg, group=am, col=(am==1))) + geom_point() + geom_smooth(method='lm') #reference

ggplot(data=mtcars, aes(x=wt, y=mpg, group=am, col=(am==1))) + geom_point() + 
  geom_smooth(method='lm') + coord_cartesian(xlim = c(2, 5), ylim = c(10, 30))

Sure, the axis labels changed a little (something you can also customize), but the lines and points are all the same! This is notably not true for the other “option”, lims (works for colors, etc. too, but looking at xlim() and ylim() here):

ggplot(data=mtcars, aes(x=wt, y=mpg, group=am, col=(am==1))) + geom_point() + 
  geom_smooth(method='lm') + xlim(2, 5) + ylim(10, 30)

There’s arguably a use case for cutting things off (RD bandwiths?) but I’d really encourage you to stick to coord_cartesian(). It’ll make life simpler.

One of the nice things about ggplot is its customizability. Using one of the simpler graphs above, I’ll show how to edit (the labels of) each portion of the graph using labs(), as well as the option to manually specify colors:

ggplot(data=mtcars, aes(x=wt, y=mpg, group=am, col=(am==1))) + geom_point() + geom_smooth(method='lm') + 
  labs(    #labels for everything
    title='MPG by Weight', 
    subtitle='Using the mtcars dataset', 
    x='Weight', 
    y='Miles per Gallon (mpg)', 
    col='American' #if this is fill/shape/size, that argument goes here. just like in aes()
    ) + #next, manually setting color values and labels
  scale_color_manual(values=c('purple', 'blue'), labels=c('Foreign', 'Domestic'))

If desired, you can also combine multiple graphs using the gridExtra package. Below, I also use the syntax to make an empty list of length 2, just in case you need that. Note the double brackets to properly access the elements of said list.

library(gridExtra)
graphlist <- vector('list', 2)
graphlist[[1]] <- ggplot(data=mtcars, aes(x=wt, y=mpg)) + geom_point(col='red')
graphlist[[2]] <- ggplot(data=mtcars, aes(x=wt, y=cyl)) + geom_point(col='blue')
grid.arrange(grobs=graphlist, ncol=2)

As this is just an arranging tool, note that these are fully two different graphs—the y axes aren’t even the same! Their scales are entirely different! Use this with caution, but know you can do it.

Functions, loops, if/else

If you’re wanting something to happen lots of times, there’s plenty of options. If you need it to work on different things, you can write your own functions. I’ve included a simple one of my own below—hopefully you can figure out what it does.

sparse <- function(df, every) {
  df[seq(1, nrow(df), every),] %>% return()
}
mtcars %>% sparse(2)

## # A tibble: 16 × 15
##    rowname       mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Mazda RX4    21       6 160     110  3.9   2.62  16.5     0     1     4     4
##  2 Datsun 710   22.8     4 108      93  3.85  2.32  18.6     1     1     4     1
##  3 Hornet Spo…  18.7     8 360     175  3.15  3.44  17.0     0     0     3     2
##  4 Duster 360   14.3     8 360     245  3.21  3.57  15.8     0     0     3     4
##  5 Merc 230     22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
##  6 Merc 280C    17.8     6 168.    123  3.92  3.44  18.9     1     0     4     4
##  7 Merc 450SL   17.3     8 276.    180  3.07  3.73  17.6     0     0     3     3
##  8 Cadillac F…  10.4     8 472     205  2.93  5.25  18.0     0     0     3     4
##  9 Chrysler I…  14.7     8 440     230  3.23  5.34  17.4     0     0     3     4
## 10 Honda Civic  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
## 11 Toyota Cor…  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1
## 12 AMC Javelin  15.2     8 304     150  3.15  3.44  17.3     0     0     3     2
## 13 Pontiac Fi…  19.2     8 400     175  3.08  3.84  17.0     0     0     3     2
## 14 Porsche 91…  26       4 120.     91  4.43  2.14  16.7     0     1     5     2
## 15 Ford Pante…  15.8     8 351     264  4.22  3.17  14.5     0     1     5     4
## 16 Maserati B…  15       8 301     335  3.54  3.57  14.6     0     1     5     8
## # ℹ 3 more variables: mpg2 <dbl>, make <chr>, model <chr>

mtcars %>% sparse(16)

## # A tibble: 2 × 15
##   rowname        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
## 2 Chrysler Im…  14.7     8   440   230  3.23  5.34  17.4     0     0     3     4
## # ℹ 3 more variables: mpg2 <dbl>, make <chr>, model <chr>

You can also use for loops and while loops. They do what they sound like: for every value x in some list y, do whatever’s in the brackets. Or, while this condition is true, do what’s in the brackets. If you’re working on one line, the brackets are technically optional, but you should use them in practice.

for(i in 1:10) print(i) #prints 1-10

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

paste0() combines strings, and is present to make formatting of the output a bit better!

for(nm in names(mtcars)) {
  paste0('Average value of ', nm, ': ', mean(mtcars[[nm]])) %>% print()
  paste0('Standard deviation of ', nm, ': ', sd(mtcars[[nm]])) %>% print()
  print('')
}

## [1] "Average value of rowname: NA"
## [1] "Standard deviation of rowname: NA"
## [1] ""
## [1] "Average value of mpg: 20.090625"
## [1] "Standard deviation of mpg: 6.0269480520891"
## [1] ""
## [1] "Average value of cyl: 6.1875"
## [1] "Standard deviation of cyl: 1.78592164694654"
## [1] ""
## [1] "Average value of disp: 230.721875"
## [1] "Standard deviation of disp: 123.938693831382"
## [1] ""
## [1] "Average value of hp: 146.6875"
## [1] "Standard deviation of hp: 68.5628684893206"
## [1] ""
## [1] "Average value of drat: 3.5965625"
## [1] "Standard deviation of drat: 0.534678736070971"
## [1] ""
## [1] "Average value of wt: 3.21725"
## [1] "Standard deviation of wt: 0.978457442989697"
## [1] ""
## [1] "Average value of qsec: 17.84875"
## [1] "Standard deviation of qsec: 1.78694323609684"
## [1] ""
## [1] "Average value of vs: 0.4375"
## [1] "Standard deviation of vs: 0.504016128774185"
## [1] ""
## [1] "Average value of am: 0.40625"
## [1] "Standard deviation of am: 0.498990917235846"
## [1] ""
## [1] "Average value of gear: 3.6875"
## [1] "Standard deviation of gear: 0.737804065256947"
## [1] ""
## [1] "Average value of carb: 2.8125"
## [1] "Standard deviation of carb: 1.61519997763185"
## [1] ""
## [1] "Average value of mpg2: 25.090625"
## [1] "Standard deviation of mpg2: 6.0269480520891"
## [1] ""
## [1] "Average value of make: NA"
## [1] "Standard deviation of make: NA"
## [1] ""
## [1] "Average value of model: NA"
## [1] "Standard deviation of model: NA"
## [1] ""

And a while loop:

i <- 1
while(i < 100) {
  i <- i + 1
}
paste0('Value: ', i)

## [1] "Value: 100"

There’s also a family of functions that end in “apply” that do a version of for loops, but formatted a little differently. For lapply(), the example we’ll see here, you feed it the vector of values and the function to apply to those values. It returns a list of values, which is convenient to get a list of the same length you needed. You can also use it to make a data frame of sorts, as I show below:

byname <- function(name) { #function to use
  c(name=name, mean=mean(mtcars[[name]]), sd=sd(mtcars[[name]]))
}
lapply(names(mtcars), byname) %>% bind_rows()

## # A tibble: 15 × 3
##    name    mean       sd               
##    <chr>   <chr>      <chr>            
##  1 rowname <NA>       <NA>             
##  2 mpg     20.090625  6.0269480520891  
##  3 cyl     6.1875     1.78592164694654 
##  4 disp    230.721875 123.938693831382 
##  5 hp      146.6875   68.5628684893206 
##  6 drat    3.5965625  0.534678736070971
##  7 wt      3.21725    0.978457442989697
##  8 qsec    17.84875   1.78694323609684 
##  9 vs      0.4375     0.504016128774185
## 10 am      0.40625    0.498990917235846
## 11 gear    3.6875     0.737804065256947
## 12 carb    2.8125     1.61519997763185 
## 13 mpg2    25.090625  6.0269480520891  
## 14 make    <NA>       <NA>             
## 15 model   <NA>       <NA>

There’s also if/else statements. The else is optional, but again, it’s descriptive: if this condition is true, do this; otherwise, do that. (Or, by default, do nothing.) Equality statements require double equals signs (==), so be careful of that.

i <- 1
if(i == 1) {
  print('true')
} else {
  print("false")
}

## [1] "true"

if(i != 1) {
  print('true')
} else {
  print("false")
}

## [1] "false"

That’s a simple demonstration, but if/else statements are powerful. You can also use the ifelse() function on a vector of conditions, which is pretty handy when using mutate(). You can put whatever you want in the true/false condition returns, as long as they’re the same length:

ifelse(c(T,F,F,T,F), 'a', 'b')

## [1] "a" "b" "b" "a" "b"

mtcars %>% mutate(test = ifelse(cyl > 4, 'Many cylinders', 'Few cylinders')) %>% 
  select(make, model, test) %>% head(10)

## # A tibble: 10 × 3
##    make    model      test          
##    <chr>   <chr>      <chr>         
##  1 Mazda   RX4        Many cylinders
##  2 Mazda   RX4 Wag    Many cylinders
##  3 Datsun  710        Few cylinders 
##  4 Hornet  4 Drive    Many cylinders
##  5 Hornet  Sportabout Many cylinders
##  6 Valiant Valiant    Many cylinders
##  7 Duster  360        Many cylinders
##  8 Merc    240D       Few cylinders 
##  9 Merc    230        Few cylinders 
## 10 Merc    280        Many cylinders

If you’re expecting to have lost of potential cases, you might want to look into switch case statements. In practice, I haven’t needed these much, but here’s a vectorized (mutating) example:

mtcars %>% mutate(test = case_when(cyl==4 ~ 'one', cyl==6 ~ 'two', cyl==8 ~ 'three')) %>% 
  select(make, model, test) %>% head(8)

## # A tibble: 8 × 3
##   make    model      test 
##   <chr>   <chr>      <chr>
## 1 Mazda   RX4        two  
## 2 Mazda   RX4 Wag    two  
## 3 Datsun  710        one  
## 4 Hornet  4 Drive    two  
## 5 Hornet  Sportabout three
## 6 Valiant Valiant    two  
## 7 Duster  360        three
## 8 Merc    240D       one

It goes down the list checking the conditions, and will use the first one it gets to.

Regressions

In this section, we’ll mostly be using the wage1 and wagepan datasets from Wooldridge’s book. The latter is panel data, which is nice for what we need.

library(wooldridge)
data('wage1')
wage1 %>% glimpse()

## Rows: 526
## Columns: 24
## $ wage     <dbl> 3.10, 3.24, 3.00, 6.00, 5.30, 8.75, 11.25, 5.00, 3.60, 18.18,…
## $ educ     <int> 11, 12, 11, 8, 12, 16, 18, 12, 12, 17, 16, 13, 12, 12, 12, 16…
## $ exper    <int> 2, 22, 2, 44, 7, 9, 15, 5, 26, 22, 8, 3, 15, 18, 31, 14, 10, …
## $ tenure   <int> 0, 2, 0, 28, 2, 8, 7, 3, 4, 21, 2, 0, 0, 3, 15, 0, 0, 10, 0, …
## $ nonwhite <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ female   <int> 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1…
## $ married  <int> 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0…
## $ numdep   <int> 2, 3, 2, 0, 1, 0, 0, 0, 2, 0, 0, 0, 2, 0, 1, 1, 0, 0, 3, 0, 0…
## $ smsa     <int> 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ northcen <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ south    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ west     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ construc <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ndurman  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ trcommpu <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ trade    <int> 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ services <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ profserv <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1…
## $ profocc  <int> 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1…
## $ clerocc  <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ servocc  <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ lwage    <dbl> 1.1314021, 1.1755733, 1.0986123, 1.7917595, 1.6677068, 2.1690…
## $ expersq  <int> 4, 484, 4, 1936, 49, 81, 225, 25, 676, 484, 64, 9, 225, 324, …
## $ tenursq  <int> 0, 4, 0, 784, 4, 64, 49, 9, 16, 441, 4, 0, 0, 9, 225, 0, 0, 1…

lm(lwage ~ educ + tenure + nonwhite * female, data=wage1) %>% summary()

## 
## Call:
## lm(formula = lwage ~ educ + tenure + nonwhite * female, data = wage1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.97440 -0.25289 -0.03765  0.24589  1.30187 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.629265   0.093351   6.741 4.20e-11 ***
## educ             0.081508   0.006696  12.172  < 2e-16 ***
## tenure           0.021627   0.002592   8.344 6.53e-16 ***
## nonwhite         0.018282   0.082907   0.221    0.826    
## female          -0.289465   0.039587  -7.312 9.99e-13 ***
## nonwhite:female -0.074720   0.121158  -0.617    0.538    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4194 on 520 degrees of freedom
## Multiple R-squared:  0.3834, Adjusted R-squared:  0.3774 
## F-statistic: 64.66 on 5 and 520 DF,  p-value: < 2.2e-16

Besides a bit of a glimpse of the dataset, notice that we use the lm() function to do regressions in base R. It…works just fine. Note that the format of the formula that we insert is y ~ x1 + x2 + … + x_n. These aren’t in quotes or anything. We also need to get a summary of the regression object that we’ve created in lm(). Getting an object back is nice because we can use functions like predict() or resid() with these object to get fitted values and residuals. We can also more directly access the coefficient values from them if desired:

reg <- lm(lwage ~ educ + tenure + nonwhite * female, data=wage1)
reg

## 
## Call:
## lm(formula = lwage ~ educ + tenure + nonwhite * female, data = wage1)
## 
## Coefficients:
##     (Intercept)             educ           tenure         nonwhite  
##         0.62927          0.08151          0.02163          0.01828  
##          female  nonwhite:female  
##        -0.28947         -0.07472

names(reg) #what does the reg object hold?

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

reg$coefficients

##     (Intercept)            educ          tenure        nonwhite          female 
##      0.62926514      0.08150781      0.02162687      0.01828224     -0.28946534 
## nonwhite:female 
##     -0.07472044

Note the way I put in the interaction term in these regressions: with an asterisk. By default, a * b gives you a + b + a*b, which can be good or bad. If you need particular interaction terms—that is, just a*b—then you need to use the I() function:

lm(lwage ~ educ + tenure + I(nonwhite * female), data=wage1) %>% summary()

## 
## Call:
## lm(formula = lwage ~ educ + tenure + I(nonwhite * female), data = wage1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.11486 -0.29738 -0.02029  0.29223  1.43921 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.421056   0.091606   4.596  5.4e-06 ***
## educ                  0.086118   0.006965  12.365  < 2e-16 ***
## tenure                0.025512   0.002672   9.549  < 2e-16 ***
## I(nonwhite * female) -0.208138   0.090516  -2.299   0.0219 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.441 on 522 degrees of freedom
## Multiple R-squared:  0.3155, Adjusted R-squared:  0.3115 
## F-statistic: 80.18 on 3 and 522 DF,  p-value: < 2.2e-16

This goes for squaring terms as well. However, you can apply many functions to variables directly:

lm(log(wage) ~ I(educ ^ 2) + (tenure > 10) + I(nonwhite * female), data=wage1) %>% summary()

## 
## Call:
## lm(formula = log(wage) ~ I(educ^2) + (tenure > 10) + I(nonwhite * 
##     female), data = wage1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.11845 -0.30492 -0.04298  0.28711  1.51216 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.9293157  0.0541775  17.153  < 2e-16 ***
## I(educ^2)             0.0038490  0.0002944  13.076  < 2e-16 ***
## tenure > 10TRUE       0.4180373  0.0536736   7.789 3.68e-14 ***
## I(nonwhite * female) -0.1878309  0.0916589  -2.049   0.0409 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4458 on 522 degrees of freedom
## Multiple R-squared:  0.3006, Adjusted R-squared:  0.2965 
## F-statistic: 74.77 on 3 and 522 DF,  p-value: < 2.2e-16

This may or may not be a good idea, but it sure is convenient!

The main library you most likely want to use—in part due to its speed—is the fixest library. feols() is a function that incorporates robust inference, fixed effects (in the form of dummies, same as the “within” regression), and instrumental variables.

library(fixest)
feols(lwage ~ educ + tenure + nonwhite * female, data=wage1, vcov='hetero') #heteroskedastic standard errors

## OLS estimation, Dep. Var.: lwage
## Observations: 526 
## Standard-errors: Heteroskedasticity-robust 
##                  Estimate Std. Error   t value   Pr(>|t|)    
## (Intercept)      0.629265   0.102899  6.115379 1.8941e-09 ***
## educ             0.081508   0.007434 10.964225  < 2.2e-16 ***
## tenure           0.021627   0.003151  6.862529 1.9312e-11 ***
## nonwhite         0.018282   0.090491  0.202035 8.3997e-01    
## female          -0.289465   0.040041 -7.229249 1.7446e-12 ***
## nonwhite:female -0.074720   0.124115 -0.602027 5.4742e-01    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.416998   Adj. R2: 0.37744

It also includes a function i(), which is similar to I(), but plays nice with interactions of factor variables. You can also tell it what value(s) should be the reference point if desired.

feols(lwage ~ educ + tenure + i(nonwhite * female), data=wage1, se='hetero')

## OLS estimation, Dep. Var.: lwage
## Observations: 526 
## Standard-errors: Heteroskedasticity-robust 
##                       Estimate Std. Error  t value   Pr(>|t|)    
## (Intercept)           0.421056   0.093695  4.49390 8.6193e-06 ***
## educ                  0.086118   0.007210 11.94468  < 2.2e-16 ***
## tenure                0.025512   0.003297  7.73899 5.2219e-14 ***
## nonwhite * female::1 -0.208138   0.084455 -2.46448 1.4042e-02 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.439362   Adj. R2: 0.31152

And now IV:

feols(lwage ~ married + nonwhite * female | tenure ~ educ + exper + expersq, data=wage1, se='hetero') #iv regression

## TSLS estimation, Dep. Var.: lwage, Endo.: tenure, Instr.: educ, exper, expersq
## Second stage: Dep. Var.: lwage
## Observations: 526 
## Standard-errors: Heteroskedasticity-robust 
##                  Estimate Std. Error   t value   Pr(>|t|)    
## (Intercept)      1.582330   0.047558 33.271717  < 2.2e-16 ***
## fit_tenure       0.018989   0.005979  3.175973 1.5820e-03 ** 
## married          0.172805   0.047277  3.655146 2.8315e-04 ***
## nonwhite        -0.097020   0.105598 -0.918766 3.5864e-01    
## female          -0.323067   0.043769 -7.381200 6.2605e-13 ***
## nonwhite:female  0.076129   0.137878  0.552143 5.8109e-01    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.465384   Adj. R2: 0.224581
## F-test (1st stage), tenure: stat =  50.6     , p < 2.2e-16 , on 3 and 518 DoF.
##                 Wu-Hausman: stat =   0.178328, p = 0.672989, on 1 and 519 DoF.
##                     Sargan: stat = 140.3     , p < 2.2e-16 , on 2 DoF.

Another great benefit is that you can feed feols() a vector of dependent variables and it’ll run the regressions all together. etable() is a handy function to arrange these multiple regressions together. It’ll also produce results in LaTeX format if desired–check the documentation, or use esttex() instead.

feols(c(lwage, wage) ~ married + nonwhite * female | tenure ~ educ + exper + expersq, data=wage1, se='hetero') %>% etable()

##                                   ..1                ..2
## Dependent Var.:                 lwage               wage
##                                                         
## Constant            1.582*** (0.0476)  5.631*** (0.2738)
## tenure              0.0190** (0.0060) 0.1455*** (0.0389)
## married            0.1728*** (0.0473)  0.8743** (0.2966)
## nonwhite             -0.0970 (0.1056)   -0.6940 (0.7401)
## female            -0.3231*** (0.0438) -2.009*** (0.2903)
## nonwhite x female     0.0761 (0.1379)    0.5121 (0.8635)
## _________________ ___________________ __________________
## S.E. type         Heteroskedast.-rob. Heteroskedas.-rob.
## Observations                      526                526
## R2                            0.23197            0.21205
## Adj. R2                       0.22458            0.20447
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Boostrapping

There are two options for bootstrapping: one uses the lmtest and sandwich libraries, and is a bit more direct. The formatting might be unfamiliar, but we effectively take our regression object and put it through a coefficient test, specifying the form of the variance-covariance matrix (using that same regression):

library(lmtest)
library(sandwich)
#bootstrapping
feols(hp ~ wt + gear + am, mtcars) %>% coeftest(vcov = vcovBS(., R=50))

## 
## z test of coefficients:
## 
##             Estimate Std. Error z value  Pr(>|z|)    
## (Intercept) -173.428    106.216 -1.6328    0.1025    
## wt            67.228     14.555  4.6190 3.856e-06 ***
## gear          25.017     25.315  0.9882    0.3230    
## am            28.499     35.279  0.8078    0.4192    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The second option uses the boot library, and requires you to write a function for it to run repeatedly. Both options are seen below.

#more customization
library(boot)
boot(mtcars, function(data, indices) lm(hp ~ wt + gear + am, data[indices,]) %>% coefficients() %>% return(), R=50)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = mtcars, statistic = function(data, indices) lm(hp ~ 
##     wt + gear + am, data[indices, ]) %>% coefficients() %>% return(), 
##     R = 50)
## 
## 
## Bootstrap Statistics :
##       original    bias    std. error
## t1* -173.42840 12.314506   112.31344
## t2*   67.22785 -1.656678    17.96803
## t3*   25.01673 -1.727126    24.18994
## t4*   28.49948 -6.232931    35.47859

For panel data, there are also options. feols() works pretty cleanly if you’re content with the “within” transformation, but if you’d like to do first differencing or another variety of fixed-effect model, the plm library is also an option. Your data will need to be transformed into a pdata.frame, and you’ll use the plm() function for regressions. Note that you’ll need to use coeftest() and a vcov specification (for the variance-covariance matrix) in order to get clustered results or the like.

library(plm)
data('wagepan')
wagepan2 <- wagepan %>% pdata.frame(index=c('nr', 'year')) #unit and year
plm(lwage ~ educ + union + married + exper + expersq + factor(year), 
    data=wagepan2, model='within') %>%
  coeftest(vcov=vcovHC(., type='HC1', cluster='group'))

## 
## t test of coefficients:
## 
##                     Estimate  Std. Error t value  Pr(>|t|)    
## union             0.08000186  0.02272222  3.5209 0.0004352 ***
## married           0.04668036  0.02098454  2.2245 0.0261725 *  
## exper             0.13214642  0.01199701 11.0149 < 2.2e-16 ***
## expersq          -0.00518550  0.00080949 -6.4058 1.678e-10 ***
## factor(year)1981  0.01904479  0.02270581  0.8388 0.4016552    
## factor(year)1982 -0.01132198  0.02119722 -0.5341 0.5932857    
## factor(year)1983 -0.04199552  0.02048986 -2.0496 0.0404743 *  
## factor(year)1984 -0.03847088  0.02115271 -1.8187 0.0690326 .  
## factor(year)1985 -0.04324982  0.01757881 -2.4603 0.0139247 *  
## factor(year)1986 -0.02738194  0.01620317 -1.6899 0.0911266 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

feols(lwage ~ educ + union + married + exper + expersq | year + nr, 
      data=wagepan, cluster=~nr)

## OLS estimation, Dep. Var.: lwage
## Observations: 4,360 
## Fixed-effects: year: 8,  nr: 545
## Standard-errors: Clustered (nr) 
##          Estimate Std. Error  t value   Pr(>|t|)    
## union    0.080002   0.022743  3.51763 4.7182e-04 ***
## married  0.046680   0.021004  2.22247 2.6662e-02 *  
## expersq -0.005185   0.000810 -6.39996 3.3575e-10 ***
## ... 2 variables were removed because of collinearity (educ and exper)
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.327891     Adj. R2: 0.565718
##                  Within R2: 0.021568

Note that the two above regressions have the same output, because they’re mathematically the same. But if we use the first-differencing operator:

plm(lwage ~ educ + union + married + exper + expersq + factor(year), 
    data=wagepan2, model='within') %>%
  coeftest(vcov=vcovHC(., type='HC1', cluster='group'))

## 
## t test of coefficients:
## 
##                     Estimate  Std. Error t value  Pr(>|t|)    
## union             0.08000186  0.02272222  3.5209 0.0004352 ***
## married           0.04668036  0.02098454  2.2245 0.0261725 *  
## exper             0.13214642  0.01199701 11.0149 < 2.2e-16 ***
## expersq          -0.00518550  0.00080949 -6.4058 1.678e-10 ***
## factor(year)1981  0.01904479  0.02270581  0.8388 0.4016552    
## factor(year)1982 -0.01132198  0.02119722 -0.5341 0.5932857    
## factor(year)1983 -0.04199552  0.02048986 -2.0496 0.0404743 *  
## factor(year)1984 -0.03847088  0.02115271 -1.8187 0.0690326 .  
## factor(year)1985 -0.04324982  0.01757881 -2.4603 0.0139247 *  
## factor(year)1986 -0.02738194  0.01620317 -1.6899 0.0911266 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

That pretty much covers the major use case for plm.

Probit/logit/etc. regressions are also possible, whether in base R (with glm()) or in fixest, with feglm(). Check out the documentation to get the phrasing right depending on what kind of nonlinearity you’re trying to match.

feglm(am ~ wt + gear, mtcars, family=binomial(link='probit'))

## GLM estimation, family = binomial(link = "probit"), Dep. Var.: am
## Observations: 32 
## Standard-errors: IID 
##              Estimate Std. Error   t value Pr(>|t|) 
## (Intercept)   8.39583    39787.2  0.000211  0.99983 
## wt          -43.19819    14282.9 -0.003024  0.99759 
## gear         30.43512    12331.5  0.002468  0.99803 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Log-Likelihood: -2.966e-9   Adj. Pseudo R2: 0.907471
##            BIC: 10.4          Squared Cor.: 1       
## # Evaluations: lhs: 25 -- no convergence: Maximum number of iterations reached

Parallel Processing

If we want to do parallel processing, the easiest way to do that is to parallelize a for loop. The do this by using the doParallel and foreach libraries, the second of which is loaded by default if we load doParallel. We set things up in a similar way to a for loop, but rather than storing the data somewhere within the loop, foreach() will take the “result” that’s at the end as its final stopping place. You can see here that the syntax is a bit different—also note that unless you use the .combine argument, you’ll get a list back by default. So make sure you know what exactly your loop is doing.

library(doParallel)
detectCores()

## [1] 16

registerDoParallel(detectCores()-1) #telling R how many cores to use
foreach_ex <- foreach(i=1:10, .combine=rbind) %dopar% { rnorm(5, i, i) }
foreach_ex %>% data.frame() %>% glimpse()

## Rows: 10
## Columns: 5
## $ X1 <dbl> 2.0135517, -0.2543631, 6.7964948, 0.3661041, -0.5406532, 10.0590846…
## $ X2 <dbl> 0.6297509, -0.1296349, -1.1369261, 3.8595767, 9.2997136, 7.9394745,…
## $ X3 <dbl> 1.7931296, 0.8987631, 3.2743410, 13.7863044, 6.3722779, 17.3893519,…
## $ X4 <dbl> 0.9663244, 1.3404502, 5.8537475, 6.4429921, 15.0212213, 6.2158285, …
## $ X5 <dbl> 0.5027726, 1.5569020, -1.0815818, 6.6441962, 9.4803718, 9.6832167, …

A note on core selection: on your own computer, I’d recommend using all but one of your cores as the maximum value—you’ll need enough RAM for it to copy everything over that many times. If you have 2GB of stuff to copy into each core, 16 cores, and 16GB of RAM, 7 might be about as much as you can do. (Leave at least one out so your computer can continue running graphical processes and not, well, crash.) This advice doesn’t apply on the HPCC, where you should specify the number of core’s you’ve requisitioned. I learned the hard way that detectCores will give you everything on the machine, not just what you requested.

If you want to parallelize random processes, however, this isn’t replicable:

set.seed(99)
a <- foreach(i=1:10, .combine=rbind) %dopar% { runif(1) }
set.seed(99)
b <- foreach(i=1:10, .combine=rbind) %dopar% { runif(1) }
a == b

##            [,1]
## result.1  FALSE
## result.2  FALSE
## result.3  FALSE
## result.4  FALSE
## result.5  FALSE
## result.6  FALSE
## result.7  FALSE
## result.8  FALSE
## result.9  FALSE
## result.10 FALSE

We set our seed, but it didn’t quite work. a and b are not equal. Thankfully, there’s an easy fix, by loading another library and replacing %dopar% with %dorng%:

library(doRNG)
set.seed(99)
a1 <- foreach(i=1:10, .combine=rbind) %dorng% { runif(1) }
set.seed(99)
b1 <- foreach(i=1:10, .combine=rbind) %dorng% { runif(1) }
a1 == b1

##           [,1]
## result.1  TRUE
## result.2  TRUE
## result.3  TRUE
## result.4  TRUE
## result.5  TRUE
## result.6  TRUE
## result.7  TRUE
## result.8  TRUE
## result.9  TRUE
## result.10 TRUE

None of these examples are necessarily “good” reasons to try to parallelize processes, but if you have a for loop where each iteration takes quite a while, that’d be a good reason. Likewise, if that long process involves some randomness—say, you’re trying to generate a large number of random counterfactuals for a policy change—make sure you do it in a replicable way. It’ll save you a lot of headaches.

Webscraping

There are other tools that might be better for the job (namely, Python, especially if you need to scrape consistently), but R can do it. We’ll start with a basic example and a website designed for scraping (or repeated queries): an API. Speficically, an API for a website that supplies cat facts.

library(rvest)
cats <- read_html('https://catfact.ninja/fact')
cats

## {html_document}
## <html>
## [1] <body><p>{"fact":"The Pilgrims were the first to introduce cats to North  ...

cats %>% html_text()

## [1] "{\"fact\":\"The Pilgrims were the first to introduce cats to North America.\",\"length\":63}"

html_text() is good for just converting a page into text. It’s not always this comprehensible (and even this is messy looking). It’s best to be able to refer back to the site you’re using, as here. The cat fact might not be the same, but you get an idea of the formatting.

What we’d like to do is remove the quotes and the curly braces, and split things out by piece. We have a “fact” field and a “length” field—this sounds like a job for regular expressions!

cats %>% html_text() %>% 
  gsub('"|\\{|\\}', '', .) %>% #double quotes or curly braces. note the double backslash to escape
  str_split(':|(\\.,)') %>% unlist() #split on colon or on end of fact -> length field.

## [1] "fact"                                                          
## [2] "The Pilgrims were the first to introduce cats to North America"
## [3] "length"                                                        
## [4] "63"

Now, this is all well and good, but most API queries are going to be a little more complicated. Let’s say that we want to see the first five facts with lengths under 30 characters.

cats_query <- read_html('https://catfact.ninja/facts?max_length=30&limit=5')

Notice that we can simply edit the URL to get what we want. This is very good! If you have to do much by hand, or there’s not a pattern…that’s not a good sign. Now, the result is a bit messy:

cats_query %>% html_text()

## [1] "{\"current_page\":1,\"data\":[{\"fact\":\"Cats have 3 eyelids.\",\"length\":20},{\"fact\":\"Cats walk on their toes.\",\"length\":24},{\"fact\":\"Most cats adore sardines.\",\"length\":25},{\"fact\":\"Cats dislike citrus scent.\",\"length\":26},{\"fact\":\"Cats have supersonic hearing\",\"length\":28}],\"first_page_url\":\"https:\\/\\/catfact.ninja\\/facts?page=1\",\"from\":1,\"last_page\":2,\"last_page_url\":\"https:\\/\\/catfact.ninja\\/facts?page=2\",\"links\":[{\"url\":null,\"label\":\"Previous\",\"active\":false},{\"url\":\"https:\\/\\/catfact.ninja\\/facts?page=1\",\"label\":\"1\",\"active\":true},{\"url\":\"https:\\/\\/catfact.ninja\\/facts?page=2\",\"label\":\"2\",\"active\":false},{\"url\":\"https:\\/\\/catfact.ninja\\/facts?page=2\",\"label\":\"Next\",\"active\":false}],\"next_page_url\":\"https:\\/\\/catfact.ninja\\/facts?page=2\",\"path\":\"https:\\/\\/catfact.ninja\\/facts\",\"per_page\":\"5\",\"prev_page_url\":null,\"to\":5,\"total\":7}"

If we’re thinking about the path of least resistance, we don’t really need the length of the fact, do we? We can recalculate that. It’d be easier to just extract the facts. Which, again, regular expressions will help, this time in the form of str_extract_all(), which extracts all matches to a given pattern:

cats_query %>% html_text() %>% 
    gsub('"|\\{|\\}', '', .) %>% 
  str_extract_all('fact:[A-z 0-9\\.]+') %>% 
  unlist() %>% gsub('(fact:)|"', '', .)

## [1] "Cats have 3 eyelids."         "Cats walk on their toes."    
## [3] "Most cats adore sardines."    "Cats dislike citrus scent."  
## [5] "Cats have supersonic hearing"

What’s great about this is it’s (theoretically) scaleable because of the URL stuff! Let’s say we now want the first 25 facts shorter than 200 characters:

read_html('https://catfact.ninja/facts?max_length=200&limit=25') %>%
  html_text() %>% 
    gsub('"|\\{|\\}', '', .) %>% 
  str_extract_all('fact:[A-z \\-0-9\'%\\.]+') %>% 
  unlist() %>% gsub('(fact:)|"', '', .)

##  [1] "Cats have 3 eyelids."                              
##  [2] "Cats walk on their toes."                          
##  [3] "Most cats adore sardines."                         
##  [4] "Cats dislike citrus scent."                        
##  [5] "Cats have supersonic hearing"                      
##  [6] "Female cats are polyestrous"                       
##  [7] "A form of AIDS exists in cats."                    
##  [8] "Female felines are \\\\superfecund"                
##  [9] "Milk can give some cats diarrhea."                 
## [10] "A group of cats is called a clowder."              
## [11] "A group of cats is called a \\u201cclowder.\\u201d"
## [12] "Cat's urine glows under a black light."            
## [13] "Cats can be right-pawed or left-pawed."            
## [14] "A tiger's stripes are like fingerprints"           
## [15] "70% of your cat's life is spent asleep."           
## [16] "Cats have the largest eyes of any mammal."         
## [17] "A cat cannot see directly under its nose."         
## [18] "A female cat is called a queen or a molly."        
## [19] "There are approximately 100 breeds of cat."        
## [20] "Approximately 24 cat skins can make a coat."       
## [21] "A domestic cat can run at speeds of 30 mph."       
## [22] "Cats take between 20-40 breaths per minute."       
## [23] "The cat's tail is used to maintain balance."       
## [24] "A happy cat holds her tail high and steady."       
## [25] "A cats field of vision is about 185 degrees."

You may notice that I lied a bit. I did have to make some tweaks, which is why it’s good to make sure of what you’re doing by checking the web page. For instance, above, I had a regex expression that’d match letters, spaces, and numbers. Some facts contain quotes (the weird-looking \\u201 stuff, actually), dashes, apostrophes, and one percent sign; moreover, some don’t actually end in a period. Trial and error is unfortunately going to be part of the process in webscraping.

If you need to scrape more complex sites, there’s a way to do that, but it’s messier, and you’ll probably want to specify just one part of the website to look at. For our first example of this, let’s look at Wikipedia:

library(rvest)
page <- read_html('https://en.wikipedia.org/wiki/List_of_sovereign_states')
page %>% html_element('.wikitable') %>% html_table()

## # A tibble: 195 × 4
##    `Common and formal names`       Membership within th…¹ Sovereignty dispute[…²
##    <chr>                           <chr>                  <chr>                 
##  1 Afghanistan                     A UN member state      A None                
##  2 Albania – Republic of Albania   A UN member state      A None                
##  3 Algeria – People's Democratic … A UN member state      A None                
##  4 Andorra – Principality of Ando… A UN member state      A None                
##  5 Angola – Republic of Angola     A UN member state      A None                
##  6 Antigua and Barbuda             A UN member state      A None                
##  7 Argentina – Argentine Republic… A UN member state      A None                
##  8 Armenia – Republic of Armenia   A UN member state      Not recognised by Pak…
##  9 Australia – Commonwealth of Au… A UN member state      A None                
## 10 Austria – Republic of Austria   A UN member state      A None                
## # ℹ 185 more rows
## # ℹ abbreviated names: ¹`Membership within the UN System[c]`,
## #   ²`Sovereignty dispute[d]`
## # ℹ 1 more variable:
## #   `Further information on status and recognition of sovereignty[f]` <chr>

page %>% html_element('.wikitable') %>% html_table() %>% 
  filter(`Sovereignty dispute[d]` != 'A None')

## # A tibble: 7 × 4
##   `Common and formal names`        Membership within th…¹ Sovereignty dispute[…²
##   <chr>                            <chr>                  <chr>                 
## 1 Armenia – Republic of Armenia    A UN member state      Not recognised by Pak…
## 2 China – People's Republic of Ch… A UN member state      Partially unrecognise…
## 3 Cyprus – Republic of Cyprus      A UN member state      Not recognised by Tur…
## 4 Israel – State of Israel         A UN member state      Partially unrecognised
## 5 North Korea – Democratic People… A UN member state      BClaimed by Afghanist…
## 6 Palestine – State of Palestine   A UN General Assembly… Partially unrecognise…
## 7 South Korea – Republic of Korea  A UN member state      BClaimed by Afghanist…
## # ℹ abbreviated names: ¹`Membership within the UN System[c]`,
## #   ²`Sovereignty dispute[d]`
## # ℹ 1 more variable:
## #   `Further information on status and recognition of sovereignty[f]` <chr>

The html_element() function is grabbing by CSS selector, which is great for Wikipedia, but isn’t really generalizable. See if it works, though, depending on your use case.

More broadly, you can use XPATHs, which you probably want an external tool like SelectorGadget (Chrome plugin) to help you figure out. For another example, we do that using the IMDB page for the TV show “Severance”, pulling a table of the cast members.

Frequently, as below, the data won’t be quite in the format you want. Note the regular expressions I’m using to not only fix the formatting (so there’s not carriage returns and empty white space everywhere), but also to split up the part played/appearance portions of the table. I’d guess that the errors seen here in that sense are due to the table on the URL itself having blank areas, as some actors have been confirmed for the next season, but their parts remain unknown.

imdb <- 'https://www.imdb.com/title/tt11280740/fullcredits?ref_=tt_cl_sm' %>% read_html()
imdb %>% 
  html_elements(xpath='//*[contains(concat( " ", @class, " " ), concat( " ", "cast_list", " " ))]//td') %>% 
  html_text() %>%
  gsub('(\\n)|( {2,})', '', .) %>% .[. != '' & . != '...'] %>% 
  data.frame(actor=.[seq(1, length(.), 2)], part=.[seq(2, length(.), 2)]) %>% select(-`.`) %>%
  mutate(episodes = gsub('^([^1-9]+)', '', part), part = gsub('[1-9][0-9]? ep.+', '', part)) %>% 
  unique() %>% head() %>% knitr::kable()

actor	part	episodes
Adam Scott	Mark	19 episodes, 2022-2024
Zach Cherry	Dylan	19 episodes, 2022-2024
Britt Lower	Helly	19 episodes, 2022-2024
Tramell Tillman	Milchick	19 episodes, 2022-2024
Jen Tullock	Devon	19 episodes, 2022-2024
Dichen Lachman	Ms. Casey	19 episodes, 2022-2024

Note the use of backticks (`) in both to select a column that doesn’t adhere to R’s naming guidelines. Webscraping is complex; if you can use an API, it’s preferable, since (often) things will appear in a static, simple, text-only format. That lets you avoid the mess we see above.

If you take anything away from this demonstration, it should be this: if you can use an API for something, please do. It’ll make your life so much easier.

RMarkdown

There are a few things you should know about RMarkdown when you’re planning on using it. In general, things work like LaTeX, and the formatting for that should be right. For instance, you can insert an equation:

\[\begin{equation*} y = m\cdot x + b \end{equation*}\]

I’d advise you to, generally speaking, start with HTML documents because they’re a bit easier to work with. You can scroll in code chunks and, like this document, collapse them if desired. Formatting and things is generally all right no matter how you’re doing it. To generate a document (that is, make it not a .Rmd file anymore), you “knit” the document, which you can either do in the viewer or by hitting Ctrl+Shift+K. To insert an R code block, you can hit Ctrl+Alt+I. (I assume there’s close equivalencies if you’re on OSX.) These blocks can be named so during the knitting process you can see where it’s running code, and also where the point of failure is if something breaks, but you don’t technically have to name them. Make sure that there’s no duplicate names if you do name them, though.

On the R side of things, though, it’s important to note that RMarkdown documents do not pull from the global environment. That means that if you’ve loaded some kind of data in your regular R session and try to use it in RMarkdown, when it knits the document, it won’t be able to find and use the data, so it’ll fail. You also have a few quality-of-life things that you can use in R to make your results look better (that is, not like ASCII tables made in notepad). The main one of these is the knitr package, whose kable() function makes some decent-looking tables, as you can see below (or above, in the IMDB stuff):

wage1 %>% .[1:10,1:17] %>% knitr::kable()

wage	educ	exper	tenure	female	married	numdep	smsa	west	trade	services
3.10	11	2	0	1	0	2	1	1	0	0
3.24	12	22	2	1	1	3	1	1	0	1
3.00	11	2	0	0	0	2	0	1	1	0
6.00	8	44	28	0	1	0	1	1	0	0
5.30	12	7	2	0	1	1	0	1	0	0
8.75	16	9	8	0	1	0	1	1	0	0
11.25	18	15	7	0	0	0	1	1	1	0
5.00	12	5	3	1	0	0	1	1	0	0
3.60	12	26	4	1	0	2	1	1	1	0
18.18	17	22	21	0	1	0	1	1	0	0

Generally speaking, though, RMarkdown works like a cross between a LaTeX document that’s a bit more linear and a regular R script. The main thing is just making sure that all the code you’ve put into the file is all you need, due to environment concerns. It’s nice to be able to show off exactly what you did!

R Workshop Reference Document

John Reaves