Before we get into the meat of things, a bit about the organization of this document: I start with basic R stuff and proceed through different processes and libraries. Feel free to skip around, but if you don’t understand something syntactically, it might be wise to review earlier sections. (Or if you’re really stuck, reach out!)
There’s a few cheat sheets that I’d recommend, depending on what you’re after. They don’t necessarily show the same methods or formats as found in this document, but that’s okay! There are many ways to do the same thing in R—what’s important is that you understand how it works and that you get the result you’re after. The links to the cheat sheets are:
Help files are very useful, and can be accessed in that particular
pane or by typing ?function_name into the console. (That
is, if you’d like to know more about mean(), you’d use
?mean.) Other than that, Googling is your best friend—try
to be as specific as possible, and if you’re getting any error messages,
definitely include those.
The main takeaway from this should ideally be that if you can think of something you’d like to do in R, you can—remember that. I hope that this document is helpful, and happy coding!
Obviously, you can import data in different ways. The most common method will likely be reading in an Excel file or csv, which is pretty simple to load in as a data frame. You can also load data (including things that you’ve saved) from .RData files for the whole environment, or from .rds files for particular variables/data frames. There’s also, of course, ways to load in data that’s stored in .dta (Stata) format. There are other libraries that will let you load in other types of files, but these are the main ones that you’ll need to know to get started. See the code below for different examples of this.
zillow <- read.csv("C:/Users/knigh/Downloads/zillow_dataset.csv")
head(zillow[,3:8])
## RegionName RegionType StateName X2000.01.31 X2000.02.29 X2000.03.31
## 1 California state NA 187783.93 188416.66 189269.8
## 2 Texas state NA 107911.13 107968.44 107995.5
## 3 Florida state NA 104513.44 104740.17 105014.7
## 4 New York state NA 129469.17 129930.91 130374.9
## 5 Pennsylvania state NA 92347.12 92544.77 92730.5
## 6 Illinois state NA 114873.73 114966.85 115167.0
library(readxl) #easiest for Excel files
excel <- read_excel('C:/Users/knigh/Downloads/Financial Sample.xlsx')
head(excel)
## # A tibble: 6 × 16
## Segment Country Product `Discount Band` `Units Sold` `Manufacturing Price`
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Government Canada Carrete… None 1618. 3
## 2 Government Germany Carrete… None 1321 3
## 3 Midmarket France Carrete… None 2178 3
## 4 Midmarket Germany Carrete… None 888 3
## 5 Midmarket Mexico Carrete… None 2470 3
## 6 Government Germany Carrete… None 1513 3
## # ℹ 10 more variables: `Sale Price` <dbl>, `Gross Sales` <dbl>,
## # Discounts <dbl>, Sales <dbl>, COGS <dbl>, Profit <dbl>, Date <dttm>,
## # `Month Number` <dbl>, `Month Name` <chr>, Year <chr>
library(haven) #for Stata files
gdp <- read_dta('http://www.principlesofeconometrics.com/stata/gdp.dta') #works for local files, too
head(gdp)
## # A tibble: 6 × 2
## usa aus
## <dbl> <dbl>
## 1 38.3 38.2
## 2 38.4 38.8
## 3 38.7 38.8
## 4 38.3 38.9
## 5 39.4 39.6
## 6 39.6 39.6
There are different ways that R can store and access files on your computer. You can set and access your current working directory, along with what files are in it:
setwd('C:/Users/knigh/OneDrive/Documents/RStudio/')
getwd()
## [1] "C:/Users/knigh/OneDrive/Documents/RStudio"
list.files()
## [1] "823.RData" "E-Scooter_Trips_-_2019_Pilot.csv"
## [3] "image.RData" "image_real.RData"
## [5] "rbackups" "replication_ec823"
## [7] "RLIS_Portland" "x.rds"
## [9] "y.rds"
get a list of what variables you’ve created:
ls()
## [1] "excel" "gdp" "zillow"
save and load an image of all variables:
save.image('image.RData') #store all variables
rm(list=ls()) #deletes all variables in memory
ls()
## character(0)
load('image.RData')
ls()
## [1] "excel" "gdp" "zillow"
and save/load single variables as .rds files:
x <- 1:10
saveRDS(x, 'x.rds')
y <- readRDS('x.rds')
x == y
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
For file paths on your computer, especially if you’re using Windows, keep in mind that you can’t use backslashes—forward slashes only. So rather than accessing your C:\ drive, it’d be your “C:/” drive.
As an example, we’re going to load the “mtcars” dataset from our package, to see a few useful (base R) functions and ways to access data.
data('mtcars') #loading from a library. Not typical
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
head() displays the first few rows of a data frame
(generally 6, though you can specify this):
head(mtcars, 15)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
If you’d like to see the full dataset, you can use
View(), as below.
View(mtcars) #doesn't work in a markdown document, unfortunately
summary() lets us look at some information for every
(numeric) column in the dataset, which might be helpful to get a handle
on what’s all going on in it.
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
If you want to access particular elements of a data frame, you can do that with brackets, specifying [row, column]. If you don’t specify a row/column, all values of that row/column will be included (that is, df[,] just returns all of df).
mtcars[10:20,] #rows 10-20
## mpg cyl disp hp drat wt qsec vs am gear carb
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
mtcars[,1] #first column
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
Alternatively, you can specify a column by name either with the dollar sign ($) or by feeding in the name as a string, as you can see below. Aside from the single-bracket approach, all these ways are equivalent to extract a column.
mtcars$mpg
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
mtcars['mpg'] #still a data frame
## mpg
## Mazda RX4 21.0
## Mazda RX4 Wag 21.0
## Datsun 710 22.8
## Hornet 4 Drive 21.4
## Hornet Sportabout 18.7
## Valiant 18.1
## Duster 360 14.3
## Merc 240D 24.4
## Merc 230 22.8
## Merc 280 19.2
## Merc 280C 17.8
## Merc 450SE 16.4
## Merc 450SL 17.3
## Merc 450SLC 15.2
## Cadillac Fleetwood 10.4
## Lincoln Continental 10.4
## Chrysler Imperial 14.7
## Fiat 128 32.4
## Honda Civic 30.4
## Toyota Corolla 33.9
## Toyota Corona 21.5
## Dodge Challenger 15.5
## AMC Javelin 15.2
## Camaro Z28 13.3
## Pontiac Firebird 19.2
## Fiat X1-9 27.3
## Porsche 914-2 26.0
## Lotus Europa 30.4
## Ford Pantera L 15.8
## Ferrari Dino 19.7
## Maserati Bora 15.0
## Volvo 142E 21.4
mtcars[['mpg']] #vector
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
You can also assign new columns as below:
mtcars$mpg2 <- mtcars$mpg + 5
head(mtcars, 8) #to demonstrate
## mpg cyl disp hp drat wt qsec vs am gear carb mpg2
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 26.0
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 26.0
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 27.8
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 26.4
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 23.7
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 23.1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 19.3
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 29.4
This all works, but is not the easiest or most legible way to make changes a lot of the time. I and many other prefer packages that are part of the “tidyverse”, which allows us to use a few new functions and a different syntax for data manipulation. That said, the basic ways are still helpful to know. Everything works in base R, it just might be more confusing to implement or read.
For the rest of this document, we’ll be using tidyverse conventions. There are other options for similar results, the most prominent of which is data.table, but the tidyverse is very legible. If you’re working with very large data, it might be worthwhile to explore other options, but this is an edge case.
The main difference is in the existence and use of the pipe operator,
%>%. It allows us to take the result of one operation
and “pipe” it into another function, which is great for legibility.
Without a pipe operator, to apply multiple functions, we’d either need
to use multiple lines and intermediary steps, or to nest the functions
such that to understand what’s going on, we’d have to read the code from
the outside in. For instance, there are two ways to take a random sample
of 25 integers from 1-1000 and find their mean below, but one is much
clearer to read than the other:
set.seed(1)
mean(sample(1:1000, 25))
## [1] 551.92
library(tidyverse)
set.seed(1)
1:1000 %>% sample(25) %>% mean()
## [1] 551.92
(The set.seed() function allows us to get the same
random numbers every time, which is good for reproducibility.)
Note that specifically, the pipe operator inserts the preceding result as the first argument of a function. If you need things in a different order, you can use the period (.) to tell the function where to place the piped object:
set.seed(2)
1:10 %>% sample(., 4) #unnecessary, but helps illustrate the point
## [1] 5 6 9 1
set.seed(2)
4 %>% sample(1:10, .)
## [1] 5 6 9 1
This is not only an asset to regular function use, but allows us to
do a great deal with data frames in a smoother way. This works because
each function requires a data frame as its first argument, which makes
things smooth! glimpse() lets us look at data frames and
the types of each column, though sideways:
mtcars <- mtcars %>% mutate(rowname = rownames(.)) %>%
as_tibble() %>% select(rowname, where(is.numeric)) #better printing
mtcars %>% glimpse() #note the sideways display!
## Rows: 32
## Columns: 13
## $ rowname <chr> "Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", …
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17…
## $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4,…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8,…
## $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, …
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.…
## $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150,…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90,…
## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,…
## $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3,…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1,…
## $ mpg2 <dbl> 26.0, 26.0, 27.8, 26.4, 23.7, 23.1, 19.3, 29.4, 27.8, 24.2, 22…
mutate() allows us to change/add columns as we go:
mtcars %>% mutate(test=1, test2=FALSE, test3=ifelse(wt > 3, 1, 0)) %>% glimpse()
## Rows: 32
## Columns: 16
## $ rowname <chr> "Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", …
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17…
## $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4,…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8,…
## $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, …
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.…
## $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150,…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90,…
## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,…
## $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3,…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1,…
## $ mpg2 <dbl> 26.0, 26.0, 27.8, 26.4, 23.7, 23.1, 19.3, 29.4, 27.8, 24.2, 22…
## $ test <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ test2 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ test3 <dbl> 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,…
and filter() lets us choose a particular subset of the
data.
mtcars %>% filter(cyl < 6)
## # A tibble: 11 × 13
## rowname mpg cyl disp hp drat wt qsec vs am gear carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 2 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 3 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 4 Fiat 128 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
## 5 Honda Civic 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
## 6 Toyota Cor… 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
## 7 Toyota Cor… 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
## 8 Fiat X1-9 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
## 9 Porsche 91… 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
## 10 Lotus Euro… 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
## 11 Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
## # ℹ 1 more variable: mpg2 <dbl>
Note that none of these actually change the underlying dataset, because we haven’t stored any of these alterations:
mtcars %>% head(10)
## # A tibble: 10 × 13
## rowname mpg cyl disp hp drat wt qsec vs am gear carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 Mazda RX4 … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 Hornet 4 D… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
## # ℹ 1 more variable: mpg2 <dbl>
We can also use pull() to select columns, or use the dot
operator to do the same.
mtcars %>% .$mpg
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
mtcars %>% pull(mpg)
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
select() lets us pick particular columns, and we can do
“-col_name” to drop a column.
mtcars %>% select(mpg, am) #select only certain columns
## # A tibble: 32 × 2
## mpg am
## <dbl> <dbl>
## 1 21 1
## 2 21 1
## 3 22.8 1
## 4 21.4 0
## 5 18.7 0
## 6 18.1 0
## 7 14.3 0
## 8 24.4 0
## 9 22.8 0
## 10 19.2 0
## # ℹ 22 more rows
mtcars %>% select(-mpg) #drop column
## # A tibble: 32 × 12
## rowname cyl disp hp drat wt qsec vs am gear carb mpg2
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4 6 160 110 3.9 2.62 16.5 0 1 4 4 26
## 2 Mazda RX4 … 6 160 110 3.9 2.88 17.0 0 1 4 4 26
## 3 Datsun 710 4 108 93 3.85 2.32 18.6 1 1 4 1 27.8
## 4 Hornet 4 D… 6 258 110 3.08 3.22 19.4 1 0 3 1 26.4
## 5 Hornet Spo… 8 360 175 3.15 3.44 17.0 0 0 3 2 23.7
## 6 Valiant 6 225 105 2.76 3.46 20.2 1 0 3 1 23.1
## 7 Duster 360 8 360 245 3.21 3.57 15.8 0 0 3 4 19.3
## 8 Merc 240D 4 147. 62 3.69 3.19 20 1 0 4 2 29.4
## 9 Merc 230 4 141. 95 3.92 3.15 22.9 1 0 4 2 27.8
## 10 Merc 280 6 168. 123 3.92 3.44 18.3 1 0 4 4 24.2
## # ℹ 22 more rows
Here’s a bit more with mutate(); if we’d like to perform
mathematical operations with any variables, we can access them by name
directly, as below.
mtcars %>% mutate(hp_wt = hp / wt) %>% glimpse()
## Rows: 32
## Columns: 14
## $ rowname <chr> "Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", …
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17…
## $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4,…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8,…
## $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, …
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.…
## $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150,…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90,…
## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,…
## $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3,…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1,…
## $ mpg2 <dbl> 26.0, 26.0, 27.8, 26.4, 23.7, 23.1, 19.3, 29.4, 27.8, 24.2, 22…
## $ hp_wt <dbl> 41.98473, 38.26087, 40.08621, 34.21462, 50.87209, 30.34682, 68…
mtcars %>% mutate(hp_wt = hp / wt) %>% pull(hp_wt)
## [1] 41.98473 38.26087 40.08621 34.21462 50.87209 30.34682 68.62745 19.43574
## [9] 30.15873 35.75581 35.75581 44.22604 48.25737 47.61905 39.04762 39.63864
## [17] 43.03087 30.00000 32.19814 35.42234 39.35091 42.61364 43.66812 63.80208
## [25] 45.51365 34.10853 42.52336 74.68605 83.28076 63.17690 93.83754 39.20863
Now, we’d like to tidy up the makes and models of the cars we’re working with here. I’m going to use regular expressions to do this, which might look a bit complicated, but all I’m effectively doing is splitting the string by the first space. (Some cars have multiple words as the model.)
#preparing makes and models
makes <- mtcars %>% pull(rowname) %>% gsub('([^ ]+) (.+)', '\\1', .)
models <- mtcars %>% pull(rowname) %>% gsub('([^ ]+) (.+)', '\\2', .)
makes #to see
## [1] "Mazda" "Mazda" "Datsun" "Hornet" "Hornet" "Valiant"
## [7] "Duster" "Merc" "Merc" "Merc" "Merc" "Merc"
## [13] "Merc" "Merc" "Cadillac" "Lincoln" "Chrysler" "Fiat"
## [19] "Honda" "Toyota" "Toyota" "Dodge" "AMC" "Camaro"
## [25] "Pontiac" "Fiat" "Porsche" "Lotus" "Ford" "Ferrari"
## [31] "Maserati" "Volvo"
models
## [1] "RX4" "RX4 Wag" "710" "4 Drive" "Sportabout"
## [6] "Valiant" "360" "240D" "230" "280"
## [11] "280C" "450SE" "450SL" "450SLC" "Fleetwood"
## [16] "Continental" "Imperial" "128" "Civic" "Corolla"
## [21] "Corona" "Challenger" "Javelin" "Z28" "Firebird"
## [26] "X1-9" "914-2" "Europa" "Pantera L" "Dino"
## [31] "Bora" "142E"
Now we can store this and use the new columns to filter our dataset!
mtcars <- mtcars %>% mutate(make = makes, model = models)
mtcars %>% glimpse()
## Rows: 32
## Columns: 15
## $ rowname <chr> "Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", …
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17…
## $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4,…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8,…
## $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, …
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.…
## $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150,…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90,…
## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,…
## $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3,…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1,…
## $ mpg2 <dbl> 26.0, 26.0, 27.8, 26.4, 23.7, 23.1, 19.3, 29.4, 27.8, 24.2, 22…
## $ make <chr> "Mazda", "Mazda", "Datsun", "Hornet", "Hornet", "Valiant", "Du…
## $ model <chr> "RX4", "RX4 Wag", "710", "4 Drive", "Sportabout", "Valiant", "…
mtcars %>% filter(make=='Hornet')
## # A tibble: 2 × 15
## rowname mpg cyl disp hp drat wt qsec vs am gear carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Hornet 4 Dr… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 2 Hornet Spor… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## # ℹ 3 more variables: mpg2 <dbl>, make <chr>, model <chr>
mtcars %>% filter(make=='Hornet', cyl > 6)
## # A tibble: 1 × 15
## rowname mpg cyl disp hp drat wt qsec vs am gear carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Hornet Spor… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## # ℹ 3 more variables: mpg2 <dbl>, make <chr>, model <chr>
mtcars %>% filter(make=='Hornet' | cyl > 6)
## # A tibble: 15 × 15
## rowname mpg cyl disp hp drat wt qsec vs am gear carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Hornet 4 D… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 2 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 3 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 4 Merc 450SE 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
## 5 Merc 450SL 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
## 6 Merc 450SLC 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
## 7 Cadillac F… 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
## 8 Lincoln Co… 10.4 8 460 215 3 5.42 17.8 0 0 3 4
## 9 Chrysler I… 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
## 10 Dodge Chal… 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2
## 11 AMC Javelin 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2
## 12 Camaro Z28 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
## 13 Pontiac Fi… 19.2 8 400 175 3.08 3.84 17.0 0 0 3 2
## 14 Ford Pante… 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
## 15 Maserati B… 15 8 301 335 3.54 3.57 14.6 0 1 5 8
## # ℹ 3 more variables: mpg2 <dbl>, make <chr>, model <chr>
For filter(), note that & works for “and” and “|”
works for or. Also, you need double equals signs for equality tests.
This is very much something that you’ll need to practice more than I can teach you, but I’d advise learning regex if you have to work with strings—it makes life easier. R’s regular expressions work generally like any others, though there’s a couple of differences I’ll acknowledge. In general, if you’d like to learn regular expressions, I recommend the website regexr.com to test things out; it has a built-in cheatsheet, which is nice. In general, R works similarly, but capture groups (things to identify to keep) are accessed with \\1 rather than $1, and some characters require additional backslashes as compared to the JavaScript version that regexr uses. The main functions you might be interested in are below, though the section above contains a more complicated example of regular expressions if you’re curious.
grepl(pattern='[a-z]', 'test') #is there a lowercase letter in "test"?
## [1] TRUE
gsub(pattern='[a-z]', replacement='', 'LaTeX') #replace all instances of a lowercase letter with nothing
## [1] "LTX"
In general, square brackets denote a set of possible character matches, curly braces help specify number of repetitions, and parentheses help you denote capture groups. Periods are wildcards, * is 0 or more instances, + is 1 or more instances, and ? matches 0 or 1 instances. This is very much the kind of thing that it’s hard to learn until you really need it, but it’s helpful.
None of what we’ve done thus far has really changed the shape of our
data very much—we’ve added columns and changed things a little bit, but
what if the data’s in the wrong form altogether? One easy example would
be data that’s far too wide. The Zillow dataset that we imported earlier
has 50 rows (one for each state) and way too many columns, since many of
the columns are monthly values. We can use tools like
pivot_longer():
zillow2 <- zillow %>% pivot_longer(starts_with('X'))
zillow2 %>% head(10)
## # A tibble: 10 × 7
## RegionID SizeRank RegionName RegionType StateName name value
## <int> <int> <chr> <chr> <lgl> <chr> <dbl>
## 1 9 0 California state NA X2000.01.31 187784.
## 2 9 0 California state NA X2000.02.29 188417.
## 3 9 0 California state NA X2000.03.31 189270.
## 4 9 0 California state NA X2000.04.30 191121.
## 5 9 0 California state NA X2000.05.31 193271.
## 6 9 0 California state NA X2000.06.30 195590.
## 7 9 0 California state NA X2000.07.31 198040.
## 8 9 0 California state NA X2000.08.31 200598.
## 9 9 0 California state NA X2000.09.30 203164.
## 10 9 0 California state NA X2000.10.31 205597.
zillow2 <- zillow2 %>%
mutate(year = name %>% gsub('^X([0-9]{4}).([0-9]{2}).([0-9]{2})', '\\1', .) %>% as.numeric(),
month = name %>% gsub('^X([0-9]{4}).([0-9]{2}).([0-9]{2})', '\\2', .) %>% as.numeric()) %>%
mutate(pd = 1:n(), .by=RegionName)
zillow2 %>% filter(year==2008, month==8)
## # A tibble: 51 × 10
## RegionID SizeRank RegionName RegionType StateName name value year month
## <int> <int> <chr> <chr> <lgl> <chr> <dbl> <dbl> <dbl>
## 1 9 0 California state NA X200… 3.65e5 2008 8
## 2 54 1 Texas state NA X200… 1.38e5 2008 8
## 3 14 2 Florida state NA X200… 1.93e5 2008 8
## 4 43 3 New York state NA X200… 2.38e5 2008 8
## 5 47 4 Pennsylvania state NA X200… 1.57e5 2008 8
## 6 21 5 Illinois state NA X200… 1.77e5 2008 8
## 7 44 6 Ohio state NA X200… 1.21e5 2008 8
## 8 16 7 Georgia state NA X200… 1.54e5 2008 8
## 9 36 8 North Caroli… state NA X200… 1.64e5 2008 8
## 10 30 9 Michigan state NA X200… 1.15e5 2008 8
## # ℹ 41 more rows
## # ℹ 1 more variable: pd <int>
There’s a bit going on here, so let me explain.
pivot_longer() takes all the columns we’ve selected (those
that start with “X”, all the monthly values) and makes them into two
columns: “name” and “value”. “name” is the original column name, “value”
is the value. We then use regular expressions to extract the month and
year information, because the formatting of all names is XYYYY.MM.DD,
where DD is just the last day of the month. We then add a new variable,
pd, which helps us turn year/month info into time-period info. The
.by argument means that the mutated values are calculated
by RegionName (state, actually), rather than counting linearly through
the whole data frame. If it wasn’t already in order, we’d need to
arrange() the data before the counting.
In practice, commands like these are built up over time. This is why I think that not immediately changing values is an asset—you can look at what the result is for any intermediate step in the process and make sure everything functioned right before moving on to the next step. The second part of this, though, is that I can just run one command to get the exact result that I want!
A quick aside—if we’re working with dates and times frequently, we might want to use the lubridate package (cheat sheet here), which makes operations simpler:
library(lubridate)
zillow3 <- zillow2 %>% mutate(time = gsub('X', '', name) %>% ymd())
zillow3 %>% glimpse()
## Rows: 14,331
## Columns: 11
## $ RegionID <int> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,…
## $ SizeRank <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ RegionName <chr> "California", "California", "California", "California", "Ca…
## $ RegionType <chr> "state", "state", "state", "state", "state", "state", "stat…
## $ StateName <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ name <chr> "X2000.01.31", "X2000.02.29", "X2000.03.31", "X2000.04.30",…
## $ value <dbl> 187783.9, 188416.7, 189269.8, 191120.9, 193270.9, 195590.1,…
## $ year <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000,…
## $ month <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7,…
## $ pd <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ time <date> 2000-01-31, 2000-02-29, 2000-03-31, 2000-04-30, 2000-05-31…
zillow3$time[2] - zillow3$time[1] #end of jan to end of feb
## Time difference of 29 days
There are other things we might want to do, though. We’re going to
return to the mtcars dataset here; say we want to know how many models
from each manufacturer are included in the dataset. We can use
group_by() to…well, group the data by some variable(s) and
derive some results.
mtcars %>% glimpse()
## Rows: 32
## Columns: 15
## $ rowname <chr> "Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", …
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17…
## $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4,…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8,…
## $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, …
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.…
## $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150,…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90,…
## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,…
## $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3,…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1,…
## $ mpg2 <dbl> 26.0, 26.0, 27.8, 26.4, 23.7, 23.1, 19.3, 29.4, 27.8, 24.2, 22…
## $ make <chr> "Mazda", "Mazda", "Datsun", "Hornet", "Hornet", "Valiant", "Du…
## $ model <chr> "RX4", "RX4 Wag", "710", "4 Drive", "Sportabout", "Valiant", "…
mtcars %>% group_by(make) %>% summarize(N=n())
## # A tibble: 22 × 2
## make N
## <chr> <int>
## 1 AMC 1
## 2 Cadillac 1
## 3 Camaro 1
## 4 Chrysler 1
## 5 Datsun 1
## 6 Dodge 1
## 7 Duster 1
## 8 Ferrari 1
## 9 Fiat 2
## 10 Ford 1
## # ℹ 12 more rows
If we want to sort things, we can use arrange():
mtcars %>% group_by(make) %>% summarize(N=n()) %>% arrange(N) #lowest first
## # A tibble: 22 × 2
## make N
## <chr> <int>
## 1 AMC 1
## 2 Cadillac 1
## 3 Camaro 1
## 4 Chrysler 1
## 5 Datsun 1
## 6 Dodge 1
## 7 Duster 1
## 8 Ferrari 1
## 9 Ford 1
## 10 Honda 1
## # ℹ 12 more rows
mtcars %>% group_by(make) %>% summarize(N=n()) %>% arrange(-N) #highest first
## # A tibble: 22 × 2
## make N
## <chr> <int>
## 1 Merc 7
## 2 Fiat 2
## 3 Hornet 2
## 4 Mazda 2
## 5 Toyota 2
## 6 AMC 1
## 7 Cadillac 1
## 8 Camaro 1
## 9 Chrysler 1
## 10 Datsun 1
## # ℹ 12 more rows
And we can also group by multiple variables, giving us information about more particular subsets. Be careful that it sorts by particular value, not a bin, so if you’d like to sort by ranges (say, having 1-3 cylinders), you’d need to make another variable specifying as much.
mtcars %>% group_by(cyl, am) %>% summarize(N=n(), mean_hp = mean(hp))
## # A tibble: 6 × 4
## # Groups: cyl [3]
## cyl am N mean_hp
## <dbl> <dbl> <int> <dbl>
## 1 4 0 3 84.7
## 2 4 1 8 81.9
## 3 6 0 4 115.
## 4 6 1 3 132.
## 5 8 0 12 194.
## 6 8 1 2 300.
Any function that would give you information about the makeup of your data—mean, standard deviation, max, min, etc.—they all work here. This lets us know that there are three 4-cylinder, non-American cars in the dataset. Neat!
We can also do a bit more than that: if we’d like to summarize many
variables, we can use across() to select them and state
what we’d like to do with them. We use a vector here to do multiple
functions, name them so the column names are as we want, and we can even
specify parts of the function. Note that the mean function removes
missing values; the ~fn() syntax is necessary for
edits/specifications like that, where .x is what’s being
summarized.
mtcars %>% group_by(cyl) %>%
summarize(across(where(is.numeric),
c(mn=~mean(.x, na.rm=TRUE), sd=sd)))
## # A tibble: 3 × 23
## cyl mpg_mn mpg_sd disp_mn disp_sd hp_mn hp_sd drat_mn drat_sd wt_mn wt_sd
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 26.7 4.51 105. 26.9 82.6 20.9 4.07 0.365 2.29 0.570
## 2 6 19.7 1.45 183. 41.6 122. 24.3 3.59 0.476 3.12 0.356
## 3 8 15.1 2.56 353. 67.8 209. 51.0 3.23 0.372 4.00 0.759
## # ℹ 12 more variables: qsec_mn <dbl>, qsec_sd <dbl>, vs_mn <dbl>, vs_sd <dbl>,
## # am_mn <dbl>, am_sd <dbl>, gear_mn <dbl>, gear_sd <dbl>, carb_mn <dbl>,
## # carb_sd <dbl>, mpg2_mn <dbl>, mpg2_sd <dbl>
If we’d like to, say, keep all the values in the original dataset but
add some mean value, we can do that one of a few ways. We can
mutate() with the .by argument,
group_by() before we mutate, or even join a summarized data
frame. (Joining is linking two datasets based on one or more common
variables, which is pretty useful.) The group_by() version
is also ungrouped at the end, since occasionally keeping the groups in a
data frame leads to undesired results. Summarizing peels back one level
of grouping for each summarize() call; in practice, I don’t
think this comes up much, but if you’re having any weird issues, see if
that matters.
The below should all be identical, though illustrate different ways
to get the same result: one with mutate() and
.by, one with group_by(), and one with joins
(a left join, to be specific.)
mtcars %>% mutate(hp_mean_bycyl = mean(hp), .by=cyl) %>%
select(rowname, mpg, cyl, hp_mean_bycyl) #too many variables for good display
## # A tibble: 32 × 4
## rowname mpg cyl hp_mean_bycyl
## <chr> <dbl> <dbl> <dbl>
## 1 Mazda RX4 21 6 122.
## 2 Mazda RX4 Wag 21 6 122.
## 3 Datsun 710 22.8 4 82.6
## 4 Hornet 4 Drive 21.4 6 122.
## 5 Hornet Sportabout 18.7 8 209.
## 6 Valiant 18.1 6 122.
## 7 Duster 360 14.3 8 209.
## 8 Merc 240D 24.4 4 82.6
## 9 Merc 230 22.8 4 82.6
## 10 Merc 280 19.2 6 122.
## # ℹ 22 more rows
mtcars %>% group_by(cyl) %>% mutate(hp_mean_bycyl = mean(hp)) %>% ungroup() %>%
select(rowname, mpg, cyl, hp_mean_bycyl) #too many variables for good display
## # A tibble: 32 × 4
## rowname mpg cyl hp_mean_bycyl
## <chr> <dbl> <dbl> <dbl>
## 1 Mazda RX4 21 6 122.
## 2 Mazda RX4 Wag 21 6 122.
## 3 Datsun 710 22.8 4 82.6
## 4 Hornet 4 Drive 21.4 6 122.
## 5 Hornet Sportabout 18.7 8 209.
## 6 Valiant 18.1 6 122.
## 7 Duster 360 14.3 8 209.
## 8 Merc 240D 24.4 4 82.6
## 9 Merc 230 22.8 4 82.6
## 10 Merc 280 19.2 6 122.
## # ℹ 22 more rows
mtcars %>%
left_join(
mtcars %>% group_by(cyl) %>% summarize(hp_mean_bycyl = mean(hp))
) %>%
select(rowname, mpg, cyl, hp_mean_bycyl) #too many variables for good display
## # A tibble: 32 × 4
## rowname mpg cyl hp_mean_bycyl
## <chr> <dbl> <dbl> <dbl>
## 1 Mazda RX4 21 6 122.
## 2 Mazda RX4 Wag 21 6 122.
## 3 Datsun 710 22.8 4 82.6
## 4 Hornet 4 Drive 21.4 6 122.
## 5 Hornet Sportabout 18.7 8 209.
## 6 Valiant 18.1 6 122.
## 7 Duster 360 14.3 8 209.
## 8 Merc 240D 24.4 4 82.6
## 9 Merc 230 22.8 4 82.6
## 10 Merc 280 19.2 6 122.
## # ℹ 22 more rows
If you so choose, you can also use the table() command
for tallies. I think it’s harder to read, but it’s up to you.
table(mtcars$am, mtcars$gear)
##
## 3 4 5
## 0 15 4 0
## 1 0 8 5
Sometimes there are missing values in datasets, which show up as
NA. The easy, loaded datasets don’t really have this issue,
but in practice this will be relevant. Look at the zillow dataset from
earlier:
zillow2 %>% group_by(RegionName) %>%
summarize(prop_missing = sum(is.na(value)) / n(),
mean_val = mean(value),
mean_val_nomissing = mean(value, na.rm=TRUE))
## # A tibble: 51 × 4
## RegionName prop_missing mean_val mean_val_nomissing
## <chr> <dbl> <dbl> <dbl>
## 1 Alabama 0.00356 NA 119037.
## 2 Alaska 0.00712 NA 237470.
## 3 Arizona 0.00356 NA 217589.
## 4 Arkansas 0 108971. 108971.
## 5 California 0 414039. 414039.
## 6 Colorado 0 290685. 290685.
## 7 Connecticut 0 261084. 261084.
## 8 Delaware 0 234044. 234044.
## 9 District of Columbia 0 400532. 400532.
## 10 Florida 0 194654. 194654.
## # ℹ 41 more rows
You can see that some states have all their values missing, but others just have a few. But as soon as you have missing values, many functions don’t work, which you can see in the mean_val column as compared to the one without missing values. You might wish to replace those values with zeroes in some circumstances, or ignore them altogether, but make sure you understand what you’re getting back.
So we’ve done a lot of data manipulation, but what about graphing? R
has base plotting functions, but ggplot is much better. You build up
graphs by adding pieces: a plot first, then points, regression lines,
connecting lines, labels, etc. You can give each individual piece
different data if you’d like, too. The aes() portion of the
plot/piece tells ggplot what to do with things—what to put on which
axis, but also if things need separate regression lines, or different
colors, sizes, shapes, or the like. If you want everything to be the
same size/color/shape, you can specify that outside of
aes(); sometimes, as below, this needs to be specified in
the actual point/line/etc. rather than in the ggplot() call
itself.
Note the slight alterations between the code for each graph that follows:
ggplot(data=mtcars, aes(x=wt, y=mpg)) + geom_point()
ggplot(data=mtcars, aes(x=wt, y=mpg, col=am)) + geom_point()
ggplot(data=mtcars, aes(x=wt, y=mpg)) + geom_point(col='red') #all same color
ggplot(data=mtcars, aes(x=wt, y=mpg, col=am)) + geom_point() + geom_smooth() #very wiggly by default
ggplot(data=mtcars, aes(x=wt, y=mpg, group=am, col=(am==1))) + geom_point() + geom_smooth(method='lm') #linear
You can also split the data out in such a way that different subsets
are graphed side by side using facet_wrap(), which takes up
to two variables.
ggplot(data=mtcars, aes(x=wt, y=mpg, group=am, col=(am==1))) + geom_point() + geom_smooth(method='lm') + facet_wrap(~am)
ggplot(data=mtcars, aes(x=wt, y=mpg, col=(am==1))) + geom_point() + geom_smooth(method='lm') + facet_wrap(am~cyl)
If you’d like to specify particular x- and y-axis limits, you can do
so using two different options (though one is a worse idea than the
other). The first, good, option is coord_cartesian(): it
will simply zoom in the plot to the limits you specify. This is nice
because if you are, say, adding a regression line, it will stay the same
as you zoom in:
ggplot(data=mtcars, aes(x=wt, y=mpg, group=am, col=(am==1))) + geom_point() + geom_smooth(method='lm') #reference
ggplot(data=mtcars, aes(x=wt, y=mpg, group=am, col=(am==1))) + geom_point() +
geom_smooth(method='lm') + coord_cartesian(xlim = c(2, 5), ylim = c(10, 30))
Sure, the axis labels changed a little (something you can also
customize), but the lines and points are all the same! This is notably
not true for the other “option”, lims (works for
colors, etc. too, but looking at xlim() and
ylim() here):
ggplot(data=mtcars, aes(x=wt, y=mpg, group=am, col=(am==1))) + geom_point() +
geom_smooth(method='lm') + xlim(2, 5) + ylim(10, 30)
There’s arguably a use case for cutting things off (RD bandwiths?)
but I’d really encourage you to stick to coord_cartesian().
It’ll make life simpler.
One of the nice things about ggplot is its customizability. Using one
of the simpler graphs above, I’ll show how to edit (the labels of) each
portion of the graph using labs(), as well as the option to
manually specify colors:
ggplot(data=mtcars, aes(x=wt, y=mpg, group=am, col=(am==1))) + geom_point() + geom_smooth(method='lm') +
labs( #labels for everything
title='MPG by Weight',
subtitle='Using the mtcars dataset',
x='Weight',
y='Miles per Gallon (mpg)',
col='American' #if this is fill/shape/size, that argument goes here. just like in aes()
) + #next, manually setting color values and labels
scale_color_manual(values=c('purple', 'blue'), labels=c('Foreign', 'Domestic'))
If desired, you can also combine multiple graphs using the gridExtra package. Below, I also use the syntax to make an empty list of length 2, just in case you need that. Note the double brackets to properly access the elements of said list.
library(gridExtra)
graphlist <- vector('list', 2)
graphlist[[1]] <- ggplot(data=mtcars, aes(x=wt, y=mpg)) + geom_point(col='red')
graphlist[[2]] <- ggplot(data=mtcars, aes(x=wt, y=cyl)) + geom_point(col='blue')
grid.arrange(grobs=graphlist, ncol=2)
As this is just an arranging tool, note that these are fully two different graphs—the y axes aren’t even the same! Their scales are entirely different! Use this with caution, but know you can do it.
If you’re wanting something to happen lots of times, there’s plenty of options. If you need it to work on different things, you can write your own functions. I’ve included a simple one of my own below—hopefully you can figure out what it does.
sparse <- function(df, every) {
df[seq(1, nrow(df), every),] %>% return()
}
mtcars %>% sparse(2)
## # A tibble: 16 × 15
## rowname mpg cyl disp hp drat wt qsec vs am gear carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 3 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 4 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 5 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 6 Merc 280C 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
## 7 Merc 450SL 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
## 8 Cadillac F… 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
## 9 Chrysler I… 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
## 10 Honda Civic 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
## 11 Toyota Cor… 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
## 12 AMC Javelin 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2
## 13 Pontiac Fi… 19.2 8 400 175 3.08 3.84 17.0 0 0 3 2
## 14 Porsche 91… 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
## 15 Ford Pante… 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
## 16 Maserati B… 15 8 301 335 3.54 3.57 14.6 0 1 5 8
## # ℹ 3 more variables: mpg2 <dbl>, make <chr>, model <chr>
mtcars %>% sparse(16)
## # A tibble: 2 × 15
## rowname mpg cyl disp hp drat wt qsec vs am gear carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 Chrysler Im… 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
## # ℹ 3 more variables: mpg2 <dbl>, make <chr>, model <chr>
You can also use for loops and while loops. They do what they sound like: for every value x in some list y, do whatever’s in the brackets. Or, while this condition is true, do what’s in the brackets. If you’re working on one line, the brackets are technically optional, but you should use them in practice.
for(i in 1:10) print(i) #prints 1-10
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
paste0() combines strings, and is present to make
formatting of the output a bit better!
for(nm in names(mtcars)) {
paste0('Average value of ', nm, ': ', mean(mtcars[[nm]])) %>% print()
paste0('Standard deviation of ', nm, ': ', sd(mtcars[[nm]])) %>% print()
print('')
}
## [1] "Average value of rowname: NA"
## [1] "Standard deviation of rowname: NA"
## [1] ""
## [1] "Average value of mpg: 20.090625"
## [1] "Standard deviation of mpg: 6.0269480520891"
## [1] ""
## [1] "Average value of cyl: 6.1875"
## [1] "Standard deviation of cyl: 1.78592164694654"
## [1] ""
## [1] "Average value of disp: 230.721875"
## [1] "Standard deviation of disp: 123.938693831382"
## [1] ""
## [1] "Average value of hp: 146.6875"
## [1] "Standard deviation of hp: 68.5628684893206"
## [1] ""
## [1] "Average value of drat: 3.5965625"
## [1] "Standard deviation of drat: 0.534678736070971"
## [1] ""
## [1] "Average value of wt: 3.21725"
## [1] "Standard deviation of wt: 0.978457442989697"
## [1] ""
## [1] "Average value of qsec: 17.84875"
## [1] "Standard deviation of qsec: 1.78694323609684"
## [1] ""
## [1] "Average value of vs: 0.4375"
## [1] "Standard deviation of vs: 0.504016128774185"
## [1] ""
## [1] "Average value of am: 0.40625"
## [1] "Standard deviation of am: 0.498990917235846"
## [1] ""
## [1] "Average value of gear: 3.6875"
## [1] "Standard deviation of gear: 0.737804065256947"
## [1] ""
## [1] "Average value of carb: 2.8125"
## [1] "Standard deviation of carb: 1.61519997763185"
## [1] ""
## [1] "Average value of mpg2: 25.090625"
## [1] "Standard deviation of mpg2: 6.0269480520891"
## [1] ""
## [1] "Average value of make: NA"
## [1] "Standard deviation of make: NA"
## [1] ""
## [1] "Average value of model: NA"
## [1] "Standard deviation of model: NA"
## [1] ""
And a while loop:
i <- 1
while(i < 100) {
i <- i + 1
}
paste0('Value: ', i)
## [1] "Value: 100"
There’s also a family of functions that end in “apply” that do a
version of for loops, but formatted a little differently. For
lapply(), the example we’ll see here, you feed it the
vector of values and the function to apply to those values. It returns a
list of values, which is convenient to get a list of the same length you
needed. You can also use it to make a data frame of sorts, as I show
below:
byname <- function(name) { #function to use
c(name=name, mean=mean(mtcars[[name]]), sd=sd(mtcars[[name]]))
}
lapply(names(mtcars), byname) %>% bind_rows()
## # A tibble: 15 × 3
## name mean sd
## <chr> <chr> <chr>
## 1 rowname <NA> <NA>
## 2 mpg 20.090625 6.0269480520891
## 3 cyl 6.1875 1.78592164694654
## 4 disp 230.721875 123.938693831382
## 5 hp 146.6875 68.5628684893206
## 6 drat 3.5965625 0.534678736070971
## 7 wt 3.21725 0.978457442989697
## 8 qsec 17.84875 1.78694323609684
## 9 vs 0.4375 0.504016128774185
## 10 am 0.40625 0.498990917235846
## 11 gear 3.6875 0.737804065256947
## 12 carb 2.8125 1.61519997763185
## 13 mpg2 25.090625 6.0269480520891
## 14 make <NA> <NA>
## 15 model <NA> <NA>
There’s also if/else statements. The else is optional, but again, it’s descriptive: if this condition is true, do this; otherwise, do that. (Or, by default, do nothing.) Equality statements require double equals signs (==), so be careful of that.
i <- 1
if(i == 1) {
print('true')
} else {
print("false")
}
## [1] "true"
if(i != 1) {
print('true')
} else {
print("false")
}
## [1] "false"
That’s a simple demonstration, but if/else statements are powerful.
You can also use the ifelse() function on a vector of
conditions, which is pretty handy when using mutate(). You
can put whatever you want in the true/false condition returns, as long
as they’re the same length:
ifelse(c(T,F,F,T,F), 'a', 'b')
## [1] "a" "b" "b" "a" "b"
mtcars %>% mutate(test = ifelse(cyl > 4, 'Many cylinders', 'Few cylinders')) %>%
select(make, model, test) %>% head(10)
## # A tibble: 10 × 3
## make model test
## <chr> <chr> <chr>
## 1 Mazda RX4 Many cylinders
## 2 Mazda RX4 Wag Many cylinders
## 3 Datsun 710 Few cylinders
## 4 Hornet 4 Drive Many cylinders
## 5 Hornet Sportabout Many cylinders
## 6 Valiant Valiant Many cylinders
## 7 Duster 360 Many cylinders
## 8 Merc 240D Few cylinders
## 9 Merc 230 Few cylinders
## 10 Merc 280 Many cylinders
If you’re expecting to have lost of potential cases, you might want to look into switch case statements. In practice, I haven’t needed these much, but here’s a vectorized (mutating) example:
mtcars %>% mutate(test = case_when(cyl==4 ~ 'one', cyl==6 ~ 'two', cyl==8 ~ 'three')) %>%
select(make, model, test) %>% head(8)
## # A tibble: 8 × 3
## make model test
## <chr> <chr> <chr>
## 1 Mazda RX4 two
## 2 Mazda RX4 Wag two
## 3 Datsun 710 one
## 4 Hornet 4 Drive two
## 5 Hornet Sportabout three
## 6 Valiant Valiant two
## 7 Duster 360 three
## 8 Merc 240D one
It goes down the list checking the conditions, and will use the first one it gets to.
In this section, we’ll mostly be using the wage1 and wagepan datasets from Wooldridge’s book. The latter is panel data, which is nice for what we need.
library(wooldridge)
data('wage1')
wage1 %>% glimpse()
## Rows: 526
## Columns: 24
## $ wage <dbl> 3.10, 3.24, 3.00, 6.00, 5.30, 8.75, 11.25, 5.00, 3.60, 18.18,…
## $ educ <int> 11, 12, 11, 8, 12, 16, 18, 12, 12, 17, 16, 13, 12, 12, 12, 16…
## $ exper <int> 2, 22, 2, 44, 7, 9, 15, 5, 26, 22, 8, 3, 15, 18, 31, 14, 10, …
## $ tenure <int> 0, 2, 0, 28, 2, 8, 7, 3, 4, 21, 2, 0, 0, 3, 15, 0, 0, 10, 0, …
## $ nonwhite <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ female <int> 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1…
## $ married <int> 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0…
## $ numdep <int> 2, 3, 2, 0, 1, 0, 0, 0, 2, 0, 0, 0, 2, 0, 1, 1, 0, 0, 3, 0, 0…
## $ smsa <int> 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ northcen <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ south <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ west <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ construc <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ndurman <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ trcommpu <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ trade <int> 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ services <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ profserv <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1…
## $ profocc <int> 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1…
## $ clerocc <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ servocc <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ lwage <dbl> 1.1314021, 1.1755733, 1.0986123, 1.7917595, 1.6677068, 2.1690…
## $ expersq <int> 4, 484, 4, 1936, 49, 81, 225, 25, 676, 484, 64, 9, 225, 324, …
## $ tenursq <int> 0, 4, 0, 784, 4, 64, 49, 9, 16, 441, 4, 0, 0, 9, 225, 0, 0, 1…
lm(lwage ~ educ + tenure + nonwhite * female, data=wage1) %>% summary()
##
## Call:
## lm(formula = lwage ~ educ + tenure + nonwhite * female, data = wage1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.97440 -0.25289 -0.03765 0.24589 1.30187
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.629265 0.093351 6.741 4.20e-11 ***
## educ 0.081508 0.006696 12.172 < 2e-16 ***
## tenure 0.021627 0.002592 8.344 6.53e-16 ***
## nonwhite 0.018282 0.082907 0.221 0.826
## female -0.289465 0.039587 -7.312 9.99e-13 ***
## nonwhite:female -0.074720 0.121158 -0.617 0.538
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4194 on 520 degrees of freedom
## Multiple R-squared: 0.3834, Adjusted R-squared: 0.3774
## F-statistic: 64.66 on 5 and 520 DF, p-value: < 2.2e-16
Besides a bit of a glimpse of the dataset, notice that we use the
lm() function to do regressions in base R. It…works just
fine. Note that the format of the formula that we insert is y ~ x1 + x2
+ … + x_n. These aren’t in quotes or anything. We also need to get a
summary of the regression object that we’ve created in
lm(). Getting an object back is nice because we can use
functions like predict() or resid() with these
object to get fitted values and residuals. We can also more directly
access the coefficient values from them if desired:
reg <- lm(lwage ~ educ + tenure + nonwhite * female, data=wage1)
reg
##
## Call:
## lm(formula = lwage ~ educ + tenure + nonwhite * female, data = wage1)
##
## Coefficients:
## (Intercept) educ tenure nonwhite
## 0.62927 0.08151 0.02163 0.01828
## female nonwhite:female
## -0.28947 -0.07472
names(reg) #what does the reg object hold?
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
reg$coefficients
## (Intercept) educ tenure nonwhite female
## 0.62926514 0.08150781 0.02162687 0.01828224 -0.28946534
## nonwhite:female
## -0.07472044
Note the way I put in the interaction term in these regressions: with
an asterisk. By default, a * b gives you a + b + a*b, which can be good
or bad. If you need particular interaction terms—that is, just a*b—then
you need to use the I() function:
lm(lwage ~ educ + tenure + I(nonwhite * female), data=wage1) %>% summary()
##
## Call:
## lm(formula = lwage ~ educ + tenure + I(nonwhite * female), data = wage1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.11486 -0.29738 -0.02029 0.29223 1.43921
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.421056 0.091606 4.596 5.4e-06 ***
## educ 0.086118 0.006965 12.365 < 2e-16 ***
## tenure 0.025512 0.002672 9.549 < 2e-16 ***
## I(nonwhite * female) -0.208138 0.090516 -2.299 0.0219 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.441 on 522 degrees of freedom
## Multiple R-squared: 0.3155, Adjusted R-squared: 0.3115
## F-statistic: 80.18 on 3 and 522 DF, p-value: < 2.2e-16
This goes for squaring terms as well. However, you can apply many functions to variables directly:
lm(log(wage) ~ I(educ ^ 2) + (tenure > 10) + I(nonwhite * female), data=wage1) %>% summary()
##
## Call:
## lm(formula = log(wage) ~ I(educ^2) + (tenure > 10) + I(nonwhite *
## female), data = wage1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.11845 -0.30492 -0.04298 0.28711 1.51216
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9293157 0.0541775 17.153 < 2e-16 ***
## I(educ^2) 0.0038490 0.0002944 13.076 < 2e-16 ***
## tenure > 10TRUE 0.4180373 0.0536736 7.789 3.68e-14 ***
## I(nonwhite * female) -0.1878309 0.0916589 -2.049 0.0409 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4458 on 522 degrees of freedom
## Multiple R-squared: 0.3006, Adjusted R-squared: 0.2965
## F-statistic: 74.77 on 3 and 522 DF, p-value: < 2.2e-16
This may or may not be a good idea, but it sure is convenient!
The main library you most likely want to use—in part due to its
speed—is the fixest library. feols() is a function that
incorporates robust inference, fixed effects (in the form of dummies,
same as the “within” regression), and instrumental variables.
library(fixest)
feols(lwage ~ educ + tenure + nonwhite * female, data=wage1, vcov='hetero') #heteroskedastic standard errors
## OLS estimation, Dep. Var.: lwage
## Observations: 526
## Standard-errors: Heteroskedasticity-robust
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.629265 0.102899 6.115379 1.8941e-09 ***
## educ 0.081508 0.007434 10.964225 < 2.2e-16 ***
## tenure 0.021627 0.003151 6.862529 1.9312e-11 ***
## nonwhite 0.018282 0.090491 0.202035 8.3997e-01
## female -0.289465 0.040041 -7.229249 1.7446e-12 ***
## nonwhite:female -0.074720 0.124115 -0.602027 5.4742e-01
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.416998 Adj. R2: 0.37744
It also includes a function i(), which is similar to
I(), but plays nice with interactions of factor variables.
You can also tell it what value(s) should be the reference point if
desired.
feols(lwage ~ educ + tenure + i(nonwhite * female), data=wage1, se='hetero')
## OLS estimation, Dep. Var.: lwage
## Observations: 526
## Standard-errors: Heteroskedasticity-robust
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.421056 0.093695 4.49390 8.6193e-06 ***
## educ 0.086118 0.007210 11.94468 < 2.2e-16 ***
## tenure 0.025512 0.003297 7.73899 5.2219e-14 ***
## nonwhite * female::1 -0.208138 0.084455 -2.46448 1.4042e-02 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.439362 Adj. R2: 0.31152
And now IV:
feols(lwage ~ married + nonwhite * female | tenure ~ educ + exper + expersq, data=wage1, se='hetero') #iv regression
## TSLS estimation, Dep. Var.: lwage, Endo.: tenure, Instr.: educ, exper, expersq
## Second stage: Dep. Var.: lwage
## Observations: 526
## Standard-errors: Heteroskedasticity-robust
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.582330 0.047558 33.271717 < 2.2e-16 ***
## fit_tenure 0.018989 0.005979 3.175973 1.5820e-03 **
## married 0.172805 0.047277 3.655146 2.8315e-04 ***
## nonwhite -0.097020 0.105598 -0.918766 3.5864e-01
## female -0.323067 0.043769 -7.381200 6.2605e-13 ***
## nonwhite:female 0.076129 0.137878 0.552143 5.8109e-01
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.465384 Adj. R2: 0.224581
## F-test (1st stage), tenure: stat = 50.6 , p < 2.2e-16 , on 3 and 518 DoF.
## Wu-Hausman: stat = 0.178328, p = 0.672989, on 1 and 519 DoF.
## Sargan: stat = 140.3 , p < 2.2e-16 , on 2 DoF.
Another great benefit is that you can feed feols() a
vector of dependent variables and it’ll run the regressions all
together. etable() is a handy function to arrange these
multiple regressions together. It’ll also produce results in LaTeX
format if desired–check the documentation, or use esttex()
instead.
feols(c(lwage, wage) ~ married + nonwhite * female | tenure ~ educ + exper + expersq, data=wage1, se='hetero') %>% etable()
## ..1 ..2
## Dependent Var.: lwage wage
##
## Constant 1.582*** (0.0476) 5.631*** (0.2738)
## tenure 0.0190** (0.0060) 0.1455*** (0.0389)
## married 0.1728*** (0.0473) 0.8743** (0.2966)
## nonwhite -0.0970 (0.1056) -0.6940 (0.7401)
## female -0.3231*** (0.0438) -2.009*** (0.2903)
## nonwhite x female 0.0761 (0.1379) 0.5121 (0.8635)
## _________________ ___________________ __________________
## S.E. type Heteroskedast.-rob. Heteroskedas.-rob.
## Observations 526 526
## R2 0.23197 0.21205
## Adj. R2 0.22458 0.20447
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There are two options for bootstrapping: one uses the
lmtest and sandwich libraries, and is a bit
more direct. The formatting might be unfamiliar, but we effectively take
our regression object and put it through a coefficient test, specifying
the form of the variance-covariance matrix (using that same
regression):
library(lmtest)
library(sandwich)
#bootstrapping
feols(hp ~ wt + gear + am, mtcars) %>% coeftest(vcov = vcovBS(., R=50))
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -173.428 106.216 -1.6328 0.1025
## wt 67.228 14.555 4.6190 3.856e-06 ***
## gear 25.017 25.315 0.9882 0.3230
## am 28.499 35.279 0.8078 0.4192
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The second option uses the boot library, and requires
you to write a function for it to run repeatedly. Both options are seen
below.
#more customization
library(boot)
boot(mtcars, function(data, indices) lm(hp ~ wt + gear + am, data[indices,]) %>% coefficients() %>% return(), R=50)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = mtcars, statistic = function(data, indices) lm(hp ~
## wt + gear + am, data[indices, ]) %>% coefficients() %>% return(),
## R = 50)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -173.42840 12.314506 112.31344
## t2* 67.22785 -1.656678 17.96803
## t3* 25.01673 -1.727126 24.18994
## t4* 28.49948 -6.232931 35.47859
For panel data, there are also options. feols() works
pretty cleanly if you’re content with the “within” transformation, but
if you’d like to do first differencing or another variety of
fixed-effect model, the plm library is also an option. Your
data will need to be transformed into a pdata.frame, and you’ll use the
plm() function for regressions. Note that you’ll need to
use coeftest() and a vcov specification (for the
variance-covariance matrix) in order to get clustered results or the
like.
library(plm)
data('wagepan')
wagepan2 <- wagepan %>% pdata.frame(index=c('nr', 'year')) #unit and year
plm(lwage ~ educ + union + married + exper + expersq + factor(year),
data=wagepan2, model='within') %>%
coeftest(vcov=vcovHC(., type='HC1', cluster='group'))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## union 0.08000186 0.02272222 3.5209 0.0004352 ***
## married 0.04668036 0.02098454 2.2245 0.0261725 *
## exper 0.13214642 0.01199701 11.0149 < 2.2e-16 ***
## expersq -0.00518550 0.00080949 -6.4058 1.678e-10 ***
## factor(year)1981 0.01904479 0.02270581 0.8388 0.4016552
## factor(year)1982 -0.01132198 0.02119722 -0.5341 0.5932857
## factor(year)1983 -0.04199552 0.02048986 -2.0496 0.0404743 *
## factor(year)1984 -0.03847088 0.02115271 -1.8187 0.0690326 .
## factor(year)1985 -0.04324982 0.01757881 -2.4603 0.0139247 *
## factor(year)1986 -0.02738194 0.01620317 -1.6899 0.0911266 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
feols(lwage ~ educ + union + married + exper + expersq | year + nr,
data=wagepan, cluster=~nr)
## OLS estimation, Dep. Var.: lwage
## Observations: 4,360
## Fixed-effects: year: 8, nr: 545
## Standard-errors: Clustered (nr)
## Estimate Std. Error t value Pr(>|t|)
## union 0.080002 0.022743 3.51763 4.7182e-04 ***
## married 0.046680 0.021004 2.22247 2.6662e-02 *
## expersq -0.005185 0.000810 -6.39996 3.3575e-10 ***
## ... 2 variables were removed because of collinearity (educ and exper)
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.327891 Adj. R2: 0.565718
## Within R2: 0.021568
Note that the two above regressions have the same output, because they’re mathematically the same. But if we use the first-differencing operator:
plm(lwage ~ educ + union + married + exper + expersq + factor(year),
data=wagepan2, model='within') %>%
coeftest(vcov=vcovHC(., type='HC1', cluster='group'))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## union 0.08000186 0.02272222 3.5209 0.0004352 ***
## married 0.04668036 0.02098454 2.2245 0.0261725 *
## exper 0.13214642 0.01199701 11.0149 < 2.2e-16 ***
## expersq -0.00518550 0.00080949 -6.4058 1.678e-10 ***
## factor(year)1981 0.01904479 0.02270581 0.8388 0.4016552
## factor(year)1982 -0.01132198 0.02119722 -0.5341 0.5932857
## factor(year)1983 -0.04199552 0.02048986 -2.0496 0.0404743 *
## factor(year)1984 -0.03847088 0.02115271 -1.8187 0.0690326 .
## factor(year)1985 -0.04324982 0.01757881 -2.4603 0.0139247 *
## factor(year)1986 -0.02738194 0.01620317 -1.6899 0.0911266 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
That pretty much covers the major use case for plm.
Probit/logit/etc. regressions are also possible, whether in base R
(with glm()) or in fixest, with feglm(). Check
out the documentation to get the phrasing right depending on what kind
of nonlinearity you’re trying to match.
feglm(am ~ wt + gear, mtcars, family=binomial(link='probit'))
## GLM estimation, family = binomial(link = "probit"), Dep. Var.: am
## Observations: 32
## Standard-errors: IID
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.39583 39787.2 0.000211 0.99983
## wt -43.19819 14282.9 -0.003024 0.99759
## gear 30.43512 12331.5 0.002468 0.99803
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Log-Likelihood: -2.966e-9 Adj. Pseudo R2: 0.907471
## BIC: 10.4 Squared Cor.: 1
## # Evaluations: lhs: 25 -- no convergence: Maximum number of iterations reached
If we want to do parallel processing, the easiest way to do that is
to parallelize a for loop. The do this by using the doParallel and
foreach libraries, the second of which is loaded by default if we load
doParallel. We set things up in a similar way to a for loop, but rather
than storing the data somewhere within the loop, foreach()
will take the “result” that’s at the end as its final stopping place.
You can see here that the syntax is a bit different—also note that
unless you use the .combine argument, you’ll get a list
back by default. So make sure you know what exactly your loop is
doing.
library(doParallel)
detectCores()
## [1] 16
registerDoParallel(detectCores()-1) #telling R how many cores to use
foreach_ex <- foreach(i=1:10, .combine=rbind) %dopar% { rnorm(5, i, i) }
foreach_ex %>% data.frame() %>% glimpse()
## Rows: 10
## Columns: 5
## $ X1 <dbl> 2.0135517, -0.2543631, 6.7964948, 0.3661041, -0.5406532, 10.0590846…
## $ X2 <dbl> 0.6297509, -0.1296349, -1.1369261, 3.8595767, 9.2997136, 7.9394745,…
## $ X3 <dbl> 1.7931296, 0.8987631, 3.2743410, 13.7863044, 6.3722779, 17.3893519,…
## $ X4 <dbl> 0.9663244, 1.3404502, 5.8537475, 6.4429921, 15.0212213, 6.2158285, …
## $ X5 <dbl> 0.5027726, 1.5569020, -1.0815818, 6.6441962, 9.4803718, 9.6832167, …
A note on core selection: on your own computer, I’d recommend using all but one of your cores as the maximum value—you’ll need enough RAM for it to copy everything over that many times. If you have 2GB of stuff to copy into each core, 16 cores, and 16GB of RAM, 7 might be about as much as you can do. (Leave at least one out so your computer can continue running graphical processes and not, well, crash.) This advice doesn’t apply on the HPCC, where you should specify the number of core’s you’ve requisitioned. I learned the hard way that detectCores will give you everything on the machine, not just what you requested.
If you want to parallelize random processes, however, this isn’t replicable:
set.seed(99)
a <- foreach(i=1:10, .combine=rbind) %dopar% { runif(1) }
set.seed(99)
b <- foreach(i=1:10, .combine=rbind) %dopar% { runif(1) }
a == b
## [,1]
## result.1 FALSE
## result.2 FALSE
## result.3 FALSE
## result.4 FALSE
## result.5 FALSE
## result.6 FALSE
## result.7 FALSE
## result.8 FALSE
## result.9 FALSE
## result.10 FALSE
We set our seed, but it didn’t quite work. a and
b are not equal. Thankfully, there’s an easy fix, by
loading another library and replacing %dopar% with
%dorng%:
library(doRNG)
set.seed(99)
a1 <- foreach(i=1:10, .combine=rbind) %dorng% { runif(1) }
set.seed(99)
b1 <- foreach(i=1:10, .combine=rbind) %dorng% { runif(1) }
a1 == b1
## [,1]
## result.1 TRUE
## result.2 TRUE
## result.3 TRUE
## result.4 TRUE
## result.5 TRUE
## result.6 TRUE
## result.7 TRUE
## result.8 TRUE
## result.9 TRUE
## result.10 TRUE
None of these examples are necessarily “good” reasons to try to parallelize processes, but if you have a for loop where each iteration takes quite a while, that’d be a good reason. Likewise, if that long process involves some randomness—say, you’re trying to generate a large number of random counterfactuals for a policy change—make sure you do it in a replicable way. It’ll save you a lot of headaches.
There are other tools that might be better for the job (namely, Python, especially if you need to scrape consistently), but R can do it. We’ll start with a basic example and a website designed for scraping (or repeated queries): an API. Speficically, an API for a website that supplies cat facts.
library(rvest)
cats <- read_html('https://catfact.ninja/fact')
cats
## {html_document}
## <html>
## [1] <body><p>{"fact":"The Pilgrims were the first to introduce cats to North ...
cats %>% html_text()
## [1] "{\"fact\":\"The Pilgrims were the first to introduce cats to North America.\",\"length\":63}"
html_text() is good for just converting a page into
text. It’s not always this comprehensible (and even this is messy
looking). It’s best to be able to refer back to the site you’re using,
as here. The cat fact might not
be the same, but you get an idea of the formatting.
What we’d like to do is remove the quotes and the curly braces, and split things out by piece. We have a “fact” field and a “length” field—this sounds like a job for regular expressions!
cats %>% html_text() %>%
gsub('"|\\{|\\}', '', .) %>% #double quotes or curly braces. note the double backslash to escape
str_split(':|(\\.,)') %>% unlist() #split on colon or on end of fact -> length field.
## [1] "fact"
## [2] "The Pilgrims were the first to introduce cats to North America"
## [3] "length"
## [4] "63"
Now, this is all well and good, but most API queries are going to be a little more complicated. Let’s say that we want to see the first five facts with lengths under 30 characters.
cats_query <- read_html('https://catfact.ninja/facts?max_length=30&limit=5')
Notice that we can simply edit the URL to get what we want. This is very good! If you have to do much by hand, or there’s not a pattern…that’s not a good sign. Now, the result is a bit messy:
cats_query %>% html_text()
## [1] "{\"current_page\":1,\"data\":[{\"fact\":\"Cats have 3 eyelids.\",\"length\":20},{\"fact\":\"Cats walk on their toes.\",\"length\":24},{\"fact\":\"Most cats adore sardines.\",\"length\":25},{\"fact\":\"Cats dislike citrus scent.\",\"length\":26},{\"fact\":\"Cats have supersonic hearing\",\"length\":28}],\"first_page_url\":\"https:\\/\\/catfact.ninja\\/facts?page=1\",\"from\":1,\"last_page\":2,\"last_page_url\":\"https:\\/\\/catfact.ninja\\/facts?page=2\",\"links\":[{\"url\":null,\"label\":\"Previous\",\"active\":false},{\"url\":\"https:\\/\\/catfact.ninja\\/facts?page=1\",\"label\":\"1\",\"active\":true},{\"url\":\"https:\\/\\/catfact.ninja\\/facts?page=2\",\"label\":\"2\",\"active\":false},{\"url\":\"https:\\/\\/catfact.ninja\\/facts?page=2\",\"label\":\"Next\",\"active\":false}],\"next_page_url\":\"https:\\/\\/catfact.ninja\\/facts?page=2\",\"path\":\"https:\\/\\/catfact.ninja\\/facts\",\"per_page\":\"5\",\"prev_page_url\":null,\"to\":5,\"total\":7}"
If we’re thinking about the path of least resistance, we don’t really
need the length of the fact, do we? We can recalculate that. It’d be
easier to just extract the facts. Which, again, regular expressions will
help, this time in the form of str_extract_all(), which
extracts all matches to a given pattern:
cats_query %>% html_text() %>%
gsub('"|\\{|\\}', '', .) %>%
str_extract_all('fact:[A-z 0-9\\.]+') %>%
unlist() %>% gsub('(fact:)|"', '', .)
## [1] "Cats have 3 eyelids." "Cats walk on their toes."
## [3] "Most cats adore sardines." "Cats dislike citrus scent."
## [5] "Cats have supersonic hearing"
What’s great about this is it’s (theoretically) scaleable because of the URL stuff! Let’s say we now want the first 25 facts shorter than 200 characters:
read_html('https://catfact.ninja/facts?max_length=200&limit=25') %>%
html_text() %>%
gsub('"|\\{|\\}', '', .) %>%
str_extract_all('fact:[A-z \\-0-9\'%\\.]+') %>%
unlist() %>% gsub('(fact:)|"', '', .)
## [1] "Cats have 3 eyelids."
## [2] "Cats walk on their toes."
## [3] "Most cats adore sardines."
## [4] "Cats dislike citrus scent."
## [5] "Cats have supersonic hearing"
## [6] "Female cats are polyestrous"
## [7] "A form of AIDS exists in cats."
## [8] "Female felines are \\\\superfecund"
## [9] "Milk can give some cats diarrhea."
## [10] "A group of cats is called a clowder."
## [11] "A group of cats is called a \\u201cclowder.\\u201d"
## [12] "Cat's urine glows under a black light."
## [13] "Cats can be right-pawed or left-pawed."
## [14] "A tiger's stripes are like fingerprints"
## [15] "70% of your cat's life is spent asleep."
## [16] "Cats have the largest eyes of any mammal."
## [17] "A cat cannot see directly under its nose."
## [18] "A female cat is called a queen or a molly."
## [19] "There are approximately 100 breeds of cat."
## [20] "Approximately 24 cat skins can make a coat."
## [21] "A domestic cat can run at speeds of 30 mph."
## [22] "Cats take between 20-40 breaths per minute."
## [23] "The cat's tail is used to maintain balance."
## [24] "A happy cat holds her tail high and steady."
## [25] "A cats field of vision is about 185 degrees."
You may notice that I lied a bit. I did have to make some tweaks,
which is why it’s good to make sure of what you’re doing by checking the
web page. For instance, above, I had a regex expression that’d match
letters, spaces, and numbers. Some facts contain quotes (the
weird-looking \\u201 stuff, actually), dashes, apostrophes,
and one percent sign; moreover, some don’t actually end in a period.
Trial and error is unfortunately going to be part of the process in
webscraping.
If you need to scrape more complex sites, there’s a way to do that, but it’s messier, and you’ll probably want to specify just one part of the website to look at. For our first example of this, let’s look at Wikipedia:
library(rvest)
page <- read_html('https://en.wikipedia.org/wiki/List_of_sovereign_states')
page %>% html_element('.wikitable') %>% html_table()
## # A tibble: 195 × 4
## `Common and formal names` Membership within th…¹ Sovereignty dispute[…²
## <chr> <chr> <chr>
## 1 Afghanistan A UN member state A None
## 2 Albania – Republic of Albania A UN member state A None
## 3 Algeria – People's Democratic … A UN member state A None
## 4 Andorra – Principality of Ando… A UN member state A None
## 5 Angola – Republic of Angola A UN member state A None
## 6 Antigua and Barbuda A UN member state A None
## 7 Argentina – Argentine Republic… A UN member state A None
## 8 Armenia – Republic of Armenia A UN member state Not recognised by Pak…
## 9 Australia – Commonwealth of Au… A UN member state A None
## 10 Austria – Republic of Austria A UN member state A None
## # ℹ 185 more rows
## # ℹ abbreviated names: ¹`Membership within the UN System[c]`,
## # ²`Sovereignty dispute[d]`
## # ℹ 1 more variable:
## # `Further information on status and recognition of sovereignty[f]` <chr>
page %>% html_element('.wikitable') %>% html_table() %>%
filter(`Sovereignty dispute[d]` != 'A None')
## # A tibble: 7 × 4
## `Common and formal names` Membership within th…¹ Sovereignty dispute[…²
## <chr> <chr> <chr>
## 1 Armenia – Republic of Armenia A UN member state Not recognised by Pak…
## 2 China – People's Republic of Ch… A UN member state Partially unrecognise…
## 3 Cyprus – Republic of Cyprus A UN member state Not recognised by Tur…
## 4 Israel – State of Israel A UN member state Partially unrecognised
## 5 North Korea – Democratic People… A UN member state BClaimed by Afghanist…
## 6 Palestine – State of Palestine A UN General Assembly… Partially unrecognise…
## 7 South Korea – Republic of Korea A UN member state BClaimed by Afghanist…
## # ℹ abbreviated names: ¹`Membership within the UN System[c]`,
## # ²`Sovereignty dispute[d]`
## # ℹ 1 more variable:
## # `Further information on status and recognition of sovereignty[f]` <chr>
The html_element() function is grabbing by CSS selector,
which is great for Wikipedia, but isn’t really generalizable. See if it
works, though, depending on your use case.
More broadly, you can use XPATHs, which you probably want an external tool like SelectorGadget (Chrome plugin) to help you figure out. For another example, we do that using the IMDB page for the TV show “Severance”, pulling a table of the cast members.
Frequently, as below, the data won’t be quite in the format you want. Note the regular expressions I’m using to not only fix the formatting (so there’s not carriage returns and empty white space everywhere), but also to split up the part played/appearance portions of the table. I’d guess that the errors seen here in that sense are due to the table on the URL itself having blank areas, as some actors have been confirmed for the next season, but their parts remain unknown.
imdb <- 'https://www.imdb.com/title/tt11280740/fullcredits?ref_=tt_cl_sm' %>% read_html()
imdb %>%
html_elements(xpath='//*[contains(concat( " ", @class, " " ), concat( " ", "cast_list", " " ))]//td') %>%
html_text() %>%
gsub('(\\n)|( {2,})', '', .) %>% .[. != '' & . != '...'] %>%
data.frame(actor=.[seq(1, length(.), 2)], part=.[seq(2, length(.), 2)]) %>% select(-`.`) %>%
mutate(episodes = gsub('^([^1-9]+)', '', part), part = gsub('[1-9][0-9]? ep.+', '', part)) %>%
unique() %>% head() %>% knitr::kable()
| actor | part | episodes |
|---|---|---|
| Adam Scott | Mark | 19 episodes, 2022-2024 |
| Zach Cherry | Dylan | 19 episodes, 2022-2024 |
| Britt Lower | Helly | 19 episodes, 2022-2024 |
| Tramell Tillman | Milchick | 19 episodes, 2022-2024 |
| Jen Tullock | Devon | 19 episodes, 2022-2024 |
| Dichen Lachman | Ms. Casey | 19 episodes, 2022-2024 |
Note the use of backticks (`) in both to select a column that doesn’t adhere to R’s naming guidelines. Webscraping is complex; if you can use an API, it’s preferable, since (often) things will appear in a static, simple, text-only format. That lets you avoid the mess we see above.
If you take anything away from this demonstration, it should be this: if you can use an API for something, please do. It’ll make your life so much easier.
There are a few things you should know about RMarkdown when you’re planning on using it. In general, things work like LaTeX, and the formatting for that should be right. For instance, you can insert an equation:
\[\begin{equation*} y = m\cdot x + b \end{equation*}\]
I’d advise you to, generally speaking, start with HTML documents because they’re a bit easier to work with. You can scroll in code chunks and, like this document, collapse them if desired. Formatting and things is generally all right no matter how you’re doing it. To generate a document (that is, make it not a .Rmd file anymore), you “knit” the document, which you can either do in the viewer or by hitting Ctrl+Shift+K. To insert an R code block, you can hit Ctrl+Alt+I. (I assume there’s close equivalencies if you’re on OSX.) These blocks can be named so during the knitting process you can see where it’s running code, and also where the point of failure is if something breaks, but you don’t technically have to name them. Make sure that there’s no duplicate names if you do name them, though.
On the R side of things, though, it’s important to note that
RMarkdown documents do not pull from the global environment.
That means that if you’ve loaded some kind of data in your regular R
session and try to use it in RMarkdown, when it knits the document, it
won’t be able to find and use the data, so it’ll fail. You also have a
few quality-of-life things that you can use in R to make your results
look better (that is, not like ASCII tables made in notepad). The main
one of these is the knitr package, whose kable() function
makes some decent-looking tables, as you can see below (or above, in the
IMDB stuff):
wage1 %>% .[1:10,1:17] %>% knitr::kable()
| wage | educ | exper | tenure | nonwhite | female | married | numdep | smsa | northcen | south | west | construc | ndurman | trcommpu | trade | services |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3.10 | 11 | 2 | 0 | 0 | 1 | 0 | 2 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 3.24 | 12 | 22 | 2 | 0 | 1 | 1 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 3.00 | 11 | 2 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
| 6.00 | 8 | 44 | 28 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 5.30 | 12 | 7 | 2 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 8.75 | 16 | 9 | 8 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 11.25 | 18 | 15 | 7 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
| 5.00 | 12 | 5 | 3 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 3.60 | 12 | 26 | 4 | 0 | 1 | 0 | 2 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
| 18.18 | 17 | 22 | 21 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
Generally speaking, though, RMarkdown works like a cross between a LaTeX document that’s a bit more linear and a regular R script. The main thing is just making sure that all the code you’ve put into the file is all you need, due to environment concerns. It’s nice to be able to show off exactly what you did!