Chapter 15 Speeding up R
library(microbenchmark) # for measuring how long stuff takes
library(doMC) # do multi-core stuff
library(foreach) # parallelizable for loops
library(ggplot2)
library(dplyr)
library(faraway) # some examples
library(boot)
library(caret)
library(glmnet)
Eventually if you have large enough data sets, an R user eventually writes code that is slow to execute and needs to be sped up. This chapter tries to lay out common problems and bad habits and shows how to correct them. However, the correctness and maintainability of code should take precendence over speed. Too often, misguided attempts to obtain efficient code results in an unmaintainable mess that is no faster that the initial code.
Hadley Wickham has a book aimed at advanced R user that describes many of the finer details about R. One section in the book describes his process for building fast, maintainable software projects and if you have the time, I highly suggest reading the on-line version, Advanced R.
First we need some way of measuring how long our code took to run. For this we will use the package microbenchmark
. The idea is that we want to evaluate two or three expressions that solve a problem.
x <- runif(1000)
microbenchmark(
sqrt(x), # First expression to compare
x^(0.5) # second expression to compare
) %>% print(digits=3)
## Unit: microseconds
## expr min lq mean median uq max neval cld
## sqrt(x) 2.59 2.69 2.92 2.77 2.89 9.33 100 a
## x^(0.5) 30.37 30.73 31.59 30.87 31.00 52.08 100 b
What microbenchmark
does is run the two expressions a number of times and then produces the 5-number summary of those times. By running it multiple times, we account for the randomness associated with a operating system that is also running at the same time.
15.1 Faster for loops?
Often we need to perform some simple action repeatedly. It is natural to write a for
loop to do the action and we wish to speed the up. In this first case, we will consider having to do the action millions of times and each chunk of computation within the for
takes very little time.
Consider frame of 4 columns, and for each of \(n\) rows, we wish to know which column has the largest value.
make.data <- function(n){
data <- cbind(
rnorm(n, mean=5, sd=2),
rpois(n, lambda = 5),
rgamma(n, shape = 2, scale = 3),
rexp(n, rate = 1/5))
data <- data.frame(data)
return(data)
}
data <- make.data(100)
The way that you might first think about solving this problem is to write a for loop and, for each row, figure it out.
f1 <- function( input ){
output <- NULL
for( i in 1:nrow(input) ){
output[i] <- which.max( input[i,] )
}
return(output)
}
We might consider that there are two ways to return a value from a function (using the return
function and just printing it). In fact, I’ve always heard that using the return
statment is a touch slower.
f2.noReturn <- function( input ){
output <- NULL
for( i in 1:nrow(input) ){
output[i] <- which.max( input[i,] )
}
output
}
data <- make.data(100)
microbenchmark(
f1(data),
f2.noReturn(data)
) %>% print(digits=3)
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## f1(data) 3.3 3.43 3.70 3.56 3.71 8.23 100 a
## f2.noReturn(data) 3.3 3.41 3.84 3.54 3.84 9.20 100 a
In fact, it looks like it is a touch slower, but not massively compared to the run-to-run variability. I prefer to use the return
statement for readability, but if we agree have the last line of code in the function be whatever needs to be returned, readability isn’t strongly effected.
We next consider whether it would be faster to allocate the output vector once we figure out the number of rows needed, or just build it on the fly?
f3.AllocOutput <- function( input ){
n <- nrow(input)
output <- rep(NULL, n)
for( i in 1:nrow(input) ){
output[i] <- which.max( input[i,] )
}
return(output)
}
microbenchmark(
f1(data),
f3.AllocOutput(data)
) %>% print(digits=3)
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## f1(data) 3.32 3.54 3.97 3.80 4.14 8.15 100 a
## f3.AllocOutput(data) 3.29 3.50 4.00 3.77 4.09 9.07 100 a
If anything, allocating the size of output first was slower. So given this, we shouldn’t feel to bad being lazy and using output <- NULL
to initiallize things.
15.2 Vectorizing loops
In general, for
loops in R are very slow and we want to avoid them as much as possible. The apply
family of functions can be quite helpful for applying a function to each row or column of a matrix or data.frame or to each element of a list.
To test this, instead of a for
loop, we will use apply
.
f4.apply <- function( input ){
output <- apply(input, 1, which.max)
return(output)
}
microbenchmark(
f1(data),
f4.apply(data)
) %>% print(digits=3)
## Unit: microseconds
## expr min lq mean median uq max neval cld
## f1(data) 3291 3538 3922 3742 4086 6578 100 b
## f4.apply(data) 264 305 375 337 385 2272 100 a
This is the type of speed up that matters. We have a 10-fold speed up in execution time and particularly the maximum time has dropped impressively.
Unfortunately, I have always found the apply
functions a little cumbersome and I prefer to use dplyr
instead strictly for readability.
f5.dplyr <- function( input ){
output <- input %>%
mutate( max.col=which.max( c(X1, X2, X3, X4) ) )
return(output$max.col)
}
microbenchmark(
f4.apply(data),
f5.dplyr(data)
) %>% print(digits=3)
## Unit: microseconds
## expr min lq mean median uq max neval cld
## f4.apply(data) 265 291 328 316 350 575 100 a
## f5.dplyr(data) 1772 1865 2142 2021 2200 5981 100 b
Unfortunately dplyr
is a lot slower than apply
in this case. I wonder if the dynamics would change with a larger n
?
data <- make.data(10000)
microbenchmark(
f4.apply(data),
f5.dplyr(data)
) %>% print(digits=3)
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## f4.apply(data) 23.3 26.20 33.2 28.96 33.29 238.9 100 b
## f5.dplyr(data) 2.2 2.43 3.0 2.67 3.02 19.6 100 a
data <- make.data(100000)
microbenchmark(
f4.apply(data),
f5.dplyr(data)
) %>% print(digits=3)
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## f4.apply(data) 288.1 374.29 493.32 463.37 579.22 1067 100 b
## f5.dplyr(data) 3.4 4.52 8.54 4.97 6.23 170 100 a
What just happened? The package dplyr
is designed to work well for large data sets, and utilizes a modified structure, called a tibble
, which provides massive benefits for large tables, but at the small scale, the overhead of converting the data.frame
to a tibble
overwhelms any speed up. But because the small sample case is already fast enough to not be noticable, we don’t really care about the small n
case.
15.3 Parallel Processing
Most modern computers have multiple computing cores, and can run muliple processes at the same time. Sometimes this means that you can run multiple programs and switch back and forth easily without lag, but we are now interested in using as many cores as possible to get our statistical calculations completed by using muliple processing cores at the same time. This is referred to as running the process “in parallel” and there are many tasks in modern statistical computing that are “embarrasingly easily parallelized”. In particular bootstrapping and cross validation techniques are extremely easy to implement in a parallel fashion.
However, running commands in parallel incurs some overhead cost in set up computation, as well as all the message passing from core to core. For example, to have 5 cores all perform an analysis on a set of data, all 5 cores must have access to the data, and not overwrite any of it. So parallelizing code only makes sense if the individual steps that we pass to each core is of sufficient size that the overhead incurred is substantially less than the time to run the job.
We should think of executing code in parallel as having three major steps: 1. Tell R that there are multiple computing cores available and to set up a useable cluster to which we can pass jobs to. 2. Decide what ‘computational chunk’ should be sent to each core and distribute all necessary data, libraries, etc to each core. 3. Combine the results of each core back into a unified object.
15.4 Parallelizing for loops
There are a number of packages that allow you to tell R how many cores you have access to. One of the easiest ways to parallelize a for loop is using a package called foreach
. The registration of multiple cores is actually pretty easy.
doMC::registerDoMC(cores = 2) # my laptop only has two cores.
We will consider an example that is common in modern statistics. We will examine parallel computing utilizing a bootstrap example where we create bootstrap samples for calculating confidence intervals for regression coefficients.
ggplot(trees, aes(x=Girth, y=Volume)) + geom_point() + geom_smooth(method='lm')
model <- lm( Volume ~ Girth, data=trees)
This is how we would do this previously.
# f is a formula
# df is the input data frame
# M is the number of bootstrap iterations
boot.for <- function( f, df, M=999){
output <- list()
for( i in 1:100 ){
# Do stuff
model.star <- lm( f, data=df %>% sample_frac(1, replace=TRUE) )
output[[i]] <- model.star$coefficients
}
# use rbind to put the list of results together into a data.frame
output <- sapply(output, rbind) %>% t() %>% data.frame()
return(output)
}
We will first ask about how to do the same thing using the function foreach
# f is a formula
# df is the input data frame
# M is the number of bootstrap iterations
boot.foreach <- function(f, df, M=999){
output <- foreach( i=1:100 ) %dopar% {
# Do stuff
model.star <- lm( f, data=df %>% sample_frac(1, replace=TRUE) )
model.star$coefficients
}
# use rbind to put the list of results together into a data.frame
output <- sapply(output, rbind) %>% t() %>% data.frame()
return(output)
}
Not much has changed in our code. Lets see which is faster.
microbenchmark(
boot.for( Volume~Girth, trees ),
boot.foreach( Volume~Girth, trees )
) %>% print(digits=3)
## Unit: milliseconds
## expr min lq mean median uq max neval
## boot.for(Volume ~ Girth, trees) 233 253 319 337 360 596 100
## boot.foreach(Volume ~ Girth, trees) 358 413 450 430 453 1230 100
## cld
## a
## b
In this case, the overhead associated with splitting the job across two cores, copying the data over, and then combining the results back together was more than we saved by using both cores. If the nugget of computation within each pass of the for
loop was larger, then it would pay to use both cores.
# massiveTrees has 31000 observations
massiveTrees <- NULL
for( i in 1:1000 ){
massiveTrees <- rbind(massiveTrees, trees)
}
microbenchmark(
boot.for( Volume~Girth, massiveTrees ) ,
boot.foreach( Volume~Girth, massiveTrees )
) %>% print(digits=3)
## Unit: seconds
## expr min lq mean median uq
## boot.for(Volume ~ Girth, massiveTrees) 3.04 3.61 3.82 3.72 3.85
## boot.foreach(Volume ~ Girth, massiveTrees) 1.65 1.98 2.32 2.37 2.54
## max neval cld
## 9.55 100 b
## 3.50 100 a
Because we often generate a bunch of results that we want to see as a data.frame, the foreach
function includes and option to do it for us.
output <- foreach( i=1:100, .combine=data.frame ) %dopar% {
# Do stuff
model.star <- lm( Volume ~ Girth, data= trees %>% sample_frac(1, replace=TRUE) )
model$coefficients
}
It is important to recognize that the data.frame trees
was utilized inside the foreach
loop. So when we called the foreach
loop and distributed the workload across the cores, it was smart enough to distribute the data to each core. However, if there were functions that we utilized inside the foor loop that came from a packege, we need to tell each core to load the function.
output <- foreach( i=1:1000, .combine=data.frame, .packages='dplyr' ) %dopar% {
# Do stuff
model.star <- lm( Volume ~ Girth, data= trees %>% sample_frac(1, replace=TRUE) )
model.star$coefficients
}
15.5 Parallel Aware Functions
There are many packages that address problems that are “embarassingly easily parallelized” and they will happily work with multiple cores. Methods that rely on resampling certainly fit into this category.
15.5.1 boot::boot
Bootstrapping relys on resampling the dataset and calculating test statistics from each resample. In R, the most common way to do this is using the package boot
and we just need to tell the boot
function, to use the multiple cores available. (Note, we have to have registered the cores first!)
model <- lm( Volume ~ Girth, data=trees)
my.fun <- function(df, index){
model.star <- lm( Volume ~ Girth, data= trees[index,] )
model.star$coefficients
}
microbenchmark(
serial = boot::boot( trees, my.fun, R=1000 ),
parallel = boot::boot( trees, my.fun, R=1000,
parallel='multicore', ncpus=2 )
) %>% print(digits=3)
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## serial 679 737 848 839 907 1658 100 b
## parallel 685 721 787 751 801 1258 100 a
In this case, we had a bit of a spead up, but not a factor of 2. This is due to the overhead of splitting the job across both cores.
15.5.2 caret::train
The statistical learning package caret
also handles all the work to do cross validation in a parallel computing environment. The functions in caret
have an option allowParallel
which by default is TRUE, which controls if we should use all the cores. Assuming we have already registered the number of cores, then by default caret
will use them all.
library(faraway)
library(caret)
ctrl.serial <- trainControl( method='repeatedcv', number=5, repeats=4,
preProcOptions = c('center','scale'),
allowParallel = FALSE)
ctrl.parallel <- trainControl( method='repeatedcv', number=5, repeats=4,
preProcOptions = c('center','scale'),
allowParallel = TRUE)
grid <- data.frame(
alpha = 1, # 1 => Lasso Regression
lambda = exp(seq(-6, 1, length=50)))
microbenchmark(
model <- train( lpsa ~ ., data=prostate, method='glmnet',
trControl=ctrl.serial, tuneGrid=grid,
lambda = grid$lambda ),
model <- train( lpsa ~ ., data=prostate, method='glmnet',
trControl=ctrl.parallel, tuneGrid=grid,
lambda = grid$lambda )
) %>% print(digits=3)
## Unit: seconds
## expr
## model <- train(lpsa ~ ., data = prostate, method = "glmnet", trControl = ctrl.serial, tuneGrid = grid, lambda = grid$lambda)
## model <- train(lpsa ~ ., data = prostate, method = "glmnet", trControl = ctrl.parallel, tuneGrid = grid, lambda = grid$lambda)
## min lq mean median uq max neval cld
## 1.17 1.20 1.23 1.21 1.23 1.49 100 a
## 1.33 1.35 1.39 1.36 1.40 2.03 100 b
Again, we saw only moderate gains by using both cores, however it didn’t really cost us anything. Because the caret
package by default allows parallel processing, it doesn’t hurt to just load the doMC
package and register the number of cores. Even in just the two core case, it is a good habit to get into so that when you port your code to a huge computer with many cores, the only thing to change is how many cores you have access to.