Saturday, April 1, 2017

AWS, R, RStudio, Parallel processing

In this post I will share my experiences with using a spot AMI instance for heavy parallel processing of R scripts.

Fist we start a Rstudio AMI:



Start a spot instance:




Request:

Search AMI:


Search for Rstudio:


select the needed compute power:



Its recommended to look at the pricing history:



In this case, we select a General Purpuse 16 CPU machine, which has a fair price per hour. Since we only use the machine for some heavy processing, we will decommission it in a few hours.



Leave the other settings as is.

select next, be sure to create a new key pair, in case you do not have the original key pair.
Its also good to add a new security group which has port 80 open:



open the ports:


The click 'Launch' instance, and then click op the capacity link;

Look up the public IP:

You should now be able to log on:

Log on credentials can be found here:
http://www.louisaslett.com/RStudio_AMI/

Now that the server is up and running we can start using R packages to allow parallel processing:
  
install.packages("doParallel",dependencies=TRUE)
install.packages("doMC",dependencies=TRUE)

We use the package
  
library(doParallel)
library(foreach)
library(doMC)

In order to initialize we;
- set the number of cores.
- initialize a cluster
-and we need to export our functions, variables and datasets to the clusters:
  
#prep parralel processing
# Calculate the number of cores
no_cores <- detectCores() - 1
# Initiate cluster
cl <- makeCluster(no_cores)
registerDoMC(no_cores)
clusterExport(cl=cl, varlist=c("splitter", "create_ngram_table"))

There are a number of options for parralel processing roughly speaking you can choose from ParLapply and forEach. When you would like to split your dataset in multiple subsets and want to have a worker node perform an operation this can easily be done with a forEach, see example below. In both cases a list is returned. While we usualy would like to have a full dataset in return. In order to achieve this result I used the code do.call("rbind", result).

Many more examples can be found on: http://gforge.se/2015/02/how-to-go-parallel-in-r-basics-tips/
  
result = foreach(j=seq(1,nl, by=(nl/parts))) %dopar% {
}
}
}
do.call("rbind", result)
}