Using AWS and rightscale and r studio server, optimising with data.table, I was able to get the equivalent of 8 times the performance of my local desktop for about five hours for 14usd and get my report out.
Conclusion:
Although not entirely painless it wasn't hard to get multiple R studio instances up and running, and data up, calculated and returned.
Given I only run large queries occasionally, it's a lot more cost effective to use AWS than to buy the performance locally. And this is true even for home use - I would compromise on performance knowing I could get it from AWS on the rare occasions I needed it for large tasks.
----------------
tools used: AWS, Rightscale, R, data.table, RStudio server
I had plans to use one of the mapreduce frameworks but as the deadline got closer, I resorted to creating some c1.xlarge instances each of which performed about twice as fast as my machine on a test data set.
Rightscale enabled me to start up ubuntu instances trivially, and connect via putty or mindterm
the mindterm has a secure file copy tool which enabled me trivially to upload the datasets to each machine and download the results.
I worked off an Ubuntu 10.04 amd64 instance on the East coast
specifially the instance with rstudio that they have that did some settings.
I used the instructions on the R project website to install a recent R
Added a link to the R mirrors
vi /etc/apt/sources.list
deb http://software.rc.fas.harvard.edu/mirrors/R/bin/linux/ubuntu lucid/
sudo apt-get update
sudo apt-get install r-base
One of the instances I created had problems - I don't know whether the data set was corrupted or I messed up the installation or what, but it didn't finish. I reran the dataset on two other instances and they ran it fine. Also the performance across instances seemed slightly different - so it's well worth checking an instance once created.
I then installed RStudio server and made it run on port 80 to get past my firm firewalls
(again good help on their site)
wget https://s3.amazonaws.com/rstudio-server/rstudio-server-0.94.92-amd64.deb
sudo dpkg -i rstudio-server-0.94.92-amd64.deb
vi /etc/rstudio/rserver.conf
www-port=80
sudo rstudio-server restart
Then Rightscale showed the ip, I had entered the username and password in initializing the template and I had a single machine running
Repeat another three times and I had around 8 times the performance of my desktop
I used doSnow which was simple and worked for me to use all the cores on ubuntu and on windows - I had my script running for several days without finishing and discovered my initial estimate of time to run off a small sample set was hopelessly wrong.
I then reassembled the data, the reduce part or mapreduce, on the result files with my local machine with a perl script.
rough equivalent of parallel map - remember to export any required libraries to the slave instances of R e.g.
sample r code
---
require(doSNOW)
require(data.table)
cl<-makeCluster(8) # number cores
registerDoSNOW(cl)
check <-function(n){ for(i in 1:1000){ sme <- matrix(rnorm(100), 10,10); solve(sme) }}
times <- 100
system.time(for(j in 1:times ) x <- check(j))
system.time(calc.results <- foreach(j=1:rawData.length, .packages=c("data.table"), .inorder=FALSE, .errorhandling="remove" ) %dopar% LongRunningFunction(rawData[[j]]))