Thursday, 15 September 2011

Simple benefits of Amazon Web Services and low tech map reduce

I had to run a classically parallel calculation over several hundred megabytes of data with a longing running newtonian search on subsets of the source data. My local machine has four 2.66 ghz early xeon cores but version three of my optimised query was still running several days in.

Using AWS and rightscale and r studio server, optimising with data.table, I was able to get the equivalent of 8 times the performance of my local desktop for about five hours for 14usd and get my report out.

Conclusion:
Although not entirely painless it wasn't hard to get multiple R studio instances up and running, and data up, calculated and returned.
Given I only run large queries occasionally, it's a lot more cost effective to use AWS than to buy the performance locally. And this is true even for home use - I would compromise on performance knowing I could get it from AWS on the rare occasions I needed it for large tasks.
----------------
tools used: AWS, Rightscale, R, data.table, RStudio server

I had plans to use one of the mapreduce frameworks but as the deadline got closer, I resorted to creating some c1.xlarge instances each of which performed about twice as fast as my machine on a test data set.

Rightscale enabled me to start up ubuntu instances trivially, and connect via putty or mindterm
the mindterm has a secure file copy tool which enabled me trivially to upload the datasets to each machine and download the results.
I worked off an Ubuntu 10.04 amd64 instance on the East coast
specifially the instance with rstudio that they have that did some settings.

I used the instructions on the R project website to install a recent R
Added a link to the R mirrors

vi /etc/apt/sources.list
deb http://software.rc.fas.harvard.edu/mirrors/R/bin/linux/ubuntu lucid/


sudo apt-get update
sudo apt-get install r-base

One of the instances I created had problems - I don't know whether the data set was corrupted or I messed up the installation or what, but it didn't finish. I reran the dataset on two other instances and they ran it fine. Also the performance across instances seemed slightly different - so it's well worth checking an instance once created.


I then installed RStudio server and made it run on port 80 to get past my firm firewalls
(again good help on their site)

wget https://s3.amazonaws.com/rstudio-server/rstudio-server-0.94.92-amd64.deb
sudo dpkg -i rstudio-server-0.94.92-amd64.deb


vi /etc/rstudio/rserver.conf

www-port=80
sudo rstudio-server restart

Then Rightscale showed the ip, I had entered the username and password in initializing the template and I had a single machine running
Repeat another three times and I had around 8 times the performance of my desktop


I used doSnow which was simple and worked for me to use all the cores on ubuntu and on windows - I had my script running for several days without finishing and discovered my initial estimate of time to run off a small sample set was hopelessly wrong.



I then reassembled the data, the reduce part or mapreduce, on the result files with my local machine with a perl script.


rough equivalent of parallel map - remember to export any required libraries to the slave instances of R e.g.




sample r code
---

require(doSNOW)
require(data.table)
cl<-makeCluster(8) # number cores
registerDoSNOW(cl)
check <-function(n){ for(i in 1:1000){ sme <- matrix(rnorm(100), 10,10); solve(sme)  }}
times <- 100
system.time(for(j in 1:times ) x <- check(j))

system.time(calc.results <- foreach(j=1:rawData.length, .packages=c("data.table"), .inorder=FALSE, .errorhandling="remove" ) %dopar% LongRunningFunction(rawData[[j]]))