RHIPE - R and Hadoop Integrated Processing Environment

RHIPE(phonetic spelling: hree-pay' 1) is a java package that integrates the R environment with Hadoop, the open source implementation of Google's mapreduce. Using RHIPE it is possible to code map-reduce algorithms in R e.g

m <- expression({
for(x in map.values){
  y <- strsplit(x," +")[[1]]
  for(w in y) rhcollect(w,T)
}})
r <- expression(pre={
  count <- 0
},
reduce={
  count <- count+sum(unlist(reduce.values))
},
post={
 rhcollect(reduce.key,as.integer(count))
})
z=rhmr(map=m,reduce=r,comb=F,inout=c("text","sequence"),ifolder="/tmp/50mil",ofolder='/tmp/tof')
rhex(z)

Or just, load Rhipe and type

rhwordcount(infolder,outfolder)

where infolder is the input file(or folder of files) or words and outfolder is the destination directory.

Table of Contents

1 More Information

For more information about what RHIPE is and not read the FAQ. Please not, this does not work on Mac OS X Snow Leopard.

2 rprotobuf

An R package to serialize R objects using Google's protobuffers See this page http://ml.stat.purdue.edu/rhipe/doc/html/ProtoBuffers.html

3 Download

Source

The source code is present on Git, go here http://github.com/saptarshiguha/RHIPE/

To check out the current version, install git

git clone git://github.com/saptarshiguha/RHIPE.git

To download version X e.g 0.45

git clone git://github.com/saptarshiguha/RHIPE.git
git checkout 0.45

The current version is always the master.

Versions

Read the documentation for installation. Current is the latest version.

VersionDownload
0.51Rhipe_0.51.tar.gz
0.5Rhipe_0.5.tar.gz
0.44rhipe.0.44.tgz

4 EC2

Download the scripts from here, untar it. This will create a folder called ec2 which has a bin folder in it. Modify ec2/bin/hadoop-ec2-env.sh.template

**NOTE** For 32 bit instance types only (m1.small, c1.medium). Hopefully, 64 bit will work soon.

5 Documentation

The documentation can be found here. PDF version can be found here

6 News

6.1 Mon Oct 12 11:18:31 EDT 2009

  • Removed the dependency on rJava. Getting it to work with Hadoop classpaths caused to much grief. The actualy RHIPE program remains unchanged but the client handler (R package) is a bit slower(?)

6.2 Sun Sep 27 22:01:33 EDT 2009

  • Names are only read for VECSXP (list objects), because of a strange bug.

6.3 Tue Sep 8 15:35:24 EDT 2009

  • Moved to Hadoop 0.20
  • Uses protobuf for serialization, fewer R types allowed
  • Does not depend on Rserve, single R package to install

6.4 Fri Aug 7 2009, Version 0.45

  • Web site revamped. Beginning with the current version, the entire manual is in PDF or can be accessed at the documentation link.
  • Source code is available on Git, go to the download page for instructions.
  • Stopped seeding via secure random generator, so the user will have to seed it to avoid correlated streams. On RHEL linux when running rhlapply on 145K+ tasks, /dev/random would block.

7 Contact

sguha -AT- purdue -DOT- edu

Footnotes:

1 This is greek for a moment in time. See here for pronunciation: Greek Lexicon

Author: Saptarshi Guha <sguha@purdue.edu>

Date: 2009-10-22 00:48:52 EDT

HTML generated by org-mode 6.28trans in emacs 22