RHIPE - R and Hadoop Integrated Processing Environment
RHIPE(phonetic spelling: hree-pay' 1) is a java package that integrates the R environment with Hadoop, the open source implementation of Google's mapreduce. Using RHIPE it is possible to code map-reduce algorithms in R e.g
m <- expression({
for(x in map.values){
y <- strsplit(x," +")[[1]]
for(w in y) rhcollect(w,T)
}})
r <- expression(pre={
count <- 0
},
reduce={
count <- count+sum(unlist(reduce.values))
},
post={
rhcollect(reduce.key,as.integer(count))
})
z=rhmr(map=m,reduce=r,comb=F,inout=c("text","sequence"),ifolder="/tmp/50mil",ofolder='/tmp/tof')
rhex(z)
Or just, load Rhipe and type
rhwordcount(infolder,outfolder)
where infolder is the input file(or folder of files) or words and outfolder is
the destination directory.
Table of Contents
1 More Information
For more information about what RHIPE is and not read the FAQ. Please not, this does not work on Mac OS X Snow Leopard.
2 rprotobuf
An R package to serialize R objects using Google's protobuffers See this page http://ml.stat.purdue.edu/rhipe/doc/html/ProtoBuffers.html
3 Download
Source
The source code is present on Git, go here http://github.com/saptarshiguha/RHIPE/
To check out the current version, install git
git clone git://github.com/saptarshiguha/RHIPE.git
To download version X e.g 0.45
git clone git://github.com/saptarshiguha/RHIPE.git git checkout 0.45
The current version is always the master.
Versions
Read the documentation for installation. Current is the latest version.
| Version | Download |
|---|---|
| 0.51 | Rhipe_0.51.tar.gz |
| 0.5 | Rhipe_0.5.tar.gz |
| 0.44 | rhipe.0.44.tgz |
4 EC2
Download the scripts from here, untar it. This will create a folder called ec2 which has a bin folder in it. Modify ec2/bin/hadoop-ec2-env.sh.template
**NOTE** For 32 bit instance types only (m1.small, c1.medium). Hopefully, 64 bit will work soon.
6 News
6.1 Mon Oct 12 11:18:31 EDT 2009
- Removed the dependency on rJava. Getting it to work with Hadoop classpaths caused to much grief. The actualy RHIPE program remains unchanged but the client handler (R package) is a bit slower(?)
6.2 Sun Sep 27 22:01:33 EDT 2009
- Names are only read for VECSXP (list objects), because of a strange bug.
6.3 Tue Sep 8 15:35:24 EDT 2009
- Moved to Hadoop 0.20
- Uses protobuf for serialization, fewer R types allowed
- Does not depend on Rserve, single R package to install
6.4 Fri Aug 7 2009, Version 0.45
- Web site revamped. Beginning with the current version, the entire manual is in PDF or can be accessed at the documentation link.
- Source code is available on Git, go to the download page for instructions.
-
Stopped seeding via secure random generator, so the user will have
to seed it to avoid correlated streams. On RHEL linux
when running
rhlapplyon 145K+ tasks,/dev/randomwould block.
7 Contact
sguha -AT- purdue -DOT- edu
Footnotes:
1 This is greek for a moment in time. See here for pronunciation: Greek Lexicon
Date: 2009-10-22 00:48:52 EDT
HTML generated by org-mode 6.28trans in emacs 22