Saptarshi Guha
2009-Jan-24 18:08 UTC
[R] R and Hadoop Integrated Processing Environment - RHIPE
Hello, We have created an interface between R and Hadoop so that the user can, after a fashion, interact with very large datasets using the Map Reduce programming model. We also use IBM's TSpaces to implement a shared memory implementation that can be accessed via R(somewhat like networkspaces). RHIPE uses Rserve to execute R code. Some of the functions implemented are: mrlapply - run lapply across a Hadoop cluster mrsubsetf - subset a file according to an R function mtapplyf - run a tapply on a file - mrmapreduce - run a map reduce algorithm on a file or group of files. The user provides a mapper and reducer. The are also some shared memory operations such as mrread,mrtake,mrput. Currently, it is at a proof of concept stage and much work is required before it is production ready. However, for the adventurous, it is possible to use it to process large data. For more information and examples please visit this page: http://www.stat.purdue.edu/~sguha/rhipe . If anyone would like to contribute to this project, please email me directly - any help is welcome. Regards Saptarshi Guha