Avram Aelony
2008-Aug-21 18:32 UTC
[R] Large data sets with R (binding to hadoop available?)
Dear R community, I find R fantastic and use R whenever I can for my data analytic needs. Certain data sets, however, are so large that other tools seem to be needed to pre-process data such that it can be brought into R for further analysis. Questions I have for the many expert contributors on this list are: 1. How do others handle situations of large data sets (gigabytes, terabytes) for analysis in R ? 2. Are there existing ways or plans to devise ways to use the R language to interact with Hadoop or PIG ? The Hadoop project by Apache has been successful at processing data on a large scale using the map-reduce algorithm. A sister project uses an emerging language called ?PIG-latin? or simply ?PIG? for using the Hadoop framework in a manner reminiscent of the look and feel of R. Is there an opportunity here to create a conceptual bridge since these projects are also open-source? Does it already exist? Thanks in advance for your comments. -Avram --------------------------- Information about Hadoop: http://wiki.apache.org/hadoop/ http://en.wikipedia.org/wiki/Hadoop ?Apache Hadoop is a free Java software framework that supports data intensive distributed applications running on large clusters of commodity computers.[1] It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google's MapReduce and Google File System (GFS) papers.? --------------------------- Information about PIG: http://incubator.apache.org/pig/ ?Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties: * Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain. * Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency. * Extensibility. Users can create their own functions to do special- purpose processing.? ---------------------------
Gabor Grothendieck
2008-Aug-21 18:40 UTC
[R] Large data sets with R (binding to hadoop available?)
RSQLite package can read files into an SQLite database without the data going through R. sqldf package provides a front end that makes it particularly easy to use - basically you need only a couple of lines of code. Other databases have similar facilities. See: http://sqldf.googlecode.com On Thu, Aug 21, 2008 at 2:32 PM, Avram Aelony <aavram at mac.com> wrote:> > Dear R community, > > I find R fantastic and use R whenever I can for my data analytic needs. > Certain data sets, however, are so large that other tools seem to be needed > to pre-process data such that it can be brought into R for further analysis. > > Questions I have for the many expert contributors on this list are: > > 1. How do others handle situations of large data sets (gigabytes, terabytes) > for analysis in R ? > > 2. Are there existing ways or plans to devise ways to use the R language to > interact with Hadoop or PIG ? The Hadoop project by Apache has been > successful at processing data on a large scale using the map-reduce > algorithm. A sister project uses an emerging language called "PIG-latin" or > simply "PIG" for using the Hadoop framework in a manner reminiscent of the > look and feel of R. Is there an opportunity here to create a conceptual > bridge since these projects are also open-source? Does it already exist? > > > Thanks in advance for your comments. > > -Avram > > > > > --------------------------- > Information about Hadoop: > http://wiki.apache.org/hadoop/ > http://en.wikipedia.org/wiki/Hadoop > > "Apache Hadoop is a free Java software framework that supports data > intensive distributed applications running on large clusters of commodity > computers.[1] It enables applications to work with thousands of nodes and > petabytes of data. Hadoop was inspired by Google's MapReduce and Google File > System (GFS) papers." > > > > --------------------------- > Information about PIG: > > http://incubator.apache.org/pig/ > > "Pig is a platform for analyzing large data sets that consists of a > high-level language for expressing data analysis programs, coupled with > infrastructure for evaluating these programs. The salient property of Pig > programs is that their structure is amenable to substantial parallelization, > which in turns enables them to handle very large data sets. > At the present time, Pig's infrastructure layer consists of a compiler that > produces sequences of Map-Reduce programs, for which large-scale parallel > implementations already exist (e.g., the Hadoop subproject). Pig's language > layer currently consists of a textual language called Pig Latin, which has > the following key properties: > > * Ease of programming. It is trivial to achieve parallel execution of > simple, "embarrassingly parallel" data analysis tasks. Complex tasks > comprised of multiple interrelated data transformations are explicitly > encoded as data flow sequences, making them easy to write, understand, and > maintain. > * Optimization opportunities. The way in which tasks are encoded permits the > system to optimize their execution automatically, allowing the user to focus > on semantics rather than efficiency. > * Extensibility. Users can create their own functions to do special-purpose > processing." > > ---------------------------______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Roland Rau
2008-Aug-21 19:03 UTC
[R] Large data sets with R (binding to hadoop available?)
Hi Avram Aelony wrote:> > Dear R community, > > I find R fantastic and use R whenever I can for my data analytic needs. > Certain data sets, however, are so large that other tools seem to be > needed to pre-process data such that it can be brought into R for > further analysis. > > Questions I have for the many expert contributors on this list are: > > 1. How do others handle situations of large data sets (gigabytes, > terabytes) for analysis in R ? >I usually try to store the data in an SQLite database and interface via functions from the packages RSQLite (and DBI). No idea about Question No. 2, though. Hope this helps, Roland P.S. When I am sure that I only need a certain subset of large data sets, I still prefer to do some pre-processing in awk (gawk). 2.P.S. The size of my data sets are in the gigabyte range (not terabyte range). This might be important if your data sets are *really large* and you want to use sqlite: http://www.sqlite.org/whentouse.html