Sean Zhang
2009-Mar-15 00:46 UTC
[R] What is the best package for large data cleaning (not statistical analysis)?
Dear R helpers: I am a newbie to R and have a question related to cleaning large data frames in R. So far, I have been using SAS for data cleaning because my data sets are relatively large (handling multiple files, each could be as large as 5-10 G). I am not a fan of SAS at all and am eager to move data cleaning tasks into R completely. Seems to me, there are 3 options. Using SQL, ff or filehash. I do not want to learn sql. so my question is more related to ff and filehash. In specifics, (1) for merging two large data frames, which one is better, ff vs. filehash? (2) for reshaping a large data frame (say from long to wide or the opposite) which one is better, ff vs. filehash? If you can provide examples, that will be even better. Many thanks in advance. -Sean [[alternative HTML version deleted]]
jim holtman
2009-Mar-15 02:13 UTC
[R] What is the best package for large data cleaning (not statistical analysis)?
Exactly what type of cleaning do you want to do on them? Can you read in the data a block at a time (e.g., 1M records), clean them up and then write them back out? You would have the choice of putting them back as a text file or possibly storing them using 'filehash'. I have used that technique to segment a year's worth of data that was probably 3GB of text into monthly objects that were about 70MB dataframes that I stored using filehash. These I then read back in to do processing where I could summarize by month. So it all depends on what you want to do. You could read in the chunks, clean them and then reshape them into dataframes that you could process later. You will still probably have the problem that all the data still won't fit in memory. Now one thing I did was that since the dataframes were stored as binary objects in filehash, it was pretty fast to retrieve them, pick out the data I needed from each month and create a subset of just the data I needed that would now fit in memory. So it all depends ........... On Sat, Mar 14, 2009 at 8:46 PM, Sean Zhang <seanecon at gmail.com> wrote:> Dear R helpers: > > I am a newbie to R and have a question related to cleaning large data frames > in R. > > So far, I have been using SAS for data cleaning because my data sets are > relatively large (handling multiple files, each could be as large as 5-10 > G). > I am not a fan of SAS at all and am eager to move data cleaning tasks into R > completely. > > Seems to me, there are 3 options. Using SQL, ff or filehash. I do not want > to learn sql. so my question is more related to ff and filehash. > > In specifics, > > (1) for merging two large data frames, ?which one is better, ff vs. > filehash? > (2) for reshaping a large data frame (say from long to wide or the opposite) > which one is better, ff vs. filehash? > > If you can provide examples, that will be even better. > > Many thanks in advance. > > -Sean > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?