thr3ads.net - R help - [R] "Best" way to merge 300+ .5MB dataframes? [Aug 2014]

If this information is useful, please help other people find it:
Share via:

Grant Rettke

2014-Aug-10 18:51 UTC

[R] "Best" way to merge 300+ .5MB dataframes?

Good afternoon,

Today I was working on a practice problem. It was simple, and perhaps
even realistic. It looked like this:

? Get a list of all the data files in a directory
? Load each file into a dataframe
? Merge them into a single data frame

Because all of the columns were the same, the simplest solution in my
mind was to `Reduce' the vector of dataframes with a call to
`merge'. That worked fine, I got what was expected. That is key
actually. It is literally a one-liner, and there will never be index
or scoping errors with it.

Now with that in mind, what is the idiomatic way? Do people usually do
something else because it is /faster/ (by some definition)?

Kind regards,

Grant Rettke | ACM, ASA, FSF, IEEE, SIAM
gcr at wisdomandwonder.com | http://www.wisdomandwonder.com/
?Wisdom begins in wonder.? --Socrates
((? (x) (x x)) (? (x) (x x)))
?Life has become immeasurably better since I have been forced to stop
taking it seriously.? --Thompson

Jeff Newmiller

2014-Aug-10 21:22 UTC

head link

[R] "Best" way to merge 300+ .5MB dataframes?

Just load the data frames into a list and give that list to rbind. It is way
more efficient to be able to identify how big the final data frame is going to
have to be at the beginning and preallocate the result memory than to
incrementally allocate larger and larger data frames along the way using Reduce.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

On August 10, 2014 11:51:22 AM PDT, Grant Rettke <gcr at
wisdomandwonder.com> wrote:>Good afternoon,
>
>Today I was working on a practice problem. It was simple, and perhaps
>even realistic. It looked like this:
>
>? Get a list of all the data files in a directory
>? Load each file into a dataframe
>? Merge them into a single data frame
>
>Because all of the columns were the same, the simplest solution in my
>mind was to `Reduce' the vector of dataframes with a call to
>`merge'. That worked fine, I got what was expected. That is key
>actually. It is literally a one-liner, and there will never be index
>or scoping errors with it.
>
>Now with that in mind, what is the idiomatic way? Do people usually do
>something else because it is /faster/ (by some definition)?
>
>Kind regards,
>
>Grant Rettke | ACM, ASA, FSF, IEEE, SIAM
>gcr at wisdomandwonder.com | http://www.wisdomandwonder.com/
>?Wisdom begins in wonder.? --Socrates
>((? (x) (x x)) (? (x) (x x)))
>?Life has become immeasurably better since I have been forced to stop
>taking it seriously.? --Thompson
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

David Winsemius

2014-Aug-10 22:24 UTC

head link

[R] "Best" way to merge 300+ .5MB dataframes?

On Aug 10, 2014, at 11:51 AM, Grant Rettke wrote:
> Good afternoon,
> 
> Today I was working on a practice problem. It was simple, and perhaps
> even realistic. It looked like this:
> 
> ? Get a list of all the data files in a directory
> ? Load each file into a dataframe
> ? Merge them into a single data frame
Something along these lines:

all <- do.call( rbind, 
                 lapply( list.files(path=getwd(), pattern=".csv"), 
                         read.csv) )

Possibly:

all <- sapply( list.files(path=getwd(), pattern=".csv"), 
                         read.csv)

Untested since no reproducible example was offered. This skips the task of
individually assigning names to the input dataframes. There are quite a few
variations on this in the Archives. You should learn to search them. Rseek.org
or MarkMail are effective for me.

http://www.rseek.org/

http://markmail.org/search/?q=list%3Aorg.r-project.r-help
> 
> Because all of the columns were the same, the simplest solution in my
> mind was to `Reduce' the vector of dataframes with a call to
> `merge'. That worked fine, I got what was expected. That is key
> actually. It is literally a one-liner, and there will never be index
> or scoping errors with it.
You might have forced `merge` to work with the correct choice of arguments but I
would have silently eliminated duplicate rows. Seems unlikely to me that it
would be efficient for the purpose of just stacking dataframe
values.> 
> > merge( data.frame(a=1, b=2), data.frame(a=3, b=4) )[1] a b
<0 rows> (or 0-length row.names)
> merge( data.frame(a=1, b=2), data.frame(a=3, b=4) , all=TRUE)  a b
1 1 2
2 3 4> merge( data.frame(a=1, b=2), data.frame(a=1, b=2) )  a b
1 1 2
> rbind( data.frame(a=1, b=2), data.frame(a=1, b=2) )  a b
1 1 2
2 1 2
> Now with that in mind, what is the idiomatic way? Do people usually do
> something else because it is /faster/ (by some definition)?
> 
> Kind regards,
> 
-- 

David Winsemius
Alameda, CA, USA

John McKown

2014-Aug-10 23:50 UTC

head link

[R] "Best" way to merge 300+ .5MB dataframes?

On Sun, Aug 10, 2014 at 1:51 PM, Grant Rettke <gcr at wisdomandwonder.com>
wrote:>
> Good afternoon,
>
> Today I was working on a practice problem. It was simple, and perhaps
> even realistic. It looked like this:
>
> ? Get a list of all the data files in a directory

OK, I assume this results in a vector of file names in a variable,
like you'd get from list.files();
>
> ? Load each file into a dataframe

Why? Do you need them in separate data frames?
>
> ? Merge them into a single data frame
The meat of the question. If you don't need the files in separate data
frames, and the files do _NOT_ have headers, then I would just load
them all into a single frame. I used Linux and so my solution may not
work on Windows. Something like:

list_of_files = list.files(pattern=".*data$"); # list of data files
#
# command to list contents of all files to stdout:
command <- pipe(paste('cat',list_of_files));
read.table(command,header=FALSE);

I would guess that Windows has something equivalent to cat, is it
"type"? I have a vague memory of that.

The above will work with header=TRUE, but the headers in the second
and subsequent files are taken as data. And if you have row.names in
the data, such as write.csv() does, then this is really not for you.
Well, at least it would not be as simple. There are ways around it
using a more intelligent "copy" program than "cat". Such as
AWK. If
you need an AWK example, I can fake one up. It would strip the headers
from the 2nd and subsequent files and remove the first column
"row.names" values. Not really all that difficult, but
"fiddly".
>
> Because all of the columns were the same, the simplest solution in my
> mind was to `Reduce' the vector of dataframes with a call to
> `merge'. That worked fine, I got what was expected. That is key
> actually. It is literally a one-liner, and there will never be index
> or scoping errors with it.
>
> Now with that in mind, what is the idiomatic way? Do people usually do
> something else because it is /faster/ (by some definition)?
>
> Kind regards,
>
>

-- 
There is nothing more pleasant than traveling and meeting new people!
Genghis Khan

Maranatha! <><
John McKown

R help - Aug 2014 - "Best" way to merge 300+ .5MB dataframes?

[R] "Best" way to merge 300+ .5MB dataframes?

[R] "Best" way to merge 300+ .5MB dataframes?

[R] "Best" way to merge 300+ .5MB dataframes?

[R] "Best" way to merge 300+ .5MB dataframes?