Good afternoon, Today I was working on a practice problem. It was simple, and perhaps even realistic. It looked like this: ? Get a list of all the data files in a directory ? Load each file into a dataframe ? Merge them into a single data frame Because all of the columns were the same, the simplest solution in my mind was to `Reduce' the vector of dataframes with a call to `merge'. That worked fine, I got what was expected. That is key actually. It is literally a one-liner, and there will never be index or scoping errors with it. Now with that in mind, what is the idiomatic way? Do people usually do something else because it is /faster/ (by some definition)? Kind regards, Grant Rettke | ACM, ASA, FSF, IEEE, SIAM gcr at wisdomandwonder.com | http://www.wisdomandwonder.com/ ?Wisdom begins in wonder.? --Socrates ((? (x) (x x)) (? (x) (x x))) ?Life has become immeasurably better since I have been forced to stop taking it seriously.? --Thompson
Just load the data frames into a list and give that list to rbind. It is way more efficient to be able to identify how big the final data frame is going to have to be at the beginning and preallocate the result memory than to incrementally allocate larger and larger data frames along the way using Reduce. --------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live... DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --------------------------------------------------------------------------- Sent from my phone. Please excuse my brevity. On August 10, 2014 11:51:22 AM PDT, Grant Rettke <gcr at wisdomandwonder.com> wrote:>Good afternoon, > >Today I was working on a practice problem. It was simple, and perhaps >even realistic. It looked like this: > >? Get a list of all the data files in a directory >? Load each file into a dataframe >? Merge them into a single data frame > >Because all of the columns were the same, the simplest solution in my >mind was to `Reduce' the vector of dataframes with a call to >`merge'. That worked fine, I got what was expected. That is key >actually. It is literally a one-liner, and there will never be index >or scoping errors with it. > >Now with that in mind, what is the idiomatic way? Do people usually do >something else because it is /faster/ (by some definition)? > >Kind regards, > >Grant Rettke | ACM, ASA, FSF, IEEE, SIAM >gcr at wisdomandwonder.com | http://www.wisdomandwonder.com/ >?Wisdom begins in wonder.? --Socrates >((? (x) (x x)) (? (x) (x x))) >?Life has become immeasurably better since I have been forced to stop >taking it seriously.? --Thompson > >______________________________________________ >R-help at r-project.org mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.
On Aug 10, 2014, at 11:51 AM, Grant Rettke wrote:> Good afternoon, > > Today I was working on a practice problem. It was simple, and perhaps > even realistic. It looked like this: > > ? Get a list of all the data files in a directory > ? Load each file into a dataframe > ? Merge them into a single data frameSomething along these lines: all <- do.call( rbind, lapply( list.files(path=getwd(), pattern=".csv"), read.csv) ) Possibly: all <- sapply( list.files(path=getwd(), pattern=".csv"), read.csv) Untested since no reproducible example was offered. This skips the task of individually assigning names to the input dataframes. There are quite a few variations on this in the Archives. You should learn to search them. Rseek.org or MarkMail are effective for me. http://www.rseek.org/ http://markmail.org/search/?q=list%3Aorg.r-project.r-help> > Because all of the columns were the same, the simplest solution in my > mind was to `Reduce' the vector of dataframes with a call to > `merge'. That worked fine, I got what was expected. That is key > actually. It is literally a one-liner, and there will never be index > or scoping errors with it.You might have forced `merge` to work with the correct choice of arguments but I would have silently eliminated duplicate rows. Seems unlikely to me that it would be efficient for the purpose of just stacking dataframe values.> > > merge( data.frame(a=1, b=2), data.frame(a=3, b=4) )[1] a b <0 rows> (or 0-length row.names)> merge( data.frame(a=1, b=2), data.frame(a=3, b=4) , all=TRUE)a b 1 1 2 2 3 4> merge( data.frame(a=1, b=2), data.frame(a=1, b=2) )a b 1 1 2> rbind( data.frame(a=1, b=2), data.frame(a=1, b=2) )a b 1 1 2 2 1 2> Now with that in mind, what is the idiomatic way? Do people usually do > something else because it is /faster/ (by some definition)? > > Kind regards, >-- David Winsemius Alameda, CA, USA
On Sun, Aug 10, 2014 at 1:51 PM, Grant Rettke <gcr at wisdomandwonder.com> wrote:> > Good afternoon, > > Today I was working on a practice problem. It was simple, and perhaps > even realistic. It looked like this: > > ? Get a list of all the data files in a directoryOK, I assume this results in a vector of file names in a variable, like you'd get from list.files();> > ? Load each file into a dataframeWhy? Do you need them in separate data frames?> > ? Merge them into a single data frameThe meat of the question. If you don't need the files in separate data frames, and the files do _NOT_ have headers, then I would just load them all into a single frame. I used Linux and so my solution may not work on Windows. Something like: list_of_files = list.files(pattern=".*data$"); # list of data files # # command to list contents of all files to stdout: command <- pipe(paste('cat',list_of_files)); read.table(command,header=FALSE); I would guess that Windows has something equivalent to cat, is it "type"? I have a vague memory of that. The above will work with header=TRUE, but the headers in the second and subsequent files are taken as data. And if you have row.names in the data, such as write.csv() does, then this is really not for you. Well, at least it would not be as simple. There are ways around it using a more intelligent "copy" program than "cat". Such as AWK. If you need an AWK example, I can fake one up. It would strip the headers from the 2nd and subsequent files and remove the first column "row.names" values. Not really all that difficult, but "fiddly".> > Because all of the columns were the same, the simplest solution in my > mind was to `Reduce' the vector of dataframes with a call to > `merge'. That worked fine, I got what was expected. That is key > actually. It is literally a one-liner, and there will never be index > or scoping errors with it. > > Now with that in mind, what is the idiomatic way? Do people usually do > something else because it is /faster/ (by some definition)? > > Kind regards, > >-- There is nothing more pleasant than traveling and meeting new people! Genghis Khan Maranatha! <>< John McKown