richard_stahlhut at urmc.rochester.edu
2009-Aug-12 16:25 UTC
[Rd] 10x slower merge in mac 2.9.1 vs. 2.9.0 (PR#13890)
Full_Name: Rick Stahlhut Version: 2.9.1 OS: os x 10.5.7 Submission from: (NULL) (128.151.71.23) I upgraded to 2.9.1 today from 2.9.0. I work with large CDC (center for disease control) datasets and start, frequently, with a series of 23 large-ish merges to create the final dataset I work on. I do this each time because (a) R is fast. why not? and b) the datasets occasionally get updated by CDC and it's easier to swap in new files that way. One such merge is two data.frames with 10 variables and 10,000 rows each. The command in question is: temp = merge (demo.2,ph,by="seqn",all.x=TRUE) in 2.9.0, this command took 3.3 seconds. in 2.9.1, it took 35.8 seconds. I have reverted back to 2.9.0. Additional packages loaded are: library(Hmisc) library(alr3) library(epicalc) library(ggplot2) library(lattice) library(reshape) library(survey) library(car) thanks very much for all the effort. R is wonderful.
Simon Urbanek
2009-Aug-13 15:52 UTC
[Rd] 10x slower merge in mac 2.9.1 vs. 2.9.0 (PR#13890)
Rick, I'm sorry, but I cannot reproduce it. You didn't supply sessionInfo() and the actual data, so all I can do is guess, but according to your description this test case shows no difference: set.seed(1) n=10000 d1 = data .frame (seqn = as .integer (runif (n )*n ),a = rnorm (n ),b = rnorm (n ),c = rnorm (n),d=rnorm(n),e=rnorm(n),f=rnorm(n),g=rnorm(n),h=rnorm(n),i=rnorm(n)) d2 = data .frame (seqn = as .integer (runif (n )*n ),a = rnorm (n ),b = rnorm (n ),c = rnorm (n),d=rnorm(n),e=rnorm(n),f=rnorm(n),g=rnorm(n),h=rnorm(n),i=rnorm(n)) system.time(merge(d1,d2,by="seqn",all.x=TRUE)) R 2.9.1: > system.time(merge(d1,d2,by="seqn",all.x=TRUE)) user system elapsed 0.150 0.067 0.217 R 2.9.0: > system.time(merge(d1,d2,by="seqn",all.x=TRUE)) user system elapsed 0.148 0.068 0.216 To substantiate your claim, please provide a reproducible example as well as sessionInfo() [and details on how you run it - GUI, CLI, ...], but I suspect the difference may be in your data, not R. Thanks, Simon On Aug 12, 2009, at 12:25 , richard_stahlhut at urmc.rochester.edu wrote:> Full_Name: Rick Stahlhut > Version: 2.9.1 > OS: os x 10.5.7 > Submission from: (NULL) (128.151.71.23) > > > I upgraded to 2.9.1 today from 2.9.0. I work with large CDC > (center for > disease control) datasets and start, frequently, with a series of 23 > large-ish > merges to create the final dataset I work on. I do this each time > because (a) R > is fast. why not? and b) the datasets occasionally get updated by > CDC and > it's easier to swap in new files that way. > > One such merge is two data.frames with 10 variables and 10,000 rows > each. The > command in question is: > > temp = merge (demo.2,ph,by="seqn",all.x=TRUE) > > in 2.9.0, this command took 3.3 seconds. > in 2.9.1, it took 35.8 seconds. > > I have reverted back to 2.9.0. > > Additional packages loaded are: > > library(Hmisc) > library(alr3) > library(epicalc) > library(ggplot2) > library(lattice) > library(reshape) > library(survey) > library(car) > > thanks very much for all the effort. R is wonderful. > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > >
adrian_d at eskimo.com
2009-Aug-13 16:05 UTC
[Rd] 10x slower merge in mac 2.9.1 vs. 2.9.0 (PR#13890)
This issue has been reported before http://thread.gmane.org/gmane.comp.lang.r.devel/20945/focus=20959 It happens when data frames contain character strings. Thanks, Adrian On Thu, 13 Aug 2009, Simon Urbanek wrote:> Rick, > > I'm sorry, but I cannot reproduce it. You didn't supply sessionInfo() and the > actual data, so all I can do is guess, but according to your description this > test case shows no difference: > > set.seed(1) > n=10000 > d1=data.frame(seqn=as.integer(runif(n)*n),a=rnorm(n),b=rnorm(n),c=rnorm(n),d=rnorm(n),e=rnorm(n),f=rnorm(n),g=rnorm(n),h=rnorm(n),i=rnorm(n)) > d2=data.frame(seqn=as.integer(runif(n)*n),a=rnorm(n),b=rnorm(n),c=rnorm(n),d=rnorm(n),e=rnorm(n),f=rnorm(n),g=rnorm(n),h=rnorm(n),i=rnorm(n)) > system.time(merge(d1,d2,by="seqn",all.x=TRUE)) > > R 2.9.1: >> system.time(merge(d1,d2,by="seqn",all.x=TRUE)) > user system elapsed > 0.150 0.067 0.217 > > R 2.9.0: >> system.time(merge(d1,d2,by="seqn",all.x=TRUE)) > user system elapsed > 0.148 0.068 0.216 > > To substantiate your claim, please provide a reproducible example as well as > sessionInfo() [and details on how you run it - GUI, CLI, ...], but I suspect > the difference may be in your data, not R. > > Thanks, > Simon > > > On Aug 12, 2009, at 12:25 , richard_stahlhut at urmc.rochester.edu wrote: > >> Full_Name: Rick Stahlhut >> Version: 2.9.1 >> OS: os x 10.5.7 >> Submission from: (NULL) (128.151.71.23) >> >> >> I upgraded to 2.9.1 today from 2.9.0. I work with large CDC (center for >> disease control) datasets and start, frequently, with a series of 23 >> large-ish >> merges to create the final dataset I work on. I do this each time because >> (a) R >> is fast. why not? and b) the datasets occasionally get updated by CDC >> and >> it's easier to swap in new files that way. >> >> One such merge is two data.frames with 10 variables and 10,000 rows each. >> The >> command in question is: >> >> temp = merge (demo.2,ph,by="seqn",all.x=TRUE) >> >> in 2.9.0, this command took 3.3 seconds. >> in 2.9.1, it took 35.8 seconds. >> >> I have reverted back to 2.9.0. >> >> Additional packages loaded are: >> >> library(Hmisc) >> library(alr3) >> library(epicalc) >> library(ggplot2) >> library(lattice) >> library(reshape) >> library(survey) >> library(car) >> >> thanks very much for all the effort. R is wonderful. >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> >> > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel