Hi, netters, First of all, thanks a lot for all the prompt replies to my earlier question about "merging" data frames in R. Actually that's an equivalence to the "join" clause in mysql. Now I have another question. Suppose I have a data frame X with lots of columns/variables: Name, Age,Group, Type, Salary. I wanna do a subtotal of salaries: aggregate(X$Salary, by=list(X$Group,X$Age,X$Type),Fun=mean) When the levels of Group and Type are huge, it took R forever to finish the aggregation. And I used gc to find that the memory usage was big too. However, in mysql, it took seconds to finish a similar job: select Group,Age,Type ,avg(Salary) from X group by Group,Age,Type Is it because mysql is superior in doing such kind of things? Or my R command is not efficient enough? Why did R have to consume huge memories to do the aggregation? Thanks again! Zhihua Li _________________________________________________________________ ÌìÁ¹ÁË£¬ÌíÒÂÁË£¬ÐĶ¯ÁË£¬¡°Æß¼þ¡±ÁË http://get.live.cn [[alternative HTML version deleted]]
huali, if i were you, i will create a view on the MySql server to aggregate the data first and then use R to pull the data through this created view. This is not only applicable to R but also a general guideline in similar situation. Per my understanding and experience, R is able to do data manipulation reasonably well. However, we should always use the right tool to do the right thing. On Jan 26, 2008 6:45 PM, zhihuali <lzhtom at hotmail.com> wrote:> > Hi, netters, > > First of all, thanks a lot for all the prompt replies to my earlier question about "merging" data frames in R. > Actually that's an equivalence to the "join" clause in mysql. > > Now I have another question. Suppose I have a data frame X with lots of columns/variables: > Name, Age,Group, Type, Salary. > I wanna do a subtotal of salaries: > aggregate(X$Salary, by=list(X$Group,X$Age,X$Type),Fun=mean) > > When the levels of Group and Type are huge, it took R forever to finish the aggregation. > And I used gc to find that the memory usage was big too. > > However, in mysql, it took seconds to finish a similar job: > select Group,Age,Type ,avg(Salary) from X group by Group,Age,Type > > Is it because mysql is superior in doing such kind of things? Or my R command is not efficient enough? Why did R have to consume huge memories to do the aggregation? > > Thanks again! > > Zhihua Li > > _________________________________________________________________ > ????????????"??"? > http://get.live.cn > [[alternative HTML version deleted]] > > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >-- ==============================WenSui Liu Statistical Project Manager ChoicePoint Precision Marketing (http://spaces.msn.com/statcompute/blog) ===============================
Gabor Grothendieck
2008-Jan-27 00:07 UTC
[R] Comparison of aggregate in R and group by in mysql
How does the it compare if you read it into R and then do your aggregate with sqldf: library(sqldf) # example using builtin data set CO2 CO2agg <- sqldf("select Plant, Type, Treatment, avg(conc) from CO2 group by Plant, Type, Treatment") # or using your data: Xagg <- sqldf("select Group, Age, Type, avg(Salary) from X group by Group, Age, Type") On Jan 26, 2008 6:45 PM, zhihuali <lzhtom at hotmail.com> wrote:> > Hi, netters, > > First of all, thanks a lot for all the prompt replies to my earlier question about "merging" data frames in R. > Actually that's an equivalence to the "join" clause in mysql. > > Now I have another question. Suppose I have a data frame X with lots of columns/variables: > Name, Age,Group, Type, Salary. > I wanna do a subtotal of salaries: > aggregate(X$Salary, by=list(X$Group,X$Age,X$Type),Fun=mean) > > When the levels of Group and Type are huge, it took R forever to finish the aggregation. > And I used gc to find that the memory usage was big too. > > However, in mysql, it took seconds to finish a similar job: > select Group,Age,Type ,avg(Salary) from X group by Group,Age,Type > > Is it because mysql is superior in doing such kind of things? Or my R command is not efficient enough? Why did R have to consume huge memories to do the aggregation? > > Thanks again! > > Zhihua Li > > _________________________________________________________________ > ????????????"??"? > http://get.live.cn > [[alternative HTML version deleted]] > > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >
How large is your dataframe? How much memory do you have on your system? Are you paging? Here is a test I ran with a data frame with 1,000,000 entries and it seems to be fast:> n <- 1000000 > x <- data.frame(A=sample(LETTERS,n,TRUE), B=sample(letters[1:4],n,TRUE),+ C=sample(LETTERS[1:4], n, TRUE), data=runif(n))> system.time(x.agg <- aggregate(x$data, list(x$A, x$B, x$C), mean))user system elapsed 2.65 0.34 3.00>On Jan 26, 2008 6:45 PM, zhihuali <lzhtom at hotmail.com> wrote:> > Hi, netters, > > First of all, thanks a lot for all the prompt replies to my earlier question about "merging" data frames in R. > Actually that's an equivalence to the "join" clause in mysql. > > Now I have another question. Suppose I have a data frame X with lots of columns/variables: > Name, Age,Group, Type, Salary. > I wanna do a subtotal of salaries: > aggregate(X$Salary, by=list(X$Group,X$Age,X$Type),Fun=mean) > > When the levels of Group and Type are huge, it took R forever to finish the aggregation. > And I used gc to find that the memory usage was big too. > > However, in mysql, it took seconds to finish a similar job: > select Group,Age,Type ,avg(Salary) from X group by Group,Age,Type > > Is it because mysql is superior in doing such kind of things? Or my R command is not efficient enough? Why did R have to consume huge memories to do the aggregation? > > Thanks again! > > Zhihua Li > > _________________________________________________________________ > ????????????"??"? > http://get.live.cn > [[alternative HTML version deleted]] > > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve?
Apparently Analagous Threads
- An R clause to bind dataframes under certain contions
- locate the rows in a dataframe with some criteria
- how to use a function in aggregate which accepts matrix and outputs matrix?
- sort a data matrix by all the values and keep the names
- Error: evaluation nested too deeply when doing heatmap with binary distfunction