Dimitri Liakhovitski
2010-Mar-30 15:04 UTC
[R] Code is too slow: mean-centering variables in a data frame by subgroup
Dear R-ers, I have a large data frame (several thousands of rows and about 2.5 thousand columns). One variable ("group") is a grouping variable with over 30 levels. And I have a lot of NAs. For each variable, I need to divide each value by variable mean - by subgroup. I have the code but it's way too slow - takes me about 1.5 hours. Below is a data example and my code that is too slow. Is there a different, faster way of doing the same thing? Thanks a lot for your advice! Dimitri # Building an example frame - with groups and a lot of NAs: set.seed(1234) frame<-data.frame(group=rep(paste("group",1:10),10),a=rnorm(1:100),b=rnorm(1:100),c=rnorm(1:100),d=rnorm(1:100),e=rnorm(1:100),f=rnorm(1:100),g=rnorm(1:100)) frame<-frame[order(frame$group),] names.used<-names(frame)[2:length(frame)] set.seed(1234) for(i in names.used){ i.for.NA<-sample(1:100,60) frame[[i]][i.for.NA]<-NA } frame ### Code that does what's needed but is too slow: Start<-Sys.time() frame <- do.call(cbind, lapply(names.used, function(x){ unlist(by(frame, frame$group, function(y) y[,x] / mean(y[,x],na.rm=T))) })) Finish<-Sys.time() print(Finish-Start) # Takes too long -- Dimitri Liakhovitski Ninah.com Dimitri.Liakhovitski at ninah.com
Dgnn
2010-Mar-30 16:02 UTC
[R] Code is too slow: mean-centering variables in a data frame by subgroup
I posted a similar problem last week (but with an uninformative subject header) See if this http://n4.nabble.com/a-vectorized-solution-to-some-simple-dataframe-math-td1692810.html#a1710410 this helps. -- View this message in context: http://n4.nabble.com/Code-is-too-slow-mean-centering-variables-in-a-data-frame-by-subgroup-tp1745335p1745434.html Sent from the R help mailing list archive at Nabble.com.
Dimitri Liakhovitski
2010-Mar-30 16:07 UTC
[R] Code is too slow: mean-centering variables in a data frame by subgroup
I wrote a different code - but it takes twice as long as my original code. :( However, I thought I should share it as well - because the second part of the code is fast - it's the first part that's slow. Maybe there is a way to fix the first part... Thank you! group.var<-"group" subgroups<-levels(frame[[group.var]]) system.time({ means.no.zeros<-list() for(i in 1:length(subgroups)){ # SLOW part of the code row.of.means<-as.data.frame(t(colMeans(frame[frame[[group.var]] %in% subgroups[i],names.used],na.rm=T))) nr.of.rows<-(dim(frame[frame[[group.var]] %in% subgroups[i],])[1]) means.no.zeros[[i]]<-as.data.frame(matrix(nrow=nr.of.rows,ncol=length(names.used))) means.no.zeros[[i]]<-row.of.means for(z in 1:nr.of.rows){ #z<-1 means.no.zeros[[i]][z,] = row.of.means } } means.no.zeros<-do.call(rbind,means.no.zeros) }) system.time({ #FAST part of the code frame[names.used]<-frame[names.used]/means.no.zeros }) ################################################################################ On Tue, Mar 30, 2010 at 11:04 AM, Dimitri Liakhovitski <ld7631 at gmail.com> wrote:> Dear R-ers, > > I have ?a large data frame (several thousands of rows and about 2.5 > thousand columns). One variable ("group") is a grouping variable with > over 30 levels. And I have a lot of NAs. > For each variable, I need to divide each value by variable mean - by > subgroup. I have the code but it's way too slow - takes me about 1.5 > hours. > Below is a data example and my code that is too slow. Is there a > different, faster way of doing the same thing? > Thanks a lot for your advice! > > Dimitri > > > # Building an example frame - with groups and a lot of NAs: > set.seed(1234) > frame<-data.frame(group=rep(paste("group",1:10),10),a=rnorm(1:100),b=rnorm(1:100),c=rnorm(1:100),d=rnorm(1:100),e=rnorm(1:100),f=rnorm(1:100),g=rnorm(1:100)) > frame<-frame[order(frame$group),] > names.used<-names(frame)[2:length(frame)] > set.seed(1234) > for(i in names.used){ > ? ? ? i.for.NA<-sample(1:100,60) > ? ? ? frame[[i]][i.for.NA]<-NA > } > frame > > ### Code that does what's needed but is too slow: > Start<-Sys.time() > frame <- do.call(cbind, lapply(names.used, function(x){ > ?unlist(by(frame, frame$group, function(y) y[,x] / mean(y[,x],na.rm=T))) > })) > Finish<-Sys.time() > print(Finish-Start) # Takes too long > > -- > Dimitri Liakhovitski > Ninah.com > Dimitri.Liakhovitski at ninah.com >-- Dimitri Liakhovitski Ninah.com Dimitri.Liakhovitski at ninah.com
Charles C. Berry
2010-Mar-30 16:24 UTC
[R] Code is too slow: mean-centering variables in a data frame by subgroup
On Tue, 30 Mar 2010, Dimitri Liakhovitski wrote:> Dear R-ers, > > I have a large data frame (several thousands of rows and about 2.5 > thousand columns). One variable ("group") is a grouping variable with > over 30 levels. And I have a lot of NAs. > For each variable, I need to divide each value by variable mean - by > subgroup. I have the code but it's way too slow - takes me about 1.5 > hours. > Below is a data example and my code that is too slow. Is there a > different, faster way of doing the same thing? > Thanks a lot for your advice! > > Dimitri > > > # Building an example frame - with groups and a lot of NAs: > set.seed(1234) > frame<-data.frame(group=rep(paste("group",1:10),10),a=rnorm(1:100),b=rnorm(1:100),c=rnorm(1:100),d=rnorm(1:100),e=rnorm(1:100),f=rnorm(1:100),g=rnorm(1:100))Use model.matrix and crossprod to do this in a vectorized fashion:> mat <- as.matrix(frame[,-1]) > mm <- model.matrix(~0+group,frame) > col.grp.N <- crossprod( !is.na(mat), mm ) > mat[is.na(mat)] <- 0.0 > col.grp.sum <- crossprod( mat, mm ) > mat <- mat / ( t(col.grp.sum/col.grp.N)[ frame$group,] ) > is.na(mat) <- is.na(frame[,-1]) >mat is now a matrix whose columns each correspond to the columns in 'frame' as you have it after do.call(...) Are you sure you want to divide the values by their (possibly negative) means?? HTH, Chuck> frame<-frame[order(frame$group),] > names.used<-names(frame)[2:length(frame)] > set.seed(1234) > for(i in names.used){ > i.for.NA<-sample(1:100,60) > frame[[i]][i.for.NA]<-NA > } > frame > > ### Code that does what's needed but is too slow: > Start<-Sys.time() > frame <- do.call(cbind, lapply(names.used, function(x){ > unlist(by(frame, frame$group, function(y) y[,x] / mean(y[,x],na.rm=T))) > })) > Finish<-Sys.time() > print(Finish-Start) # Takes too long > > -- > Dimitri Liakhovitski > Ninah.com > Dimitri.Liakhovitski at ninah.com > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:cberry at tajo.ucsd.edu UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
Bert Gunter
2010-Mar-30 17:52 UTC
[R] Code is too slow: mean-centering variables in a data frame bysubgroup
?scale Bert Gunter Genentech Nonclinical Biostatistics -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Dimitri Liakhovitski Sent: Tuesday, March 30, 2010 8:05 AM To: r-help Subject: [R] Code is too slow: mean-centering variables in a data frame bysubgroup Dear R-ers, I have a large data frame (several thousands of rows and about 2.5 thousand columns). One variable ("group") is a grouping variable with over 30 levels. And I have a lot of NAs. For each variable, I need to divide each value by variable mean - by subgroup. I have the code but it's way too slow - takes me about 1.5 hours. Below is a data example and my code that is too slow. Is there a different, faster way of doing the same thing? Thanks a lot for your advice! Dimitri # Building an example frame - with groups and a lot of NAs: set.seed(1234) frame<-data.frame(group=rep(paste("group",1:10),10),a=rnorm(1:100),b=rnorm(1 :100),c=rnorm(1:100),d=rnorm(1:100),e=rnorm(1:100),f=rnorm(1:100),g=rnorm(1: 100)) frame<-frame[order(frame$group),] names.used<-names(frame)[2:length(frame)] set.seed(1234) for(i in names.used){ i.for.NA<-sample(1:100,60) frame[[i]][i.for.NA]<-NA } frame ### Code that does what's needed but is too slow: Start<-Sys.time() frame <- do.call(cbind, lapply(names.used, function(x){ unlist(by(frame, frame$group, function(y) y[,x] / mean(y[,x],na.rm=T))) })) Finish<-Sys.time() print(Finish-Start) # Takes too long -- Dimitri Liakhovitski Ninah.com Dimitri.Liakhovitski at ninah.com ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Apparently Analagous Threads
- summing values by week - based on daily dates - but with some dates missing
- Competing with SPSS and SAS: improving code that loops through rows (data manipulation)
- Analogue to SPSS regression commands ENTER and REMOVE in R?
- merging 2 frames while keeping all the entries from the "reference" frame
- More elegant way of excluding rows with equal values in any 2 columns?