thr3ads.net - R help - [R] Code is too slow: mean-centering variables in a data frame by subgroup [Mar 2010]

If this information is useful, please help other people find it:
Share via:

Dimitri Liakhovitski

2010-Mar-30 15:04 UTC

[R] Code is too slow: mean-centering variables in a data frame by subgroup

Dear R-ers,

I have  a large data frame (several thousands of rows and about 2.5
thousand columns). One variable ("group") is a grouping variable with
over 30 levels. And I have a lot of NAs.
For each variable, I need to divide each value by variable mean - by
subgroup. I have the code but it's way too slow - takes me about 1.5
hours.
Below is a data example and my code that is too slow. Is there a
different, faster way of doing the same thing?
Thanks a lot for your advice!

Dimitri


# Building an example frame - with groups and a lot of NAs:
set.seed(1234)
frame<-data.frame(group=rep(paste("group",1:10),10),a=rnorm(1:100),b=rnorm(1:100),c=rnorm(1:100),d=rnorm(1:100),e=rnorm(1:100),f=rnorm(1:100),g=rnorm(1:100))
frame<-frame[order(frame$group),]
names.used<-names(frame)[2:length(frame)]
set.seed(1234)
for(i in names.used){
       i.for.NA<-sample(1:100,60)
       frame[[i]][i.for.NA]<-NA
}
frame

### Code that does what's needed but is too slow:
Start<-Sys.time()
frame <- do.call(cbind, lapply(names.used, function(x){
  unlist(by(frame, frame$group, function(y) y[,x] / mean(y[,x],na.rm=T)))
}))
Finish<-Sys.time()
print(Finish-Start) # Takes too long

-- 
Dimitri Liakhovitski
Ninah.com
Dimitri.Liakhovitski at ninah.com

Dgnn

2010-Mar-30 16:02 UTC

head link

[R] Code is too slow: mean-centering variables in a data frame by subgroup

I posted a similar problem last week (but with an uninformative subject
header) See if this 
http://n4.nabble.com/a-vectorized-solution-to-some-simple-dataframe-math-td1692810.html#a1710410
this  helps.

-- 
View this message in context:
http://n4.nabble.com/Code-is-too-slow-mean-centering-variables-in-a-data-frame-by-subgroup-tp1745335p1745434.html
Sent from the R help mailing list archive at Nabble.com.

Dimitri Liakhovitski

2010-Mar-30 16:07 UTC

head link

[R] Code is too slow: mean-centering variables in a data frame by subgroup

I wrote a different code - but it takes twice as long as my original code. :(
However, I thought I should share it as well - because the second part
of the code is fast - it's the first part that's slow. Maybe there is
a way to fix the first part...
Thank you!


group.var<-"group"
subgroups<-levels(frame[[group.var]])

system.time({
means.no.zeros<-list()
for(i in 1:length(subgroups)){  # SLOW part of the code
  row.of.means<-as.data.frame(t(colMeans(frame[frame[[group.var]] %in%
subgroups[i],names.used],na.rm=T)))
  nr.of.rows<-(dim(frame[frame[[group.var]] %in% subgroups[i],])[1])
 
means.no.zeros[[i]]<-as.data.frame(matrix(nrow=nr.of.rows,ncol=length(names.used)))
  means.no.zeros[[i]]<-row.of.means
  for(z in 1:nr.of.rows){ #z<-1
    means.no.zeros[[i]][z,] = row.of.means
  }
 }
means.no.zeros<-do.call(rbind,means.no.zeros)
})

system.time({    #FAST part of the code
frame[names.used]<-frame[names.used]/means.no.zeros
})


################################################################################
On Tue, Mar 30, 2010 at 11:04 AM, Dimitri Liakhovitski <ld7631 at
gmail.com> wrote:> Dear R-ers,
>
> I have ?a large data frame (several thousands of rows and about 2.5
> thousand columns). One variable ("group") is a grouping variable
with
> over 30 levels. And I have a lot of NAs.
> For each variable, I need to divide each value by variable mean - by
> subgroup. I have the code but it's way too slow - takes me about 1.5
> hours.
> Below is a data example and my code that is too slow. Is there a
> different, faster way of doing the same thing?
> Thanks a lot for your advice!
>
> Dimitri
>
>
> # Building an example frame - with groups and a lot of NAs:
> set.seed(1234)
>
frame<-data.frame(group=rep(paste("group",1:10),10),a=rnorm(1:100),b=rnorm(1:100),c=rnorm(1:100),d=rnorm(1:100),e=rnorm(1:100),f=rnorm(1:100),g=rnorm(1:100))
> frame<-frame[order(frame$group),]
> names.used<-names(frame)[2:length(frame)]
> set.seed(1234)
> for(i in names.used){
> ? ? ? i.for.NA<-sample(1:100,60)
> ? ? ? frame[[i]][i.for.NA]<-NA
> }
> frame
>
> ### Code that does what's needed but is too slow:
> Start<-Sys.time()
> frame <- do.call(cbind, lapply(names.used, function(x){
> ?unlist(by(frame, frame$group, function(y) y[,x] / mean(y[,x],na.rm=T)))
> }))
> Finish<-Sys.time()
> print(Finish-Start) # Takes too long
>
> --
> Dimitri Liakhovitski
> Ninah.com
> Dimitri.Liakhovitski at ninah.com
>


-- 
Dimitri Liakhovitski
Ninah.com
Dimitri.Liakhovitski at ninah.com

Charles C. Berry

2010-Mar-30 16:24 UTC

head link

[R] Code is too slow: mean-centering variables in a data frame by subgroup

On Tue, 30 Mar 2010, Dimitri Liakhovitski wrote:
> Dear R-ers,
>
> I have  a large data frame (several thousands of rows and about 2.5
> thousand columns). One variable ("group") is a grouping variable
with
> over 30 levels. And I have a lot of NAs.
> For each variable, I need to divide each value by variable mean - by
> subgroup. I have the code but it's way too slow - takes me about 1.5
> hours.
> Below is a data example and my code that is too slow. Is there a
> different, faster way of doing the same thing?
> Thanks a lot for your advice!
>
> Dimitri
>
>
> # Building an example frame - with groups and a lot of NAs:
> set.seed(1234)
>
frame<-data.frame(group=rep(paste("group",1:10),10),a=rnorm(1:100),b=rnorm(1:100),c=rnorm(1:100),d=rnorm(1:100),e=rnorm(1:100),f=rnorm(1:100),g=rnorm(1:100))

Use model.matrix and crossprod to do this in a vectorized fashion:
> mat <- as.matrix(frame[,-1])
> mm <- model.matrix(~0+group,frame)
> col.grp.N <- crossprod( !is.na(mat), mm )
> mat[is.na(mat)] <- 0.0
> col.grp.sum <- crossprod( mat, mm )
> mat <- mat / ( t(col.grp.sum/col.grp.N)[ frame$group,] )
> is.na(mat) <- is.na(frame[,-1])
>
mat is now a matrix whose columns each correspond to the columns in 
'frame' as you have it after do.call(...)


Are you sure you want to divide the values by their (possibly negative) 
means??

HTH,

Chuck


> frame<-frame[order(frame$group),]
> names.used<-names(frame)[2:length(frame)]
> set.seed(1234)
> for(i in names.used){
>       i.for.NA<-sample(1:100,60)
>       frame[[i]][i.for.NA]<-NA
> }
> frame
>
> ### Code that does what's needed but is too slow:
> Start<-Sys.time()
> frame <- do.call(cbind, lapply(names.used, function(x){
>  unlist(by(frame, frame$group, function(y) y[,x] / mean(y[,x],na.rm=T)))
> }))
> Finish<-Sys.time()
> print(Finish-Start) # Takes too long
>
> -- 
> Dimitri Liakhovitski
> Ninah.com
> Dimitri.Liakhovitski at ninah.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901

Bert Gunter

2010-Mar-30 17:52 UTC

head link

[R] Code is too slow: mean-centering variables in a data frame bysubgroup

?scale

Bert Gunter
Genentech Nonclinical Biostatistics
 
 

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On
Behalf Of Dimitri Liakhovitski
Sent: Tuesday, March 30, 2010 8:05 AM
To: r-help
Subject: [R] Code is too slow: mean-centering variables in a data frame
bysubgroup

Dear R-ers,

I have  a large data frame (several thousands of rows and about 2.5
thousand columns). One variable ("group") is a grouping variable with
over 30 levels. And I have a lot of NAs.
For each variable, I need to divide each value by variable mean - by
subgroup. I have the code but it's way too slow - takes me about 1.5
hours.
Below is a data example and my code that is too slow. Is there a
different, faster way of doing the same thing?
Thanks a lot for your advice!

Dimitri


# Building an example frame - with groups and a lot of NAs:
set.seed(1234)
frame<-data.frame(group=rep(paste("group",1:10),10),a=rnorm(1:100),b=rnorm(1
:100),c=rnorm(1:100),d=rnorm(1:100),e=rnorm(1:100),f=rnorm(1:100),g=rnorm(1:
100))
frame<-frame[order(frame$group),]
names.used<-names(frame)[2:length(frame)]
set.seed(1234)
for(i in names.used){
       i.for.NA<-sample(1:100,60)
       frame[[i]][i.for.NA]<-NA
}
frame

### Code that does what's needed but is too slow:
Start<-Sys.time()
frame <- do.call(cbind, lapply(names.used, function(x){
  unlist(by(frame, frame$group, function(y) y[,x] / mean(y[,x],na.rm=T)))
}))
Finish<-Sys.time()
print(Finish-Start) # Takes too long

-- 
Dimitri Liakhovitski
Ninah.com
Dimitri.Liakhovitski at ninah.com

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Apparently Analagous Threads

Search for more apparently analagous threads

R help - Mar 2010 - Code is too slow: mean-centering variables in a data frame by subgroup

[R] Code is too slow: mean-centering variables in a data frame by subgroup

[R] Code is too slow: mean-centering variables in a data frame by subgroup

[R] Code is too slow: mean-centering variables in a data frame by subgroup

[R] Code is too slow: mean-centering variables in a data frame by subgroup

[R] Code is too slow: mean-centering variables in a data frame bysubgroup

Apparently Analagous Threads