set.seed(123)
N = 30000
K = 400
theData = matrix(rnorm(N*K), ncol=K)
theData = as.data.frame(theData)
theData = cbind(indicator = sample(0:1, N, rep=T), theData)
> system.time(results <- colMeans(subset(theData, indicator == 1)))
user system elapsed
2.309 1.319 3.853
b
On Jul 20, 2007, at 6:17 PM, Diogo Alagador wrote:
> Hi all,
>
> I'm handling massive data.frames and matrices in R (30000 x 400).
> In the 1st column, say, I have 0s and 1s indicating rows that
> matter; other columns have probability values.
> One simple task I would like to do would be to get the column mean
> values for signaled rows (the ones with 1)
> As a very fresh "programmer" I have build a simple function in R
> which should not be very efficient indeed! It works well for
> current-dimension matrices, but it just not goes so well in huge ones.
>
> meanprob<-function(Robj){
> NLINE<-dim(Robj)[1];
> NCOLUMN<-dim(Robj)[2];
> mprob<-c(rep(0,(NCOLUMN-1)));
> for (i in 2:NCOLUMN){
> sumprob<-0;
> pa<-0;
> for (j in 1:NLINE){
> if(Robj[j,1]!=0){
> pa<-pa+1;
> sumprob<-Robj[j,i]+sumprob;
> }
> }
> mprob[i-1]<-sumprob/pa;
> }
> return(mprob);
> }
>
>
> So I "only" see 3 ways to get through the problem:
>
> - to reformulate the function to gain efficiency;
> - to establish a C-routine (for example), where loops are more
> "speedy", and then interfacing with R;
> - to find some function/ package that already do that.
>
> Can anybody illuminate my way here,
>
> Mush thanks,
>
> Diogo Andre' Alagador
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.