Hello All, I am new to R. I am trying to process this huge data set of matrix containing four columns, say x1, x2, x3, x4 and n number of rows. I want to aggregate the matrix by x1 and perform statistic based on columns x2, x3, x4. I tried aggregate function but it gave me memory allocation error (which I am not surprised), so I ended up performing a for loop based on x1 and subsetting the matrix based on x1. However I have a hunch that their should be a less expensive way of doing this processing. Any ideas or tips to optimize this processing logic would be greatly appreciated. Manoj [[alternative HTML version deleted]]
Manoj - Hachibushu Capital wrote:> I am new to R. I am trying to process this huge data set of > matrix containing four columns, say x1, x2, x3, x4 and n number of rows. > > I want to aggregate the matrix by x1 and perform statistic based on > columns x2, x3, x4.Someone will probably give you a way to do this directly in R, but if your data set is truly huge, at least one option is to use a PostgreSQL database for the data, and define a custom aggregate using PL/R. For a simple example, see: http://www.joeconway.com/plr/doc/plr-aggregate-funcs.html HTH, Joe
Loops are time consuming in R. Try one of the apply functions for vectorized calculations, like "apply", "lapply","sapply" or "tapply". Also see help for "split". In a message dated 10/19/03 5:25:51 PM Pacific Daylight Time, Wanzare@HCJP.com writes:> Hello All, > I am new to R. I am trying to process this huge data set of > matrix containing four columns, say x1, x2, x3, x4 and n number of rows. > > I want to aggregate the matrix by x1 and perform statistic based on > columns x2, x3, x4. I tried aggregate function but it gave me memory > allocation error (which I am not surprised), so I ended up performing a > for loop based on x1 and subsetting the matrix based on x1. However I > have a hunch that their should be a less expensive way of doing this > processing. Any ideas or tips to optimize this processing logic would > be greatly appreciated. >[[alternative HTML version deleted]]
> From: TyagiAnupam at aol.com [mailto:TyagiAnupam at aol.com] > > Loops are time consuming in R. Try one of the apply functions > for vectorized > calculations, like "apply", "lapply","sapply" or "tapply". > Also see help for > "split".Have you actually compared for loop with apply, in terms of timing? Have you looked at the R code for apply()? It has: <...> if (length(d.call) < 2) { if (length(dn.call)) dimnames(newX) <- c(dn.call, list(NULL)) for (i in 1:d2) ans[[i]] <- FUN(newX[, i], ...) } else for (i in 1:d2) ans[[i]] <- FUN(array(newX[, i], d.call, dn.call), ...) <...> Notice the for loop there! While what you said about apply and for loop might be true for (older version of) Splus, it's not true for R. lapply() does do the looping at the C level. sapply and tapply uses lapply, so they can be faster than for loop at the R level. Andy> > In a message dated 10/19/03 5:25:51 PM Pacific Daylight Time, > Wanzare at HCJP.com writes: > > > Hello All, > > I am new to R. I am trying to process this huge data set of > > matrix containing four columns, say x1, x2, x3, x4 and n number of > > rows. > > > > I want to aggregate the matrix by x1 and perform statistic based on > > columns x2, x3, x4. I tried aggregate function but it gave > me memory > > allocation error (which I am not surprised), so I ended up > performing > > a for loop based on x1 and subsetting the matrix based on > x1. However > > I have a hunch that their should be a less expensive way of > doing this > > processing. Any ideas or tips to optimize this processing > logic would > > be greatly appreciated. > > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo> /r-help >
In a message dated 10/20/03 5:11:25 AM Pacific Daylight Time, andy_liaw@merck.com writes:> Have you actually compared for loop with apply, in terms of timing? Have > you looked at the R code for apply()? It has: > > <...> > if (length(d.call) <2) { > if (length(dn.call)) > dimnames(newX) <- c(dn.call, list(NULL)) > for (i in 1:d2) ans[[i]] <- FUN(newX[, i], ...) > } > else for (i in 1:d2) ans[[i]] <- FUN(array(newX[, i], d.call, > dn.call), ...) > <...> > > Notice the for loop there! While what you said about apply and for loop > might be true for (older version of) Splus, it's not true for R. > > lapply() does do the looping at the C level. sapply and tapply uses lapply, > so they can be faster than for loop at the R level. > > Andy >I have not done the comparison. Thanks a lot for point this out. Anupam. [[alternative HTML version deleted]]