thr3ads.net - R help - [R] Processing logic for Huge Data set [Oct 2003]

If this information is useful, please help other people find it:
Share via:

Manoj - Hachibushu Capital

2003-Oct-20 00:24 UTC

[R] Processing logic for Huge Data set

Hello All,
            I am new to R. I am trying to process this huge data set of
matrix containing four columns, say x1, x2, x3, x4 and n number of rows.

 
I want to aggregate the matrix by x1 and perform statistic based on
columns x2, x3, x4. I tried aggregate function but it gave me memory
allocation error (which I am not surprised), so I ended up performing a
for loop based on x1 and subsetting the matrix based on x1. However I
have a hunch that their should be a less expensive way of doing this
processing.  Any ideas or tips to optimize this processing logic would
be greatly appreciated.
 
Manoj 
            
            
 
            

	[[alternative HTML version deleted]]

Joe Conway

2003-Oct-20 02:11 UTC

head link

[R] Processing logic for Huge Data set

Manoj - Hachibushu Capital wrote:>             I am new to R. I am trying to process this huge data set of
> matrix containing four columns, say x1, x2, x3, x4 and n number of rows.
> 
> I want to aggregate the matrix by x1 and perform statistic based on
> columns x2, x3, x4.
Someone will probably give you a way to do this directly in R, but if 
your data set is truly huge, at least one option is to use a PostgreSQL 
database for the data, and define a custom aggregate using PL/R. For a 
simple example, see:
   http://www.joeconway.com/plr/doc/plr-aggregate-funcs.html

HTH,

Joe

TyagiAnupam@aol.com

2003-Oct-20 03:33 UTC

head link

[R] Processing logic for Huge Data set

Loops are time consuming in R. Try one of the apply functions for vectorized 
calculations, like "apply", "lapply","sapply" or
"tapply". Also see help for
"split".

In a message dated 10/19/03 5:25:51 PM Pacific Daylight Time, 
Wanzare@HCJP.com writes:
> Hello All,
>       I am new to R. I am trying to process this huge data set of
> matrix containing four columns, say x1, x2, x3, x4 and n number of rows.
> 
> I want to aggregate the matrix by x1 and perform statistic based on
> columns x2, x3, x4. I tried aggregate function but it gave me memory
> allocation error (which I am not surprised), so I ended up performing a
> for loop based on x1 and subsetting the matrix based on x1. However I
> have a hunch that their should be a less expensive way of doing this
> processing.  Any ideas or tips to optimize this processing logic would
> be greatly appreciated.
> 

	[[alternative HTML version deleted]]

Liaw, Andy

2003-Oct-20 12:10 UTC

head link

[R] Processing logic for Huge Data set

> From: TyagiAnupam at aol.com [mailto:TyagiAnupam at aol.com] 
> 
> Loops are time consuming in R. Try one of the apply functions 
> for vectorized 
> calculations, like "apply", "lapply","sapply"
or "tapply".
> Also see help for 
> "split".
Have you actually compared for loop with apply, in terms of timing?  Have
you looked at the R code for apply()?  It has:

    <...>
    if (length(d.call) < 2) {
        if (length(dn.call)) 
            dimnames(newX) <- c(dn.call, list(NULL))
        for (i in 1:d2) ans[[i]] <- FUN(newX[, i], ...)
    }
    else for (i in 1:d2) ans[[i]] <- FUN(array(newX[, i], d.call, 
        dn.call), ...)
    <...>

Notice the for loop there!  While what you said about apply and for loop
might be true for (older version of) Splus, it's not true for R.

lapply() does do the looping at the C level.  sapply and tapply uses lapply,
so they can be faster than for loop at the R level.

Andy

> 
> In a message dated 10/19/03 5:25:51 PM Pacific Daylight Time, 
> Wanzare at HCJP.com writes:
> 
> > Hello All,
> >       I am new to R. I am trying to process this huge data set of 
> > matrix containing four columns, say x1, x2, x3, x4 and n number of 
> > rows.
> > 
> > I want to aggregate the matrix by x1 and perform statistic based on 
> > columns x2, x3, x4. I tried aggregate function but it gave 
> me memory 
> > allocation error (which I am not surprised), so I ended up 
> performing 
> > a for loop based on x1 and subsetting the matrix based on 
> x1. However 
> > I have a hunch that their should be a less expensive way of 
> doing this 
> > processing.  Any ideas or tips to optimize this processing 
> logic would 
> > be greatly appreciated.
> > 
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list 
> https://www.stat.math.ethz.ch/mailman/listinfo> /r-help
>

TyagiAnupam@aol.com

2003-Oct-20 14:46 UTC

head link

[R] Processing logic for Huge Data set

In a message dated 10/20/03 5:11:25 AM Pacific Daylight Time, 
andy_liaw@merck.com writes:
> Have you actually compared for loop with apply, in terms of timing?  Have
> you looked at the R code for apply()?  It has:
> 
>   <...>
>   if (length(d.call) <2) {
>     if (length(dn.call)) 
>       dimnames(newX) <- c(dn.call, list(NULL))
>     for (i in 1:d2) ans[[i]] <- FUN(newX[, i], ...)
>   }
>   else for (i in 1:d2) ans[[i]] <- FUN(array(newX[, i], d.call, 
>     dn.call), ...)
>   <...>
> 
> Notice the for loop there!  While what you said about apply and for loop
> might be true for (older version of) Splus, it's not true for R.
> 
> lapply() does do the looping at the C level.  sapply and tapply uses
lapply,
> so they can be faster than for loop at the R level.
> 
> Andy
> 
I have not done the comparison.  Thanks a lot for point this out.

Anupam.

	[[alternative HTML version deleted]]

Apparently Analagous Threads

Search for more reasonably related threads

R help - Oct 2003 - Processing logic for Huge Data set

[R] Processing logic for Huge Data set

[R] Processing logic for Huge Data set

[R] Processing logic for Huge Data set

[R] Processing logic for Huge Data set

[R] Processing logic for Huge Data set

Apparently Analagous Threads