thr3ads.net - R help - [R] How to do aggregate operations with non-scalar functions [Apr 2005]

If this information is useful, please help other people find it:
Share via:

Itay Furman

2005-Apr-05 22:59 UTC

[R] How to do aggregate operations with non-scalar functions

Hi,

I have a data set, the structure of which is something like this:
> a <- rep(c("a", "b"), c(6,6))
> x <- rep(c("x", "y", "z"), c(4,4,4))
> df <- data.frame(a=a, x=x, r=rnorm(12))
The true data set has >1 million rows. The factors "a" and
"x"
have about 70 levels each; combined together they subset 'df'
into ~900 data frames.
For each such subset I'd like to compute various statistics
including quantiles, but I can't find an efficient way of
doing this.  Aggregate() gives me the desired structure - 
namely, one row per subset - but I can use it only to compute
a single quantile.
> aggregate(df[,"r"], list(a=a, x=x), quantile, probs=0.25)   a x          x
1 a x  0.1693188
2 a y  0.1566322
3 b y -0.2677410
4 b z -0.6505710

With by() I could compute several quantiles per subset at
each shot, but the structure of the output is not
convenient for further analysis and visualization.
> by(df[,"r"], list(a=a, x=x), quantile, probs=c(0, 0.25))a: a
x: x
         0%        25% 
-0.7727268  0.1693188 
---------------------------------------------------------- 
a: b
x: x
NULL
----------------------------------------------------------

[snip]

I would like to end up with a data frame like this:

   a x         0%        25% 
1 a x -0.7727268  0.1693188 
2 a y -0.3410671  0.1566322 
3 b y -0.2914710 -0.2677410 
4 b z -0.8502875 -0.6505710

I checked sweep() and apply() and didn't see how to harness
them for that purpose.

So, is there a simple way to convert the object returned
by by() into a data.frame?
Or, is there a better way to go with this?
Finally, if I should roll my own coercion function: any tips?

 	Thank you very much in advance,
 	Itay

----------------------------------------------------------------
itayf at u.washington.edu  /  +1 (206) 543 9040  /  U of Washington

Rich FitzJohn

2005-Apr-06 00:07 UTC

head link

[R] How to do aggregate operations with non-scalar functions

Hi Itay,

Not sure if by() can do it directly, but this does it from first
principles, using lapply() and tapply() (which aggregate uses
internally).  It would be reasonably straightforward to wrap this up
in a function.

a <- rep(c("a", "b"), c(6,6))
x <- rep(c("x", "y", "z"), c(4,4,4))
df <- data.frame(a=a, x=x, r=rnorm(12))
## Probabilities for quantile
p <- c(.25, .5, .75)

## This does the hard work of calculating the statistics over your
## combinations, and over the values in `p'
y <- lapply(p, function(y)
            tapply(df$r, list(a=a, x=x), quantile, probs=y))

## Then, we need to work out what combinations of a & x are possible:
## these are the header columns.  aggregate() does this in a much more
## complicated way, which may handle more difficult cases than this
## (e.g. if there are lots of missing values points, or something).
vars <- expand.grid(dimnames(y[[1]]))

## Finish up by converting `y' into a true data.frame, and ommiting
## all the cases where all the values in `y' are NA: these are
## combinations of a and x that we did not encounter.
y <- as.data.frame(lapply(y, as.vector))
names(y) <- paste(p, "%", sep="")
i <- colSums(apply(y, 1, is.na)) != ncol(y)
y <- cbind(vars, y)[i,]

Cheers,
Rich

On Apr 6, 2005 10:59 AM, Itay Furman <itayf at u.washington.edu>
wrote:> 
> Hi,
> 
> I have a data set, the structure of which is something like this:
> 
> > a <- rep(c("a", "b"), c(6,6))
> > x <- rep(c("x", "y", "z"), c(4,4,4))
> > df <- data.frame(a=a, x=x, r=rnorm(12))
> 
> The true data set has >1 million rows. The factors "a" and
"x"
> have about 70 levels each; combined together they subset 'df'
> into ~900 data frames.
> For each such subset I'd like to compute various statistics
> including quantiles, but I can't find an efficient way of
> doing this.  Aggregate() gives me the desired structure -
> namely, one row per subset - but I can use it only to compute
> a single quantile.
> 
> > aggregate(df[,"r"], list(a=a, x=x), quantile, probs=0.25)
>   a x          x
> 1 a x  0.1693188
> 2 a y  0.1566322
> 3 b y -0.2677410
> 4 b z -0.6505710
> 
> With by() I could compute several quantiles per subset at
> each shot, but the structure of the output is not
> convenient for further analysis and visualization.
> 
> > by(df[,"r"], list(a=a, x=x), quantile, probs=c(0, 0.25))
> a: a
> x: x
>         0%        25%
> -0.7727268  0.1693188
> ----------------------------------------------------------
> a: b
> x: x
> NULL
> ----------------------------------------------------------
> 
> [snip]
> 
> I would like to end up with a data frame like this:
> 
>   a x         0%        25%
> 1 a x -0.7727268  0.1693188
> 2 a y -0.3410671  0.1566322
> 3 b y -0.2914710 -0.2677410
> 4 b z -0.8502875 -0.6505710
> 
> I checked sweep() and apply() and didn't see how to harness
> them for that purpose.
> 
> So, is there a simple way to convert the object returned
> by by() into a data.frame?
> Or, is there a better way to go with this?
> Finally, if I should roll my own coercion function: any tips?
> 
>        Thank you very much in advance,
>        Itay
> 
> ----------------------------------------------------------------
> itayf at u.washington.edu  /  +1 (206) 543 9040  /  U of Washington
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
> 

-- 
Rich FitzJohn
rich.fitzjohn <at> gmail.com   |   
http://homepages.paradise.net.nz/richa183
                      You are in a maze of twisty little functions, all alike

Gabor Grothendieck

2005-Apr-06 02:15 UTC

head link

[R] How to do aggregate operations with non-scalar functions

On Apr 5, 2005 6:59 PM, Itay Furman <itayf at u.washington.edu>
wrote:> 
> Hi,
> 
> I have a data set, the structure of which is something like this:
> 
> > a <- rep(c("a", "b"), c(6,6))
> > x <- rep(c("x", "y", "z"), c(4,4,4))
> > df <- data.frame(a=a, x=x, r=rnorm(12))
> 
> The true data set has >1 million rows. The factors "a" and
"x"
> have about 70 levels each; combined together they subset 'df'
> into ~900 data frames.
> For each such subset I'd like to compute various statistics
> including quantiles, but I can't find an efficient way of
> doing this.  Aggregate() gives me the desired structure -
> namely, one row per subset - but I can use it only to compute
> a single quantile.
> 
> > aggregate(df[,"r"], list(a=a, x=x), quantile, probs=0.25)
>   a x          x
> 1 a x  0.1693188
> 2 a y  0.1566322
> 3 b y -0.2677410
> 4 b z -0.6505710
> 
> With by() I could compute several quantiles per subset at
> each shot, but the structure of the output is not
> convenient for further analysis and visualization.
> 
> > by(df[,"r"], list(a=a, x=x), quantile, probs=c(0, 0.25))
> a: a
> x: x
>         0%        25%
> -0.7727268  0.1693188
> ----------------------------------------------------------
> a: b
> x: x
> NULL
> ----------------------------------------------------------
> 
> [snip]
> 
> I would like to end up with a data frame like this:
> 
>   a x         0%        25%
> 1 a x -0.7727268  0.1693188
> 2 a y -0.3410671  0.1566322
> 3 b y -0.2914710 -0.2677410
> 4 b z -0.8502875 -0.6505710
> 
> I checked sweep() and apply() and didn't see how to harness
> them for that purpose.
> 
> So, is there a simple way to convert the object returned
> by by() into a data.frame?
> Or, is there a better way to go with this?
> Finally, if I should roll my own coercion function: any tips?
> 

One can use 

	do.call("rbind", by(df, list(a = a, x = x), f))

where f is the appropriate function. 

In this case f can be described in terms of df.quantile which 
is like quantile except it returns a one row data frame:

	df.quantile <- function(x,p) 
		as.data.frame(t(data.matrix(quantile(x, p))))

	f <- function(df, p = c(0.25, 0.5))
		cbind(df[1,1:2], df.quantile(df[,"r"], p))

Itay Furman

2005-Apr-07 05:38 UTC

head link

[R] How to do aggregate operations with non-scalar functions

On Wed, 6 Apr 2005 james.holtman at convergys.com wrote:
> Here is a method that I use in this situation.  I work with the indices of
> the rows so that copies are not made and it is fast.
>
> Result <- lapply(split(seq(nrow(df)), df$a), function(.a){  # partition
on
> the first variable
>      lapply(split(.a, df$z[.a]), function(.z){   # partition on the second
> variable -- notice the subsetting
>            c(quantile(df$r[.z]), ...anything else you want to compute)
>      })
> })
> Result <- do.call('rbind', Result)  # create a matrix - now you
have your
> results
>
> Jim
Jim,

Thank you for your reply.  For some reason, when I try your 
proposed solution I get:

Error in sort(unique.default(x), na.last = TRUE) :
 	`x' must be atomic

Eventually, I used the solution proposed by Gabor G in this 
thread.  One advantage of his solution is that it is easier to 
scale up I believe;  for example in the case you have 3 factors 
that together subset the data frame.


 	Regards,
 	Itay

----------------------------------------------------------------
itayf at u.washington.edu  /  +1 (206) 543 9040  /  U of Washington

Maybe Matching Threads

Search for more possibly parallel threads

R help - Apr 2005 - How to do aggregate operations with non-scalar functions

[R] How to do aggregate operations with non-scalar functions

[R] How to do aggregate operations with non-scalar functions

[R] How to do aggregate operations with non-scalar functions

[R] How to do aggregate operations with non-scalar functions

Maybe Matching Threads