thr3ads.net - R devel - [Rd] Potential improvements of ave? [Mar 2021]

If this information is useful, please help other people find it:
Share via:

SOEIRO Thomas

2021-Mar-12 22:59 UTC

[Rd] Potential improvements of ave?

Dear all,

I have two questions/suggestions about ave, but I am not sure if it's
relevant for bug reports.



1) I have performance issues with ave in a case where I didn't expect it.
The following code runs as expected:

set.seed(1)

df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE),
                  id2 = sample(1:3, 5e2, TRUE),
                  id3 = sample(1:5, 5e2, TRUE),
                  val = sample(1:300, 5e2, TRUE))

df1$diff <- ave(df1$val,
                df1$id1,
                df1$id2,
                df1$id3,
                FUN = function(i) c(diff(i), 0))

head(df1[order(df1$id1,
               df1$id2,
               df1$id3), ])

But when expanding the data.frame (* 1e4), ave fails (Error: cannot allocate
vector of size 1110.0 Gb):

df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE),
                  id2 = sample(1:3, 5e2 * 1e4, TRUE),
                  id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE),
                  val = sample(1:300, 5e2 * 1e4, TRUE))

df2$diff <- ave(df2$val,
                df2$id1,
                df2$id2,
                df2$id3,
                FUN = function(i) c(diff(i), 0))

This use case does not seem extreme to me (e.g. aggregate et al work perfectly
on this data.frame).
So my question is: Is this expected/intended/reasonable? i.e. Does ave need to
be optimized?



2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to avoid
warnings in case of unused levels
(https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html).
Is it relevant/possible to expose the drop argument explicitly?



Thanks,

Thomas

SOEIRO Thomas

2021-Mar-13 23:05 UTC

head link

[Rd] Potential improvements of ave?

The bottleneck of ave is the call to interaction (i.e. not the call to
split/lapply).

Therefore, the following code runs as expected (but I may miss something...):

ave2 <- function (x, ..., FUN = mean)
{
    if(missing(...))
	x[] <- FUN(x)
    else {
	#g <- interaction(...)
	g <- paste0(...)
	split(x,g) <- lapply(split(x, g), FUN)
    }
    x
}

df2$diff <- ave2(df2$val,
                 df2$id1,
                 df2$id2,
                 df2$id3,
                 FUN = function(i) c(diff(i), 0))



Of course I can also simply solve my current issue with:

df2$id123 <- paste0(df2$id1,
                    df2$id2,
                    df2$id3)

df2$diff <- ave(df2$val,
                df2$id123,
                FUN = function(i) c(diff(i), 0))



In addition, ave2 also avoid warnings in case of unused levels (see point 2) in
my previous message).
________________________________________
De : SOEIRO Thomas
Envoy? : vendredi 12 mars 2021 23:59
? : r-devel at r-project.org
Objet : Potential improvements of ave?

Dear all,

I have two questions/suggestions about ave, but I am not sure if it's
relevant for bug reports.



1) I have performance issues with ave in a case where I didn't expect it.
The following code runs as expected:

set.seed(1)

df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE),
                  id2 = sample(1:3, 5e2, TRUE),
                  id3 = sample(1:5, 5e2, TRUE),
                  val = sample(1:300, 5e2, TRUE))

df1$diff <- ave(df1$val,
                df1$id1,
                df1$id2,
                df1$id3,
                FUN = function(i) c(diff(i), 0))

head(df1[order(df1$id1,
               df1$id2,
               df1$id3), ])

But when expanding the data.frame (* 1e4), ave fails (Error: cannot allocate
vector of size 1110.0 Gb):

df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE),
                  id2 = sample(1:3, 5e2 * 1e4, TRUE),
                  id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE),
                  val = sample(1:300, 5e2 * 1e4, TRUE))

df2$diff <- ave(df2$val,
                df2$id1,
                df2$id2,
                df2$id3,
                FUN = function(i) c(diff(i), 0))

This use case does not seem extreme to me (e.g. aggregate et al work perfectly
on this data.frame).
So my question is: Is this expected/intended/reasonable? i.e. Does ave need to
be optimized?



2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to avoid
warnings in case of unused levels
(https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html).
Is it relevant/possible to expose the drop argument explicitly?



Thanks,

Thomas

Abby Spurdle

2021-Mar-15 09:22 UTC

head link

[Rd] Potential improvements of ave?

Hi Thomas,

These are some great suggestions.
But I can't help but feel there's a much bigger problem here.

Intuitively, the ave function could (or should) sort the data.
Then the indexing step becomes almost trivial, in terms of both time
and space complexity.
And the ave function is not the only example of where a problem
becomes much simpler, if the data is sorted.

Historically, I've never found base R functions user-friendly for
aggregation purposes, or for sorting.
(At least, not by comparison to SQL).

But that's not the main problem.
It would seem preferable to sort the data, only once.
(Rather than sorting it repeatedly, or not at all).

Perhaps, objects such as vectors and data.frame(s) could have a
boolean attribute, to indicate if they're sorted.
Or functions such as ave could have a sorted argument.
In either case, if true, the function assumes the data is sorted and
applies a more efficient algorithm.

B.

On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <Thomas.SOEIRO at ap-hm.fr>
wrote:>
> Dear all,
>
> I have two questions/suggestions about ave, but I am not sure if it's
relevant for bug reports.
>
>
>
> 1) I have performance issues with ave in a case where I didn't expect
it. The following code runs as expected:
>
> set.seed(1)
>
> df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE),
>                   id2 = sample(1:3, 5e2, TRUE),
>                   id3 = sample(1:5, 5e2, TRUE),
>                   val = sample(1:300, 5e2, TRUE))
>
> df1$diff <- ave(df1$val,
>                 df1$id1,
>                 df1$id2,
>                 df1$id3,
>                 FUN = function(i) c(diff(i), 0))
>
> head(df1[order(df1$id1,
>                df1$id2,
>                df1$id3), ])
>
> But when expanding the data.frame (* 1e4), ave fails (Error: cannot
allocate vector of size 1110.0 Gb):
>
> df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE),
>                   id2 = sample(1:3, 5e2 * 1e4, TRUE),
>                   id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE),
>                   val = sample(1:300, 5e2 * 1e4, TRUE))
>
> df2$diff <- ave(df2$val,
>                 df2$id1,
>                 df2$id2,
>                 df2$id3,
>                 FUN = function(i) c(diff(i), 0))
>
> This use case does not seem extreme to me (e.g. aggregate et al work
perfectly on this data.frame).
> So my question is: Is this expected/intended/reasonable? i.e. Does ave need
to be optimized?
>
>
>
> 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to
avoid warnings in case of unused levels
(https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html).
> Is it relevant/possible to expose the drop argument explicitly?
>
>
>
> Thanks,
>
> Thomas
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

R devel - Mar 2021 - Potential improvements of ave?

[Rd] Potential improvements of ave?

[Rd] Potential improvements of ave?

[Rd] Potential improvements of ave?