thr3ads.net - R devel - [Rd] Potential improvements of ave? [Mar 2021]

If this information is useful, please help other people find it:
Share via:

Gabriel Becker

2021-Mar-15 22:08 UTC

[Rd] Potential improvements of ave?

Abby,

Vectors do have an internal mechanism for knowing that they are sorted via
ALTREP (it was one of 2 core motivating features for 'smart vectors' the
other being knowledge about presence of NAs).

Currently I don't think we expose it at the R level, though it is part of
the official C API. I don't know of any plans for this to change, but I
suppose it could. Plus for functions in R itself, we could even use it
without exposing it more widely. A number of functions, including sort
itself, already do this in fact, but more could. I'd be interested in
hearing which functions you think would particularly benefit from this.

~G

On Mon, Mar 15, 2021 at 12:01 PM SOEIRO Thomas <Thomas.SOEIRO at ap-hm.fr>
wrote:
> Hi Abby,
>
> Thank you for your positive feedback.
>
> I agree for your general comment about sorting.
>
> For ave specifically, ordering may not help because the output must
> maintain the order of the input (as ave returns only x and not the entiere
> data.frame).
>
> Thanks,
>
> Thomas
> ________________________________________
> De : Abby Spurdle <spurdle.a at gmail.com>
> Envoy? : lundi 15 mars 2021 10:22
> ? : SOEIRO Thomas
> Cc : r-devel at r-project.org
> Objet : Re: [Rd] Potential improvements of ave?
>
> EMAIL EXTERNE - TRAITER AVEC PR?CAUTION LIENS ET FICHIERS
>
> Hi Thomas,
>
> These are some great suggestions.
> But I can't help but feel there's a much bigger problem here.
>
> Intuitively, the ave function could (or should) sort the data.
> Then the indexing step becomes almost trivial, in terms of both time
> and space complexity.
> And the ave function is not the only example of where a problem
> becomes much simpler, if the data is sorted.
>
> Historically, I've never found base R functions user-friendly for
> aggregation purposes, or for sorting.
> (At least, not by comparison to SQL).
>
> But that's not the main problem.
> It would seem preferable to sort the data, only once.
> (Rather than sorting it repeatedly, or not at all).
>
> Perhaps, objects such as vectors and data.frame(s) could have a
> boolean attribute, to indicate if they're sorted.
> Or functions such as ave could have a sorted argument.
> In either case, if true, the function assumes the data is sorted and
> applies a more efficient algorithm.
>
>
> B.
>
>
> On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <Thomas.SOEIRO at
ap-hm.fr>
> wrote:
> >
> > Dear all,
> >
> > I have two questions/suggestions about ave, but I am not sure if
it's
> relevant for bug reports.
> >
> >
> >
> > 1) I have performance issues with ave in a case where I didn't
expect
> it. The following code runs as expected:
> >
> > set.seed(1)
> >
> > df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE),
> >                   id2 = sample(1:3, 5e2, TRUE),
> >                   id3 = sample(1:5, 5e2, TRUE),
> >                   val = sample(1:300, 5e2, TRUE))
> >
> > df1$diff <- ave(df1$val,
> >                 df1$id1,
> >                 df1$id2,
> >                 df1$id3,
> >                 FUN = function(i) c(diff(i), 0))
> >
> > head(df1[order(df1$id1,
> >                df1$id2,
> >                df1$id3), ])
> >
> > But when expanding the data.frame (* 1e4), ave fails (Error: cannot
> allocate vector of size 1110.0 Gb):
> >
> > df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE),
> >                   id2 = sample(1:3, 5e2 * 1e4, TRUE),
> >                   id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE),
> >                   val = sample(1:300, 5e2 * 1e4, TRUE))
> >
> > df2$diff <- ave(df2$val,
> >                 df2$id1,
> >                 df2$id2,
> >                 df2$id3,
> >                 FUN = function(i) c(diff(i), 0))
> >
> > This use case does not seem extreme to me (e.g. aggregate et al work
> perfectly on this data.frame).
> > So my question is: Is this expected/intended/reasonable? i.e. Does ave
> need to be optimized?
> >
> >
> >
> > 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed
to
> avoid warnings in case of unused levels (
>
https://urldefense.com/v3/__https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjU7NXrBO$
> ).
> > Is it relevant/possible to expose the drop argument explicitly?
> >
> >
> >
> > Thanks,
> >
> > Thomas
> > ______________________________________________
> > R-devel at r-project.org mailing list
> >
>
https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjUzdLFM1$
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
	[[alternative HTML version deleted]]

Martin Maechler

2021-Mar-16 09:50 UTC

head link

[Rd] Potential improvements of ave?

>>>>> Gabriel Becker 
>>>>>     on Mon, 15 Mar 2021 15:08:44 -0700 writes:
    > Abby,
    > Vectors do have an internal mechanism for knowing that they are sorted
via
    > ALTREP (it was one of 2 core motivating features for 'smart
vectors' the
    > other being knowledge about presence of NAs).

    > Currently I don't think we expose it at the R level, though it is
part of
    > the official C API. I don't know of any plans for this to change,
but I
    > suppose it could. Plus for functions in R itself, we could even use it
    > without exposing it more widely. A number of functions, including sort
    > itself, already do this in fact, but more could. I'd be interested
in
    > hearing which functions you think would particularly benefit from this.

Thank you Gabe.

    > ~G

I vaguely remember (from Luke's docs/presentation on ALTREP)
that there are some "missing parts" here.
One of them the not-existing R level functionality, another may be
the C code below R's  is.unsorted()  ... maybe  is.unsorted()
could get a new argument and or be re-written, moving  the NA
handling also to C and have that happen *after* the C code looks
if it's an ALTREP object and if that "knows it's sorted".

Martin


    > On Mon, Mar 15, 2021 at 12:01 PM SOEIRO Thomas <Thomas.SOEIRO at
ap-hm.fr>
    > wrote:

    >> Hi Abby,
    >> 
    >> Thank you for your positive feedback.
    >> 
    >> I agree for your general comment about sorting.
    >> 
    >> For ave specifically, ordering may not help because the output must
    >> maintain the order of the input (as ave returns only x and not the
entiere
    >> data.frame).
    >> 
    >> Thanks,
    >> 
    >> Thomas
    >> ________________________________________
    >> De : Abby Spurdle <spurdle.a at gmail.com>
    >> Envoy? : lundi 15 mars 2021 10:22
    >> ? : SOEIRO Thomas
    >> Cc : r-devel at r-project.org
    >> Objet : Re: [Rd] Potential improvements of ave?
    >> 
    >> EMAIL EXTERNE - TRAITER AVEC PR?CAUTION LIENS ET FICHIERS
    >> 
    >> Hi Thomas,
    >> 
    >> These are some great suggestions.
    >> But I can't help but feel there's a much bigger problem
here.
    >> 
    >> Intuitively, the ave function could (or should) sort the data.
    >> Then the indexing step becomes almost trivial, in terms of both
time
    >> and space complexity.
    >> And the ave function is not the only example of where a problem
    >> becomes much simpler, if the data is sorted.
    >> 
    >> Historically, I've never found base R functions user-friendly
for
    >> aggregation purposes, or for sorting.
    >> (At least, not by comparison to SQL).
    >> 
    >> But that's not the main problem.
    >> It would seem preferable to sort the data, only once.
    >> (Rather than sorting it repeatedly, or not at all).
    >> 
    >> Perhaps, objects such as vectors and data.frame(s) could have a
    >> boolean attribute, to indicate if they're sorted.
    >> Or functions such as ave could have a sorted argument.
    >> In either case, if true, the function assumes the data is sorted
and
    >> applies a more efficient algorithm.
    >> 
    >> 
    >> B.
    >> 
    >> 
    >> On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <Thomas.SOEIRO at
ap-hm.fr>
    >> wrote:
    >> >
    >> > Dear all,
    >> >
    >> > I have two questions/suggestions about ave, but I am not sure
if it's
    >> relevant for bug reports.
    >> >
    >> >
    >> >
    >> > 1) I have performance issues with ave in a case where I
didn't expect
    >> it. The following code runs as expected:
    >> >
    >> > set.seed(1)
    >> >
    >> > df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE),
    >> >                   id2 = sample(1:3, 5e2, TRUE),
    >> >                   id3 = sample(1:5, 5e2, TRUE),
    >> >                   val = sample(1:300, 5e2, TRUE))
    >> >
    >> > df1$diff <- ave(df1$val,
    >> >                 df1$id1,
    >> >                 df1$id2,
    >> >                 df1$id3,
    >> >                 FUN = function(i) c(diff(i), 0))
    >> >
    >> > head(df1[order(df1$id1,
    >> >                df1$id2,
    >> >                df1$id3), ])
    >> >
    >> > But when expanding the data.frame (* 1e4), ave fails (Error:
cannot
    >> allocate vector of size 1110.0 Gb):
    >> >
    >> > df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4,
TRUE),
    >> >                   id2 = sample(1:3, 5e2 * 1e4, TRUE),
    >> >                   id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE),
    >> >                   val = sample(1:300, 5e2 * 1e4, TRUE))
    >> >
    >> > df2$diff <- ave(df2$val,
    >> >                 df2$id1,
    >> >                 df2$id2,
    >> >                 df2$id3,
    >> >                 FUN = function(i) c(diff(i), 0))
    >> >
    >> > This use case does not seem extreme to me (e.g. aggregate et
al work
    >> perfectly on this data.frame).
    >> > So my question is: Is this expected/intended/reasonable? i.e.
Does ave
    >> need to be optimized?
    >> >
    >> >
    >> >
    >> > 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is
needed to
    >> avoid warnings in case of unused levels (
    >>
https://urldefense.com/v3/__https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjU7NXrBO$
    >> ).
    >> > Is it relevant/possible to expose the drop argument
explicitly?
    >> >
    >> >
    >> >
    >> > Thanks,
    >> >
    >> > Thomas
    >> > ______________________________________________
    >> > R-devel at r-project.org mailing list
    >> >
    >>
https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjUzdLFM1$
    >> 
    >> ______________________________________________
    >> R-devel at r-project.org mailing list
    >> https://stat.ethz.ch/mailman/listinfo/r-devel
    >> 

    > [[alternative HTML version deleted]]

    > ______________________________________________
    > R-devel at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel

Abby Spurdle

2021-Mar-16 21:53 UTC

head link

[Rd] Potential improvements of ave?

There are some relatively obvious examples:
unique, which.min/which.max/etc, range/min/max, quantile, aggregate/split

Also, many timeseries, graphics and spline functions are dependent on the order.

In the case of data.frame(s), a boolean flag would probably need to be
extended to allow for multiple column sorting, and
ascending/descending options.

On Tue, Mar 16, 2021 at 11:08 AM Gabriel Becker <gabembecker at gmail.com>
wrote:>
> Abby,
>
> Vectors do have an internal mechanism for knowing that they are sorted via
ALTREP (it was one of 2 core motivating features for 'smart vectors' the
other being knowledge about presence of NAs).
>
> Currently I don't think we expose it at the R level, though it is part
of the official C API. I don't know of any plans for this to change, but I
suppose it could. Plus for functions in R itself, we could even use it without
exposing it more widely. A number of functions, including sort itself, already
do this in fact, but more could. I'd be interested in hearing which
functions you think would particularly benefit from this.
>
> ~G
>
> On Mon, Mar 15, 2021 at 12:01 PM SOEIRO Thomas <Thomas.SOEIRO at
ap-hm.fr> wrote:
>>
>> Hi Abby,
>>
>> Thank you for your positive feedback.
>>
>> I agree for your general comment about sorting.
>>
>> For ave specifically, ordering may not help because the output must
maintain the order of the input (as ave returns only x and not the entiere
data.frame).
>>
>> Thanks,
>>
>> Thomas
>> ________________________________________
>> De : Abby Spurdle <spurdle.a at gmail.com>
>> Envoy? : lundi 15 mars 2021 10:22
>> ? : SOEIRO Thomas
>> Cc : r-devel at r-project.org
>> Objet : Re: [Rd] Potential improvements of ave?
>>
>> EMAIL EXTERNE - TRAITER AVEC PR?CAUTION LIENS ET FICHIERS
>>
>> Hi Thomas,
>>
>> These are some great suggestions.
>> But I can't help but feel there's a much bigger problem here.
>>
>> Intuitively, the ave function could (or should) sort the data.
>> Then the indexing step becomes almost trivial, in terms of both time
>> and space complexity.
>> And the ave function is not the only example of where a problem
>> becomes much simpler, if the data is sorted.
>>
>> Historically, I've never found base R functions user-friendly for
>> aggregation purposes, or for sorting.
>> (At least, not by comparison to SQL).
>>
>> But that's not the main problem.
>> It would seem preferable to sort the data, only once.
>> (Rather than sorting it repeatedly, or not at all).
>>
>> Perhaps, objects such as vectors and data.frame(s) could have a
>> boolean attribute, to indicate if they're sorted.
>> Or functions such as ave could have a sorted argument.
>> In either case, if true, the function assumes the data is sorted and
>> applies a more efficient algorithm.
>>
>>
>> B.
>>
>>
>> On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <Thomas.SOEIRO at
ap-hm.fr> wrote:
>> >
>> > Dear all,
>> >
>> > I have two questions/suggestions about ave, but I am not sure if
it's relevant for bug reports.
>> >
>> >
>> >
>> > 1) I have performance issues with ave in a case where I didn't
expect it. The following code runs as expected:
>> >
>> > set.seed(1)
>> >
>> > df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE),
>> >                   id2 = sample(1:3, 5e2, TRUE),
>> >                   id3 = sample(1:5, 5e2, TRUE),
>> >                   val = sample(1:300, 5e2, TRUE))
>> >
>> > df1$diff <- ave(df1$val,
>> >                 df1$id1,
>> >                 df1$id2,
>> >                 df1$id3,
>> >                 FUN = function(i) c(diff(i), 0))
>> >
>> > head(df1[order(df1$id1,
>> >                df1$id2,
>> >                df1$id3), ])
>> >
>> > But when expanding the data.frame (* 1e4), ave fails (Error:
cannot allocate vector of size 1110.0 Gb):
>> >
>> > df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE),
>> >                   id2 = sample(1:3, 5e2 * 1e4, TRUE),
>> >                   id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE),
>> >                   val = sample(1:300, 5e2 * 1e4, TRUE))
>> >
>> > df2$diff <- ave(df2$val,
>> >                 df2$id1,
>> >                 df2$id2,
>> >                 df2$id3,
>> >                 FUN = function(i) c(diff(i), 0))
>> >
>> > This use case does not seem extreme to me (e.g. aggregate et al
work perfectly on this data.frame).
>> > So my question is: Is this expected/intended/reasonable? i.e. Does
ave need to be optimized?
>> >
>> >
>> >
>> > 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is
needed to avoid warnings in case of unused levels
(https://urldefense.com/v3/__https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjU7NXrBO$
).
>> > Is it relevant/possible to expose the drop argument explicitly?
>> >
>> >
>> >
>> > Thanks,
>> >
>> > Thomas
>> > ______________________________________________
>> > R-devel at r-project.org mailing list
>> >
https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjUzdLFM1$
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel

R devel - Mar 2021 - Potential improvements of ave?

[Rd] Potential improvements of ave?

[Rd] Potential improvements of ave?

[Rd] Potential improvements of ave?