Since the original report raised several proposals, I submitted a bug report on
R Bugzilla trying to summarize the discussion:
https://bugs.r-project.org/show_bug.cgi?id=18223
(Maybe I should have ask before if it is really appropriate to do so. Please let
me no if not.)
> Hi Abby,
>
> I actually have a patch submitted that does this for unique/duplicated
> (only numeric cases I think) but it is, as patches from external
> contributors go, quite sizable which means it requires a correspondingly
> large amount of an R-core member's time and energy to vet and consider.
It
> is in the queue, and so, I expect (/hope, provided I didn't make a
mistake)
> it will be incorporated at some point. (
> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17993)
>
> You are correct that the speedups are quite significant for calling
> unique/duplicated on large vectors that know they are sorted: Speedup on my
> machine for a fairly sizable vector (length 1e7) ranges from about ~10x in
> the densely duplicated case up to ~60-70x in the sparsely duplicated case
> for duplicated(). For unique() it seems to range from ~10x in the densely
> duplicated case to ~15 in the spare case.
>
> I had thought that min and max already did this, but looking now, they
> don't seem to by default, thought ALTREP classes themselves do have an
> option of setting a min/max method, which would be hit. That does seem like
> low-hanging fruit, I agree, though in many cases the slow down from a
> single pass over the data to get a min probably isn't earthshattering.
>
> The others do seem like they could benefit as well.
>
> Best,
> ~G
>
> On Tue, Mar 16, 2021 at 2:54 PM Abby Spurdle <spurdle.a using
gmail.com> wrote:
>
> > There are some relatively obvious examples:
> > unique, which.min/which.max/etc, range/min/max, quantile,
aggregate/split
> >
> > Also, many timeseries, graphics and spline functions are dependent on
the
> > order.
> >
> > In the case of data.frame(s), a boolean flag would probably need to be
> > extended to allow for multiple column sorting, and
> > ascending/descending options.
> >
> > On Tue, Mar 16, 2021 at 11:08 AM Gabriel Becker <gabembecker using
gmail.com>
> > wrote:
> > >
> > > Abby,
> > >
> > > Vectors do have an internal mechanism for knowing that they are
sorted
> > via ALTREP (it was one of 2 core motivating features for 'smart
vectors'
> > the other being knowledge about presence of NAs).
> > >
> > > Currently I don't think we expose it at the R level, though
it is part
> > of the official C API. I don't know of any plans for this to
change, but I
> > suppose it could. Plus for functions in R itself, we could even use it
> > without exposing it more widely. A number of functions, including sort
> > itself, already do this in fact, but more could. I'd be interested
in
> > hearing which functions you think would particularly benefit from
this.
> > >
> > > ~G
> > >
> > > On Mon, Mar 15, 2021 at 12:01 PM SOEIRO Thomas <Thomas.SOEIRO
using ap-hm.fr>
> > wrote:
> > > >
> > > > Hi Abby,
> > > >
> > > > Thank you for your positive feedback.
> > > >
> > > > I agree for your general comment about sorting.
> > > >
> > > > For ave specifically, ordering may not help because the
output must
> > maintain the order of the input (as ave returns only x and not the
entiere
> > data.frame).
> > > >
> > > > Thanks,
> > > >
> > > > Thomas
> > > > ________________________________________
> > > > De : Abby Spurdle <spurdle.a using gmail.com>
> > > > Envoy? : lundi 15 mars 2021 10:22
> > > > ? : SOEIRO Thomas
> > > > Cc : r-devel using r-project.org
> > > > Objet : Re: [Rd] Potential improvements of ave?
> > > >
> > > > EMAIL EXTERNE - TRAITER AVEC PR?CAUTION LIENS ET FICHIERS
> > > >
> > > > Hi Thomas,
> > > >
> > > > These are some great suggestions.
> > > > But I can't help but feel there's a much bigger
problem here.
> > > >
> > > > Intuitively, the ave function could (or should) sort the
data.
> > > > Then the indexing step becomes almost trivial, in terms of
both time
> > > > and space complexity.
> > > > And the ave function is not the only example of where a
problem
> > > > becomes much simpler, if the data is sorted.
> > > >
> > > > Historically, I've never found base R functions
user-friendly for
> > > > aggregation purposes, or for sorting.
> > > > (At least, not by comparison to SQL).
> > > >
> > > > But that's not the main problem.
> > > > It would seem preferable to sort the data, only once.
> > > > (Rather than sorting it repeatedly, or not at all).
> > > >
> > > > Perhaps, objects such as vectors and data.frame(s) could
have a
> > > > boolean attribute, to indicate if they're sorted.
> > > > Or functions such as ave could have a sorted argument.
> > > > In either case, if true, the function assumes the data is
sorted and
> > > > applies a more efficient algorithm.
> > > >
> > > >
> > > > B.
> > > >
> > > >
> > > > On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas
<Thomas.SOEIRO using ap-hm.fr>
> > wrote:
> > > > >
> > > > > Dear all,
> > > > >
> > > > > I have two questions/suggestions about ave, but I am
not sure if it's
> > relevant for bug reports.
> > > > >
> > > > >
> > > > >
> > > > > 1) I have performance issues with ave in a case where I
didn't expect
> > it. The following code runs as expected:
> > > > >
> > > > > set.seed(1)
> > > > >
> > > > > df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE),
> > > > > id2 = sample(1:3, 5e2, TRUE),
> > > > > id3 = sample(1:5, 5e2, TRUE),
> > > > > val = sample(1:300, 5e2, TRUE))
> > > > >
> > > > > df1$diff <- ave(df1$val,
> > > > > df1$id1,
> > > > > df1$id2,
> > > > > df1$id3,
> > > > > FUN = function(i) c(diff(i), 0))
> > > > >
> > > > > head(df1[order(df1$id1,
> > > > > df1$id2,
> > > > > df1$id3), ])
> > > > >
> > > > > But when expanding the data.frame (* 1e4), ave fails
(Error: cannot
> > allocate vector of size 1110.0 Gb):
> > > > >
> > > > > df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 *
1e4, TRUE),
> > > > > id2 = sample(1:3, 5e2 * 1e4, TRUE),
> > > > > id3 = sample(1:(5 * 1e4), 5e2 * 1e4,
TRUE),
> > > > > val = sample(1:300, 5e2 * 1e4, TRUE))
> > > > >
> > > > > df2$diff <- ave(df2$val,
> > > > > df2$id1,
> > > > > df2$id2,
> > > > > df2$id3,
> > > > > FUN = function(i) c(diff(i), 0))
> > > > >
> > > > > This use case does not seem extreme to me (e.g.
aggregate et al work
> > perfectly on this data.frame).
> > > > > So my question is: Is this
expected/intended/reasonable? i.e. Does
> > ave need to be optimized?
> > > > >
> > > > >
> > > > >
> > > > > 2) Gabor Grothendieck pointed out in 2011 that drop =
TRUE is needed
> > to avoid warnings in case of unused levels
(https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html).
> > > > > Is it relevant/possible to expose the drop argument
explicitly?
> > > > >
> > > > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Thomas