There are some relatively obvious examples: unique, which.min/which.max/etc, range/min/max, quantile, aggregate/split Also, many timeseries, graphics and spline functions are dependent on the order. In the case of data.frame(s), a boolean flag would probably need to be extended to allow for multiple column sorting, and ascending/descending options. On Tue, Mar 16, 2021 at 11:08 AM Gabriel Becker <gabembecker at gmail.com> wrote:> > Abby, > > Vectors do have an internal mechanism for knowing that they are sorted via ALTREP (it was one of 2 core motivating features for 'smart vectors' the other being knowledge about presence of NAs). > > Currently I don't think we expose it at the R level, though it is part of the official C API. I don't know of any plans for this to change, but I suppose it could. Plus for functions in R itself, we could even use it without exposing it more widely. A number of functions, including sort itself, already do this in fact, but more could. I'd be interested in hearing which functions you think would particularly benefit from this. > > ~G > > On Mon, Mar 15, 2021 at 12:01 PM SOEIRO Thomas <Thomas.SOEIRO at ap-hm.fr> wrote: >> >> Hi Abby, >> >> Thank you for your positive feedback. >> >> I agree for your general comment about sorting. >> >> For ave specifically, ordering may not help because the output must maintain the order of the input (as ave returns only x and not the entiere data.frame). >> >> Thanks, >> >> Thomas >> ________________________________________ >> De : Abby Spurdle <spurdle.a at gmail.com> >> Envoy? : lundi 15 mars 2021 10:22 >> ? : SOEIRO Thomas >> Cc : r-devel at r-project.org >> Objet : Re: [Rd] Potential improvements of ave? >> >> EMAIL EXTERNE - TRAITER AVEC PR?CAUTION LIENS ET FICHIERS >> >> Hi Thomas, >> >> These are some great suggestions. >> But I can't help but feel there's a much bigger problem here. >> >> Intuitively, the ave function could (or should) sort the data. >> Then the indexing step becomes almost trivial, in terms of both time >> and space complexity. >> And the ave function is not the only example of where a problem >> becomes much simpler, if the data is sorted. >> >> Historically, I've never found base R functions user-friendly for >> aggregation purposes, or for sorting. >> (At least, not by comparison to SQL). >> >> But that's not the main problem. >> It would seem preferable to sort the data, only once. >> (Rather than sorting it repeatedly, or not at all). >> >> Perhaps, objects such as vectors and data.frame(s) could have a >> boolean attribute, to indicate if they're sorted. >> Or functions such as ave could have a sorted argument. >> In either case, if true, the function assumes the data is sorted and >> applies a more efficient algorithm. >> >> >> B. >> >> >> On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <Thomas.SOEIRO at ap-hm.fr> wrote: >> > >> > Dear all, >> > >> > I have two questions/suggestions about ave, but I am not sure if it's relevant for bug reports. >> > >> > >> > >> > 1) I have performance issues with ave in a case where I didn't expect it. The following code runs as expected: >> > >> > set.seed(1) >> > >> > df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE), >> > id2 = sample(1:3, 5e2, TRUE), >> > id3 = sample(1:5, 5e2, TRUE), >> > val = sample(1:300, 5e2, TRUE)) >> > >> > df1$diff <- ave(df1$val, >> > df1$id1, >> > df1$id2, >> > df1$id3, >> > FUN = function(i) c(diff(i), 0)) >> > >> > head(df1[order(df1$id1, >> > df1$id2, >> > df1$id3), ]) >> > >> > But when expanding the data.frame (* 1e4), ave fails (Error: cannot allocate vector of size 1110.0 Gb): >> > >> > df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE), >> > id2 = sample(1:3, 5e2 * 1e4, TRUE), >> > id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE), >> > val = sample(1:300, 5e2 * 1e4, TRUE)) >> > >> > df2$diff <- ave(df2$val, >> > df2$id1, >> > df2$id2, >> > df2$id3, >> > FUN = function(i) c(diff(i), 0)) >> > >> > This use case does not seem extreme to me (e.g. aggregate et al work perfectly on this data.frame). >> > So my question is: Is this expected/intended/reasonable? i.e. Does ave need to be optimized? >> > >> > >> > >> > 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to avoid warnings in case of unused levels (https://urldefense.com/v3/__https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjU7NXrBO$ ). >> > Is it relevant/possible to expose the drop argument explicitly? >> > >> > >> > >> > Thanks, >> > >> > Thomas >> > ______________________________________________ >> > R-devel at r-project.org mailing list >> > https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjUzdLFM1$ >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel
Dear all, Thank you for your consideration on this topic. I do not have enough knowledge of R internals to join the discussion about sorting mechanisms. In fact, I did not get how ordering could help for ave as the output must maintain the order of the input (because ave returns only x and not the entiere data.frame). However, while the proposed workaround (i.e. paste0 instead of interaction, cf https://stat.ethz.ch/pipermail/r-devel/2021-March/080509.html) does not solves the "bigger problem" of sorting, it is usable as is and solves the issue. Therefore, what do you think about it? (i.e is it relevant for a patch?) Thanks, Thomas> ________________________________________ > De : Abby Spurdle <spurdle.a at gmail.com> > Envoy? : lundi 15 mars 2021 10:22 > ? : SOEIRO Thomas > Cc : r-devel at r-project.org > Objet : Re: [Rd] Potential improvements of ave? > > Hi Thomas, > > These are some great suggestions. > But I can't help but feel there's a much bigger problem here. > > Intuitively, the ave function could (or should) sort the data. > Then the indexing step becomes almost trivial, in terms of both time > and space complexity. > And the ave function is not the only example of where a problem > becomes much simpler, if the data is sorted. > > Historically, I've never found base R functions user-friendly for > aggregation purposes, or for sorting. > (At least, not by comparison to SQL). > > But that's not the main problem. > It would seem preferable to sort the data, only once. > (Rather than sorting it repeatedly, or not at all). > > Perhaps, objects such as vectors and data.frame(s) could have a > boolean attribute, to indicate if they're sorted. > Or functions such as ave could have a sorted argument. > In either case, if true, the function assumes the data is sorted and > applies a more efficient algorithm. > > > B. > > > On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <Thomas.SOEIRO at ap-hm.fr> wrote: >> >> Dear all, >> >> I have two questions/suggestions about ave, but I am not sure if it's relevant for bug reports. >> >> >> >> 1) I have performance issues with ave in a case where I didn't expect it. The following code runs as expected: >> >> set.seed(1) >> >> df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE), >> id2 = sample(1:3, 5e2, TRUE), >> id3 = sample(1:5, 5e2, TRUE), >> val = sample(1:300, 5e2, TRUE)) >> >> df1$diff <- ave(df1$val, >> df1$id1, >> df1$id2, >> df1$id3, >> FUN = function(i) c(diff(i), 0)) >> >> head(df1[order(df1$id1, >> df1$id2, >> df1$id3), ]) >> >> But when expanding the data.frame (* 1e4), ave fails (Error: cannot allocate vector of size 1110.0 Gb): >> >> df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE), >> id2 = sample(1:3, 5e2 * 1e4, TRUE), >> id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE), >> val = sample(1:300, 5e2 * 1e4, TRUE)) >> >> df2$diff <- ave(df2$val, >> df2$id1, >> df2$id2, >> df2$id3, >> FUN = function(i) c(diff(i), 0)) >> >> This use case does not seem extreme to me (e.g. aggregate et al work perfectly on this data.frame). >> So my question is: Is this expected/intended/reasonable? i.e. Does ave need to be optimized? >> >> >> >> 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to avoid warnings in case of unused levels (https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html). >> Is it relevant/possible to expose the drop argument explicitly? >> >> >> >> Thanks, >> >> Thomas
Hi Abby, I actually have a patch submitted that does this for unique/duplicated (only numeric cases I think) but it is, as patches from external contributors go, quite sizable which means it requires a correspondingly large amount of an R-core member's time and energy to vet and consider. It is in the queue, and so, I expect (/hope, provided I didn't make a mistake) it will be incorporated at some point. ( https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17993) You are correct that the speedups are quite significant for calling unique/duplicated on large vectors that know they are sorted: Speedup on my machine for a fairly sizable vector (length 1e7) ranges from about ~10x in the densely duplicated case up to ~60-70x in the sparsely duplicated case for duplicated(). For unique() it seems to range from ~10x in the densely duplicated case to ~15 in the spare case. I had thought that min and max already did this, but looking now, they don't seem to by default, thought ALTREP classes themselves do have an option of setting a min/max method, which would be hit. That does seem like low-hanging fruit, I agree, though in many cases the slow down from a single pass over the data to get a min probably isn't earthshattering. The others do seem like they could benefit as well. Best, ~G On Tue, Mar 16, 2021 at 2:54 PM Abby Spurdle <spurdle.a at gmail.com> wrote:> There are some relatively obvious examples: > unique, which.min/which.max/etc, range/min/max, quantile, aggregate/split > > Also, many timeseries, graphics and spline functions are dependent on the > order. > > In the case of data.frame(s), a boolean flag would probably need to be > extended to allow for multiple column sorting, and > ascending/descending options. > > On Tue, Mar 16, 2021 at 11:08 AM Gabriel Becker <gabembecker at gmail.com> > wrote: > > > > Abby, > > > > Vectors do have an internal mechanism for knowing that they are sorted > via ALTREP (it was one of 2 core motivating features for 'smart vectors' > the other being knowledge about presence of NAs). > > > > Currently I don't think we expose it at the R level, though it is part > of the official C API. I don't know of any plans for this to change, but I > suppose it could. Plus for functions in R itself, we could even use it > without exposing it more widely. A number of functions, including sort > itself, already do this in fact, but more could. I'd be interested in > hearing which functions you think would particularly benefit from this. > > > > ~G > > > > On Mon, Mar 15, 2021 at 12:01 PM SOEIRO Thomas <Thomas.SOEIRO at ap-hm.fr> > wrote: > >> > >> Hi Abby, > >> > >> Thank you for your positive feedback. > >> > >> I agree for your general comment about sorting. > >> > >> For ave specifically, ordering may not help because the output must > maintain the order of the input (as ave returns only x and not the entiere > data.frame). > >> > >> Thanks, > >> > >> Thomas > >> ________________________________________ > >> De : Abby Spurdle <spurdle.a at gmail.com> > >> Envoy? : lundi 15 mars 2021 10:22 > >> ? : SOEIRO Thomas > >> Cc : r-devel at r-project.org > >> Objet : Re: [Rd] Potential improvements of ave? > >> > >> EMAIL EXTERNE - TRAITER AVEC PR?CAUTION LIENS ET FICHIERS > >> > >> Hi Thomas, > >> > >> These are some great suggestions. > >> But I can't help but feel there's a much bigger problem here. > >> > >> Intuitively, the ave function could (or should) sort the data. > >> Then the indexing step becomes almost trivial, in terms of both time > >> and space complexity. > >> And the ave function is not the only example of where a problem > >> becomes much simpler, if the data is sorted. > >> > >> Historically, I've never found base R functions user-friendly for > >> aggregation purposes, or for sorting. > >> (At least, not by comparison to SQL). > >> > >> But that's not the main problem. > >> It would seem preferable to sort the data, only once. > >> (Rather than sorting it repeatedly, or not at all). > >> > >> Perhaps, objects such as vectors and data.frame(s) could have a > >> boolean attribute, to indicate if they're sorted. > >> Or functions such as ave could have a sorted argument. > >> In either case, if true, the function assumes the data is sorted and > >> applies a more efficient algorithm. > >> > >> > >> B. > >> > >> > >> On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <Thomas.SOEIRO at ap-hm.fr> > wrote: > >> > > >> > Dear all, > >> > > >> > I have two questions/suggestions about ave, but I am not sure if it's > relevant for bug reports. > >> > > >> > > >> > > >> > 1) I have performance issues with ave in a case where I didn't expect > it. The following code runs as expected: > >> > > >> > set.seed(1) > >> > > >> > df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE), > >> > id2 = sample(1:3, 5e2, TRUE), > >> > id3 = sample(1:5, 5e2, TRUE), > >> > val = sample(1:300, 5e2, TRUE)) > >> > > >> > df1$diff <- ave(df1$val, > >> > df1$id1, > >> > df1$id2, > >> > df1$id3, > >> > FUN = function(i) c(diff(i), 0)) > >> > > >> > head(df1[order(df1$id1, > >> > df1$id2, > >> > df1$id3), ]) > >> > > >> > But when expanding the data.frame (* 1e4), ave fails (Error: cannot > allocate vector of size 1110.0 Gb): > >> > > >> > df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE), > >> > id2 = sample(1:3, 5e2 * 1e4, TRUE), > >> > id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE), > >> > val = sample(1:300, 5e2 * 1e4, TRUE)) > >> > > >> > df2$diff <- ave(df2$val, > >> > df2$id1, > >> > df2$id2, > >> > df2$id3, > >> > FUN = function(i) c(diff(i), 0)) > >> > > >> > This use case does not seem extreme to me (e.g. aggregate et al work > perfectly on this data.frame). > >> > So my question is: Is this expected/intended/reasonable? i.e. Does > ave need to be optimized? > >> > > >> > > >> > > >> > 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed > to avoid warnings in case of unused levels ( > https://urldefense.com/v3/__https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjU7NXrBO$ > ). > >> > Is it relevant/possible to expose the drop argument explicitly? > >> > > >> > > >> > > >> > Thanks, > >> > > >> > Thomas > >> > ______________________________________________ > >> > R-devel at r-project.org mailing list > >> > > https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjUzdLFM1$ > >> > >> ______________________________________________ > >> R-devel at r-project.org mailing list > >> https://stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]