tim@t@yior m@iii@g oii hidde@eieph@@ts@co@uk
2022-Feb-03 11:30 UTC
[Rd] model.weights and model.offset: request for adjustment
> On 03/02/2022 11:14 Martin Maechler <maechler at stat.math.ethz.ch> wrote: > > > >>>>> Ben Bolker > >>>>> on Tue, 1 Feb 2022 21:21:46 -0500 writes: > > > The model.weights() and model.offset() functions from the 'stats' > > package index possibly-missing elements of a data frame via $, e.g. > > > x$"(offset)" > > x$"(weights)" > > > This returns NULL without comment when x is a data frame: > > > x <- data.frame(a=1) > > x$"(offset)" ## NULL > > x$"(weights)" ## NULL > > > However, when x is a tibble we get a warning as well: > > > x <- tibble::as_tibble(x) > > x$"(offset)" > > ## NULL > > ## Warning message: > > ## Unknown or uninitialised column: `(offset)`. > > > I know it's not R-core's responsibility to manage forward > > compatibility with tibbles, but in this case [[-indexing would seem to > > be better practice in any case. > > Yes, I would agree: we should use [[ instead of $ here > in order to force exact matching just as principle > > Importantly, because also mf[["(weights)"]] > will return NULL without a warning for a model/data frame, and > it seems it does so also for tibbles. > > > Might a patch be accepted ... ? > > That would not be necessary. > > There's one remaining problem however: > `$` access is clearly faster than `[[` for small data frames > (because `$` is a primitive function doing everything in C, > whereas `[[` calls the R level data frame method ). > > Faster in both cases, i.e., when there *is* a column and when there > is none (and NULL is returned), e.g., for the first case > > > system.time(for(i in 1:20000) df[["a"]]) > user system elapsed > 0.064 0.000 0.065 > > system.time(for(i in 1:20000) df$a) > user system elapsed > 0.009 0.000 0.009 > > So that's probably been the reason why `$` has been prefered?Would .subset2(df, "a) be preferable? R> df <- mtcars R> system.time(for(i in 1:20000) df[["hp"]]) user system elapsed 0.078 0.000 0.078 R> system.time(for(i in 1:20000) df$hp) user system elapsed 0.011 0.000 0.010 R> system.time(for(i in 1:20000) .subset2(df,"hp")) user system elapsed 0.004 0.000 0.004 Tim
Martin Maechler
2022-Feb-03 14:21 UTC
[Rd] model.weights and model.offset: request for adjustment
>>>>> tim taylor >>>>> on Thu, 3 Feb 2022 11:30:17 +0000 (GMT) writes:>> On 03/02/2022 11:14 Martin Maechler <maechler at stat.math.ethz.ch> wrote: >> >> >> >>>>> Ben Bolker >> >>>>> on Tue, 1 Feb 2022 21:21:46 -0500 writes: >> >> > The model.weights() and model.offset() functions from the 'stats' >> > package index possibly-missing elements of a data frame via $, e.g. >> >> > x$"(offset)" >> > x$"(weights)" >> >> > This returns NULL without comment when x is a data frame: >> >> > x <- data.frame(a=1) >> > x$"(offset)" ## NULL >> > x$"(weights)" ## NULL >> >> > However, when x is a tibble we get a warning as well: >> >> > x <- tibble::as_tibble(x) >> > x$"(offset)" >> > ## NULL >> > ## Warning message: >> > ## Unknown or uninitialised column: `(offset)`. >> >> > I know it's not R-core's responsibility to manage forward >> > compatibility with tibbles, but in this case [[-indexing would seem to >> > be better practice in any case. >> >> Yes, I would agree: we should use [[ instead of $ here >> in order to force exact matching just as principle >> >> Importantly, because also mf[["(weights)"]] >> will return NULL without a warning for a model/data frame, and >> it seems it does so also for tibbles. >> >> > Might a patch be accepted ... ? >> >> That would not be necessary. >> >> There's one remaining problem however: >> `$` access is clearly faster than `[[` for small data frames >> (because `$` is a primitive function doing everything in C, >> whereas `[[` calls the R level data frame method ). >> >> Faster in both cases, i.e., when there *is* a column and when there >> is none (and NULL is returned), e.g., for the first case >> >> > system.time(for(i in 1:20000) df[["a"]]) >> user system elapsed >> 0.064 0.000 0.065 >> > system.time(for(i in 1:20000) df$a) >> user system elapsed >> 0.009 0.000 0.009 >> >> So that's probably been the reason why `$` has been prefered? > Would .subset2(df, "a) be preferable? R> df <- mtcars R> system.time(for(i in 1:20000) df[["hp"]]) > user system elapsed > 0.078 0.000 0.078 R> system.time(for(i in 1:20000) df$hp) > user system elapsed > 0.011 0.000 0.010 R> system.time(for(i in 1:20000) .subset2(df,"hp")) > user system elapsed > 0.004 0.000 0.004 > Tim Yes, I think that's a very good idea -- notably, as interestingly it seems to work with tibble's very well, too. Martin