Vadim Ogranovich
2005-May-08 20:09 UTC
[Rd] Light-weight data.frame class: was: how to add method to .Primitive function
Hi, Encouraged by a tip from Simon Urbanek I tried to use the S3 machinery to write a faster version of the data.frame class. This quickly hits a snag: the "[.default"(x, i) for some reason cares about the dimensionality of x. In the end there is a full transcript of my R session. It includes the motivation for writing the class and the problems I have encountered. As a result I see three issues here: * why "[.default"(x, i) doesn't work if dim(x) is 2? After all a single subscript into a vector works regardless of whether it's a matrix or not. Is there an alternative way to access "[.default"? * why does unclass() make deep copy? This is a facet of the global over-conservatism of R with respect to copying. * is it possible to add some sort copy profiling to R? Something like copyProfiling(TRUE), which should cause R to log sizes of each copied object (just raw sizes w/o any attempt to identify the object). This feature should at least help assess the magnitude of the problem. Thanks, Vadim Now the transcript itself:> # the motivation: subscription of a data.frame is *much* (almost 20times) slower than that of a list> # compare > n = 1e6 > i = seq(n) > > x = data.frame(a=seq(n), b=seq(n)) > system.time(x[i,], gcFirst=TRUE)[1] 1.01 0.14 1.14 0.00 0.00> > x = list(a=seq(n), b=seq(n)) > system.time(lapply(x, function(col) col[i]), gcFirst=TRUE)[1] 0.06 0.00 0.06 0.00 0.00> > > # the solution: define methods for the light-weight data.frame class > lwdf = function(...) structure(list(...), class = "lwdf") > > # dim > dim.lwdf = function(x) c(length(x[[1]]), length(x)) > > # for pretty printing we define print.lwdf via a conversion todata.frame> # as.data.frame.lwdf > as.data.frame.lwdf = function(x) structure(unclass(x),class="data.frame", row.names=as.character(seq(nrow(x))))> > # print > print.lwdf = function(x) print.data.frame(as.data.frame.lwdf(x)) > > # now the real stuff > > # "[" > # the naive "[.lwdf" = function (x, i, j) lapply(x[j], function(col)col[i])> # won't work because evaluation of x[j] calls "[.lwdf" again and not"[.default"> # so we switch by the number of arguments > "[.lwdf" = function (x, i, j) {+ if (nargs() == 2) + NextMethod("[", x, i) + else + structure(lapply(x[j], function(col) col[i]), class = "lwdf") + }> > x = lwdf(a=seq(3), b=letters[seq(3)], c=as.factor(letters[seq(3)])) > i = c(1,3); j = c(1,3) > > # unfortunately, for some reasons "[.default" cares aboutdimensionality of its argument> x[i,j]Error in "[.default"(x, j) : incorrect number of dimensions> > > # we could use unclass to get it right > "[.lwdf" = function (x, i, j) {+ structure(lapply(unclass(x)[j], function(col) col[i]), class "lwdf") + }> > x[i,j]a c 1 1 a 2 3 c> > # *but* unclass creates a deep copy of its argument as indirectlyevidenced by the following timing> x = lwdf(a=seq(1e6)); system.time(unclass(x))[1] 0.01 0.00 0.01 0.00 0.00> x = lwdf(a=seq(1e8)); system.time(unclass(x))[1] 0.44 0.39 0.82 0.00 0.00> version_ platform x86_64-unknown-linux-gnu arch x86_64 os linux-gnu system x86_64, linux-gnu status major 2 minor 0.1 year 2004 month 11 day 15 language R
Simon Urbanek
2005-May-09 05:03 UTC
[Rd] Light-weight data.frame class: was: how to add method to .Primitive function
Vadim, On May 8, 2005, at 2:09 PM, Vadim Ogranovich wrote:>> # the naive "[.lwdf" = function (x, i, j) lapply(x[j], function >> (col) col[i])Umm... what about his: "[.lwdf" = function(x, i, j) { r<-lapply(lapply(j,function(a) x [[a]]),function(x) x[i]); names(r)<-names(x)[j]; r } The subsetting operates on vectors, so it's not a problem. Don't ask me about the speed, though ;). And btw: you could access "[ What I meant with my cautious remarks are the following issues. You were talking about building a df alternative (s/df/data.frame/g in this e-mail). The first issue is that by re-defining "[" and friends you make your new calls incompatible with the behavior of lists, so you won't be able to use it where lists are required (even though is.list says TRUE). This may break code were you'd like your class to act as a list. On the other hand, your class is not a df either - and I suspect that it's far from trivial to make it even closely compatible with a df in terms of its behavior. Moreover any function that checks for df won't treat your class as such, because it simply is no df (is.data.frame()=FALSE for starters). So in the end, you would have to modify every function in R that uses df to recognize your new class. On the other hand if you make your class a subclass of df (there we get into some trouble with S3), you could replace the back-end, but then you will have to support every df feature including row.names. You could try it, but I'm somewhat skeptical... but your mileage may vary ... Cheers, Simon
Gabor Grothendieck
2005-May-10 06:46 UTC
[Rd] Light-weight data.frame class: was: how to add method to .Primitive function
"[.default" is implemented in R as .subset. See ?.subset and note that it begins with a dot. e.g. for the case where i and j are not missing: "[.lwdf" <- function(x, i, j) lapply(.subset(x,j), "[", i) On 5/8/05, Vadim Ogranovich <vograno@evafunds.com> wrote:> Hi, > > Encouraged by a tip from Simon Urbanek I tried to use the S3 machinery > to write a faster version of the data.frame class. > This quickly hits a snag: the "[.default"(x, i) for some reason cares > about the dimensionality of x. > In the end there is a full transcript of my R session. It includes the > motivation for writing the class and the problems I have encountered. > > As a result I see three issues here: > * why "[.default"(x, i) doesn't work if dim(x) is 2? After all a single > subscript into a vector works regardless of whether it's a matrix or > not. Is there an alternative way to access "[.default"? > * why does unclass() make deep copy? This is a facet of the global > over-conservatism of R with respect to copying. > * is it possible to add some sort copy profiling to R? Something like > copyProfiling(TRUE), which should cause R to log sizes of each copied > object (just raw sizes w/o any attempt to identify the object). This > feature should at least help assess the magnitude of the problem. > > Thanks, > Vadim > > Now the transcript itself: > > # the motivation: subscription of a data.frame is *much* (almost 20 > times) slower than that of a list > > # compare > > n = 1e6 > > i = seq(n) > > > > x = data.frame(a=seq(n), b=seq(n)) > > system.time(x[i,], gcFirst=TRUE) > [1] 1.01 0.14 1.14 0.00 0.00 > > > > x = list(a=seq(n), b=seq(n)) > > system.time(lapply(x, function(col) col[i]), gcFirst=TRUE) > [1] 0.06 0.00 0.06 0.00 0.00 > > > > > > # the solution: define methods for the light-weight data.frame class > > lwdf = function(...) structure(list(...), class = "lwdf") > > > > # dim > > dim.lwdf = function(x) c(length(x[[1]]), length(x)) > > > > # for pretty printing we define print.lwdf via a conversion to > data.frame > > # as.data.frame.lwdf > > as.data.frame.lwdf = function(x) structure(unclass(x), > class="data.frame", row.names=as.character(seq(nrow(x)))) > > > > # print > > print.lwdf = function(x) print.data.frame(as.data.frame.lwdf(x)) > > > > # now the real stuff > > > > # "[" > > # the naive "[.lwdf" = function (x, i, j) lapply(x[j], function(col) > col[i]) > > # won't work because evaluation of x[j] calls "[.lwdf" again and not > "[.default" > > # so we switch by the number of arguments > > "[.lwdf" = function (x, i, j) { > + if (nargs() == 2) > + NextMethod("[", x, i) > + else > + structure(lapply(x[j], function(col) col[i]), class = "lwdf") > + } > > > > x = lwdf(a=seq(3), b=letters[seq(3)], c=as.factor(letters[seq(3)])) > > i = c(1,3); j = c(1,3) > > > > # unfortunately, for some reasons "[.default" cares about > dimensionality of its argument > > x[i,j] > Error in "[.default"(x, j) : incorrect number of dimensions > > > > > > # we could use unclass to get it right > > "[.lwdf" = function (x, i, j) { > + structure(lapply(unclass(x)[j], function(col) col[i]), class > "lwdf") > + } > > > > x[i,j] > a c > 1 1 a > 2 3 c > > > > # *but* unclass creates a deep copy of its argument as indirectly > evidenced by the following timing > > x = lwdf(a=seq(1e6)); system.time(unclass(x)) > [1] 0.01 0.00 0.01 0.00 0.00 > > x = lwdf(a=seq(1e8)); system.time(unclass(x)) > [1] 0.44 0.39 0.82 0.00 0.00 > > > version > _ > platform x86_64-unknown-linux-gnu > arch x86_64 > os linux-gnu > system x86_64, linux-gnu > status > major 2 > minor 0.1 > year 2004 > month 11 > day 15 > language R > > ______________________________________________ > R-devel@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >
Vadim Ogranovich
2005-May-10 20:28 UTC
[Rd] Light-weight data.frame class: was: how to add method to .Primitive function
Thanks again! BTW, how did you find the code for "[.default"? I tried:> get("[.default")Error in get(x, envir, mode, inherits) : variable "[.default" was not found> -----Original Message----- > From: Gabor Grothendieck [mailto:ggrothendieck@gmail.com] > Sent: Monday, May 09, 2005 9:46 PM > To: Vadim Ogranovich > Cc: r-devel@stat.math.ethz.ch; simon.urbanek@r-project.org > Subject: Re: [Rd] Light-weight data.frame class: was: how to > add method to .Primitive function > > "[.default" is implemented in R as .subset. See ?.subset and > note that it begins with a dot. e.g. for the case where i > and j are not missing: > > "[.lwdf" <- function(x, i, j) lapply(.subset(x,j), "[", i) > > > > On 5/8/05, Vadim Ogranovich <vograno@evafunds.com> wrote: > > Hi, > > > > Encouraged by a tip from Simon Urbanek I tried to use the > S3 machinery > > to write a faster version of the data.frame class. > > This quickly hits a snag: the "[.default"(x, i) for some > reason cares > > about the dimensionality of x. > > In the end there is a full transcript of my R session. It > includes the > > motivation for writing the class and the problems I have > encountered. > > > > As a result I see three issues here: > > * why "[.default"(x, i) doesn't work if dim(x) is 2? After all a > > single subscript into a vector works regardless of whether it's a > > matrix or not. Is there an alternative way to access "[.default"? > > * why does unclass() make deep copy? This is a facet of the global > > over-conservatism of R with respect to copying. > > * is it possible to add some sort copy profiling to R? > Something like > > copyProfiling(TRUE), which should cause R to log sizes of > each copied > > object (just raw sizes w/o any attempt to identify the > object). This > > feature should at least help assess the magnitude of the problem. > > > > Thanks, > > Vadim > > > > Now the transcript itself: > > > # the motivation: subscription of a data.frame is *much* > (almost 20 > > times) slower than that of a list > > > # compare > > > n = 1e6 > > > i = seq(n) > > > > > > x = data.frame(a=seq(n), b=seq(n)) > > > system.time(x[i,], gcFirst=TRUE) > > [1] 1.01 0.14 1.14 0.00 0.00 > > > > > > x = list(a=seq(n), b=seq(n)) > > > system.time(lapply(x, function(col) col[i]), gcFirst=TRUE) > > [1] 0.06 0.00 0.06 0.00 0.00 > > > > > > > > > # the solution: define methods for the light-weight > data.frame class > > > lwdf = function(...) structure(list(...), class = "lwdf") > > > > > > # dim > > > dim.lwdf = function(x) c(length(x[[1]]), length(x)) > > > > > > # for pretty printing we define print.lwdf via a conversion to > > data.frame > > > # as.data.frame.lwdf > > > as.data.frame.lwdf = function(x) structure(unclass(x), > > class="data.frame", row.names=as.character(seq(nrow(x)))) > > > > > > # print > > > print.lwdf = function(x) print.data.frame(as.data.frame.lwdf(x)) > > > > > > # now the real stuff > > > > > > # "[" > > > # the naive "[.lwdf" = function (x, i, j) lapply(x[j], > function(col) > > col[i]) > > > # won't work because evaluation of x[j] calls "[.lwdf" > again and not > > "[.default" > > > # so we switch by the number of arguments "[.lwdf" = > function (x, i, > > > j) { > > + if (nargs() == 2) > > + NextMethod("[", x, i) > > + else > > + structure(lapply(x[j], function(col) col[i]), class > = "lwdf") } > > > > > > x = lwdf(a=seq(3), b=letters[seq(3)], > c=as.factor(letters[seq(3)])) > > > i = c(1,3); j = c(1,3) > > > > > > # unfortunately, for some reasons "[.default" cares about > > dimensionality of its argument > > > x[i,j] > > Error in "[.default"(x, j) : incorrect number of dimensions > > > > > > > > > # we could use unclass to get it right "[.lwdf" = > function (x, i, j) > > > { > > + structure(lapply(unclass(x)[j], function(col) col[i]), class > > "lwdf") > > + } > > > > > > x[i,j] > > a c > > 1 1 a > > 2 3 c > > > > > > # *but* unclass creates a deep copy of its argument as indirectly > > evidenced by the following timing > > > x = lwdf(a=seq(1e6)); system.time(unclass(x)) > > [1] 0.01 0.00 0.01 0.00 0.00 > > > x = lwdf(a=seq(1e8)); system.time(unclass(x)) > > [1] 0.44 0.39 0.82 0.00 0.00 > > > > > version > > _ > > platform x86_64-unknown-linux-gnu > > arch x86_64 > > os linux-gnu > > system x86_64, linux-gnu > > status > > major 2 > > minor 0.1 > > year 2004 > > month 11 > > day 15 > > language R > > > > ______________________________________________ > > R-devel@stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > >