Ivan Krylov
2023-Dec-16 09:48 UTC
[Rd] Partial matching performance in data frame rownames using [
On Wed, 13 Dec 2023 09:04:18 +0100 Hilmar Berger via R-devel <r-devel at r-project.org> wrote:> Still, I feel that default partial matching cripples the functionality > of data.frame for larger tables.Changing the default now would require a long deprecation cycle to give everyone who uses `[.data.frame` and relies on partial matching (whether they know it or not) enough time to adjust. Still, adding an argument feels like a small change: edit https://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R and add a condition before calling pmatch(). Adjust the warning() for named arguments. Don't forget to document the new argument in the man page at https://svn.r-project.org/R/trunk/src/library/base/man/Extract.data.frame.Rd Index: src/library/base/R/dataframe.R ==================================================================--- src/library/base/R/dataframe.R (revision 85664) +++ src/library/base/R/dataframe.R (working copy) @@ -591,14 +591,14 @@ ### These are a little less general than S `[.data.frame` <- - function(x, i, j, drop = if(missing(i)) TRUE else length(cols) == 1) + function(x, i, j, drop = if(missing(i)) TRUE else length(cols) == 1, pmatch.rows = TRUE) { mdrop <- missing(drop) Narg <- nargs() - !mdrop # number of arg from x,i,j that were specified has.j <- !missing(j) - if(!all(names(sys.call()) %in% c("", "drop")) + if(!all(names(sys.call()) %in% c("", "drop", "pmatch.rows")) && !isS4(x)) # at least don't warn for callNextMethod! - warning("named arguments other than 'drop' are discouraged") + warning("named arguments other than 'drop', 'pmatch.rows' are discouraged") if(Narg < 3L) { # list-like indexing or matrix indexing if(!mdrop) warning("'drop' argument will be ignored") @@ -679,7 +679,11 @@ ## for consistency with [, <length-1>] if(is.character(i)) { rows <- attr(xx, "row.names") - i <- pmatch(i, rows, duplicates.ok = TRUE) + i <- if (pmatch.rows) { + pmatch(i, rows, duplicates.ok = TRUE) + } else { + match(i, rows) + } } ## need to figure which col was selected: ## cannot use .subset2 directly as that may @@ -699,7 +703,11 @@ # as this can be expensive. if(is.character(i)) { rows <- attr(xx, "row.names") - i <- pmatch(i, rows, duplicates.ok = TRUE) + i <- if (pmatch.rows) { + pmatch(i, rows, duplicates.ok = TRUE) + } else { + match(i, rows) + } } for(j in seq_along(x)) { xj <- xx[[ sxx[j] ]] Index: src/library/base/man/Extract.data.frame.Rd ==================================================================--- src/library/base/man/Extract.data.frame.Rd (revision 85664) +++ src/library/base/man/Extract.data.frame.Rd (working copy) @@ -15,7 +15,7 @@ Extract or replace subsets of data frames. } \usage{ -\method{[}{data.frame}(x, i, j, drop = ) +\method{[}{data.frame}(x, i, j, drop =, pmatch.rows = TRUE) \method{[}{data.frame}(x, i, j) <- value \method{[[}{data.frame}(x, ..., exact = TRUE) \method{[[}{data.frame}(x, i, j) <- value @@ -45,6 +45,9 @@ column is selected.} \item{exact}{logical: see \code{\link{[}}, and applies to column names.} + + \item{pmatch.rows}{logical: whether to perform partial matching on + row names in case \code{i} is a character vector.} } \details{ Data frames can be indexed in several modes. When \code{[} and system.time({r <- d1[q2,, drop=FALSE, pmatch.rows = FALSE]}) # user system elapsed # 0.478 0.004 0.482 Unfortunately, that would be only the beginning. The prose in the whole ?`[.data.frame` would have to be updated; the new behaviour would have to be tested in tests/**.R. There may be very good reasons why named arguments to `[` other than drop= are discouraged for data.frames. I'm afraid I lack the whole-project view to consider whether such an addition would be safe. -- Best regards, Ivan
Toby Hocking
2023-Dec-19 20:57 UTC
[Rd] Partial matching performance in data frame rownames using [
Hi Hilmar and Ivan, I have used your code examples to write a blog post about this topic, which has figures that show the asymptotic time complexity of the various approaches, https://tdhock.github.io/blog/2023/df-partial-match/ The asymptotic complexity of partial matching appears to be quadratic O(N^2) whereas the other approaches are asymptotically faster: linear O(N) or log-linear O(N log N). I think that accepting Ivan's pmatch.rows patch would add un-necessary complexity to base R, since base R already provides an efficient work-around, d1[match(q1,rownames(d1)),] I do think the CheckUserInterrupt patch is a good idea, though. Best, Toby On Sat, Dec 16, 2023 at 2:49?AM Ivan Krylov <krylov.r00t at gmail.com> wrote:> > On Wed, 13 Dec 2023 09:04:18 +0100 > Hilmar Berger via R-devel <r-devel at r-project.org> wrote: > > > Still, I feel that default partial matching cripples the functionality > > of data.frame for larger tables. > > Changing the default now would require a long deprecation cycle to give > everyone who uses `[.data.frame` and relies on partial matching > (whether they know it or not) enough time to adjust. > > Still, adding an argument feels like a small change: edit > https://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R and > add a condition before calling pmatch(). Adjust the warning() for named > arguments. Don't forget to document the new argument in the man page at > https://svn.r-project.org/R/trunk/src/library/base/man/Extract.data.frame.Rd > > Index: src/library/base/R/dataframe.R > ==================================================================> --- src/library/base/R/dataframe.R (revision 85664) > +++ src/library/base/R/dataframe.R (working copy) > @@ -591,14 +591,14 @@ > ### These are a little less general than S > > `[.data.frame` <- > - function(x, i, j, drop = if(missing(i)) TRUE else length(cols) == 1) > + function(x, i, j, drop = if(missing(i)) TRUE else length(cols) == 1, pmatch.rows = TRUE) > { > mdrop <- missing(drop) > Narg <- nargs() - !mdrop # number of arg from x,i,j that were specified > has.j <- !missing(j) > - if(!all(names(sys.call()) %in% c("", "drop")) > + if(!all(names(sys.call()) %in% c("", "drop", "pmatch.rows")) > && !isS4(x)) # at least don't warn for callNextMethod! > - warning("named arguments other than 'drop' are discouraged") > + warning("named arguments other than 'drop', 'pmatch.rows' are discouraged") > > if(Narg < 3L) { # list-like indexing or matrix indexing > if(!mdrop) warning("'drop' argument will be ignored") > @@ -679,7 +679,11 @@ > ## for consistency with [, <length-1>] > if(is.character(i)) { > rows <- attr(xx, "row.names") > - i <- pmatch(i, rows, duplicates.ok = TRUE) > + i <- if (pmatch.rows) { > + pmatch(i, rows, duplicates.ok = TRUE) > + } else { > + match(i, rows) > + } > } > ## need to figure which col was selected: > ## cannot use .subset2 directly as that may > @@ -699,7 +703,11 @@ > # as this can be expensive. > if(is.character(i)) { > rows <- attr(xx, "row.names") > - i <- pmatch(i, rows, duplicates.ok = TRUE) > + i <- if (pmatch.rows) { > + pmatch(i, rows, duplicates.ok = TRUE) > + } else { > + match(i, rows) > + } > } > for(j in seq_along(x)) { > xj <- xx[[ sxx[j] ]] > Index: src/library/base/man/Extract.data.frame.Rd > ==================================================================> --- src/library/base/man/Extract.data.frame.Rd (revision 85664) > +++ src/library/base/man/Extract.data.frame.Rd (working copy) > @@ -15,7 +15,7 @@ > Extract or replace subsets of data frames. > } > \usage{ > -\method{[}{data.frame}(x, i, j, drop = ) > +\method{[}{data.frame}(x, i, j, drop =, pmatch.rows = TRUE) > \method{[}{data.frame}(x, i, j) <- value > \method{[[}{data.frame}(x, ..., exact = TRUE) > \method{[[}{data.frame}(x, i, j) <- value > @@ -45,6 +45,9 @@ > column is selected.} > > \item{exact}{logical: see \code{\link{[}}, and applies to column names.} > + > + \item{pmatch.rows}{logical: whether to perform partial matching on > + row names in case \code{i} is a character vector.} > } > \details{ > Data frames can be indexed in several modes. When \code{[} and > > > system.time({r <- d1[q2,, drop=FALSE, pmatch.rows = FALSE]}) > # user system elapsed > # 0.478 0.004 0.482 > > Unfortunately, that would be only the beginning. The prose in the whole > ?`[.data.frame` would have to be updated; the new behaviour would have > to be tested in tests/**.R. There may be very good reasons why named > arguments to `[` other than drop= are discouraged for data.frames. I'm > afraid I lack the whole-project view to consider whether such an > addition would be safe. > > -- > Best regards, > Ivan > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Toby Hocking
2023-Dec-19 20:57 UTC
[Rd] Partial matching performance in data frame rownames using [
Hi Hilmar and Ivan, I have used your code examples to write a blog post about this topic, which has figures that show the asymptotic time complexity of the various approaches, https://tdhock.github.io/blog/2023/df-partial-match/ The asymptotic complexity of partial matching appears to be quadratic O(N^2) whereas the other approaches are asymptotically faster: linear O(N) or log-linear O(N log N). I think that accepting Ivan's pmatch.rows patch would add un-necessary complexity to base R, since base R already provides an efficient work-around, d1[match(q1,rownames(d1)),] I do think the CheckUserInterrupt patch is a good idea, though. Best, Toby On Sat, Dec 16, 2023 at 2:49?AM Ivan Krylov <krylov.r00t at gmail.com> wrote:> > On Wed, 13 Dec 2023 09:04:18 +0100 > Hilmar Berger via R-devel <r-devel at r-project.org> wrote: > > > Still, I feel that default partial matching cripples the functionality > > of data.frame for larger tables. > > Changing the default now would require a long deprecation cycle to give > everyone who uses `[.data.frame` and relies on partial matching > (whether they know it or not) enough time to adjust. > > Still, adding an argument feels like a small change: edit > https://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R and > add a condition before calling pmatch(). Adjust the warning() for named > arguments. Don't forget to document the new argument in the man page at > https://svn.r-project.org/R/trunk/src/library/base/man/Extract.data.frame.Rd > > Index: src/library/base/R/dataframe.R > ==================================================================> --- src/library/base/R/dataframe.R (revision 85664) > +++ src/library/base/R/dataframe.R (working copy) > @@ -591,14 +591,14 @@ > ### These are a little less general than S > > `[.data.frame` <- > - function(x, i, j, drop = if(missing(i)) TRUE else length(cols) == 1) > + function(x, i, j, drop = if(missing(i)) TRUE else length(cols) == 1, pmatch.rows = TRUE) > { > mdrop <- missing(drop) > Narg <- nargs() - !mdrop # number of arg from x,i,j that were specified > has.j <- !missing(j) > - if(!all(names(sys.call()) %in% c("", "drop")) > + if(!all(names(sys.call()) %in% c("", "drop", "pmatch.rows")) > && !isS4(x)) # at least don't warn for callNextMethod! > - warning("named arguments other than 'drop' are discouraged") > + warning("named arguments other than 'drop', 'pmatch.rows' are discouraged") > > if(Narg < 3L) { # list-like indexing or matrix indexing > if(!mdrop) warning("'drop' argument will be ignored") > @@ -679,7 +679,11 @@ > ## for consistency with [, <length-1>] > if(is.character(i)) { > rows <- attr(xx, "row.names") > - i <- pmatch(i, rows, duplicates.ok = TRUE) > + i <- if (pmatch.rows) { > + pmatch(i, rows, duplicates.ok = TRUE) > + } else { > + match(i, rows) > + } > } > ## need to figure which col was selected: > ## cannot use .subset2 directly as that may > @@ -699,7 +703,11 @@ > # as this can be expensive. > if(is.character(i)) { > rows <- attr(xx, "row.names") > - i <- pmatch(i, rows, duplicates.ok = TRUE) > + i <- if (pmatch.rows) { > + pmatch(i, rows, duplicates.ok = TRUE) > + } else { > + match(i, rows) > + } > } > for(j in seq_along(x)) { > xj <- xx[[ sxx[j] ]] > Index: src/library/base/man/Extract.data.frame.Rd > ==================================================================> --- src/library/base/man/Extract.data.frame.Rd (revision 85664) > +++ src/library/base/man/Extract.data.frame.Rd (working copy) > @@ -15,7 +15,7 @@ > Extract or replace subsets of data frames. > } > \usage{ > -\method{[}{data.frame}(x, i, j, drop = ) > +\method{[}{data.frame}(x, i, j, drop =, pmatch.rows = TRUE) > \method{[}{data.frame}(x, i, j) <- value > \method{[[}{data.frame}(x, ..., exact = TRUE) > \method{[[}{data.frame}(x, i, j) <- value > @@ -45,6 +45,9 @@ > column is selected.} > > \item{exact}{logical: see \code{\link{[}}, and applies to column names.} > + > + \item{pmatch.rows}{logical: whether to perform partial matching on > + row names in case \code{i} is a character vector.} > } > \details{ > Data frames can be indexed in several modes. When \code{[} and > > > system.time({r <- d1[q2,, drop=FALSE, pmatch.rows = FALSE]}) > # user system elapsed > # 0.478 0.004 0.482 > > Unfortunately, that would be only the beginning. The prose in the whole > ?`[.data.frame` would have to be updated; the new behaviour would have > to be tested in tests/**.R. There may be very good reasons why named > arguments to `[` other than drop= are discouraged for data.frames. I'm > afraid I lack the whole-project view to consider whether such an > addition would be safe. > > -- > Best regards, > Ivan > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel