Hilmar Berger
2023-Dec-13 08:04 UTC
[Rd] Partial matching performance in data frame rownames using [
Dear Ivan, thanks a lot, that is helpful. Still, I feel that default partial matching cripples the functionality of data.frame for larger tables. Thanks again and best regards Hilmar On 12.12.23 13:55, Ivan Krylov wrote:> ? Mon, 11 Dec 2023 21:11:48 +0100 > Hilmar Berger via R-devel <r-devel at r-project.org> ?????: > >> What was unexpected is that in this case was that [.data.frame was >> hanging for a long time (I waited about 10 minutes and then restarted >> R). Also, this cannot be interrupted in interactive mode. > That's unfortunate. If an operation takes a long time, it ought to be > interruptible. Here's a patch that passes make check-devel: > > --- src/main/unique.c (revision 85667) > +++ src/main/unique.c (working copy) > @@ -1631,6 +1631,7 @@ > } > } > > + unsigned int ic = 9999; > if(nexact < n_input) { > /* Second pass, partial matching */ > for (R_xlen_t i = 0; i < n_input; i++) { > @@ -1642,6 +1643,10 @@ > mtch = 0; > mtch_count = 0; > for (int j = 0; j < n_target; j++) { > + if (!--ic) { > + R_CheckUserInterrupt(); > + ic = 9999; > + } > if (no_dups && used[j]) continue; > if (strncmp(ss, tar[j], temp) == 0) { > mtch = j + 1; >
Ivan Krylov
2023-Dec-16 09:48 UTC
[Rd] Partial matching performance in data frame rownames using [
On Wed, 13 Dec 2023 09:04:18 +0100 Hilmar Berger via R-devel <r-devel at r-project.org> wrote:> Still, I feel that default partial matching cripples the functionality > of data.frame for larger tables.Changing the default now would require a long deprecation cycle to give everyone who uses `[.data.frame` and relies on partial matching (whether they know it or not) enough time to adjust. Still, adding an argument feels like a small change: edit https://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R and add a condition before calling pmatch(). Adjust the warning() for named arguments. Don't forget to document the new argument in the man page at https://svn.r-project.org/R/trunk/src/library/base/man/Extract.data.frame.Rd Index: src/library/base/R/dataframe.R ==================================================================--- src/library/base/R/dataframe.R (revision 85664) +++ src/library/base/R/dataframe.R (working copy) @@ -591,14 +591,14 @@ ### These are a little less general than S `[.data.frame` <- - function(x, i, j, drop = if(missing(i)) TRUE else length(cols) == 1) + function(x, i, j, drop = if(missing(i)) TRUE else length(cols) == 1, pmatch.rows = TRUE) { mdrop <- missing(drop) Narg <- nargs() - !mdrop # number of arg from x,i,j that were specified has.j <- !missing(j) - if(!all(names(sys.call()) %in% c("", "drop")) + if(!all(names(sys.call()) %in% c("", "drop", "pmatch.rows")) && !isS4(x)) # at least don't warn for callNextMethod! - warning("named arguments other than 'drop' are discouraged") + warning("named arguments other than 'drop', 'pmatch.rows' are discouraged") if(Narg < 3L) { # list-like indexing or matrix indexing if(!mdrop) warning("'drop' argument will be ignored") @@ -679,7 +679,11 @@ ## for consistency with [, <length-1>] if(is.character(i)) { rows <- attr(xx, "row.names") - i <- pmatch(i, rows, duplicates.ok = TRUE) + i <- if (pmatch.rows) { + pmatch(i, rows, duplicates.ok = TRUE) + } else { + match(i, rows) + } } ## need to figure which col was selected: ## cannot use .subset2 directly as that may @@ -699,7 +703,11 @@ # as this can be expensive. if(is.character(i)) { rows <- attr(xx, "row.names") - i <- pmatch(i, rows, duplicates.ok = TRUE) + i <- if (pmatch.rows) { + pmatch(i, rows, duplicates.ok = TRUE) + } else { + match(i, rows) + } } for(j in seq_along(x)) { xj <- xx[[ sxx[j] ]] Index: src/library/base/man/Extract.data.frame.Rd ==================================================================--- src/library/base/man/Extract.data.frame.Rd (revision 85664) +++ src/library/base/man/Extract.data.frame.Rd (working copy) @@ -15,7 +15,7 @@ Extract or replace subsets of data frames. } \usage{ -\method{[}{data.frame}(x, i, j, drop = ) +\method{[}{data.frame}(x, i, j, drop =, pmatch.rows = TRUE) \method{[}{data.frame}(x, i, j) <- value \method{[[}{data.frame}(x, ..., exact = TRUE) \method{[[}{data.frame}(x, i, j) <- value @@ -45,6 +45,9 @@ column is selected.} \item{exact}{logical: see \code{\link{[}}, and applies to column names.} + + \item{pmatch.rows}{logical: whether to perform partial matching on + row names in case \code{i} is a character vector.} } \details{ Data frames can be indexed in several modes. When \code{[} and system.time({r <- d1[q2,, drop=FALSE, pmatch.rows = FALSE]}) # user system elapsed # 0.478 0.004 0.482 Unfortunately, that would be only the beginning. The prose in the whole ?`[.data.frame` would have to be updated; the new behaviour would have to be tested in tests/**.R. There may be very good reasons why named arguments to `[` other than drop= are discouraged for data.frames. I'm afraid I lack the whole-project view to consider whether such an addition would be safe. -- Best regards, Ivan