thr3ads.net - R devel - [Rd] Partial matching performance in data frame rownames using [ [Dec 2023]

If this information is useful, please help other people find it:
Share via:

Hilmar Berger

2023-Dec-13 08:04 UTC

[Rd] Partial matching performance in data frame rownames using [

Dear Ivan,

thanks a lot, that is helpful.

Still, I feel that default partial matching cripples the functionality
of data.frame for larger tables.

Thanks again and best regards

Hilmar

On 12.12.23 13:55, Ivan Krylov wrote:> ? Mon, 11 Dec 2023 21:11:48 +0100
> Hilmar Berger via R-devel <r-devel at r-project.org> ?????:
>
>> What was unexpected is that in this case was that [.data.frame was
>> hanging for a long time (I waited about 10 minutes and then restarted
>> R). Also, this cannot be interrupted in interactive mode.
> That's unfortunate. If an operation takes a long time, it ought to be
> interruptible. Here's a patch that passes make check-devel:
>
> --- src/main/unique.c	(revision 85667)
> +++ src/main/unique.c	(working copy)
> @@ -1631,6 +1631,7 @@
>   	}
>       }
>
> +    unsigned int ic = 9999;
>       if(nexact < n_input) {
>   	/* Second pass, partial matching */
>   	for (R_xlen_t i = 0; i < n_input; i++) {
> @@ -1642,6 +1643,10 @@
>   	    mtch = 0;
>   	    mtch_count = 0;
>   	    for (int j = 0; j < n_target; j++) {
> +		if (!--ic) {
> +		    R_CheckUserInterrupt();
> +		    ic = 9999;
> +		}
>   		if (no_dups && used[j]) continue;
>   		if (strncmp(ss, tar[j], temp) == 0) {
>   		    mtch = j + 1;
>

Ivan Krylov

2023-Dec-16 09:48 UTC

head link

[Rd] Partial matching performance in data frame rownames using [

On Wed, 13 Dec 2023 09:04:18 +0100
Hilmar Berger via R-devel <r-devel at r-project.org> wrote:
> Still, I feel that default partial matching cripples the functionality
> of data.frame for larger tables.
Changing the default now would require a long deprecation cycle to give
everyone who uses `[.data.frame` and relies on partial matching
(whether they know it or not) enough time to adjust.

Still, adding an argument feels like a small change: edit
https://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R and
add a condition before calling pmatch(). Adjust the warning() for named
arguments. Don't forget to document the new argument in the man page at
https://svn.r-project.org/R/trunk/src/library/base/man/Extract.data.frame.Rd

Index: src/library/base/R/dataframe.R
==================================================================---
src/library/base/R/dataframe.R	(revision 85664)
+++ src/library/base/R/dataframe.R	(working copy)
@@ -591,14 +591,14 @@
 ###  These are a little less general than S
 
 `[.data.frame` <-
-    function(x, i, j, drop = if(missing(i)) TRUE else length(cols) == 1)
+    function(x, i, j, drop = if(missing(i)) TRUE else length(cols) == 1,
pmatch.rows = TRUE)
 {
     mdrop <- missing(drop)
     Narg <- nargs() - !mdrop  # number of arg from x,i,j that were specified
     has.j <- !missing(j)
-    if(!all(names(sys.call()) %in% c("", "drop"))
+    if(!all(names(sys.call()) %in% c("", "drop",
"pmatch.rows"))
        && !isS4(x)) # at least don't warn for callNextMethod!
-        warning("named arguments other than 'drop' are
discouraged")
+        warning("named arguments other than 'drop',
'pmatch.rows' are discouraged")
 
     if(Narg < 3L) {  # list-like indexing or matrix indexing
         if(!mdrop) warning("'drop' argument will be ignored")
@@ -679,7 +679,11 @@
             ## for consistency with [, <length-1>]
             if(is.character(i)) {
                 rows <- attr(xx, "row.names")
-                i <- pmatch(i, rows, duplicates.ok = TRUE)
+                i <- if (pmatch.rows) {
+                    pmatch(i, rows, duplicates.ok = TRUE)
+                } else {
+                    match(i, rows)
+                }
             }
             ## need to figure which col was selected:
             ## cannot use .subset2 directly as that may
@@ -699,7 +703,11 @@
                  # as this can be expensive.
     if(is.character(i)) {
         rows <- attr(xx, "row.names")
-        i <- pmatch(i, rows, duplicates.ok = TRUE)
+        i <- if (pmatch.rows) {
+            pmatch(i, rows, duplicates.ok = TRUE)
+        } else {
+            match(i, rows)
+        }
     }
     for(j in seq_along(x)) {
         xj <- xx[[ sxx[j] ]]
Index: src/library/base/man/Extract.data.frame.Rd
==================================================================---
src/library/base/man/Extract.data.frame.Rd	(revision 85664)
+++ src/library/base/man/Extract.data.frame.Rd	(working copy)
@@ -15,7 +15,7 @@
   Extract or replace subsets of data frames.
 }
 \usage{
-\method{[}{data.frame}(x, i, j, drop = )
+\method{[}{data.frame}(x, i, j, drop =, pmatch.rows = TRUE)
 \method{[}{data.frame}(x, i, j) <- value
 \method{[[}{data.frame}(x, ..., exact = TRUE)
 \method{[[}{data.frame}(x, i, j) <- value
@@ -45,6 +45,9 @@
     column is selected.}
 
    \item{exact}{logical: see \code{\link{[}}, and applies to column names.}
+
+   \item{pmatch.rows}{logical: whether to perform partial matching on
+     row names in case \code{i} is a character vector.}
 }
 \details{
   Data frames can be indexed in several modes.  When \code{[} and


system.time({r <- d1[q2,, drop=FALSE, pmatch.rows = FALSE]})
#    user  system elapsed 
#   0.478   0.004   0.482 

Unfortunately, that would be only the beginning. The prose in the whole
?`[.data.frame` would have to be updated; the new behaviour would have
to be tested in tests/**.R. There may be very good reasons why named
arguments to `[` other than drop= are discouraged for data.frames. I'm
afraid I lack the whole-project view to consider whether such an
addition would be safe.

-- 
Best regards,
Ivan

Maybe Matching Threads

Search for more apparently analagous threads

R devel - Dec 2023 - Partial matching performance in data frame rownames using [

[Rd] Partial matching performance in data frame rownames using [

[Rd] Partial matching performance in data frame rownames using [

Maybe Matching Threads