Hilmar Berger
2023-Dec-11 20:11 UTC
[Rd] Partial matching performance in data frame rownames using [
Dear all, I have seen that others have discussed the partial matching behaviour of data.frame[idx,] in the past, in particular with respect to unexpected results sets. I am aware of the fact that one can work around this using either match() or switching to tibble/data.table or similar altogether. I have a different issue with the partial matching, in particular its performance when used on large data frames or more specifically, with large queries matched against its row names. I came across a case where I wanted to extract data from a large table (approx 1M rows) using an index which matched only about 50% to the row names, i.e. about 50% row name hits and 50% misses. What was unexpected is that in this case was that [.data.frame was hanging for a long time (I waited about 10 minutes and then restarted R). Also, this cannot be interrupted in interactive mode. ids <- paste0("cg", sprintf("%06d",0:(1e6-1))) d1 <- data.frame(row.names=ids, v=1:(1e6) ) q1 <- sample(ids, 1e6, replace=F) system.time({r <- d1[q1,,drop=F]}) #?? user? system elapsed #? 0.464?? 0.000?? 0.465 # those will hang a long time, I stopped R after 10 minutes q2 <- c(q1[1:5e5], gsub("cg", "ct", q1[(5e5+1):1e6]) ) system.time({r <- d1[q2,,drop=F]}) # same here q3 <- c(q1[1:5e5], rep("FOO",5e5) ) system.time({r <- d1[q3,,drop=F]}) It seems that the penalty of partial matching the non-hits across the whole row name vector is not negligible any more with large tables and queries, compared to small and medium tables. I checked and pmatch(q2, rownames(d1) is equally slow. Is there a chance to a) document this in the help page ("with large indexes/tables use match()") or even better b) add an exact flag to [.data.frame ? Thanks a lot! Best regards Hilmar
Ivan Krylov
2023-Dec-12 12:55 UTC
[Rd] Partial matching performance in data frame rownames using [
? Mon, 11 Dec 2023 21:11:48 +0100 Hilmar Berger via R-devel <r-devel at r-project.org> ?????:> What was unexpected is that in this case was that [.data.frame was > hanging for a long time (I waited about 10 minutes and then restarted > R). Also, this cannot be interrupted in interactive mode.That's unfortunate. If an operation takes a long time, it ought to be interruptible. Here's a patch that passes make check-devel: --- src/main/unique.c (revision 85667) +++ src/main/unique.c (working copy) @@ -1631,6 +1631,7 @@ } } + unsigned int ic = 9999; if(nexact < n_input) { /* Second pass, partial matching */ for (R_xlen_t i = 0; i < n_input; i++) { @@ -1642,6 +1643,10 @@ mtch = 0; mtch_count = 0; for (int j = 0; j < n_target; j++) { + if (!--ic) { + R_CheckUserInterrupt(); + ic = 9999; + } if (no_dups && used[j]) continue; if (strncmp(ss, tar[j], temp) == 0) { mtch = j + 1; -- Best regards, Ivan