Hilmar Berger
2024-Jul-12 15:35 UTC
[Rd] xftrm is more than 100x slower for AsIs than for character vectors
Good evening, I recently have observed slow merges when combining multiple data frames derived from DataFrame and base::data.frame. I observed that the index column of intermediate tables was of class <AsIs> (automatically converted from character). The problem occurred mainly when using the sorted = T option in base::merge. This can be traced to xtfrm.AsIs being more than 100 times slower than the comparable function for character vectors. x = paste0("A_", 1:1e5) system.time({o <- xtfrm(x)}) #? user? system elapsed #? 0.325?? 0.005?? 0.332 x <- I(x) system.time({o <- xtfrm(x)}) # this calls xtfrm.AsIs # user? system elapsed # 26.153?? 0.016? 26.388 This can be finally traced to base::rank() (called from xtfrm.default), where I found that "NB: rank is not itself generic but xtfrm is, and rank(xtfrm(x), ....) will have the desired result if there is a xtfrm method. Otherwise, rank will make use of ==, >, is.na and extraction methods for classed objects, possibly rather slowly. " This *sounds* like the existence of xtfrm.AsIs should already be able to accelerate the ranking, but this does not seem to work. xtfrm.AsIs does not do anything for my case of class(x) == "AsIs" and just delegates to xtfrm.default. As a quick solution (and if there is no other fix), could we possibly add a note to the help page of I() that sorting/ordering/ranking of AsIs columns will be rather slow? Thanks a lot! Best regards Hilmar > sessionInfo() R version 4.4.1 (2024-06-14) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 20.04.6 LTS Matrix products: default BLAS:?? /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3;? LAPACK version 3.9.0 locale: ?[1] LC_CTYPE=en_US.UTF-8?????? LC_NUMERIC=C ?[3] LC_TIME=de_DE.UTF-8??????? LC_COLLATE=en_US.UTF-8 ?[5] LC_MONETARY=de_DE.UTF-8??? LC_MESSAGES=en_US.UTF-8 ?[7] LC_PAPER=de_DE.UTF-8?????? LC_NAME=C ?[9] LC_ADDRESS=C?????????????? LC_TELEPHONE=C [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C time zone: Europe/Berlin tzcode source: system (glibc) attached base packages: [1] stats???? graphics? grDevices utils???? datasets? methods base loaded via a namespace (and not attached): [1] compiler_4.4.1
Ivan Krylov
2024-Jul-14 09:24 UTC
[Rd] xftrm is more than 100x slower for AsIs than for character vectors
? Fri, 12 Jul 2024 17:35:19 +0200 Hilmar Berger via R-devel <r-devel at r-project.org> ?????:> This can be finally traced to base::rank() (called from > xtfrm.default), where I found that > > "NB: rank is not itself generic but xtfrm is, and rank(xtfrm(x), ....) > will have the desired result if there is a xtfrm method. Otherwise, > rank will make use of ==, >, is.na and extraction methods for classed > objects, possibly rather slowly. "The problem is indeed that the vector reaches base::rank in both cases, but since it has a class, the function has to construct and evaluate a call to .gt every time it wants to compare two elements. xtfrm.AsIs even tries to remove the 'AsIs' class before continuing the method dispatch process:>> if (length(cl <- class(x)) > 1) oldClass(x) <- cl[-1L]It doesn't work in the (very contrived) case when 'AsIs' is not the first class and it doesn't remove 'AsIs' as the only class (making static int equal(...) take the slower branch). What's going to break if we allow removing the class attribute altogether? This seems to speed up xtfrm(I(x)) and survive LC_ALL=C.UTF-8 make check-devel: Index: src/library/base/R/sort.R ==================================================================--- src/library/base/R/sort.R (revision 86895) +++ src/library/base/R/sort.R (working copy) @@ -297,7 +297,8 @@ xtfrm.AsIs <- function(x) { - if(length(cl <- class(x)) > 1) oldClass(x) <- cl[-1L] + cl <- oldClass(x) + oldClass(x) <- cl[cl != 'AsIs'] NextMethod("xtfrm") } -- Best regards, Ivan
Reasonably Related Threads
- xftrm is more than 100x slower for AsIs than for character vectors
- xftrm is more than 100x slower for AsIs than for character vectors
- I() in merge (was: Re: xftrm is more than 100x slower for AsIs than for character vectors)
- order() fails on a chr object of class "AsIs" with "\265" in it
- Change to I() in R 4.1