Sebastian Martin Krantz
2022-Feb-26 22:12 UTC
[Rd] Enhancements in base R: Some Suggestions from the {collapse} and {kit} Packages
Dear R Core and Developers, I have been asked by a user to contribute to base R, which I was hesitant about because I think you have better things to do than adding/optimizing C code, and also because the objective of my package {collapse} - to vectorize grouped statistical operations in R - is for the most part beyond the scope of base R. There are however some functions and algorithms utilized in {collapse} and also in the {kit} package by Morgan Jacob (with variants in {data.table} as well) that could benefit base R, so I'll just give you here my 5 cents about those, in the hope that they could be useful at some point. 1. Factor Generation in R could be faster, utilizing order(.., method "radix") (for numeric data) and kit::charToFact. The basic idea for numeric data is to use the fast radix based ordering already present in base R, and then do a run-length-type grouping of the vector: fast_num_fact <- function(x) { names(x) <- NULL radixord_core <- function(...) .Internal(radixsort(TRUE, FALSE, TRUE, TRUE, ...)) o <- radixord_core(x) ends <- attr(o, "ends") f <- collapse::groupid(x, o, na.skip = TRUE, check.o = FALSE) attributes(f) <- NULL attr(f, "levels") <- if(is.character(x)) x[o[ends]] else as.character(x[o[ends]]) class(f) <- "factor" f } This function will also be faster than a hash table for character data that is approximately sorted. The new hash table based implementation kit::charToFact is however faster than either match() or radix ordering for character data, and could easily be ported into base R. Code: https://github.com/2005m/kit/blob/6ee20af14228df3a69cbf594cb6e116a838b5407/src/psort.c 2. Unique values in R could be significantly faster using collapse::group(), which utilizes a hash function first developed in {kit} in a clever way to achieve very fast first-appearance-order grouping for vectors or lists of vectors / data frames. Code examples see collapse::funique() for data frames or collapse::qF(..., sort = FALSE) which generates factors in first-appearance order of levels. Code: https://github.com/SebKrantz/collapse/blob/master/src/kit_dup.c 3. split() could become significantly faster, using collapse::gsplit(). gsplit() is {collapse}'s version of split() utilizing grouping objects (created with collapse::GRP, which utilizes in a more direct way the algorithms just outlined), but it also works with factors. Rudimentary benchmarks show that lapply(gsplit(x, f), FUN, ...) is comparable to the speed that {data.table} applies basic R functions across groups (without internal vectorization / GeForce), and could benefit a lot of base R. Code: https://github.com/SebKrantz/collapse/blob/master/src/small_helper.c (might go to a separate file in the future) 4. Data frame subsetting could become a lot faster: Various faster implementations are available in {data.table}, {collapse} (same as {data.table} but without parallelism and no overallocation of columns) and {kit}. 5. There are many smaller functions in both packages that are useful and could be more or less ported directly to base R. These include mathematical operations by reference for vectors / matrices / data frames (collapse::setop and %+=%, %-=%, %*=%, %/=%), multiple assignment (collapse::massign, %=%), or additional parallel statistics functions (kit::pmean, psum, pprod, pany, pall), fast ifelse (kit::iif) etc. See https://sebkrantz.github.io/collapse/reference/index.html#-memory-efficient-programming And code: https://github.com/SebKrantz/collapse/blob/master/src/small_helper.c and https://github.com/2005m/kit/blob/master/src/psum.c and https://github.com/2005m/kit/blob/master/src/iif.c Those were my 5 cents based on what I have seen and done so far, if they are useful for base R development as well I am glad. Otherwise keep up the great work you are doing, and we (and many others) will continue to develop the {fastverse}. Best regards, Sebastian Krantz [[alternative HTML version deleted]]