Sebastian Martin Krantz
2020-May-31 22:33 UTC
[R-pkgs] collapse package: Advanced and Fast Data Transformation in R
Dear R users, with some delay I would like to make you aware of the recent CRAN release of *collapse* (https://CRAN.R-project.org/package=collapse), a large new C/C++ based package for advanced and high-performance general purpose data transformation in R. *collapse* has 2 main objectives: 1. To facilitate complex data transformation and exploration tasks in R. *(In particular grouped and weighted statistical computations, advanced aggregation of mixed-type data, advanced transformations of time-series and panel-data, and the manipulation of lists)* 2. To help make R code fast, flexible, parsimonious and programmer friendly. *(Providing order of magnitude performance improvements via extensive use of C/C++ and highly optimized R code, broad object orientation and infrastructure for grouped programming)* *collapse*'s main innovation to service these objectives is the introduction of a comprehensive set of fast generic functions and transformation operators, with methods for all standard R objects written in C++. Currently *collapse* provides 13 fast statistical functions (`fmean`, `fmedian`, `fmode`, `fsum`, `fprod`, `fsd`, `fvar`, `fmin`, `fmax`, `ffirst`, `flast`, `fNobs` and `fNdistinct`) supporting grouped and weighted computations on vectors, matrices and data.frames, and 8 specialized vector-valued functions and associated transformation operators (`fscale/STD`, `fbetween/B`, `fwithin/W`, `fHDbetween/HDB`, `fHDwithin/HDW`, `flag/L/F`, `fdiff/D/Dlog` and `fgrowth/G`) particularly useful for the transformation of time-series and panel-data. Furthermore the function `collap` painlessly handles complex aggregations of mixed-type data, and the function `qsu` computes fast (panel-) summary statistics. Together with these functions, *collapse* also attempts to formalize and speed up C++ based grouped programming in R: The function `GRP` creates grouping objects which can be passed to the `g` argument of the above functions. This eliminates all time spent on grouping when performing several computations over the same groups! The `TRA` function also exists for grouped replacing and sweeping out of any computed statistics. To round things off, *collapse* provides full sets of functions for very fast manipulation of data.frames, fast ordering, fast factor generation, fast conversions between common data objects, and for recursive list processing (such as the function `unlist2d` which creates a tidy data.frame from a nested list of heterogeneous data objects). To enhance compatibility with existing frameworks, *collapse* functions provide methods for *dplyr* grouped tibbles and *plm* classes for panel-data (pseries and pdata.frame). *data.table*'s are also supported by all functions. These methods allow for easy integration of *collapse*'s fast functions into any of the workflows with these packages. The default methods for transformation functions like `fscale` or `flag` can also handle most time-series classes. In general attributes are preserved as much as possible in all *collapse* computations. Regarding performance: *collapse* seems to be the fastest R package for a good share of the functionality it offers. Sizable performance gains can be realized over packages like *dplyr* or *data.table* for various grouped computations. The emphasis is on C++, and R code employed is carefully micro-optimized, so a *collapse* script typically evaluates significantly faster than, say, a *dplyr* script doing the same thing. Some benchmarks are in the vignettes. *collapse* also realizes an innovative approach to documentation. Installing the package and calling `help("collapse-documentation")` brings up a full hierarchically structured documentation. The introductory vignette also introduces all main features in a systematic way. At this point, *collapse* 1.2.1 is already a quite mature package with a stable user API, passing repeated checks of R and C++ code and > 5600 unit tests on all supported operating systems. The package will continue to receive active maintenance and development. I hope that the availability of *collapse* would lead not only to faster data science, but especially to faster and richer development of complex statistical techniques. I welcome initiatives of like-minded developers willing to speed up grouped programming in R via C++, and encourage the use of the *collapse* API for such endeavors. For any issues, contributions, comments or suggestions, use github or send me an e-mail. Best regards, Sebastian [[alternative HTML version deleted]]