Sklyar, Oleg (London)
2009-Mar-27 15:56 UTC
[Rd] imporving performance of slicing on matrices and S4 their derivatives
Dear list. It is a known issue that accessing slots of S4 objects and in particular accessing .Data slots is slow in R. However, what surprises me are two things demonstrated in the code below (runnable with 'inline', my times are in the comments): - copying data out of a large 3x1e7 .Data slot into a matrix can be easily made 3-4 times faster than accessing a .Data slot which I believe grabs a reference (and as copying can be avoided the acceleration should be even more dramatic). It is surprising that this memory inefficient operation is faster than such a simple thing like getting a reference! - getting a column, or columns, from an atomic R matrix or actually an S4 object derived from it, can be up to 10 times faster than using standard slicing with the [-operator (yes, less generic, but with such performance gain we do definitely use it). My point is: should not [-operators for atomic objects and @.Data be redesigned? The code here is just an example for double storage-mode and without any checks though. Adding checks and colnames etc does not lead to performance degradation. I was originally thinking that the dispatch looking up a particular [ implementation for an object is the issue, but in fact it is not the case as redefining [ or $ as S4 methods (!) to use the mcol below for an S4 object shows the same performance gains as the diret use of use of mcol/mcols! Any comments welcome! ## --- code ---------------------------------------------------- ## available from CRAN, needs compilers installed library(inline) ## get 1 column of a matrix to use instead of [-operator ## (same performance gains if index is a character or on multiple columns or ## when getting multiple columns as matrix and assigning the names from input) body = "/* test for column extraction: no checks here for code simplicity */ int nrow = Rf_nrows(m); int i = INTEGER(index)[0] - 1; SEXP res; PROTECT(res = allocVector(REALSXP, nrow)); memcpy(REAL(res), &(REAL(m)[i*nrow]), nrow*sizeof(double)); UNPROTECT(1); return res;" mcol = cfunction(signature(m="matrix", index="integer"), body=body, includes="#include <string.h>") ## get A COPY of the @.Data slot from an object derived from numeric/matrix body = "/* test performance of getting A COPY of @.Data, keeping dimnames */ int nrow = Rf_nrows(m); int ncol = Rf_ncols(m); SEXP res, dim; PROTECT(res = allocVector(REALSXP, nrow*ncol)); PROTECT(dim = allocVector(INTSXP, 2)); INTEGER(dim)[0] = nrow; INTEGER(dim)[1] = ncol; SET_DIM(res, dim); if (GET_DIMNAMES(m)!= R_NilValue) SET_DIMNAMES(res, Rf_duplicate(GET_DIMNAMES(m))); if (ncol>0 && nrow>0) memcpy(REAL(res), REAL(m), nrow*ncol*sizeof(double)); UNPROTECT(2); return res;" mcols = cfunction(signature(m="matrix"), body=body, includes="#include <string.h>") ## --- tests --------------------------------------------------- m = matrix(runif(3e7), nc=3) setClass("MyClass", representation("matrix", comment="character")) dat = new("MyClass", m, comment="test object") mean(sapply(1:20, function(i) system.time(dat at .Data)[1] )) ## output: [1] 0.2526 mean(sapply(1:20, function(i) system.time(mcols(dat))[1] )) ## output: [1] 0.08215 mean(sapply(1:50, function(i) system.time(m[,2])[1] )) ## output: [1] 0.1222 mean(sapply(1:50, function(i) system.time(mcol(m,2L))[1] )) ## output: [1] 0.02596 mean(sapply(1:50, function(i) system.time(dat[,2])[1] )) ## output: [1] 0.1269 mean(sapply(1:50, function(i) system.time(mcol(dat,2L))[1] )) ## output: [1] 0.02584 ---> sessionInfo()R version 2.9.0 Under development (unstable) (2009-02-02 r47821) x86_64-unknown-linux-gnu locale: C attached base packages: [1] stats graphics utils datasets grDevices methods base other attached packages: [1] inline_0.3.3 Dr Oleg Sklyar Research Technologist AHL / Man Investments Ltd +44 (0)20 7144 3107 osklyar at maninvestments.com ********************************************************************** Please consider the environment before printing this email or its attachments. The contents of this email are for the named addressees ...{{dropped:19}}