Generating a model matrix with very large numbers of columns overflows the stack and/or runs very slowly, due to the implementation of TrimRepeats(). This patch modifies it to use Rf_duplicated() to find the duplicates. This makes the running time linear in the number of columns and eliminates the recursive function calls. Thanks -------------- next part -------------- A non-text attachment was scrubbed... Name: stats_model.patch Type: text/x-patch Size: 2182 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20160226/b54f1cc2/attachment.bin>
>>>>> Karl Millar via R-devel <r-devel at r-project.org> >>>>> on Fri, 26 Feb 2016 15:58:20 -0800 writes:> Generating a model matrix with very large numbers of > columns overflows the stack and/or runs very slowly, due > to the implementation of TrimRepeats(). > This patch modifies it to use Rf_duplicated() to find the > duplicates. This makes the running time linear in the > number of columns and eliminates the recursive function > calls. Thank you, Karl. I've committed this (very slightly modified) to R-devel, (also after looking for a an example that runs on a non-huge computer and shows the difference) : nF <- 11 ; set.seed(1) lff <- setNames(replicate(nF, as.factor(rpois(128, 1/4)), simplify=FALSE), letters[1:nF]) str(dd <- as.data.frame(lff)); prod(sapply(dd, nlevels)) ## 'data.frame': 128 obs. of 11 variables: ## $ a: Factor w/ 3 levels "0","1","2": 1 1 1 2 1 2 2 1 1 1 ... ## $ b: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 2 1 1 1 ... ## $ c: Factor w/ 3 levels "0","1","2": 1 1 1 2 1 1 1 2 1 1 ... ## $ d: Factor w/ 3 levels "0","1","2": 1 1 2 2 1 2 1 1 2 1 ... ## $ e: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 2 1 ... ## $ f: Factor w/ 2 levels "0","1": 2 1 2 1 2 1 1 2 1 2 ... ## $ g: Factor w/ 4 levels "0","1","2","3": 2 1 1 2 1 3 1 1 1 1 ... ## $ h: Factor w/ 4 levels "0","1","2","4": 1 1 1 1 2 1 1 1 1 1 ... ## $ i: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ... ## $ j: Factor w/ 3 levels "0","1","2": 1 2 3 1 1 1 1 1 1 1 ... ## $ k: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ... ## ## [1] 139968 system.time(mff <- model.matrix(~ . ^ 11, dd, contrasts = list(a = "contr.helmert"))) ## user system elapsed ## 0.255 0.033 0.287 --- *with* the patch on my desktop (16 GB) ## 1.489 0.031 1.522 --- for R-patched (i.e. w/o the patch)> dim(mff)[1] 128 139968> object.size(mff)154791504 bytes --- BTW: These example would gain tremendously if I finally got around to provide model.matrix(........, sparse = TRUE) which would then produce a Matrix-package sparse matrix. Even for this somewhat small case, a sparse matrix is a factor of 13.5 x smaller :> s1 <- object.size(mff); s2 <- object.size(M <- Matrix::Matrix(mff)); as.vector( s1/s2 )[1] 13.47043 I'm happy to collaborate with you on adding such a (C level) interface to sparse matrices for this case. Martin Maechler
Thanks. Couldn't you implement model.matrix(..., sparse = TRUE) with a small amount of R code similar to MatrixModels::model.Matrix ? On Mon, Feb 29, 2016 at 10:01 AM, Martin Maechler <maechler at stat.math.ethz.ch> wrote:>>>>>> Karl Millar via R-devel <r-devel at r-project.org> >>>>>> on Fri, 26 Feb 2016 15:58:20 -0800 writes: > > > Generating a model matrix with very large numbers of > > columns overflows the stack and/or runs very slowly, due > > to the implementation of TrimRepeats(). > > > This patch modifies it to use Rf_duplicated() to find the > > duplicates. This makes the running time linear in the > > number of columns and eliminates the recursive function > > calls. > > Thank you, Karl. > I've committed this (very slightly modified) to R-devel, > > (also after looking for a an example that runs on a non-huge > computer and shows the difference) : > > nF <- 11 ; set.seed(1) > lff <- setNames(replicate(nF, as.factor(rpois(128, 1/4)), simplify=FALSE), letters[1:nF]) > str(dd <- as.data.frame(lff)); prod(sapply(dd, nlevels)) > ## 'data.frame': 128 obs. of 11 variables: > ## $ a: Factor w/ 3 levels "0","1","2": 1 1 1 2 1 2 2 1 1 1 ... > ## $ b: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 2 1 1 1 ... > ## $ c: Factor w/ 3 levels "0","1","2": 1 1 1 2 1 1 1 2 1 1 ... > ## $ d: Factor w/ 3 levels "0","1","2": 1 1 2 2 1 2 1 1 2 1 ... > ## $ e: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 2 1 ... > ## $ f: Factor w/ 2 levels "0","1": 2 1 2 1 2 1 1 2 1 2 ... > ## $ g: Factor w/ 4 levels "0","1","2","3": 2 1 1 2 1 3 1 1 1 1 ... > ## $ h: Factor w/ 4 levels "0","1","2","4": 1 1 1 1 2 1 1 1 1 1 ... > ## $ i: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ... > ## $ j: Factor w/ 3 levels "0","1","2": 1 2 3 1 1 1 1 1 1 1 ... > ## $ k: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ... > ## > ## [1] 139968 > > system.time(mff <- model.matrix(~ . ^ 11, dd, contrasts = list(a = "contr.helmert"))) > ## user system elapsed > ## 0.255 0.033 0.287 --- *with* the patch on my desktop (16 GB) > ## 1.489 0.031 1.522 --- for R-patched (i.e. w/o the patch) > >> dim(mff) > [1] 128 139968 >> object.size(mff) > 154791504 bytes > > --- > > BTW: These example would gain tremendously if I finally got > around to provide > > model.matrix(........, sparse = TRUE) > > which would then produce a Matrix-package sparse matrix. > > Even for this somewhat small case, a sparse matrix is a factor > of 13.5 x smaller : > >> s1 <- object.size(mff); s2 <- object.size(M <- Matrix::Matrix(mff)); as.vector( s1/s2 ) > [1] 13.47043 > > I'm happy to collaborate with you on adding such a (C level) > interface to sparse matrices for this case. > > Martin Maechler