I am using rpart to build a model for later predictions. To save the prediction across restarts and share the data across nodes I have been using "save" to persist the result of rpart to a file and "load" it later. But the saved size was becoming unusually large (even with binary, compressed mode). The size was also proportional to the amount of data that was used to create the model. After tinkering a bit, I figured out that most of the size was because of the rpart$functions attribute. If I set it to NULL, the size seems to drop dramatically. It can be seen with the following lines of R code, where there is a difference, though it is small. The difference is more pronounced with large datasets. library(rpart) fit <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis) save(fit, file="fit1.sav") fit$functions <- NULL save(fit, file="fit2.sav") What is the reason behind it? The functions themselves seem small, so where it all the bulk coming from? Thanks, Tan
Prof Brian Ripley
2009-Feb-03 11:56 UTC
[R] Large file size while persisting rpart model to disk
On Tue, 3 Feb 2009, tan wrote:> I am using rpart to build a model for later predictions. To save the > prediction across restarts and share the data across nodes I have been > using "save" to persist the result of rpart to a file and "load" it > later. But the saved size was becoming unusually large (even with > binary, compressed mode). The size was also proportional to the amount > of data that was used to create the model. > > After tinkering a bit, I figured out that most of the size was because > of the rpart$functions attribute. If I set it to NULL, the size seems > to drop dramatically. It can be seen with the following lines of R > code, where there is a difference, though it is small. The difference > is more pronounced with large datasets. > > library(rpart) > fit <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis) > save(fit, file="fit1.sav") > fit$functions <- NULL > save(fit, file="fit2.sav") > > What is the reason behind it? The functions themselves seem small, so > where it all the bulk coming from?Their environments. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Dear Prof. Ripley, Thanks for the quick reply. I do notice an <environment...> in the print output. I assume it is used to keep copies of the initial data used for the model. - Is it safe to assume that it would not affect any other functionality, apart from the usage of those particular functions? - Is there a better/recommended way of reducing the size? Thanks, Tan On Feb 3, 4:56?pm, Prof Brian Ripley <rip... at stats.ox.ac.uk> wrote:> On Tue, 3 Feb 2009, tan wrote: > > I am using rpart to build a model for later predictions. To save the > > prediction across restarts and share the data across nodes I have been > > using "save" to persist the result of rpart to a file and "load" it > > later. But the saved size was becoming unusually large (even with > > binary, compressed mode). The size was also proportional to the amount > > of data that was used to create the model. > > > After tinkering a bit, I figured out that most of the size was because > > of the rpart$functions attribute. If I set it to NULL, the size seems > > to drop dramatically. It can be seen with the following lines of R > > code, where there is a difference, though it is small. The difference > > is more pronounced with large datasets. > > > library(rpart) > > fit <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis) > > save(fit, file="fit1.sav") > > fit$functions <- NULL > > save(fit, file="fit2.sav") > > > What is the reason behind it? The functions themselves seem small, so > > where it all the bulk coming from? > > Their environments. > > -- > Brian D. Ripley, ? ? ? ? ? ? ? ? ?rip... at stats.ox.ac.uk > Professor of Applied Statistics, ?http://www.stats.ox.ac.uk/~ripley/ > University of Oxford, ? ? ? ? ? ? Tel: ?+44 1865 272861 (self) > 1 South Parks Road, ? ? ? ? ? ? ? ? ? ? +44 1865 272866 (PA) > Oxford OX1 3TG, UK ? ? ? ? ? ? ? ?Fax: ?+44 1865 272595 > > ______________________________________________ > R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Terry Therneau
2009-Feb-04 14:31 UTC
[R] Large file size while persisting rpart model to disk
In R, functions remember their entire calling chain. The good thing about this is that they can find variables further up in the nested context, i.e., mfun <- function(x) { x+y} will look for 'y' in the function that called myfun, then in the function that called the function, .... on up and then through the search() list. This makes life easier for certain things such as minimizers. The bad thing is that to make this work R has to remember all of the variables that were available up the entire chain, and 99-100% of them aren't necessary. (Because of constructs like get(varname) a parser can't read the code to decide what might be needed). This is an issue with embedded functions. I recently noticed an extreme case of it in the pspline routine and made changes to fix it. The short version pspline(x, ...other args) { some computations to define an X matrix, which can be large define a print function ... return(X, printfun, other stuff) } It's even worse in the frailty functions, where X can be VERY large. The print function's environment wanted to 'remember' all of the temporary work that went into defining X, plus X itself and so would be huge. My solution was add the line environment(printfun) <- new.env(parent=baseenv()) which marks the function as not needing anything from the local environment, only the base R definitions. This would probably be a good addition to rpart, but I need to look closer. My first cut was to use emptyenv(), but that wasn't so smart. It leaves everything undefined, like "+" for instance. :-) Terry Therneau
Terry Therneau
2009-Feb-04 15:05 UTC
[R] Large file size while persisting rpart model to disk
Brian R makes good points. I made a mistake in the prior post, it should have said new.env(parent=globalenv()) for pspline. You want the saved function to pay attention to the search() path. This is what is actually in the code, I was guilty of mistyping. If the print function uses a not-exported function from the enclosing package then we need to be more careful. This is the case for rpart.
Terry Therneau
2009-Feb-04 19:27 UTC
[R] Large file size while persisting rpart model to disk
Lots of interesting comments while I was off in meetings. (Some days I wonder why they pay me - with so many meetings I certainly don't accomplish any work.) Some responses: 1. To Brian: I think that there is another issue outside of save(). Use the frailty.gamma function as a thought example. It's about 3 pages long with lots and lots of temporary variables and computations, at the end of which it returns an X matrix of data and a stack of attributes. One of these is a print function. Some of the temp objects can be really large, large enough that memory recovery may be important. Does not the reference of these in an environment prevent R from reclaiming that memory during the session? 2. Duncan: You objected to my phrase mfun <- function(x) { x+y} will look for 'y' in the function that called myfun, then in the function that called the function, .... on up and then through the search() list. This makes life easier for certain things such as minimizers. I was writing for ordinary mortals, reading code. The distinction you raise between the code and the "current instance of memory objects when the code was being executed" is opaque to many. At least its tricky for me. 3. On removing variables: I don't like that idea, and think it is much much clearer to exlicitly refer to what you do want than to remove what you don't. I never liked the m$x <- m$y <- m$whozit....... <- NULL construct for that reason, which was once found in most of the modeling functions. 4. Luke: I've read your code suggestion thrice now, and I understand what you are doing less on each pass. Now, two questions for the pros a. I like Brian's suggestion of using asNamespace('survival'), other than the help page that expliclty states that I should never ever call said function. If I don't use any non-exported-from-the-package functions, it seems that globalenv() is the most clear construct, however. How do I know what gets saved and what doesn't? We don't want the all the survival functions to be saved on disk with my object, like local variables would be. b. Is there any difference or preference for environment(printfun) <- asNamespace('survival') environment(printfun) <- new.env(parent= asNamespace('surivival')) Terry T.