Woolner, Keith
2008-Jul-11 15:50 UTC
[R] More compact form of lm object that can be used for prediction?
Hi everyone, Is there a way to take an lm() model and strip it to a minimal form (or convert it to another type of object) that can still used to predict the dependent variable? Background: I have a series of 6 lm() models, each of which are being run on the same data frame of approximately 500,000 rows. I eventually want to predict all 6 of the dependent variables on a new data frame, which also has several hundred thousand rows. E.g. I need to create: lm.1 <- lm(y1 ~ x1 + f + g,data=my.500k.rows) lm.2 <- lm(y2 ~ x2 + f + g,data=my.500k.rows) ... lm.6 <- lm(y6 ~ x6 + f + g,data=my.500k.rows) and then predict y1 ... y6 for another large data set predict (lm.1, newdata=another.500k.rows) predict (lm.2, newdata=another.500k.rows) ... predict (lm.6, newdata=another.500k.rows) Because of the size of the input data frame, the individual model objects are quite large. Through some probably ill-advised tinkering, I've found that I can strip some of the larger components out of the models (e.g. residuals, effects, fitted.values), and still be able to use predict() to generate predicted values on new data. However, the "qr" component can not be removed, and using qr=FALSE in the call to lm() makes calling predict() on the resulting model object return all zeroes. The qr matrix seems to consume an amount of memory proportional to the input data. On my system (Windows XP, running R 2.7.1 with /3GB and -max-mem-size=2899M enabled), this means storing 6 such models simultaneously is impossible as I get "unable to allocate vector of size <N>" errors during processing. If I'm not mistaken, one could, in principle, generate the predicted dependent variable values using just the coefficients (and perhaps a couple of other pieces of metadata about such things as the model formula and factors), but I'm not sure if there's a straightforward way to do so in R. The example below shows what I am trying to do conceptually, with just one model instead of six. ######################### pop.size <- 500 # actual data size closer to 500,000 f.fac <- as.factor(c("A","B","C","D")) g.fac <- as.factor(c("W","X","Y","Z")) my.data <- data.frame( f=f.fac[sample(1:4,pop.size,replace=T)] ,g=g.fac[sample(1:4,pop.size,replace=T)] ,x1<-runif(pop.size,0,1) ,x2<-runif(pop.size,0,1) ) my.data$y1<-x1*rnorm(pop.size,-5,5) my.data$y2<-x2*rnorm(pop.size,-5,5) # Create model - tried using qr=FALSE, but it made prediction later on fail lm.1 <- lm(y1~x1+f+g,data=my.data) # Show sizes of the different component of the model object object.size(lm.1) do.call("rbind",lapply(names(lm.1),function(x){list(name=x,size=object.s ize(lm.1[x]))})) # Create new data we want predictions for my.predict <- data.frame( f=f.fac[sample(1:4,pop.size,replace=T)] ,g=g.fac[sample(1:4,pop.size,replace=T)] ,x1<-runif(pop.size,0,1) ,x2<-runif(pop.size,0,1) ) # Predict using standard R functionality. This works, but because the # model objects are so large, can't hold all of them in memory predictions <- predict(lm.1, newdata=my.predict) # Pretend we have a magic function that creates minimally-sized model object # from coefficients that can still be used to predict value, but takes up far less memory # than standard lm object lm.compact.1 <- compactify(lm.1) # Goal: be able to generate same predictions as with standard methods, # but with more compact model object predictions <- predict(lm.compact.1, newdata=my.predict) ######################### One of my initial thoughts would be to somehow automatically create a function from the model specification and coefficients that could then be used to generate predicted values: # Internally, this would create a function that conceptually looks like: # function (x1, f, g) { coef.x1 *x1 + coef.f * f + coef.g * g } lm.1.func <- model.to.function ( coef(lm.1),formula(lm.1) ) predictions <- lm.1.func(my.predict$x1, my.predict.f, my.predict$g) However, I'm open to any approach that allows predicted values to be generated while consuming significantly less memory. I've searched the list archives and seen references to the "biglm" package, but that appears to be intended for dealing with input data that is larger than the system's memory can hold, rather than keeping the resulting model object size to a minimum. Thanks in advance for any guidance. Keith [[alternative HTML version deleted]]
Marc Schwartz
2008-Jul-11 16:14 UTC
[R] More compact form of lm object that can be used for prediction?
on 07/11/2008 10:50 AM Woolner, Keith wrote:> Hi everyone, > > > > Is there a way to take an lm() model and strip it to a minimal form (or > convert it to another type of object) that can still used to predict the > dependent variable?<snip> Depending upon how much memory you need to conserve and what else you may need to do with the model object: 1. lm(YourFormula, data = YourData, model = FALSE) 'model = FALSE' will result in the model frame not being retained. 2. lm(YourFormula, data = YourData, model = FALSE, x = FALSE) 'x = FALSE' will result in the model matrix not being retained. See ?lm for more information. HTH, Marc Schwartz