Daniel Yarlett
2010-Aug-24 19:16 UTC
[R] Time and space considerations in using predict.glm.
Hello, I am using R to train a logistic regression model and save the resulting model to disk. I am then subsequently reloading these saved objects, and using predict.glm on them in order to make predictions about single-row data frames that are generated in real-time from requests arriving at an HTTP server. The following code demonstrates the sort of R calls that I have in mind:> cases <- 2000000 > data <-data.frame(x1=runif(cases),x2=runif(cases),y=sample(0:1,cases,replace=TRUE))> lr1 <- glm(y~x1*x2,family=binomial,data=data) > new_data <- data.frame(x1=0,x2=0) > out <- predict(lr1,type="response",newdata=new_data)The first thing I am noticing is that the models that I am storing are very large because I am using large data-sets, and the models seem to store residuals, fitted values and so on, by default.> object.size(lr1)1056071320 bytes Access to all this information is not necessary for my application -- all I really need is access to model$coefficients in order to make my predictions -- so I am wondering if there is some way to prevent this information from getting stored in the glm objects when they are created (or of removing it after the models have been trained)? I have discovered the model=FALSE,x=FALSE,y=FALSE switches to glm() and these seem to help somewhat, but perhaps there is some other way of only recording the coefficients of the model and other minimal details?> lr2 <- glm(y~x1*x2,family=binomial,data=data,model=FALSE,x=FALSE,y=FALSE) > object.size(lr2)888065968 bytes Secondly, on data-sets of the scale I am using, predict.glm seems to be taking a very long time to make its predictions.> print(system.time(predict(lr1,type="response",newdata=new_data)))user system elapsed 0.136 0.040 0.175> print(system.time(predict(lr2,type="response",newdata=new_data)))user system elapsed 0.109 0.013 0.121 This may be an issue of swap-time, and so it could potentially be solved by addressing my first question above. However, given that I am essentially asking R to compute 1 / (1 + exp(-(b0 + b1*x1 + b2*x2 + b3*x1*x2))) I can't see any reason why this request should be taking longer than a hundredth or a thousandth of a second, say. Obviously R is providing a much greater level of functionality than I am requiring in this particular instance, so my overall question is what is the best way for me to reduce the size of the data I have to store in my GLM models, and to increase the speed at which I can use R to generate predictions of this sort (i.e. for novel x1,x2 pairs)? I could obviously write a custom function / class which only stores the model coefficients and computes predictions based on these using the equation above, but before I go down this route I wanted to get come advice from the R community about whether there might be a better way to address this problem and/or whether I have missed something obvious (to others). I also want to avoid writing custom code if possible because that obviously means sacrificing the great generality and power of R which could clearly be useful in my application down the line. Many thanks in advance for your assistance, Dan. [[alternative HTML version deleted]]
David Winsemius
2010-Aug-24 20:31 UTC
[R] Time and space considerations in using predict.glm.
On Aug 24, 2010, at 3:16 PM, Daniel Yarlett wrote:> Hello, > > I am using R to train a logistic regression model and save the > resulting > model to disk. I am then subsequently reloading these saved objects, > and > using predict.glm on them in order to make predictions about single- > row data > frames that are generated in real-time from requests arriving at an > HTTP > server. The following code demonstrates the sort of R calls that I > have in > mind: > >> cases <- 2000000 >> data <- > data > .frame > (x1=runif(cases),x2=runif(cases),y=sample(0:1,cases,replace=TRUE)) >> lr1 <- glm(y~x1*x2,family=binomial,data=data) >> new_data <- data.frame(x1=0,x2=0) >> out <- predict(lr1,type="response",newdata=new_data) > > The first thing I am noticing is that the models that I am storing > are very > large because I am using large data-sets, and the models seem to store > residuals, fitted values and so on, by default. > >> object.size(lr1) > 1056071320 bytes > > Access to all this information is not necessary for my application > -- all I > really need is access to model$coefficients in order to make my > predictions > -- so I am wondering if there is some way to prevent this > information from > getting stored in the glm objects when they are created (or of > removing it > after the models have been trained)? I have discovered the > model=FALSE,x=FALSE,y=FALSE switches to glm() and these seem to help > somewhat, but perhaps there is some other way of only recording the > coefficients of the model and other minimal details?Perhaps instead:> >> lr2 <- >> coef >> ( glm >> (y~x1*x2,family=binomial,data=data,model=FALSE,x=FALSE,y=FALSE) )>> object.size(lr2) >Will be much smaller> > Secondly, on data-sets of the scale I am using, predict.glm seems to > be > taking a very long time to make its predictions. > >> print(system.time(predict(lr1,type="response",newdata=new_data))) > user system elapsed > 0.136 0.040 0.175 >> print(system.time(predict(lr2,type="response",newdata=new_data))) > user system elapsed > 0.109 0.013 0.121 > > This may be an issue of swap-time, and so it could potentially be > solved by > addressing my first question above. However, given that I am > essentially > asking R to compute > > 1 / (1 + exp(-(b0 + b1*x1 + b2*x2 + b3*x1*x2))) > > I can't see any reason why this request should be taking longer than a > hundredth or a thousandth of a second, say.You could try crossprod with a data.matrix and a matrix of coefficients.> 1 / (1 + exp(-(crossprod(lr2, new_data)))> cases <- 2000 > data <- data .frame(x1=runif(cases),x2=runif(cases),y=sample(0:1,cases,replace=TRUE)) > lr1 <- coef(glm(y~x1*x2,family=binomial,data=data)) > new_data <- matrix(c(1, x1=0,x2=0, x1x2=0), nrow=4) # took me a while to figure out that I needed an interaction entry. > out <- 1 / (1 + exp(-(crossprod(new_data,lr1)))) > out [,1] [1,] 0.5107252 > lr1 (Intercept) x1 x2 x1:x2 0.04290728 -0.16826991 -0.03561711 0.06229122 > > object.size(lr1) 456 bytes> Obviously R is providing a much > greater level of functionality than I am requiring in this particular > instance, so my overall question is what is the best way for me to > reduce > the size of the data I have to store in my GLM models, and to > increase the > speed at which I can use R to generate predictions of this sort > (i.e. for > novel x1,x2 pairs)? > > I could obviously write a custom function / class which only stores > the > model coefficients and computes predictions based on these using the > equation above, but before I go down this route I wanted to get come > advice > from the R community about whether there might be a better way to > address > this problem and/or whether I have missed something obvious (to > others). I > also want to avoid writing custom code if possible because that > obviously > means sacrificing the great generality and power of R which could > clearly be > useful in my application down the line. > > Many thanks in advance for your assistance, > > Dan. >David Winsemius, MD West Hartford, CT