Dear all, I am using the bigglm package to fit a few GLM's to a large dataset (3 million rows, 6 columns). While trying to fit a Poisson GLM I noticed that the coefficient estimates were very different from what I obtained when estimating the model on a smaller dataset using glm(), I wrote a very basic toy example to compare the results of bigglm() against a glm() call. Consider the following code: > require(biglm) > options(digits=6, scipen=3, contrasts = c("contr.treatment", "contr.poly")) > dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)), ttment=gl(2,50000)) > m1 <- glm(y~ttment, data=dat, family=poisson(link="log")) > m1big <- bigglm(y~ttment , data=dat, family=poisson(link="log")) > summary(m1) <snipped output for this email> Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.30305 0.00141 1629 <2e-16 *** ttment2 0.40429 0.00183 221 <2e-16 *** Null deviance: 151889 on 99999 degrees of freedom Residual deviance: 101848 on 99998 degrees of freedom AIC: 533152 > summary(m1big) Large data regression model: bigglm(y ~ ttment, data = dat, family = poisson(link = "log")) Sample size = 100000 Coef (95% CI) SE p (Intercept) 2.651 2.650 2.653 0.001 0 ttment2 4.346 4.344 4.348 0.001 0 > m1big$deviance [1] 287158986 Notice that the coefficients and deviance are quite different in the model estimated using bigglm(). If I change the chunk to seq(1000,10000,1000) the estimates remain the same. Can someone help me understand what is causing these differences? Here is my version info: > version _ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 2 minor 8.1 year 2008 month 12 day 22 svn rev 47281 language R version.string R version 2.8.1 (2008-12-22) Many thanks in advance for your help, Francisco -- Francisco J. Zagmutt Vose Consulting 2891 20th Street Boulder, CO, 80304 USA www.voseconsulting.com
This is a surprisingly interesting problem that took a while to debug, because the computations all seemed correct. Your model hasn't converged yet. You can get the right answer either by running longer:> summary(m1big_longer)Large data regression model: bigglm(y ~ ttment, data = dat, family = poisson(link = "log"), chunksize = 100000, maxit = 20) Sample size = 100000 Coef (95% CI) SE p (Intercept) 2.304 2.301 2.307 0.001 0 ttment2 0.405 0.401 0.408 0.002 0 or supplying starting values:> summary(m1big_started)Large data regression model: bigglm(y ~ ttment, data = dat, family = poisson(link = "log"), chunksize = 100000, start = c(2, 0)) Sample size = 100000 Coef (95% CI) SE p (Intercept) 2.304 2.301 2.307 0.001 0 ttment2 0.405 0.401 0.408 0.002 0 The bug is that you weren't told about the lack of convergence. There is a flag in the object, but it is only set after the model is converged, so it is not there when convergence fails.> m1big$convergedNULL> m1big_longer$converged[1] TRUE> m1big_started$converged[1] TRUE For the next version I will make sure there is a clear warning when the model hasn't converged. The default maximum number of iterations is fairly small, by design --- if it isn't working, you want to find out and specify starting values rather than wait for dozens of potentially slow iterations. This strategy obviously breaks down when you don't notice that failure. :( -thomas On Mon, 16 Mar 2009, Francisco J. Zagmutt wrote:> Dear all, > > I am using the bigglm package to fit a few GLM's to a large dataset (3 million > rows, 6 columns). While trying to fit a Poisson GLM I noticed that the > coefficient estimates were very different from what I obtained when estimating > the model on a smaller dataset using glm(), I wrote a very basic toy example to > compare the results of bigglm() against a glm() call. Consider the following > code: > > >> require(biglm) >> options(digits=6, scipen=3, contrasts = c("contr.treatment", "contr.poly")) >> dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)), ttment=gl(2,50000)) >> m1 <- glm(y~ttment, data=dat, family=poisson(link="log")) >> m1big <- bigglm(y~ttment , data=dat, family=poisson(link="log")) >> summary(m1) > > <snipped output for this email> > Coefficients: > Estimate Std. Error z value Pr(>|z|) > (Intercept) 2.30305 0.00141 1629 <2e-16 *** > ttment2 0.40429 0.00183 221 <2e-16 *** > > Null deviance: 151889 on 99999 degrees of freedom > Residual deviance: 101848 on 99998 degrees of freedom > AIC: 533152 > >> summary(m1big) > Large data regression model: bigglm(y ~ ttment, data = dat, family = > poisson(link = "log")) > Sample size = 100000 > Coef (95% CI) SE p > (Intercept) 2.651 2.650 2.653 0.001 0 > ttment2 4.346 4.344 4.348 0.001 0 > >> m1big$deviance > [1] 287158986 > > > Notice that the coefficients and deviance are quite different in the model > estimated using bigglm(). If I change the chunk to seq(1000,10000,1000) the > estimates remain the same. > > Can someone help me understand what is causing these differences? > > Here is my version info: > >> version > _ > platform i386-pc-mingw32 > arch i386 > os mingw32 > system i386, mingw32 > status > major 2 > minor 8.1 > year 2008 > month 12 > day 22 > svn rev 47281 > language R > version.string R version 2.8.1 (2008-12-22) > > > Many thanks in advance for your help, > > Francisco > > -- > Francisco J. Zagmutt > Vose Consulting > 2891 20th Street > Boulder, CO, 80304 > USA > www.voseconsulting.com > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >Thomas Lumley Assoc. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle
Dear Francisco, I was able to duplicate the problem that you reported, and in addition discovered that the problem seems to be peculiar to the poisson family. lm(y~ttment, data=dat) and biglm(y~ttment, data=dat) produce identical results, as do glm(y~ttment, data=dat) and bigglm(y~ttment, data=dat). Another example, with the binomial family:> set.seed(12345) > dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)),ttment=gl(2,50000))> dat$y0 <- ifelse(dat$y > 12, 1, 0) > m1b <- glm(y0~ttment, data=dat, family=binomial) > m1bigb <- bigglm(y0~ttment , data=dat, family=binomial()) > coef(m1b)(Intercept) ttment2 -1.33508 2.34263> coef(m1bigb)(Intercept) ttment2 -1.33508 2.34263> deviance(m1b)[1] 109244> deviance(m1bigb)[1] 109244 I'm copying this message to Thomas Lumley, who's the author and maintainer of the biglm package. Regards, John> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]On> Behalf Of Francisco J. Zagmutt > Sent: March-16-09 10:26 PM > To: r-help at stat.math.ethz.ch > Subject: [R] bigglm() results different from glm() > > Dear all, > > I am using the bigglm package to fit a few GLM's to a large dataset (3 > million rows, 6 columns). While trying to fit a Poisson GLM I noticed > that the coefficient estimates were very different from what I obtained > when estimating the model on a smaller dataset using glm(), I wrote a > very basic toy example to compare the results of bigglm() against a > glm() call. Consider the following code: > > > > require(biglm) > > options(digits=6, scipen=3, contrasts = c("contr.treatment", > "contr.poly")) > > dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)), > ttment=gl(2,50000)) > > m1 <- glm(y~ttment, data=dat, family=poisson(link="log")) > > m1big <- bigglm(y~ttment , data=dat, family=poisson(link="log")) > > summary(m1) > > <snipped output for this email> > Coefficients: > Estimate Std. Error z value Pr(>|z|) > (Intercept) 2.30305 0.00141 1629 <2e-16 *** > ttment2 0.40429 0.00183 221 <2e-16 *** > > Null deviance: 151889 on 99999 degrees of freedom > Residual deviance: 101848 on 99998 degrees of freedom > AIC: 533152 > > > summary(m1big) > Large data regression model: bigglm(y ~ ttment, data = dat, family > poisson(link = "log")) > Sample size = 100000 > Coef (95% CI) SE p > (Intercept) 2.651 2.650 2.653 0.001 0 > ttment2 4.346 4.344 4.348 0.001 0 > > > m1big$deviance > [1] 287158986 > > > Notice that the coefficients and deviance are quite different in the > model estimated using bigglm(). If I change the chunk to > seq(1000,10000,1000) the estimates remain the same. > > Can someone help me understand what is causing these differences? > > Here is my version info: > > > version > _ > platform i386-pc-mingw32 > arch i386 > os mingw32 > system i386, mingw32 > status > major 2 > minor 8.1 > year 2008 > month 12 > day 22 > svn rev 47281 > language R > version.string R version 2.8.1 (2008-12-22) > > > Many thanks in advance for your help, > > Francisco > > -- > Francisco J. Zagmutt > Vose Consulting > 2891 20th Street > Boulder, CO, 80304 > USA > www.voseconsulting.com > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.