Hi Sir, Thanks for making package available to us. I am facing few problems if you can give some hints: Problem-1: The model summary and residual deviance matched (in the mail below) but I didn't understand why AIC is still different. > AIC(m1) [1] 532965 > AIC(m1big_longer) [1] 101442.9 Problem-2: chunksize argument is there in bigglm but not in biglm, consequently, udate.biglm is there, but not update.bigglm Is my observation correct? If yes, why is this difference? Regards Utkarsh / / From: Thomas Lumley <tlumley_at_u.washington.edu <mailto:tlumley_at_u.washington.edu?Subject=Re:%20%5BR%5D%20bigglm%28%29%20results%20different%20from%20glm%28%29>> Date: Tue, 17 Mar 2009 00:50:20 -0700 (PDT) This is a surprisingly interesting problem that took a while to debug, because the computations all seemed correct. Your model hasn't converged yet. You can get the right answer either by running longer: /> summary(m1big_longer) / Large data regression model: bigglm(y ~ ttment, data = dat, family = poisson(link = "log"), chunksize = 100000, maxit = 20) Sample size = 100000 Coef (95% CI) SE p (Intercept) 2.304 2.301 2.307 0.001 0 ttment2 0.405 0.401 0.408 0.002 0 or supplying starting values: /> summary(m1big_started) / Large data regression model: bigglm(y ~ ttment, data = dat, family = poisson(link = "log"), chunksize = 100000, start = c(2, 0)) Sample size = 100000 Coef (95% CI) SE p (Intercept) 2.304 2.301 2.307 0.001 0 ttment2 0.405 0.401 0.408 0.002 0 The bug is that you weren't told about the lack of convergence. There is a flag in the object, but it is only set after the model is converged, so it is not there when convergence fails. /> m1big$converged / *NULL * /> m1big_longer$converged / *[1] TRUE * /> m1big_started$converged / *[1] TRUE *For the next version I will make sure there is a clear warning when the model hasn't converged. The default maximum number of iterations is fairly small, by design --- if it isn't working, you want to find out and specify starting values rather than wait for dozens of potentially slow iterations. This strategy obviously breaks down when you don't notice that failure. :( -thomas On Mon, 16 Mar 2009, Francisco J. Zagmutt wrote: > Dear all, <http://tolstoy.newcastle.edu.au/R/e6/help/09/03/8169.html#8182qlink1> /> / /> I am using the bigglm package to fit a few GLM's to a large dataset (3 million / /> rows, 6 columns). While trying to fit a Poisson GLM I noticed that the / /> coefficient estimates were very different from what I obtained when estimating / /> the model on a smaller dataset using glm(), I wrote a very basic toy example to / /> compare the results of bigglm() against a glm() call. Consider the following / /> code: / /> / /> / />> require(biglm) / />> options(digits=6, scipen=3, contrasts = c("contr.treatment", "contr.poly")) / />> dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)), ttment=gl(2,50000)) / />> m1 <- glm(y~ttment, data=dat, family=poisson(link="log")) / />> m1big <- bigglm(y~ttment , data=dat, family=poisson(link="log")) / />> summary(m1) / /> / /> <snipped output for this email> / /> Coefficients: / /> Estimate Std. Error z value Pr(>|z|) / /> (Intercept) 2.30305 0.00141 1629 <2e-16 *** / /> ttment2 0.40429 0.00183 221 <2e-16 *** / /> / /> Null deviance: 151889 on 99999 degrees of freedom / /> Residual deviance: 101848 on 99998 degrees of freedom / */> AIC: 533152 / */> / />> summary(m1big) / /> Large data regression model: bigglm(y ~ ttment, data = dat, family = / /> poisson(link = "log")) / /> Sample size = 100000 / /> Coef (95% CI) SE p / /> (Intercept) 2.651 2.650 2.653 0.001 0 / /> ttment2 4.346 4.344 4.348 0.001 0 / /> / />> m1big$deviance / /> [1] 287158986 / /> / /> / /> Notice that the coefficients and deviance are quite different in the model / /> estimated using bigglm(). If I change the chunk to seq(1000,10000,1000) the / /> estimates remain the same. / /> / /> Can someone help me understand what is causing these differences? / /> / /> Here is my version info: / /> / />> version / /> _ / /> platform i386-pc-mingw32 / /> arch i386 / /> os mingw32 / /> system i386, mingw32 / /> status / /> major 2 / /> minor 8.1 / /> year 2008 / /> month 12 / /> day 22 / /> svn rev 47281 / /> language R / /> version.string R version 2.8.1 (2008-12-22) / /> / /> / /> Many thanks in advance for your help, / /> / /> Francisco / /> / /> -- / /> Francisco J. Zagmutt / /> Vose Consulting / /> 2891 20th Street / /> Boulder, CO, 80304 / */> USA / */> www.voseconsulting.com / /> / /> ______________________________________________ / /> R-help_at_r-project.org mailing list / /> https://stat.ethz.ch/mailman/listinfo/r-help / /> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html / /> and provide commented, minimal, self-contained, reproducible code. / /> / Thomas Lumley Assoc. Professor, Biostatistics tlumley_at_u.washington.edu University of Washington, Seattle [[alternative HTML version deleted]]
I don't know why the AIC is different, but remember that there are multiple definitions for AIC (generally differing in the constant added) and it may just be a difference in the constant, or it could be that you have not fit the whole dataset (based on your other question). For an lm model biglm only needs to make a single pass through the data. This was the first function written for the package and the update mechanism was an easy way to write the function (and still works well). The bigglm function came later and the models other than Gaussian require multiple passes through the data so instead of the update mechanism that biglm uses, bigglm requires the data argument to be a function that returns the next chunk of data and can restart to the beginning of the dataset. Also note that the bigglm function usually only does a few passes through the data, usually this is good enough, but in some cases you may need to increase the number of passes. Hope this helps, -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at imail.org 801.408.8111> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > project.org] On Behalf Of utkarshsinghal > Sent: Friday, July 03, 2009 7:26 AM > To: tlumley_at_u.washington.edu at stat.math.ethz.ch; r help > Subject: Re: [R] bigglm() results different from glm() > > Hi Sir, > > Thanks for making package available to us. I am facing few problems if > you can give some hints: > > Problem-1: > The model summary and residual deviance matched (in the mail below) but > I didn't understand why AIC is still different. > > > AIC(m1) > [1] 532965 > > > AIC(m1big_longer) > [1] 101442.9 > > > Problem-2: > chunksize argument is there in bigglm but not in biglm, consequently, > udate.biglm is there, but not update.bigglm > Is my observation correct? If yes, why is this difference? > > > Regards > Utkarsh > > / > > / > From: Thomas Lumley <tlumley_at_u.washington.edu > <mailto:tlumley_at_u.washington.edu?Subject=Re:%20%5BR%5D%20bigglm%28%2 > 9%20results%20different%20from%20glm%28%29>> > > Date: Tue, 17 Mar 2009 00:50:20 -0700 (PDT) > > This is a surprisingly interesting problem that took a while to debug, > because the computations all seemed correct. > > Your model hasn't converged yet. You can get the right answer either by > running longer: > > /> summary(m1big_longer) / > > Large data regression model: bigglm(y ~ ttment, data = dat, family > poisson(link = "log"), > > chunksize = 100000, maxit = 20) > Sample size = 100000 > > Coef (95% CI) SE p > (Intercept) 2.304 2.301 2.307 0.001 0 > ttment2 0.405 0.401 0.408 0.002 0 > > or supplying starting values: > > /> summary(m1big_started) / > > Large data regression model: bigglm(y ~ ttment, data = dat, family > poisson(link = "log"), > > chunksize = 100000, start = c(2, 0)) Sample size = 100000 > > Coef (95% CI) SE p > (Intercept) 2.304 2.301 2.307 0.001 0 > ttment2 0.405 0.401 0.408 0.002 0 > > The bug is that you weren't told about the lack of convergence. There > is > a flag in the object, but it is only set after the model is converged, > so it is not there when convergence fails. > > /> m1big$converged / > > *NULL * > /> m1big_longer$converged / > > *[1] TRUE * > /> m1big_started$converged / > > *[1] TRUE *For the next version I will make sure there is a clear > warning when the model hasn't converged. The default maximum number of > iterations is fairly small, by design --- if it isn't working, you want > to find out and specify starting values rather than wait for dozens of > potentially slow iterations. This strategy obviously breaks down when > you don't notice that failure. :( > > -thomas > > On Mon, 16 Mar 2009, Francisco J. Zagmutt wrote: > > > Dear all, > <http://tolstoy.newcastle.edu.au/R/e6/help/09/03/8169.html#8182qlink1> > /> / > /> I am using the bigglm package to fit a few GLM's to a large dataset > (3 million / > /> rows, 6 columns). While trying to fit a Poisson GLM I noticed that > the / > /> coefficient estimates were very different from what I obtained when > estimating / > /> the model on a smaller dataset using glm(), I wrote a very basic toy > example to / > /> compare the results of bigglm() against a glm() call. Consider the > following / > /> code: / > /> / > /> / > />> require(biglm) / > />> options(digits=6, scipen=3, contrasts = c("contr.treatment", > "contr.poly")) / > />> dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)), > ttment=gl(2,50000)) / > />> m1 <- glm(y~ttment, data=dat, family=poisson(link="log")) / > />> m1big <- bigglm(y~ttment , data=dat, family=poisson(link="log")) / > />> summary(m1) / > /> / > /> <snipped output for this email> / > /> Coefficients: / > /> Estimate Std. Error z value Pr(>|z|) / > /> (Intercept) 2.30305 0.00141 1629 <2e-16 *** / > /> ttment2 0.40429 0.00183 221 <2e-16 *** / > /> / > /> Null deviance: 151889 on 99999 degrees of freedom / > /> Residual deviance: 101848 on 99998 degrees of freedom / > */> AIC: 533152 / > */> / > />> summary(m1big) / > /> Large data regression model: bigglm(y ~ ttment, data = dat, family > / > /> poisson(link = "log")) / > /> Sample size = 100000 / > /> Coef (95% CI) SE p / > /> (Intercept) 2.651 2.650 2.653 0.001 0 / > /> ttment2 4.346 4.344 4.348 0.001 0 / > /> / > />> m1big$deviance / > /> [1] 287158986 / > /> / > /> / > /> Notice that the coefficients and deviance are quite different in the > model / > /> estimated using bigglm(). If I change the chunk to > seq(1000,10000,1000) the / > /> estimates remain the same. / > /> / > /> Can someone help me understand what is causing these differences? / > /> / > /> Here is my version info: / > /> / > />> version / > /> _ / > /> platform i386-pc-mingw32 / > /> arch i386 / > /> os mingw32 / > /> system i386, mingw32 / > /> status / > /> major 2 / > /> minor 8.1 / > /> year 2008 / > /> month 12 / > /> day 22 / > /> svn rev 47281 / > /> language R / > /> version.string R version 2.8.1 (2008-12-22) / > /> / > /> / > /> Many thanks in advance for your help, / > /> / > /> Francisco / > /> / > /> -- / > /> Francisco J. Zagmutt / > /> Vose Consulting / > /> 2891 20th Street / > /> Boulder, CO, 80304 / > */> USA / > */> www.voseconsulting.com / > /> / > /> ______________________________________________ / > /> R-help_at_r-project.org mailing list / > /> https://stat.ethz.ch/mailman/listinfo/r-help / > /> PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html / > /> and provide commented, minimal, self-contained, reproducible code. / > /> / > > Thomas Lumley Assoc. Professor, Biostatistics > tlumley_at_u.washington.edu University of Washington, Seattle > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.
Thank you Mr. Lumley and Mr. Greg. That was helpful. Regards Utkarsh Thomas Lumley wrote:> On Fri, 3 Jul 2009, utkarshsinghal wrote: > >> >> Hi Sir, >> >> Thanks for making package available to us. I am facing few problems >> if you can give some hints: >> >> Problem-1: >> The model summary and residual deviance matched (in the mail below) >> but I didn't understand why AIC is still different. >> >>> AIC(m1) >> [1] 532965 >> >>> AIC(m1big_longer) >> [1] 101442.9 > > That's because AIC.default uses the unnormalized loglikelihood and > AIC.biglm uses the deviance. Only differences in AIC between models > are meaningful, not individual values. > >> >> Problem-2: >> chunksize argument is there in bigglm but not in biglm, consequently, >> udate.biglm is there, but not update.bigglm >> Is my observation correct? If yes, why is this difference? >> > > Because update.bigglm is impossible. > > Fitting a glm requires iteration, which means that it requires > multiple passes through the data. Fitting a linear model requires only > a single pass. update.biglm can take a fitted or partially fitted > biglm and add more data. To do the same thing for a bigglm you would > need to start over again from the beginning of the data set. > > To fit a glm, you need to specify a data source that bigglm() can > iterate over. You do this with a function that can be called > repeatedly to return the next chunk of data. > > -thomas > > Thomas Lumley Assoc. Professor, Biostatistics > tlumley at u.washington.edu University of Washington, Seattle > > > >I don't know why the AIC is different, but remember that there are multiple definitions for AIC (generally differing in the constant added) and it may just be a difference in the constant, or it could be that you have not fit the whole dataset (based on your other question). For an lm model biglm only needs to make a single pass through the data. This was the first function written for the package and the update mechanism was an easy way to write the function (and still works well). The bigglm function came later and the models other than Gaussian require multiple passes through the data so instead of the update mechanism that biglm uses, bigglm requires the data argument to be a function that returns the next chunk of data and can restart to the beginning of the dataset. Also note that the bigglm function usually only does a few passes through the data, usually this is good enough, but in some cases you may need to increase the number of passes. Hope this helps, -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at imail.org 801.408.8111