thr3ads.net - R help - [R] bigglm() results different from glm() [Mar 2009]

If this information is useful, please help other people find it:
Share via:

Francisco J. Zagmutt

2009-Mar-17 02:26 UTC

[R] bigglm() results different from glm()

Dear all,

I am using the bigglm package to fit a few GLM's to a large dataset (3 
million rows, 6 columns).  While trying to fit a Poisson GLM I noticed 
that the coefficient estimates were very different from what I obtained 
when estimating the model on a smaller dataset using glm(), I wrote a 
very basic toy example to compare the results of bigglm() against a 
glm() call.  Consider the following code:


 > require(biglm)
 > options(digits=6, scipen=3, contrasts = c("contr.treatment", 
"contr.poly"))
 > dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)), 
ttment=gl(2,50000))
 > m1 <- glm(y~ttment, data=dat, family=poisson(link="log"))
 > m1big <- bigglm(y~ttment , data=dat,
family=poisson(link="log"))
 > summary(m1)

<snipped output for this email>
Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)  2.30305    0.00141    1629   <2e-16 ***
ttment2      0.40429    0.00183     221   <2e-16 ***

     Null deviance: 151889  on 99999  degrees of freedom
Residual deviance: 101848  on 99998  degrees of freedom
AIC: 533152

 > summary(m1big)
Large data regression model: bigglm(y ~ ttment, data = dat, family = 
poisson(link = "log"))
Sample size =  100000
              Coef  (95%   CI)    SE p
(Intercept) 2.651 2.650 2.653 0.001 0
ttment2     4.346 4.344 4.348 0.001 0

 > m1big$deviance
[1] 287158986


Notice that the coefficients and deviance are quite different in the 
model estimated using bigglm(). If I change the chunk to 
seq(1000,10000,1000) the estimates remain the same.

Can someone help me understand what is causing these differences?

Here is my version info:

 > version
                _
platform       i386-pc-mingw32
arch           i386
os             mingw32
system         i386, mingw32
status
major          2
minor          8.1
year           2008
month          12
day            22
svn rev        47281
language       R
version.string R version 2.8.1 (2008-12-22)


Many thanks in advance for your help,

Francisco

-- 
Francisco J. Zagmutt
Vose Consulting
2891 20th Street
Boulder, CO, 80304
USA
www.voseconsulting.com

Thomas Lumley

2009-Mar-17 07:50 UTC

head link

[R] bigglm() results different from glm()

This is a surprisingly interesting problem that took a while to debug, because
the computations all seemed correct.

Your model hasn't converged yet.  You can get the right answer either by
running longer:
> summary(m1big_longer)Large data regression model: bigglm(y ~ ttment, data = dat, family =
poisson(link = "log"),
     chunksize = 100000, maxit = 20)
Sample size =  100000
              Coef  (95%   CI)    SE p
(Intercept) 2.304 2.301 2.307 0.001 0
ttment2     0.405 0.401 0.408 0.002 0

or supplying starting values:
> summary(m1big_started)Large data regression model: bigglm(y ~ ttment, data = dat, family =
poisson(link = "log"),
     chunksize = 100000, start = c(2, 0))
Sample size =  100000
              Coef  (95%   CI)    SE p
(Intercept) 2.304 2.301 2.307 0.001 0
ttment2     0.405 0.401 0.408 0.002 0


The bug is that you weren't told about the lack of convergence.  There is a
flag in the object, but it is only set after the model is converged, so it is
not there when convergence fails.
> m1big$converged
NULL> m1big_longer$converged
[1] TRUE> m1big_started$converged[1] TRUE

For the next version I will make sure there is a clear warning when the model
hasn't converged.  The default maximum number of iterations is fairly small,
by design --- if it isn't working, you want to find out and specify starting
values rather than wait for dozens of potentially slow iterations.  This
strategy obviously breaks down when you don't notice that failure. :(

      -thomas


On Mon, 16 Mar 2009, Francisco J. Zagmutt wrote:
> Dear all,
>
> I am using the bigglm package to fit a few GLM's to a large dataset (3
million
> rows, 6 columns).  While trying to fit a Poisson GLM I noticed that the 
> coefficient estimates were very different from what I obtained when
estimating
> the model on a smaller dataset using glm(), I wrote a very basic toy
example to
> compare the results of bigglm() against a glm() call.  Consider the
following
> code:
>
>
>> require(biglm)
>> options(digits=6, scipen=3, contrasts = c("contr.treatment",
"contr.poly"))
>> dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)),
ttment=gl(2,50000))
>> m1 <- glm(y~ttment, data=dat, family=poisson(link="log"))
>> m1big <- bigglm(y~ttment , data=dat,
family=poisson(link="log"))
>> summary(m1)
>
> <snipped output for this email>
> Coefficients:
>            Estimate Std. Error z value Pr(>|z|)
> (Intercept)  2.30305    0.00141    1629   <2e-16 ***
> ttment2      0.40429    0.00183     221   <2e-16 ***
>
>    Null deviance: 151889  on 99999  degrees of freedom
> Residual deviance: 101848  on 99998  degrees of freedom
> AIC: 533152
>
>> summary(m1big)
> Large data regression model: bigglm(y ~ ttment, data = dat, family = 
> poisson(link = "log"))
> Sample size =  100000
>             Coef  (95%   CI)    SE p
> (Intercept) 2.651 2.650 2.653 0.001 0
> ttment2     4.346 4.344 4.348 0.001 0
>
>> m1big$deviance
> [1] 287158986
>
>
> Notice that the coefficients and deviance are quite different in the model 
> estimated using bigglm(). If I change the chunk to seq(1000,10000,1000) the
> estimates remain the same.
>
> Can someone help me understand what is causing these differences?
>
> Here is my version info:
>
>> version
>               _
> platform       i386-pc-mingw32
> arch           i386
> os             mingw32
> system         i386, mingw32
> status
> major          2
> minor          8.1
> year           2008
> month          12
> day            22
> svn rev        47281
> language       R
> version.string R version 2.8.1 (2008-12-22)
>
>
> Many thanks in advance for your help,
>
> Francisco
>
> -- 
> Francisco J. Zagmutt
> Vose Consulting
> 2891 20th Street
> Boulder, CO, 80304
> USA
> www.voseconsulting.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle

John Fox

2009-Mar-17 13:41 UTC

head link

[R] bigglm() results different from glm()

Dear Francisco,

I was able to duplicate the problem that you reported, and in addition
discovered that the problem seems to be peculiar to the poisson family.
lm(y~ttment, data=dat) and biglm(y~ttment, data=dat) produce identical
results, as do glm(y~ttment, data=dat) and bigglm(y~ttment, data=dat).
Another example, with the binomial family:
> set.seed(12345)
> dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)),
ttment=gl(2,50000))> dat$y0 <- ifelse(dat$y > 12, 1, 0) 
> m1b <- glm(y0~ttment, data=dat, family=binomial)  
> m1bigb <- bigglm(y0~ttment , data=dat, family=binomial()) 
> coef(m1b)(Intercept)     ttment2 
   -1.33508     2.34263 > coef(m1bigb)(Intercept)     ttment2 
   -1.33508     2.34263 > deviance(m1b)
[1] 109244> deviance(m1bigb)   [1] 109244

I'm copying this message to Thomas Lumley, who's the author and
maintainer
of the biglm package.

Regards,
 John

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at
r-project.org]
On> Behalf Of Francisco J. Zagmutt
> Sent: March-16-09 10:26 PM
> To: r-help at stat.math.ethz.ch
> Subject: [R] bigglm() results different from glm()
> 
> Dear all,
> 
> I am using the bigglm package to fit a few GLM's to a large dataset (3
> million rows, 6 columns).  While trying to fit a Poisson GLM I noticed
> that the coefficient estimates were very different from what I obtained
> when estimating the model on a smaller dataset using glm(), I wrote a
> very basic toy example to compare the results of bigglm() against a
> glm() call.  Consider the following code:
> 
> 
>  > require(biglm)
>  > options(digits=6, scipen=3, contrasts =
c("contr.treatment",
> "contr.poly"))
>  > dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)),
> ttment=gl(2,50000))
>  > m1 <- glm(y~ttment, data=dat,
family=poisson(link="log"))
>  > m1big <- bigglm(y~ttment , data=dat,
family=poisson(link="log"))
>  > summary(m1)
> 
> <snipped output for this email>
> Coefficients:
>              Estimate Std. Error z value Pr(>|z|)
> (Intercept)  2.30305    0.00141    1629   <2e-16 ***
> ttment2      0.40429    0.00183     221   <2e-16 ***
> 
>      Null deviance: 151889  on 99999  degrees of freedom
> Residual deviance: 101848  on 99998  degrees of freedom
> AIC: 533152
> 
>  > summary(m1big)
> Large data regression model: bigglm(y ~ ttment, data = dat, family >
poisson(link = "log"))
> Sample size =  100000
>               Coef  (95%   CI)    SE p
> (Intercept) 2.651 2.650 2.653 0.001 0
> ttment2     4.346 4.344 4.348 0.001 0
> 
>  > m1big$deviance
> [1] 287158986
> 
> 
> Notice that the coefficients and deviance are quite different in the
> model estimated using bigglm(). If I change the chunk to
> seq(1000,10000,1000) the estimates remain the same.
> 
> Can someone help me understand what is causing these differences?
> 
> Here is my version info:
> 
>  > version
>                 _
> platform       i386-pc-mingw32
> arch           i386
> os             mingw32
> system         i386, mingw32
> status
> major          2
> minor          8.1
> year           2008
> month          12
> day            22
> svn rev        47281
> language       R
> version.string R version 2.8.1 (2008-12-22)
> 
> 
> Many thanks in advance for your help,
> 
> Francisco
> 
> --
> Francisco J. Zagmutt
> Vose Consulting
> 2891 20th Street
> Boulder, CO, 80304
> USA
> www.voseconsulting.com
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.

Seemingly Similar Threads

Search for more possibly parallel threads

R help - Mar 2009 - bigglm() results different from glm()

[R] bigglm() results different from glm()

[R] bigglm() results different from glm()

[R] bigglm() results different from glm()

Seemingly Similar Threads