Paulo Barata
2010-Jul-03 03:33 UTC
[R] logistic regression - glm() - example in Dalgaard's book ISwR
Dear R-list members, I would like to pose a question about the use and results of the glm() function for logistic regression calculations. The question is based on an example provided on p. 229 in P. Dalgaard, Introductory Statistics with R, 2nd. edition, Springer, 2008. By means of this example, I was trying to practice the different ways of entering data in glm(). In his book, Dalgaard provides data from a case-based study about hypertension summarized in the form of a table. He shows two ways of entering the response (dependent) variable data in glm(): (1) as a matrix of successes/failures (diseased/ healthy); (2) as the proportion of people diseased for each combination of independent variable's categories. I tried to enter the response variable in glm() in a third way: by reconstructing, from the table, the original data in a case-based format, that is, a data frame in which each row shows the data for one person. In this situation, the response variable would be coded as a numeric 0/1 vector, 0=failure, 1=success, and so it would be entered in glm() as a numeric 0/1 vector. The program below presents the calculations for each of the three ways of entering data - the first and second methods were just copied from Dalgaard's book. The three methods produce the same results with regard to the estimated coefficients, when the output is seen with five decimals (although regression 3 seems to have produced slightly different std.errors). My main question is: Why are the residual deviance, its degrees of freedom and the AIC produced by regression 3 completely different when compared to those produced by regressions 1 and 2 (which seem to produce equal results)? It seems that the degrees of freedom in regressions 1 and 2 are based on the size (number of rows) of table d (see the output of the program below), but this table is just a way of summarizing the data. The degrees of freedom in regressions 1 and 2 are not based on the actual number of cases (people) examined, which is n=433. I understand that no matter the way of entering the data in glm(), we are always analyzing the same data, which are those presented in table format on Dalgaard's page 229 (these data are in data.frame d in the program below). So I understand that the three ways of entering data in glm() should produce the same results. Secondarily, why are the std.errors in regression 3 slightly different from those calculated in regressions 1 and 2? I am using R version 2.11.1 on Windows XP. Thank you very much. Paulo Barata ##== begin ================================================ ## data in: P. Dalgaard, Introductory Statistics with R, ## 2nd. edition, Springer, 2008 ## logistic regression - example in Dalgaard's Section 13.2, ## page 229 rm(list=ls()) ## data provided on Dalgaard's page 229: no.yes <- c("No","Yes") smoking <- gl(2,1,8,no.yes) obesity <- gl(2,2,8,no.yes) snoring <- gl(2,4,8,no.yes) n.tot <- c(60,17,8,2,187,85,51,23) n.hyp <- c(5,2,1,0,35,13,15,8) d <- data.frame(smoking,obesity,snoring,n.tot,n.hyp) ## d is the data to be analyzed, in table format ## d is the first table on Dalgaard's page 229 ## n.tot = total number of cases ## n.hyp = people with hypertension d ## regression 1: Dalgaard's page 230 ## response variable entered in glm() as a matrix of ## successes/failures hyp.tbl <- cbind(n.hyp,n.tot-n.hyp) regression1 <- glm(hyp.tbl~smoking+obesity+snoring, family=binomial("logit")) ## regression 2: Dalgaard's page 230 ## response variable entered in glm() as proportions prop.hyp <- n.hyp/n.tot regression2 <- glm(prop.hyp~smoking+obesity+snoring, weights=n.tot,family=binomial("logit")) ## regression 3 (well below): data entered in glm() ## by means of 'reconstructed' variables ## variables with names beginning with 'r' are ## 'reconstructed' from data in data.frame d. ## The objective is to reconstruct the original ## data from which the table on Dalgaard's page 229 ## has been produced rsmoking <- c(rep('No',d[1,4]),rep('Yes',d[2,4]), rep('No',d[3,4]),rep('Yes',d[4,4]), rep('No',d[5,4]),rep('Yes',d[6,4]), rep('No',d[7,4]),rep('Yes',d[8,4])) rsmoking <- factor(rsmoking) length(rsmoking) # just a check robesity <- c(rep('No', d[1,4]),rep('No', d[2,4]), rep('Yes',d[3,4]),rep('Yes',d[4,4]), rep('No', d[5,4]),rep('No', d[6,4]), rep('Yes',d[7,4]),rep('Yes',d[8,4])) robesity <- factor(robesity) length(robesity) # just a check rsnoring <- c(rep('No', d[1,4]),rep('No', d[2,4]), rep('No', d[3,4]),rep('No', d[4,4]), rep('Yes',d[5,4]),rep('Yes',d[6,4]), rep('Yes',d[7,4]),rep('Yes',d[8,4])) rsnoring <- factor(rsnoring) length(rsnoring) # just a check rhyp <- c(rep(1,d[1,5]),rep(0,d[1,4]-d[1,5]), rep(1,d[2,5]),rep(0,d[2,4]-d[2,5]), rep(1,d[3,5]),rep(0,d[3,4]-d[3,5]), rep(1,d[4,5]),rep(0,d[4,4]-d[4,5]), rep(1,d[5,5]),rep(0,d[5,4]-d[5,5]), rep(1,d[6,5]),rep(0,d[6,4]-d[6,5]), rep(1,d[7,5]),rep(0,d[7,4]-d[7,5]), rep(1,d[8,5]),rep(0,d[8,4]-d[8,5])) length(rhyp) # just a check sum(n.tot) # just a check ## data.frame(rsmoking,robesity,rsnoring,rhyp) would provide ## the data in a case-based format - each row would present ## data for one case (one person), with response variable ## coded 0/1 for No/Yes ## The five first cases (people in the study): data.frame(rsmoking,robesity,rsnoring,rhyp)[1:5,] ## regression 3: response variable entered in glm() ## as a numeric 0/1 vector regression3 <- glm(rhyp~rsmoking+robesity+rsnoring, family=binomial("logit")) summary(regression1) summary(regression2) summary(regression3) ## note different residual deviance, degrees of freedom ## and AIC between regression 3 and regressions 1 and 2. ##== end ================================================ ---------------------------------------------------------- Paulo Barata Fundacao Oswaldo Cruz - Oswaldo Cruz Foundation Rua Leopoldo Bulhoes 1480 - 8A 21041-210 Rio de Janeiro - RJ Brazil E-mail: pbarata at infolink.com.br Alternative e-mail: paulo.barata at ensp.fiocruz.br
Juliet Hannah
2010-Jul-03 13:48 UTC
[R] logistic regression - glm() - example in Dalgaard's book ISwR
You may find both of Alan Agresti's books on categorcial data analysis useful. Try googling both books and then search the word "grouped" within each book. Agresti refers to the difference you describe as grouped versus ungrouped data. The likelihoods differ and all summaries based on the likelihood will also differ. On Fri, Jul 2, 2010 at 11:33 PM, Paulo Barata <pbarata at infolink.com.br> wrote:> > Dear R-list members, > > I would like to pose a question about the use and results > of the glm() function for logistic regression calculations. > > The question is based on an example provided on p. 229 > in P. Dalgaard, Introductory Statistics with R, 2nd. edition, > Springer, 2008. By means of this example, I was trying to > practice the different ways of entering data in glm(). > > In his book, Dalgaard provides data from a case-based study > about hypertension summarized in the form of a table. He shows > two ways of entering the response (dependent) variable data > in glm(): (1) as a matrix of successes/failures (diseased/ > healthy); (2) as the proportion of people diseased for each > combination of independent variable's categories. > > I tried to enter the response variable in glm() in a third > way: by reconstructing, from the table, the original data > in a case-based format, that is, a data frame in which > each row shows the data for one person. In this situation, > the response variable would be coded as a numeric 0/1 vector, > 0=failure, 1=success, and so it would be entered in glm() as > a numeric 0/1 vector. > > The program below presents the calculations for each of the > three ways of entering data - the first and second methods > were just copied from Dalgaard's book. > > The three methods produce the same results with regard > to the estimated coefficients, when the output is seen > with five decimals (although regression 3 seems to have > produced slightly different std.errors). > > My main question is: Why are the residual deviance, its > degrees of freedom and the AIC produced by regression 3 > completely different when compared to those produced by > regressions 1 and 2 (which seem to produce equal results)? > It seems that the degrees of freedom in regressions 1 > and 2 are based on the size (number of rows) of table d > (see the output of the program below), but this table is > just a way of summarizing the data. The degrees of > freedom in regressions 1 and 2 are not based on the > actual number of cases (people) examined, which is n=433. > > I understand that no matter the way of entering the data > in glm(), we are always analyzing the same data, which > are those presented in table format on Dalgaard's page > 229 (these data are in data.frame d in the program below). > So I understand that the three ways of entering data > in glm() should produce the same results. > > Secondarily, why are the std.errors in regression 3 slightly > different from those calculated in regressions 1 and 2? > > I am using R version 2.11.1 on Windows XP. > > Thank you very much. > > Paulo Barata > > ##== begin ================================================> > ## data in: P. Dalgaard, Introductory Statistics with R, > ## 2nd. edition, Springer, 2008 > ## logistic regression - example in Dalgaard's Section 13.2, > ## page 229 > > rm(list=ls()) > > ## data provided on Dalgaard's page 229: > no.yes <- c("No","Yes") > smoking <- gl(2,1,8,no.yes) > obesity <- gl(2,2,8,no.yes) > snoring <- gl(2,4,8,no.yes) > n.tot <- c(60,17,8,2,187,85,51,23) > n.hyp <- c(5,2,1,0,35,13,15,8) > > d <- data.frame(smoking,obesity,snoring,n.tot,n.hyp) > ## d is the data to be analyzed, in table format > ## d is the first table on Dalgaard's page 229 > ## n.tot = total number of cases > ## n.hyp = people with hypertension > d > > ## regression 1: Dalgaard's page 230 > ## response variable entered in glm() as a matrix of > ## successes/failures > hyp.tbl <- cbind(n.hyp,n.tot-n.hyp) > regression1 <- glm(hyp.tbl~smoking+obesity+snoring, > ? ? ? ? ? ? ? ? ? family=binomial("logit")) > > ## regression 2: Dalgaard's page 230 > ## response variable entered in glm() as proportions > prop.hyp <- n.hyp/n.tot > regression2 <- glm(prop.hyp~smoking+obesity+snoring, > ? ? ? ? ? ? ? ? ? weights=n.tot,family=binomial("logit")) > > ## regression 3 (well below): data entered in glm() > ## by means of 'reconstructed' variables > ## variables with names beginning with 'r' are > ## 'reconstructed' from data in data.frame d. > ## The objective is to reconstruct the original > ## data from which the table on Dalgaard's page 229 > ## has been produced > > rsmoking <- c(rep('No',d[1,4]),rep('Yes',d[2,4]), > ? ? ? ? ? ? ?rep('No',d[3,4]),rep('Yes',d[4,4]), > ? ? ? ? ? ? ?rep('No',d[5,4]),rep('Yes',d[6,4]), > ? ? ? ? ? ? ?rep('No',d[7,4]),rep('Yes',d[8,4])) > rsmoking <- factor(rsmoking) > length(rsmoking) ?# just a check > > robesity <- c(rep('No', d[1,4]),rep('No', d[2,4]), > ? ? ? ? ? ? ?rep('Yes',d[3,4]),rep('Yes',d[4,4]), > ? ? ? ? ? ? ?rep('No', d[5,4]),rep('No', d[6,4]), > ? ? ? ? ? ? ?rep('Yes',d[7,4]),rep('Yes',d[8,4])) > robesity <- factor(robesity) > length(robesity) ?# just a check > > rsnoring <- c(rep('No', d[1,4]),rep('No', d[2,4]), > ? ? ? ? ? ? ?rep('No', d[3,4]),rep('No', d[4,4]), > ? ? ? ? ? ? ?rep('Yes',d[5,4]),rep('Yes',d[6,4]), > ? ? ? ? ? ? ?rep('Yes',d[7,4]),rep('Yes',d[8,4])) > rsnoring <- factor(rsnoring) > length(rsnoring) ?# just a check > > rhyp <- c(rep(1,d[1,5]),rep(0,d[1,4]-d[1,5]), > ? ? ? ? ?rep(1,d[2,5]),rep(0,d[2,4]-d[2,5]), > ? ? ? ? ?rep(1,d[3,5]),rep(0,d[3,4]-d[3,5]), > ? ? ? ? ?rep(1,d[4,5]),rep(0,d[4,4]-d[4,5]), > ? ? ? ? ?rep(1,d[5,5]),rep(0,d[5,4]-d[5,5]), > ? ? ? ? ?rep(1,d[6,5]),rep(0,d[6,4]-d[6,5]), > ? ? ? ? ?rep(1,d[7,5]),rep(0,d[7,4]-d[7,5]), > ? ? ? ? ?rep(1,d[8,5]),rep(0,d[8,4]-d[8,5])) > length(rhyp) ? ? ?# just a check > > sum(n.tot) ?# just a check > > ## data.frame(rsmoking,robesity,rsnoring,rhyp) would provide > ## the data in a case-based format - each row would present > ## data for one case (one person), with response variable > ## coded 0/1 for No/Yes > ## The five first cases (people in the study): > data.frame(rsmoking,robesity,rsnoring,rhyp)[1:5,] > > ## regression 3: response variable entered in glm() > ## as a numeric 0/1 vector > regression3 <- glm(rhyp~rsmoking+robesity+rsnoring, > ? ? ? ? ? ? ? ? ? family=binomial("logit")) > > summary(regression1) > summary(regression2) > summary(regression3) > > ## note different residual deviance, degrees of freedom > ## and AIC between regression 3 and regressions 1 and 2. > > ##== end ================================================> > ---------------------------------------------------------- > Paulo Barata > Fundacao Oswaldo Cruz - Oswaldo Cruz Foundation > Rua Leopoldo Bulhoes 1480 - 8A > 21041-210 ?Rio de Janeiro - RJ > Brazil > > E-mail: pbarata at infolink.com.br > Alternative e-mail: paulo.barata at ensp.fiocruz.br > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
David Winsemius
2010-Jul-03 14:00 UTC
[R] logistic regression - glm() - example in Dalgaard's book ISwR
On Jul 2, 2010, at 11:33 PM, Paulo Barata wrote:> > Dear R-list members, > > I would like to pose a question about the use and results > of the glm() function for logistic regression calculations. > > The question is based on an example provided on p. 229 > in P. Dalgaard, Introductory Statistics with R, 2nd. edition, > Springer, 2008. By means of this example, I was trying to > practice the different ways of entering data in glm(). > > In his book, Dalgaard provides data from a case-based study > about hypertension summarized in the form of a table. He shows > two ways of entering the response (dependent) variable data > in glm(): (1) as a matrix of successes/failures (diseased/ > healthy); (2) as the proportion of people diseased for each > combination of independent variable's categories. > > I tried to enter the response variable in glm() in a third > way: by reconstructing, from the table, the original data > in a case-based format, that is, a data frame in which > each row shows the data for one person. In this situation, > the response variable would be coded as a numeric 0/1 vector, > 0=failure, 1=success, and so it would be entered in glm() as > a numeric 0/1 vector. > > The program below presents the calculations for each of the > three ways of entering data - the first and second methods > were just copied from Dalgaard's book. > > The three methods produce the same results with regard > to the estimated coefficients, when the output is seen > with five decimals (although regression 3 seems to have > produced slightly different std.errors). > > My main question is: Why are the residual deviance, its > degrees of freedom and the AIC produced by regression 3 > completely different when compared to those produced by > regressions 1 and 2 (which seem to produce equal results)? > It seems that the degrees of freedom in regressions 1 > and 2 are based on the size (number of rows) of table d > (see the output of the program below), but this table is > just a way of summarizing the data. The degrees of > freedom in regressions 1 and 2 are not based on the > actual number of cases (people) examined, which is n=433.I first encountered this phenomenon 25 years ago when using GLIM. The answer from my statistical betters was that the deviance is actually only established up to a constant and that it is only differences in deviance that can be properly interpreted. The same situation exists with indefinite integrals in calculus.> > I understand that no matter the way of entering the data > in glm(), we are always analyzing the same data, which > are those presented in table format on Dalgaard's page > 229 (these data are in data.frame d in the program below). > So I understand that the three ways of entering data > in glm() should produce the same results.The differences between equivalent nested models should remain the same (up to numerical accuracy). 411.42 on 432 degrees of freedom -398.92 on 429 ----------------- 12.5 3 14.1259 on 7 degrees of freedom -1.6184 on 4 -------------- 12.5075 3> > Secondarily, why are the std.errors in regression 3 slightly > different from those calculated in regressions 1 and 2?You mean the differences 4 places to the right of the decimal???> > I am using R version 2.11.1 on Windows XP. > > Thank you very much. > > Paulo Barata > > ##== begin ================================================> > ## data in: P. Dalgaard, Introductory Statistics with R, > ## 2nd. edition, Springer, 2008 > ## logistic regression - example in Dalgaard's Section 13.2, > ## page 229 > > rm(list=ls())Personally, I rather annoyed when people post this particular line without commenting it out. It is basically saying that your code is terribly much more important than whatever else might be in a user's workspace.> > ## data provided on Dalgaard's page 229: > no.yes <- c("No","Yes") > smoking <- gl(2,1,8,no.yes) > obesity <- gl(2,2,8,no.yes) > snoring <- gl(2,4,8,no.yes) > n.tot <- c(60,17,8,2,187,85,51,23) > n.hyp <- c(5,2,1,0,35,13,15,8) > > d <- data.frame(smoking,obesity,snoring,n.tot,n.hyp) > ## d is the data to be analyzed, in table format > ## d is the first table on Dalgaard's page 229 > ## n.tot = total number of cases > ## n.hyp = people with hypertension > d > > ## regression 1: Dalgaard's page 230 > ## response variable entered in glm() as a matrix of > ## successes/failures > hyp.tbl <- cbind(n.hyp,n.tot-n.hyp) > regression1 <- glm(hyp.tbl~smoking+obesity+snoring, > family=binomial("logit")) > > ## regression 2: Dalgaard's page 230 > ## response variable entered in glm() as proportions > prop.hyp <- n.hyp/n.tot > regression2 <- glm(prop.hyp~smoking+obesity+snoring, > weights=n.tot,family=binomial("logit")) > > ## regression 3 (well below): data entered in glm() > ## by means of 'reconstructed' variables > ## variables with names beginning with 'r' are > ## 'reconstructed' from data in data.frame d. > ## The objective is to reconstruct the original > ## data from which the table on Dalgaard's page 229 > ## has been produced > > rsmoking <- c(rep('No',d[1,4]),rep('Yes',d[2,4]), > rep('No',d[3,4]),rep('Yes',d[4,4]), > rep('No',d[5,4]),rep('Yes',d[6,4]), > rep('No',d[7,4]),rep('Yes',d[8,4])) > rsmoking <- factor(rsmoking) > length(rsmoking) # just a check > > robesity <- c(rep('No', d[1,4]),rep('No', d[2,4]), > rep('Yes',d[3,4]),rep('Yes',d[4,4]), > rep('No', d[5,4]),rep('No', d[6,4]), > rep('Yes',d[7,4]),rep('Yes',d[8,4])) > robesity <- factor(robesity) > length(robesity) # just a check > > rsnoring <- c(rep('No', d[1,4]),rep('No', d[2,4]), > rep('No', d[3,4]),rep('No', d[4,4]), > rep('Yes',d[5,4]),rep('Yes',d[6,4]), > rep('Yes',d[7,4]),rep('Yes',d[8,4])) > rsnoring <- factor(rsnoring) > length(rsnoring) # just a check > > rhyp <- c(rep(1,d[1,5]),rep(0,d[1,4]-d[1,5]), > rep(1,d[2,5]),rep(0,d[2,4]-d[2,5]), > rep(1,d[3,5]),rep(0,d[3,4]-d[3,5]), > rep(1,d[4,5]),rep(0,d[4,4]-d[4,5]), > rep(1,d[5,5]),rep(0,d[5,4]-d[5,5]), > rep(1,d[6,5]),rep(0,d[6,4]-d[6,5]), > rep(1,d[7,5]),rep(0,d[7,4]-d[7,5]), > rep(1,d[8,5]),rep(0,d[8,4]-d[8,5])) > length(rhyp) # just a check > > sum(n.tot) # just a check > > ## data.frame(rsmoking,robesity,rsnoring,rhyp) would provide > ## the data in a case-based format - each row would present > ## data for one case (one person), with response variable > ## coded 0/1 for No/Yes > ## The five first cases (people in the study): > data.frame(rsmoking,robesity,rsnoring,rhyp)[1:5,] > > ## regression 3: response variable entered in glm() > ## as a numeric 0/1 vector > regression3 <- glm(rhyp~rsmoking+robesity+rsnoring, > family=binomial("logit")) > > summary(regression1) > summary(regression2) > summary(regression3) > > ## note different residual deviance, degrees of freedom > ## and AIC between regression 3 and regressions 1 and 2. > > ##== end ================================================> > ---------------------------------------------------------- > Paulo Barata > Fundacao Oswaldo Cruz - Oswaldo Cruz FoundationDavid Winsemius, MD West Hartford, CT
Reasonably Related Threads
- "mantel.haenszel.test for trend in S-plus doesn't work i R"
- rhyp function from fBasics
- bad variable names when printing a data frame containing a matrix (PR#10730)
- Help understanding why glm and lrm.fit runs with my data, but lrm does not
- model.frame: how does one use it?