Dear R gurus, I am looking for a way to fit a predictive model for a contingency table which has counts. I found that glm( family=poisson) is very good for figuring out which of several alternative models I should select. But once I select a model it is hard to present and interpret it, especially when it has interactions, because everything is done "relative to reference cell". This makes it confusing for the audience. I found that loglin() gives what might be much easier to interpret output as far as coefficients estimates are concerned because they are laid out in a nice table and are provided for all the values of all the factors. But I need to be able to explain what the coefficients really mean. For that, I need to understand how they are used in a formula to compute a fitted value. If loglin() has fitted a model (see example below) what would be a formula that it would use to computer predicted count for, say, the cell with S = H, E=H, P=No in a sample that has a total of 4991 observations? In other words, how did it arrive at the number 270.01843 in the upper left hand corner of $fit? I see that loglin() computes exactly the same predictions (fitted values) as glm( counts ~ S + E +P + S:E + S:P + E:P, data=wisconsin, family=poisson) see below) but it gives different values of the estimates for parameters. So I figure the formula it uses to compute the fitted values is not the same as what is used in Poisson regression. If there is a better way to fit this type of model and provide easy to understand and interpret / present coefficient summary, please let me know. Just in case, I provided the original data at the very bottom. YZK #################### use loglin() ################################### loglin.3 = loglin(wisconsin.table, margin = list( c(1,2), c(1,3), c(2,3) ), fit=T, param=T) loglin.3> loglin.3$lrt [1] 1.575469 $pearson [1] 1.572796 $df [1] 3 $margin $margin[[1]] [1] "S" "E" $margin[[2]] [1] "S" "P" $margin[[3]] [1] "E" "P" $fit , , P = No E S H L H 270.01843 148.98226 L 228.85782 753.14127 LM 331.04036 625.95942 UM 373.08339 420.91704 , , P = Yes E S H L H 795.97572 30.02330 L 137.14648 30.85410 LM 301.96657 39.03387 UM 467.91123 36.08873 $param $param$`(Intercept)` [1] 5.275394 $param$S H L LM UM -0.1044289 -0.1734756 0.1286741 0.1492304 #I think this says that we had a lot of S = LM and S= UM kids in our sample and relatively few S= L kids $param$E H L 0.501462 -0.501462 #I think this says that more kids had E=H than E=L # sum(wisconsin$counts[wisconsin$E=="L"]) [1] 2085 # sum(wisconsin$counts[wisconsin$E=="H"]) [1] 2906 $param$P No Yes 0.5827855 -0.5827855 $param$S.E E S H L H 0.4666025 -0.4666025 #kids in S=H were more likely to get E=H than E=L L -0.4263050 0.4263050 #kids in S=L were more likely to get E=L than E=H LM -0.1492516 0.1492516 UM 0.1089541 -0.1089541 $param$S.P P S No Yes H -0.45259177 0.45259177 L 0.34397315 -0.34397315 LM 0.13390947 -0.13390947 UM -0.02529085 0.02529085 $param$E.P P E No Yes H -0.670733 0.670733 #kids with E=H were more likely to have P=Yes than kids with E=L L 0.670733 -0.670733 ############### use glm () ######################################## summary(glm2) Call: glm(formula = counts ~ S + E + P + S:E + S:P + E:P, family = poisson, data = wisconsin) Deviance Residuals: 1 2 3 4 5 6 7 8 -0.15119 0.27320 0.04135 -0.05691 -0.04446 0.04719 0.32807 -0.24539 9 10 11 12 13 14 15 16 0.73044 -0.35578 -0.16639 0.05952 0.15116 -0.04217 -0.75147 0.14245 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 5.59850 0.05886 95.116 < 2e-16 *** SL -0.16542 0.08573 -1.930 0.05366 . SLM 0.20372 0.07841 2.598 0.00937 ** SUM 0.32331 0.07664 4.219 2.46e-05 *** EL -0.59471 0.09234 -6.441 1.19e-10 *** PYes 1.08107 0.06731 16.060 < 2e-16 *** SL:EL 1.78588 0.11444 15.606 < 2e-16 *** SLM:EL 1.23178 0.10987 11.211 < 2e-16 *** SUM:EL 0.71532 0.11136 6.424 1.33e-10 *** SL:PYes -1.59311 0.11527 -13.820 < 2e-16 *** SLM:PYes -1.17298 0.09803 -11.965 < 2e-16 *** SUM:PYes -0.85460 0.09259 -9.230 < 2e-16 *** EL:PYes -2.68292 0.09867 -27.191 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 3211.0014 on 15 degrees of freedom Residual deviance: 1.5755 on 3 degrees of freedom AIC: 141.39 ################ Original data ############################ #data from Wisconsin that classifies 4991 high school seniors according to socio-economic status S= (low, lower middle, upper middle, and high), # the degree of parental encouragement they receive E= (low and high) # and whether or not they have plans to attend college P(no, yes). #s= social stratum, E=parental encouragement P= college plans #S= social stratum, E=parental encouragement P= college plans S=c("L", "L", "LM", "LM", "UM", "UM", "H", "H") S=c(S,S) E = rep ( c("L", "H"), 8) P= c (rep("No", 8), rep("Yes",8)) counts = c(749, 233, 627, 330, 420, 374, 153, 266, 35,133,38,303,37,467,26,800) wisconsin = data.frame(S, E, P, counts)> wisconsinS E P counts 1 L L No 749 2 L H No 233 3 LM L No 627 4 LM H No 330 5 UM L No 420 6 UM H No 374 7 H L No 153 8 H H No 266 9 L L Yes 35 10 L H Yes 133 11 LM L Yes 38 12 LM H Yes 303 13 UM L Yes 37 14 UM H Yes 467 15 H L Yes 26 16 H H Yes 800 [[alternative HTML version deleted]]
On Sep 15, 2011, at 4:33 PM, Yana Kane-Esrig wrote:> Dear R gurus, > > I am looking for a way to fit a predictive model for a contingency > table which has counts. I found that glm( family=poisson) is very > good for figuring out which of several alternative models I should > select. But once I select a model it is hard to present and > interpret it, especially when it has interactions, because > everything is done "relative to reference cell". This makes it > confusing for the audience.?predict.glm The default for "type" is not the only option. Please re-read the help page and all will be revealed. That is the level of answer appropriate for what appears to be most probably homework. With that in mind let me suggest you refer to your textbooks index under the term parametrization for several of the questions below.> > > I found that loglin() gives what might be much easier to interpret > output as far as coefficients estimates are concerned because they > are laid out in a nice table and are provided for all the values of > all the factors. But I need to be > able to explain what the coefficients really mean. For that, I need to > understand how they are used in a formula to compute a fitted value. > > If loglin() has fitted a model (see example below) what would be a > formula that it would use to computer predicted count for, > say, the cell with S = H, E=H, P=No in a sample that has a total of > 4991 observations? In other words, how did it arrive at the number > 270.01843 in the upper left hand corner of $fit? > > > I see that loglin() computes exactly the same predictions (fitted > values) as > glm( counts ~ S + E +P + S:E + S:P + E:P, data=wisconsin, > family=poisson) see below) but it gives different values of the > estimates for parameters. So I figure the formula it uses to compute > the fitted values is not the same as what is used in Poisson > regression. > > If there is a better way to fit this type of model and provide easy > to understand and interpret / present coefficient summary, please > let me know. > > Just in case, I provided the original data at the very bottom. > > > > YZK > > > > #################### use loglin() ################################### > > > loglin.3 = loglin(wisconsin.table, > margin = list( c(1,2), c(1,3), c(2,3) ), fit=T, param=T) > loglin.3 >> loglin.3 > $lrt > [1] 1.575469 > > $pearson > [1] 1.572796 > > $df > [1] > 3 > > $margin > $margin[[1]] > [1] "S" "E" > > $margin[[2]] > [1] "S" "P" > > $margin[[3]] > [1] "E" "P" > > > $fit > , , P = No > > E > S H L > H 270.01843 148.98226 > L 228.85782 753.14127 > LM 331.04036 625.95942 > UM 373.08339 420.91704 > > , , P = Yes > > E > S H L > H 795.97572 30.02330 > L 137.14648 30.85410 > LM 301.96657 39.03387 > UM 467.91123 36.08873 > > > $param > $param$`(Intercept)` > [1] 5.275394 > > $param$S > H > L LM UM > -0.1044289 -0.1734756 0.1286741 0.1492304 > #I think this says that we had a lot of S = LM and S= UM kids in our > sample and relatively few S= L kids > > $param$E > H L > 0.501462 -0.501462 > #I think this says that more kids had E=H than E=L > # sum(wisconsin$counts[wisconsin$E=="L"]) [1] 2085 > # sum(wisconsin$counts[wisconsin$E=="H"]) [1] 2906 > > $param$P > No Yes > 0.5827855 -0.5827855 > > $param$S.E > E > S H L > H 0.4666025 -0.4666025 #kids in S=H were > more likely to get E=H than E=L > L -0.4263050 0.4263050 #kids in S=L were more likely to get E=L > than E=H > LM -0.1492516 0.1492516 > UM 0.1089541 -0.1089541 > > $param$S.P > P > S No Yes > H -0.45259177 0.45259177 > L 0.34397315 -0.34397315 > LM 0.13390947 -0.13390947 > UM -0.02529085 0.02529085 > > $param$E.P > P > E No Yes > H -0.670733 0.670733 #kids with E=H were more likely to have > P=Yes than kids with E=L > L 0.670733 -0.670733 > > > ############### use glm () ######################################## > > summary(glm2) > > Call: > glm(formula = counts ~ S + E + P + S:E + S:P + E:P, family = poisson, > data = wisconsin) > > Deviance Residuals: > 1 2 3 4 5 6 > 7 8 > -0.15119 0.27320 0.04135 -0.05691 -0.04446 0.04719 > 0.32807 -0.24539 > 9 10 11 12 13 14 > 15 16 > 0.73044 -0.35578 -0.16639 0.05952 0.15116 -0.04217 > -0.75147 0.14245 > > Coefficients: > Estimate Std. Error z value Pr(>|z|) > (Intercept) 5.59850 0.05886 95.116 < 2e-16 *** > SL -0.16542 0.08573 -1.930 0.05366 . > SLM 0.20372 0.07841 2.598 0.00937 ** > SUM 0.32331 0.07664 4.219 2.46e-05 *** > EL -0.59471 0.09234 -6.441 1.19e-10 *** > PYes 1.08107 0.06731 16.060 < 2e-16 *** > SL:EL 1.78588 0.11444 15.606 < 2e-16 *** > SLM:EL 1.23178 0.10987 11.211 < 2e-16 *** > SUM:EL 0.71532 0.11136 6.424 1.33e-10 *** > SL:PYes -1.59311 0.11527 -13.820 < 2e-16 *** > SLM:PYes -1.17298 0.09803 -11.965 < 2e-16 *** > SUM:PYes -0.85460 0.09259 -9.230 < 2e-16 *** > EL:PYes -2.68292 0.09867 -27.191 < 2e-16 *** > --- > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > (Dispersion parameter for poisson family taken to be 1) > > Null deviance: 3211.0014 on 15 degrees of freedom > Residual deviance: 1.5755 on 3 degrees of freedom > AIC: 141.39 > > ################ Original data ############################ > > > #data from Wisconsin that classifies 4991 high school seniors > according to > socio-economic status S= (low, lower middle, upper middle, and high), > # the degree of parental encouragement they receive E= (low and high) > # and whether or not they have plans to attend college P(no, yes). > > #s= social stratum, E=parental encouragement P= college plans > > #S= social stratum, E=parental encouragement P= college plans > > S=c("L", "L", "LM", "LM", "UM", "UM", "H", "H") > S=c(S,S) > > E = rep ( c("L", "H"), 8) > > P= c (rep("No", 8), rep("Yes",8)) > > counts = c(749, 233, 627, 330, 420, 374, 153, 266, > 35,133,38,303,37,467,26,800) > > > > > wisconsin = data.frame(S, E, P, counts) > >> wisconsin > S E P counts > 1 L L No 749 > 2 L H No 233 > 3 LM L No 627 > 4 LM H No 330 > 5 UM L No 420 > 6 UM H No 374 > 7 H L No 153 > 8 H H No 266 > 9 L L Yes 35 > 10 L H Yes 133 > 11 LM L Yes 38 > 12 LM H Yes 303 > 13 UM L Yes 37 > 14 UM H Yes 467 > 15 H L Yes 26 > 16 H H Yes 800 > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD West Hartford, CT
Hi Yana Trying to interpret associations in complex loglinear models from tables of parameter estimates is like trying to extract sunlight from a cucumber. You have to squeeze very hard, and then are usually unhappy with the quality of the sunlight. Instead, you can visualize the associations (fitted counts) or the residuals from the model (pattern of lack of fit) with mosaic and related displays from the vcd package. See the vcdExtra package for a tutorial vignette, but as a first step, try library(vcdExtra) wiscmod0 <- glm( counts ~ S + E +P , data=wisconsin, family=poisson) mosaic(wiscmod0, ~ S+E+P) wiscmod1 <- glm( counts ~ S + E +P + S:E + S:P + E:P, data=wisconsin, family=poisson) mosaic(wiscmod1, ~ S+E+P) ?mosaic for further options. vignette("vcd-tutorial", package="vcdExtra") hth -Michael On 9/15/2011 4:33 PM, Yana Kane-Esrig wrote:> Dear R gurus, > > I am looking for a way to fit a predictive model for a contingency table which has counts. I found that glm( family=poisson) is very good for figuring out which of several alternative models I should select. But once I select a model it is hard to present and interpret it, especially when it has interactions, because everything is done "relative to reference cell". This makes it confusing for the audience. > > > I found that loglin() gives what might be much easier to interpret output as far as coefficients estimates are concerned because they are laid out in a nice table and are provided for all the values of all the factors. But I need to be > able to explain what the coefficients really mean. For that, I need to > understand how they are used in a formula to compute a fitted value. > > If loglin() has fitted a model (see example below) what would be a formula that it would use to computer predicted count for, > say, the cell with S = H, E=H, P=No in a sample that has a total of 4991 observations?? In other words, how did it arrive at the number 270.01843 in the upper left hand corner of $fit? > > > I see that loglin() computes exactly the same predictions (fitted values) as > glm( counts ~ S + E +P + S:E + S:P + E:P, data=wisconsin, family=poisson) see below)? but it gives different values of the estimates for parameters. So I figure the formula it uses to compute > the fitted values is not the same as what is used in Poisson > regression. > > If there is a better way to fit this type of model and provide easy to understand and interpret / present coefficient summary, please let me know. > > Just in case, I provided the original data at the very bottom.? > ? > > > YZK > > > > #################### use loglin() ################################### > > > loglin.3 = loglin(wisconsin.table, > margin = list( c(1,2), c(1,3), c(2,3) ), fit=T, param=T) > loglin.3 >> loglin.3 > $lrt > [1] 1.575469 > > $pearson > [1] 1.572796 > > $df > [1] > 3 > > $margin > $margin[[1]] > [1] "S" "E" > > $margin[[2]] > [1] "S" "P" > > $margin[[3]] > [1] "E" "P" > > > $fit > , , P = No > > ? ? ? E > S? ? ? ? ? ? ? ? ? ? ? H? ? ? ? ? ? ? ? L > ? H? 270.01843 148.98226 > ? L? 228.85782 753.14127 > ? LM 331.04036 625.95942 > ? UM 373.08339 420.91704 > > , , P = Yes > > ? ? ? E > S? ? ? ? ? ? ? ? ? ? ? H? ? ? ? ? ? ? ? L > ? H? 795.97572? 30.02330 > ? L? 137.14648? 30.85410 > ? LM 301.96657? 39.03387 > ? UM 467.91123? 36.08873 > > > $param > $param$`(Intercept)` > [1] 5.275394 > > $param$S > ? ? ? ? ? ? ? ? H? ? ? ? ? ? ? ? ? > L? ? ? ? ? ? ? ? LM? ? ? ? ? ? ? ? UM > -0.1044289 -0.1734756? 0.1286741? 0.1492304 > #I think this says that we had a lot of S = LM and S= UM kids in our sample and relatively few S= L kids > > $param$E > ? ? ? ? ? ? ? H? ? ? ? ? ? ? ? L > ? 0.501462 -0.501462 > #I think this says that more kids had E=H than E=L > # sum(wisconsin$counts[wisconsin$E=="L"]) [1] 2085 > # sum(wisconsin$counts[wisconsin$E=="H"]) [1] 2906 > > $param$P > ? ? ? ? ? ? ? No? ? ? ? ? ? ? Yes > ? 0.5827855 -0.5827855 > > $param$S.E > ? ? ? E > S? ? ? ? ? ? ? ? ? ? ? ? H? ? ? ? ? ? ? ? ? L > ? H? ? 0.4666025 -0.4666025? #kids in S=H were > more likely to get E=H than E=L > ? L? -0.4263050? 0.4263050? #kids in S=L were more likely to get E=L than E=H > ? LM -0.1492516? 0.1492516 > ? UM? 0.1089541 -0.1089541 > > $param$S.P > ? ? ? P > S? ? ? ? ? ? ? ? ? ? ? ? No? ? ? ? ? ? ? ? Yes > ? H? -0.45259177? 0.45259177 > ? L? ? 0.34397315 -0.34397315 > ? LM? 0.13390947 -0.13390947 > ? UM -0.02529085? 0.02529085 > > $param$E.P > ? ? P > E? ? ? ? ? ? ? ? ? No? ? ? ? ? ? Yes > ? H -0.670733? 0.670733? #kids with E=H were more likely to have P=Yes than kids with E=L > ? L? 0.670733 -0.670733 > > > ############### use glm () ######################################## > > summary(glm2) > > Call: > glm(formula = counts ~ S + E + P + S:E + S:P + E:P, family = poisson, > ? ? ? data = wisconsin) > > Deviance Residuals: > ? ? ? ? ? ? 1? ? ? ? ? ? ? ? 2? ? ? ? ? ? ? ? 3? ? ? ? ? ? ? ? 4? ? ? ? ? ? ? ? 5? ? ? ? ? ? ? ? 6? ? ? ? ? ? ? ? 7? ? ? ? ? ? ? ? 8? > -0.15119? ? 0.27320? ? 0.04135? -0.05691? -0.04446? ? 0.04719? ? 0.32807? -0.24539? > ? ? ? ? ? ? 9? ? ? ? ? ? ? 10? ? ? ? ? ? ? 11? ? ? ? ? ? ? 12? ? ? ? ? ? ? 13? ? ? ? ? ? ? 14? ? ? ? ? ? ? 15? ? ? ? ? ? ? 16? > ? 0.73044? -0.35578? -0.16639? ? 0.05952? ? 0.15116? -0.04217? -0.75147? ? 0.14245? > > Coefficients: > ? ? ? ? ? ? ? ? ? ? ? Estimate Std. Error z value Pr(>|z|)? ? ? > (Intercept)? 5.59850? ? ? 0.05886? 95.116?< 2e-16 *** > SL? ? ? ? ? ? ? ? ? -0.16542? ? ? 0.08573? -1.930? 0.05366 .? > SLM? ? ? ? ? ? ? ? ? 0.20372? ? ? 0.07841? ? 2.598? 0.00937 ** > SUM? ? ? ? ? ? ? ? ? 0.32331? ? ? 0.07664? ? 4.219 2.46e-05 *** > EL? ? ? ? ? ? ? ? ? -0.59471? ? ? 0.09234? -6.441 1.19e-10 *** > PYes? ? ? ? ? ? ? ? 1.08107? ? ? 0.06731? 16.060?< 2e-16 *** > SL:EL? ? ? ? ? ? ? 1.78588? ? ? 0.11444? 15.606?< 2e-16 *** > SLM:EL? ? ? ? ? ? 1.23178? ? ? 0.10987? 11.211?< 2e-16 *** > SUM:EL? ? ? ? ? ? 0.71532? ? ? 0.11136? ? 6.424 1.33e-10 *** > SL:PYes? ? ? ? -1.59311? ? ? 0.11527 -13.820?< 2e-16 *** > SLM:PYes? ? ? -1.17298? ? ? 0.09803 -11.965?< 2e-16 *** > SUM:PYes? ? ? -0.85460? ? ? 0.09259? -9.230?< 2e-16 *** > EL:PYes? ? ? ? -2.68292? ? ? 0.09867 -27.191?< 2e-16 *** > --- > Signif. codes:? 0 ???***??? 0.001 ???**??? 0.01 ???*??? 0.05 ???.??? 0.1 ??? ??? 1 > > (Dispersion parameter for poisson family taken to be 1) > > ? ? ? Null deviance: 3211.0014? on 15? degrees of freedom > Residual deviance:? ? ? 1.5755? on? 3? degrees of freedom > AIC: 141.39 > > ################ Original data ############################ > > > #data from Wisconsin that classifies 4991 high school seniors according to > socio-economic status S= (low, lower middle, upper middle, and high), > # the degree of parental encouragement they receive E= (low and high) > # and whether or not they have plans to attend college P(no, yes). > > #s= social stratum, E=parental encouragement P= college plans > > #S= social stratum, E=parental encouragement P= college plans > > S=c("L", "L", "LM", "LM", "UM", "UM", "H", "H") > S=c(S,S) > > E = rep ( c("L", "H"), 8) > > P=? c (rep("No", 8), rep("Yes",8)) > > counts = c(749, 233, 627, 330, 420, 374, 153, 266, > 35,133,38,303,37,467,26,800) > > > > > wisconsin = data.frame(S, E, P, counts) > >> wisconsin > ? ? ? S E? ? P counts > 1? ? L L? No? ? ? 749 > 2? ? L H? No? ? ? 233 > 3? LM L? No? ? ? 627 > 4? LM H? No? ? ? 330 > 5? UM L? No? ? ? 420 > 6? UM H? No? ? ? 374 > 7? ? H L? No? ? ? 153 > 8? ? H H? No? ? ? 266 > 9? ? L L Yes? ? ? ? 35 > 10? L H Yes? ? ? 133 > 11 LM L Yes? ? ? ? 38 > 12 LM H Yes? ? ? 303 > 13 UM L Yes? ? ? ? 37 > 14 UM H Yes? ? ? 467 > 15? H L Yes? ? ? ? 26 > 16? H H Yes? ? ? 800 > [[alternative HTML version deleted]] > > > >-- Michael Friendly Email: friendly AT yorku DOT ca Professor, Psychology Dept. York University Voice: 416 736-5115 x66249 Fax: 416 736-5814 4700 Keele Street Web: http://www.datavis.ca Toronto, ONT M3J 1P3 CANADA
On 17/09/11 01:19, Michael Friendly wrote: <SNIP>> Trying to interpret associations in complex loglinear models from > tables of parameter estimates is like trying to extract sunlight from > a cucumber. You have to squeeze very hard, and then are usually > unhappy with the quality of the sunlight.<SNIP> Fortune!!! cheers, Rolf Turner
On 17-Sep-11 01:20:53, Rolf Turner wrote:> On 17/09/11 01:19, Michael Friendly wrote: > > <SNIP> >> Trying to interpret associations in complex loglinear models >> from tables of parameter estimates is like trying to extract >> sunlight from a cucumber. You have to squeeze very hard, and >> then are usually unhappy with the quality of the sunlight. > <SNIP> > > Fortune!!! > cheers, > Rolf TurnerSeconded! That reminds me of a riddle once posed by Manfred Gordon (noted for his contributions to stochastic theories of polymerisation): Question: How do you unscramble a scrambled egg? Answer: Feed it to a hen. Perhaps that could inspire a similar approach to the "sunlight from cucumber" problem (passing over technological practicalities such as anaerobic digestion, biogas, and methane-fuelled electricity generators). More generally, I think it suggests a possible approach to the unscrambling of nicely cooked summary statistics handed to you on a plate ... Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <ted.harding at wlandres.net> Fax-to-email: +44 (0)870 094 0861 Date: 17-Sep-11 Time: 10:30:38 ------------------------------ XFMail ------------------------------