R1.2.0 with Linux RH5.2 I do not believe that the problems below are new to 1.2 but I only cover this sort of thing once a year in my course and some of that happened to be last Friday so too late to report for 1.2. I see that one or two things that I was going to report have been corrected. I like the fact that interactions now show : instead of . Here is some output with comments inserted. R : Copyright 2000, The R Development Core Team Version 1.2.0 (2000-12-15) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type `license()' or `licence()' for distribution details. R is a collaborative project with many contributors. Type `contributors()' for more information. Type `demo()' for some demos, `help()' for on-line help, or `help.start()' for a HTML browser interface to help. Type `q()' to quit R. Notice contrasts for below:> options(contrasts=c("contr.treatment","contr.poly")) > data <- read.table("tmp.dat",skip=6,head=T) > dataFreq Reg66 Reg71 1 118 CC CC 2 14 LY CC 3 8 WM CC 4 12 GL CC 5 12 CC LY 6 2127 LY LY 7 69 WM LY 8 110 GL LY 9 7 CC WM 10 86 LY WM 11 2548 WM WM 12 88 GL WM 13 23 CC GL 14 130 LY GL 15 107 WM GL 16 7712 GL GL> attach(data) > wt <- codes(Reg66)!=codes(Reg71) > print(z <- glm(Freq~Reg66+Reg71,family=poisson))Call: glm(formula = Freq ~ Reg66 + Reg71, family = poisson) Coefficients: (Intercept) Reg66GL Reg66LY Reg66WM Reg71GL Reg71LY 0.6134 3.9022 2.6900 2.8376 3.9597 2.7245 Reg71WM 2.8877 Degrees of Freedom: 15 Total (i.e. Null); 9 Residual Null Deviance: 40740 Residual Deviance: 19880 AIC: 20000> print(z2 <- glm(Freq~Reg66+Reg71,family=poisson,weight=wt))Call: glm(formula = Freq ~ Reg66 + Reg71, family = poisson, weights = wt) Coefficients: (Intercept) Reg66GL Reg66LY Reg66WM Reg71GL Reg71LY 0.4615 2.1242 2.0096 1.7236 2.4555 2.0847 Reg71WM 1.9140 Degrees of Freedom: 11 Total (i.e. Null); 5 Residual Null Deviance: 486.2 Residual Deviance: 4.367 AIC: 82.7> z2$fit1 2 3 4 5 6 118.100000 11.836037 8.891541 13.272421 12.758205 2127.100000 7 8 9 10 11 12 71.505458 106.736339 10.756661 80.252100 2548.100000 89.991240 13 14 15 16 18.485138 137.911862 103.603001 7712.100000 First problem: with diagonal values weighted out in a transition matrix (mover-stayer model), glm wrongly estimates the diagonals to be the observed plus 0.1. This was correct in 90.1 a year ago but already wrong in 1.0.1. I don't have any versions in between installed.> # correct values from GLIM4 > # unit observed fitted residual > # (1) 118 1.586 0.000 > # 2 14 11.836 0.629 > # 3 8 8.892 -0.299 > # 4 12 13.272 -0.349 > # 5 12 12.758 -0.212 > # (6) 2127 95.185 0.000 > # 7 69 71.505 -0.296 > # 8 110 106.736 0.316 > # 9 7 10.757 -1.145 > # 10 86 80.252 0.642 > # (11) 2548 60.287 0.000 > # 12 88 89.991 -0.210 > # 13 23 18.485 1.050 > # 14 130 137.912 -0.674 > # 15 107 103.603 0.334 > # (16) 7712 154.648 0.000 > # and from R90.1 (wrong in R1.0.1) > #> z2$fit > # 1 2 3 4 5 6 7 > # 1.586454 11.836037 8.891541 13.272421 12.758205 95.184992 71.505458 > # 8 9 10 11 12 13 14 > #106.736339 10.756661 80.252100 60.287479 89.991240 18.485138 137.911862 > # 15 16 > #103.603001 154.648406 >Second problem: the man page for factors states that the levels are sorted, but gl does not do this. It keeps them in the order of the labels. This is inconsistent with everything else, confusing for students, and liable to induce errors in data analysis. In the example below, it is not too important because the baseline category is the same in the two cases but that will not generally be the case. As an addition comment, I would strongly recommend that, if labels can be coerced to numeric, they be ordered, not sorted. This is because the order is very confusing if there are more than 9: for example, the sorted 10 comes near the beginning and not at the end.> reg66 <- gl(4,1,16,labels=c("CC","LY","WM","GL")) > zCall: glm(formula = Freq ~ Reg66 + Reg71, family = poisson) Coefficients: (Intercept) Reg66GL Reg66LY Reg66WM Reg71GL Reg71LY 0.6134 3.9022 2.6900 2.8376 3.9597 2.7245 Reg71WM 2.8877 Degrees of Freedom: 15 Total (i.e. Null); 9 Residual Null Deviance: 40740 Residual Deviance: 19880 AIC: 20000> print(za <- glm(Freq~reg66+Reg71,family=poisson))Call: glm(formula = Freq ~ reg66 + Reg71, family = poisson) Coefficients: (Intercept) reg66LY reg66WM reg66GL Reg71GL Reg71LY 0.6134 2.6900 2.8376 3.9022 3.9597 2.7245 Reg71WM 2.8877 Degrees of Freedom: 15 Total (i.e. Null); 9 Residual Null Deviance: 40740 Residual Deviance: 19880 AIC: 20000 Third problem: if mean value contrasts are used, the level names are not attached to the variable name, but simply the number of the category. This rightly drives students mad and is simply uninterpretable if the levels have been sorted. It takes a great deal of extreme care to try to find out to what the numbers refer.> options(contrasts=c("contr.sum","contr.poly")) > > zCall: glm(formula = Freq ~ Reg66 + Reg71, family = poisson) Coefficients: (Intercept) Reg66GL Reg66LY Reg66WM Reg71GL Reg71LY 0.6134 3.9022 2.6900 2.8376 3.9597 2.7245 Reg71WM 2.8877 Degrees of Freedom: 15 Total (i.e. Null); 9 Residual Null Deviance: 40740 Residual Deviance: 19880 AIC: 20000> print(zb <- glm(Freq~Reg66+Reg71,family=poisson))Call: glm(formula = Freq ~ Reg66 + Reg71, family = poisson) Coefficients: (Intercept) Reg661 Reg662 Reg663 Reg711 Reg712 5.3638 -2.3575 1.5448 0.3325 -2.3930 1.5667 Reg713 0.3315 Degrees of Freedom: 15 Total (i.e. Null); 9 Residual Null Deviance: 40740 Residual Deviance: 19880 AIC: 20000>Additional comment: I recommend that the aic functions for poisson and binomial have the explicit calculation of the log density replaced by the corresponding d function with the log option. This will avoid the production of NAs in certain cases of zeroes in tables. Jim -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
[We do have a TooMuchAtOnce category, but it is easier to cope with short separate reports with informative subject lines for, e.g. the BUGS list.] On Mon, 18 Dec 2000 james.lindsey@luc.ac.be wrote:> R1.2.0 with Linux RH5.2 > > I do not believe that the problems below are new to 1.2 but I only > cover this sort of thing once a year in my course and some of that > happened to be last Friday so too late to report for 1.2. I see that > one or two things that I was going to report have been corrected. > I like the fact that interactions now show : instead of . > > Here is some output with comments inserted. >[...]> > wt <- codes(Reg66)!=codes(Reg71)Why not just wt <- Reg66 != Reg71 ? [...]> First problem: with diagonal values weighted out in a transition matrix > (mover-stayer model), glm wrongly estimates the diagonals to be the > observed plus 0.1. This was correct in 90.1 a year ago but already > wrong in 1.0.1. I don't have any versions in between installed.I think this is the same as [Rd] glm gives incorrect results for zero-weight cases (PR#780) in which case it is already fixed in R-patched. That does look to have solved this one too. Note that glm has *not* `estimated the diagonals' at all. These are predictions (the cases are not in the model), and predict.glm did get them right. predict(z, data) 1 2 3 4 5 6 7 8 0.61337 3.30334 3.45098 4.51559 3.33786 6.02783 6.17548 7.24009 9 10 11 12 13 14 15 16 3.50109 6.19106 6.33871 7.40332 4.57309 7.26306 7.41071 8.47532 [...]> Second problem: the man page for factors states that the levels are > sorted, but gl does not do this. It keeps them in the order of theThere is a help page for factor(), not for factors, and I don't believe that does state so. It says that *factor* sorts them by default, which is true. Where is the `man page for factors'?> labels. This is inconsistent with everything else, confusing for > students, and liable to induce errors in data analysis. In the example > below, it is not too important because the baseline category is the > same in the two cases but that will not generally be the case. > > As an addition comment, I would strongly recommend that, if labels can > be coerced to numeric, they be ordered, not sorted. This is because > the order is very confusing if there are more than 9: for example, the > sorted 10 comes near the beginning and not at the end.You can do that, of course, by specifying the order of the levels. However, we teach students that the details of the coding of linear models are details, and that they should regard coefficients as secondary and predictions as primary. aov() has the right idea in suppressing the individual coefficients. If you want to interpret coefficients you need to understand codings. Period. I don't see the value of changing the default for factor: it would not be backwards-compatible, and some people have thought thought what the default would be and accepted it.> > reg66 <- gl(4,1,16,labels=c("CC","LY","WM","GL")) > > z > > Call: glm(formula = Freq ~ Reg66 + Reg71, family = poisson) > > Coefficients: > (Intercept) Reg66GL Reg66LY Reg66WM Reg71GL Reg71LY > 0.6134 3.9022 2.6900 2.8376 3.9597 2.7245 > Reg71WM > 2.8877 > > Degrees of Freedom: 15 Total (i.e. Null); 9 Residual > Null Deviance: 40740 > Residual Deviance: 19880 AIC: 20000 > > print(za <- glm(Freq~reg66+Reg71,family=poisson)) > > Call: glm(formula = Freq ~ reg66 + Reg71, family = poisson) > > Coefficients: > (Intercept) reg66LY reg66WM reg66GL Reg71GL Reg71LY > 0.6134 2.6900 2.8376 3.9022 3.9597 2.7245 > Reg71WM > 2.8877 > > Degrees of Freedom: 15 Total (i.e. Null); 9 Residual > Null Deviance: 40740 > Residual Deviance: 19880 AIC: 20000> Third problem: if mean value contrasts are used, the level names are > not attached to the variable name, but simply the number of the > category. This rightly drives students mad and is simply > uninterpretable if the levels have been sorted. It takes a great deal > of extreme care to try to find out to what the numbers refer.Are we supposed to guess that `mean value contrasts' means contr.sum? In which case the numbers refer to the number of the *contrast*: they do not refer to levels, nor are single levels relevant. As in> contr.sum(levels(Reg66))[,1] [,2] [,3] CC 1 0 0 GL 0 1 0 LY 0 0 1 WM -1 -1 -1> contr.treatment(levels(Reg66))GL LY WM CC 0 0 0 GL 1 0 0 LY 0 1 0 WM 0 0 1 Now, what should the column labels be in the first case? And in what sense are these `mean value contrasts'? Do you really want labels like "CCvWM"? They could get very cumbersome. (One could argue that contr.treatment is already wrong, but as they are not even contrasts ....) [...]> Additional comment: I recommend that the aic functions for poisson and > binomial have the explicit calculation of the log density replaced by > the corresponding d function with the log option. This will avoid the > production of NAs in certain cases of zeroes in tables.Good idea, but could you supply examples so we can put them in the regression tests? -- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
> > [We do have a TooMuchAtOnce category, but it is easier to cope with > short separate reports with informative subject lines for, e.g. the BUGS > list.] > > On Mon, 18 Dec 2000 james.lindsey at luc.ac.be wrote: > > > R1.2.0 with Linux RH5.2 > > > > I do not believe that the problems below are new to 1.2 but I only > > cover this sort of thing once a year in my course and some of that > > happened to be last Friday so too late to report for 1.2. I see that > > one or two things that I was going to report have been corrected. > > I like the fact that interactions now show : instead of . > > > > Here is some output with comments inserted. > > > [...] > > > > wt <- codes(Reg66)!=codes(Reg71) > > Why not just wt <- Reg66 != Reg71 ?Simple: I do not trust R factor levels (among others, for reasons outlined below). For any serious work involving factor variables, I use GLIM4.> > [...] > > > First problem: with diagonal values weighted out in a transition matrix > > (mover-stayer model), glm wrongly estimates the diagonals to be the > > observed plus 0.1. This was correct in 90.1 a year ago but already > > wrong in 1.0.1. I don't have any versions in between installed. > > I think this is the same as > > [Rd] glm gives incorrect results for zero-weight cases (PR#780)Good. That patch was posted after I submitted my bug report.> > in which case it is already fixed in R-patched. That does look to have > solved this one too. Note that glm has *not* `estimated the diagonals' > at all. These are predictions (the cases are not in the model), and > predict.glm did get them right.It *has* estimated the diagonals under the fitted mover-stayer model (that is, if it were doing it right). They are estimates of certain parameters in this model. (Another way to fit the same model is to use a 4-level factor variable on the diagonal instead of weighting it out, but that does not supply the desired estimates.) R incorrectly calls them fitted values in its objects and printouts. GLIM4 clearly prints them out in a distinct way so that there is no confusion.> > predict(z, data) > 1 2 3 4 5 6 7 8 > 0.61337 3.30334 3.45098 4.51559 3.33786 6.02783 6.17548 7.24009 > 9 10 11 12 13 14 15 16 > 3.50109 6.19106 6.33871 7.40332 4.57309 7.26306 7.41071 8.47532 > > [...] > > > Second problem: the man page for factors states that the levels are > > sorted, but gl does not do this. It keeps them in the order of the > > There is a help page for factor(), not for factors, and I don't believe > that does state so. It says that *factor* sorts them by default, which is > true. Where is the `man page for factors'?Sorry. Typo. Here is what the docs say: The levels of a factor are by default sorted, but the sort order may well depend on the locale at the time of creation, and should not be assumed to be ASCII. Because this is the only documentation on factor variables and it does not mention the factor function explicitly, I take it to be a general statement.> > > labels. This is inconsistent with everything else, confusing for > > students, and liable to induce errors in data analysis. In the example > > below, it is not too important because the baseline category is the > > same in the two cases but that will not generally be the case. > > > > As an addition comment, I would strongly recommend that, if labels can > > be coerced to numeric, they be ordered, not sorted. This is because > > the order is very confusing if there are more than 9: for example, the > > sorted 10 comes near the beginning and not at the end. > > You can do that, of course, by specifying the order of the levels. > However, we teach students that the details of the coding of linear models > are details, and that they should regard coefficients as secondary and > predictions as primary. aov() has the right idea in suppressing the > individual coefficients. If you want to interpret coefficients you need to > understand codings. Period.I do not see what aov has to do with it when discussing glm's. It is for lm and, as far as I know, produces the old-fashioned material for tests that more and more scientific journals are rightly refusing. In say a standard high-dimensional contingency table (or a factorial design), the first thing that scientists and industry people want is to know which levels of factor variables yield scientifically important effects. Only then do they want measures of statistical significance. What are required are first the appropriate coefficients and then the corresponding intervals. The idea about predictions may be theoretically fine, at least in simple models, but is practically useless in realistically complex cases involving multidimensional contingency tables (or multi- factorial designs, from which factor variables take their name, for that matter), where one wants to extract the relative effects of the various factor levels and interactions. The predictions are virtually useless to disentangle such effects. I teach that all different constraint sets are equivalent because they give the same predictions so choose that set which is most appropriate or most easily interpretable for the problem at hand. If one level of a factor is of particular interest, use baseline constraints (so-called treatment contrasts); if levels are rather symmetric, use mean value constraints (so-called sum contrasts). In other cases, special contrasts need to be set up. Contrasts are a property of a variable, not a design so that it would be nice to have an automatic way of doing this for each factor variable rather than a global contrast option. But I know of no software that allows this. R makes it extremely difficult to understand codings. They are very obscure and easily lead to errors of interpretation. Everything seems to be done to discourage their use (as opposed simply to looking at tests) when their correct interpretation is essential. That is my whole point.> > I don't see the value of changing the default for factor: it would not be > backwards-compatible, and some people have thought thought what the default^^^^^^^^^^^^^^^ Does this mean that they thought twice about it or is it a type like my "factors" above?> would be and accepted it. > > > > reg66 <- gl(4,1,16,labels=c("CC","LY","WM","GL")) > > > z > > > > Call: glm(formula = Freq ~ Reg66 + Reg71, family = poisson) > > > > Coefficients: > > (Intercept) Reg66GL Reg66LY Reg66WM Reg71GL Reg71LY > > 0.6134 3.9022 2.6900 2.8376 3.9597 2.7245 > > Reg71WM > > 2.8877 > > > > Degrees of Freedom: 15 Total (i.e. Null); 9 Residual > > Null Deviance: 40740 > > Residual Deviance: 19880 AIC: 20000 > > > print(za <- glm(Freq~reg66+Reg71,family=poisson)) > > > > Call: glm(formula = Freq ~ reg66 + Reg71, family = poisson) > > > > Coefficients: > > (Intercept) reg66LY reg66WM reg66GL Reg71GL Reg71LY > > 0.6134 2.6900 2.8376 3.9022 3.9597 2.7245 > > Reg71WM > > 2.8877 > > > > Degrees of Freedom: 15 Total (i.e. Null); 9 Residual > > Null Deviance: 40740 > > Residual Deviance: 19880 AIC: 20000 > > > > Third problem: if mean value contrasts are used, the level names are > > not attached to the variable name, but simply the number of the > > category. This rightly drives students mad and is simply > > uninterpretable if the levels have been sorted. It takes a great deal > > of extreme care to try to find out to what the numbers refer. > > Are we supposed to guess that `mean value contrasts' means contr.sum? In > which case the numbers refer to the number of the *contrast*: they do not > refer to levels, nor are single levels relevant. As inI stated above, in a part that you cut, to pay attention to the contrasts options that I set. "Mean value constraints" is standard terminology in the fields where I have worked for the simple reason that the coefficients give differences from the mean value. "Conventional constraints" is also commonly used but is less informative. What does the "number of the *contrast*" mean and how do you easily find out what it is after the levels have been sorted around? In a log linear model (or complex factorial design), they do refer to differences ("contrasts") of each level from the mean value. Single levels are at least as relevant here as for any other contrast set.> > > contr.sum(levels(Reg66)) > [,1] [,2] [,3] > CC 1 0 0 > GL 0 1 0 > LY 0 0 1 > WM -1 -1 -1 > > contr.treatment(levels(Reg66)) > GL LY WM > CC 0 0 0 > GL 1 0 0 > LY 0 1 0 > WM 0 0 1 > > Now, what should the column labels be in the first case? And in what sense > are these `mean value contrasts'?They should be CC GL LY. The coefficient for WM is minus the sum of the other three. The coefficients will be differences from the mean value (on the log scale for log linear models) and are symmetric in all factor levels (in the sense that no particular category is chosen as special). Some statistical software packages very nicely print out the coefficients for all levels instead of arbitrarily dropping the last one.> > Do you really want labels like "CCvWM"? They could get very cumbersome. >You have lost me here. The answer is, of course, No. What a suggestion! It does not mean that at all. It compares CC to the mean not to WM. To me, that looks like the appropriate label for baseline constraints with the last category as baseline (what R calls treatment but is more usually placebo!).> (One could argue that contr.treatment is already wrong, but as they > are not even contrasts ....) > > [...] > > > Additional comment: I recommend that the aic functions for poisson and > > binomial have the explicit calculation of the log density replaced by > > the corresponding d function with the log option. This will avoid the > > production of NAs in certain cases of zeroes in tables. > > Good idea, but could you supply examples so we can put them in the > regression tests?Here is a simple and unrealistic (as no one would ever fit the model to these data) example that illustrates what can happen in more complex tables: y <- matrix(c(2,0,7,3,0,9), ncol=2) x <- gl(3,1) glm(y~x, family=binomial)> > -- > Brian D. Ripley, ripley at stats.ox.ac.uk > Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ > University of Oxford, Tel: +44 1865 272861 (self) > 1 South Parks Road, +44 1865 272860 (secr) > Oxford OX1 3TG, UK Fax: +44 1865 272595 > >-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
[Moved back to R-devel. That's where we discuss bug reports.] On Wed, 3 Jan 2001, Jim Lindsey wrote:> > > Additional comment: I recommend that the aic functions for poisson and > > > binomial have the explicit calculation of the log density replaced by > > > the corresponding d function with the log option. This will avoid the > > > production of NAs in certain cases of zeroes in tables. > > > > Good idea, but could you supply examples so we can put them in the > > regression tests? > > Here is a simple and unrealistic (as no one would ever fit the model > to these data) example that illustrates what can happen in more > complex tables: > > y <- matrix(c(2,0,7,3,0,9), ncol=2) > x <- gl(3,1) > glm(y~x, family=binomial)That's not one. That fails because wt/n = NaN, and produces NaN not NA. Now, could you please supply an example of what you reported, so we can put it in the regression tests? The BUGS section in the FAQ does ask for one, with good reasons. I've done as you suggested, and *not* fixed the actual bug in this example. -- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._