Peter Dunn
2006-Aug-31 02:18 UTC
[R] NaN when using dffits, stemming from lm.influence call
Hi all I'm getting a NaN returned on using dffits, as explained below. To me, there seems no obvious (or non-obvious reason for that matter) reason why a NaN appears. Before I start digging further, can anyone see why dffits might be failing? Is there a problem with the data? Consider: # Load data dep <- read.table("http://www.sci.usq.edu.au/staff/dunn/Datasets/Books/Hand/Hand-R/factor1-R.dat", header=TRUE) attach(dep) dep # Fit Poisson glm dep.glm2 <- glm( Counts ~ factor(Depression) + factor(SLE) + factor(Children) + factor(Depression):factor(SLE), family=poisson( link=log) ) # Compute dffits dffits( dep.glm2 ) This produces the output: 1 2 3 4 5 6 1.4207746 -0.1513808 NaN 0.9079142 -0.1032664 -1.0860289 7 8 0.4853797 3.8560863 NaN exists for Observation 3. I cannot understand why: there's nothing grossly unusual or bad about it. I look a bit closer, and it falls over in lm.influence when computing the deletion statistic sigma: > lm.influence(dep.glm2)$sigma 1 2 3 4 5 6 7 8 0.914829 2.134279 NaN 2.186707 2.224885 1.934539 2.225115 1.957111 The rest of the results from lm.influence are OK; for example: > lm.influence(dep.glm2)$wt.res 1 2 3 4 5 6 2.62840627 -0.88476903 -1.09492912 0.20247856 -0.23114458 -0.95123387 7 8 0.07521515 0.30208051 Use of debug( lm.influence ) indicates the NaN appears in this line of lm.influence: res <- .Fortran("lminfl", model$qr$qr, n, n, k, as.integer(do.coef), model$qr$qraux, wt.res = e, hat = double(n), coefficients = if (do.coef) matrix(0, n, k) else double(0), sigma = double(n), tol = 10 * .Machine$double.eps, DUP = FALSE, PACKAGE = "base")[c("hat", "coefficients", "sigma", "wt.res")] I don't particularly wish to dig around in the Fortran if someone else can look at it and see my problem easily. But if I must... The appearance of the NaN seems odd, since (as I understand it) lm.influence(dep.glm2)$sigma computes sigma when each observation is removed in turn. So if I remove Observation 3 and try fitting the model, there are no problems or complaints: dep.glm3 <- glm( Counts ~ factor(Depression) + factor(SLE) + factor(Children) + factor(Depression):factor(SLE), family=poisson( link=log), subset=(-3) ) This produces: > dep.glm3 Call: glm(formula = Counts ~ factor(Depression) + factor(SLE) + factor(Children) + factor(Depression):factor(SLE), family = poisson(link = log), subset = (-3)) Coefficients: (Intercept) factor(Depression)1 5.4389 -4.1392 factor(SLE)1 factor(Children)1 -0.6503 -2.4036 factor(Depression)1:factor(SLE)1 3.9513 Degrees of Freedom: 6 Total (i.e. Null); 2 Residual Null Deviance: 695.9 Residual Deviance: 0.8535 AIC: 41.25 No problems, errors, or any signs of potential problems. In the changes to R 2.3.0 (in the NEWS file, eg http://mirror.aarnet.edu.au/pub/CRAN/src/base/NEWS) I find this: o Influence measures such as rstandard() and cooks.distance() could return infinite values rather than NaN for a case which was fitted exactly. Similarly, plot.lm() could fail on such examples. plot.lm(which = 5) had to be modified to only plot cases with hat < 1. (PR#8367) lm.influence() was incorrectly reporting 'coefficients' and 'sigma' as NaN for cases with hat = 1, and on some platforms not detecting hat = 1 correctly. The last sentence identifies NaN being reported for sigma, as I find with my data. But my data do not have hat = 1, but the hat diagonals are large. The troublesome Observation 3 does not have the largest hat value in the data though: > hatvalues(dep.glm2) 1 2 3 4 5 6 7 8 0.1689061 0.1064651 0.9098542 0.9030814 0.3799079 0.6382790 0.9327408 0.9607654 And besides, I am using the most recent version of R (see below). BTW, the NaNs appear in the previous version of R also. So I'm flummoxed. As always, help appreciated. P. > version _ platform i386-pc-linux-gnu arch i386 os linux-gnu system i386, linux-gnu status major 2 minor 3.1 year 2006 month 06 day 01 svn rev 38247 language R version.string Version 2.3.1 (2006-06-01) -- Dr Peter Dunn | Email: dunn <at> usq.edu.au Faculty of Sciences, University of Southern Queensland and the Australian Centre for Sustainable Catchments CRICOS: QLD 00244B | NSW 02225M | VIC 02387D | WA 02521C
Prof Brian Ripley
2006-Aug-31 12:51 UTC
[R] NaN when using dffits, stemming from lm.influence call
When applying dffits to a GLM, you are presumably intending to apply it to the working regression. However, that is not what lm.influence has long been set up to do (it uses the deviance residuals). In your example the influence is high and so the approximations of applying the standard formulae to the working regression are unrealistic, hence the NaN (it is predicting a negative variance on deleting that case, which is a failure of the approximation used). If you look at the reference (Williams) you will find discussion of the approxmations and indeed what is possibly a better approximation in his 'likelihood residuals'. On Thu, 31 Aug 2006, Peter Dunn wrote:> Hi all > > I'm getting a NaN returned on using dffits, as explained > below. To me, there seems no obvious (or non-obvious reason > for that matter) reason why a NaN appears. > > Before I start digging further, can anyone see why dffits > might be failing? Is there a problem with the data? > > > Consider: > > # Load data > dep <- > read.table("http://www.sci.usq.edu.au/staff/dunn/Datasets/Books/Hand/Hand-R/factor1-R.dat", > header=TRUE) > attach(dep) > dep > > # Fit Poisson glm > dep.glm2 <- glm( Counts ~ factor(Depression) + factor(SLE) + factor(Children) > + factor(Depression):factor(SLE), > family=poisson( link=log) ) > > # Compute dffits > dffits( dep.glm2 ) > > > This produces the output: > 1 2 3 4 5 6 1.4207746 > -0.1513808 NaN 0.9079142 -0.1032664 -1.0860289 > 7 8 > 0.4853797 3.8560863 > > NaN exists for Observation 3. I cannot understand why: there's > nothing grossly unusual or bad about it. I look a bit closer, > and it falls over in lm.influence when computing the deletion > statistic sigma: > > > > > lm.influence(dep.glm2)$sigma > 1 2 3 4 5 6 7 8 > 0.914829 2.134279 NaN 2.186707 2.224885 1.934539 2.225115 1.957111 > > The rest of the results from lm.influence are OK; for example: > > > lm.influence(dep.glm2)$wt.res > 1 2 3 4 5 6 > 2.62840627 -0.88476903 -1.09492912 0.20247856 -0.23114458 -0.95123387 > 7 8 > 0.07521515 0.30208051 > > > Use of debug( lm.influence ) indicates the NaN appears in this line > of lm.influence: > > > > res <- .Fortran("lminfl", model$qr$qr, n, n, k, as.integer(do.coef), > model$qr$qraux, wt.res = e, hat = double(n), coefficients = if (do.coef) > matrix(0, > n, k) else double(0), sigma = double(n), tol = 10 * > .Machine$double.eps, > DUP = FALSE, PACKAGE = "base")[c("hat", "coefficients", "sigma", > "wt.res")] > > > I don't particularly wish to dig around in the Fortran if someone > else can look at it and see my problem easily. But if I must... > > > > The appearance of the NaN seems odd, since (as I understand it) > lm.influence(dep.glm2)$sigma computes sigma when each observation > is removed in turn. So if I remove Observation 3 and try fitting the > model, there are no problems or complaints: > > dep.glm3 <- glm( Counts ~ factor(Depression) + factor(SLE) + > factor(Children) + factor(Depression):factor(SLE), > family=poisson( link=log), subset=(-3) ) > > > This produces: > > > > dep.glm3 > > Call: glm(formula = Counts ~ factor(Depression) + factor(SLE) + > factor(Children) + factor(Depression):factor(SLE), family = poisson(link > = log), subset = (-3)) > > Coefficients: > (Intercept) factor(Depression)1 > 5.4389 -4.1392 > factor(SLE)1 factor(Children)1 > -0.6503 -2.4036 > factor(Depression)1:factor(SLE)1 > 3.9513 > > Degrees of Freedom: 6 Total (i.e. Null); 2 Residual > Null Deviance: 695.9 > Residual Deviance: 0.8535 AIC: 41.25 > > > No problems, errors, or any signs of potential problems. > > > In the changes to R 2.3.0 (in the NEWS file, > eg http://mirror.aarnet.edu.au/pub/CRAN/src/base/NEWS) > I find this: > > o Influence measures such as rstandard() and cooks.distance() > could return infinite values rather than NaN for a case which > was fitted exactly. Similarly, plot.lm() could fail on such > examples. plot.lm(which = 5) had to be modified to only plot > cases with hat < 1. (PR#8367) > > lm.influence() was incorrectly reporting 'coefficients' and > 'sigma' as NaN for cases with hat = 1, and on some platforms > not detecting hat = 1 correctly. > > The last sentence identifies NaN being reported for sigma, as I > find with my data. But my data do not have hat = 1, but the hat > diagonals are large. The troublesome Observation 3 does not have the > largest hat value in the data though: > > > hatvalues(dep.glm2) > 1 2 3 4 5 6 7 > 8 > 0.1689061 0.1064651 0.9098542 0.9030814 0.3799079 0.6382790 0.9327408 > 0.9607654 > > And besides, I am using the most recent version of R (see below). BTW, > the NaNs appear in the previous version of R also. > > So I'm flummoxed. > > As always, help appreciated. > > P. > > > > > version > _ > platform i386-pc-linux-gnu > arch i386 > os linux-gnu > system i386, linux-gnu > status > major 2 > minor 3.1 > year 2006 > month 06 > day 01 > svn rev 38247 > language R > version.string Version 2.3.1 (2006-06-01) > > >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595