Has anyone programmed condition indexes in R? I know that there is a function for variance inflation factors available in the car package; however, Belsley (1991) Conditioning Diagnostics (Wiley) notes that there are several weaknesses of VIFs: e.g. 1) High VIFs are sufficient but not necessary conditions for collinearity 2) VIFs don't diagnose the number of collinearities and 3) No one has determined how high a VIF has to be for the collinearity to be damaging. He then develops and suggests using condition indexes instead, so I was wondering if anyone had programmed them. Thanks Peter Peter L. Flom, PhD Assistant Director, Statistics and Data Analysis Core Center for Drug Use and HIV Research National Development and Research Institutes 71 W. 23rd St www.peterflom.com New York, NY 10010 (212) 845-4485 (voice) (917) 438-0894 (fax)
Peter Flom wrote:> Has anyone programmed condition indexes in R? > > I know that there is a function for variance inflation factors > available in the car package; however, Belsley (1991) Conditioning > Diagnostics (Wiley) notes that there are several weaknesses of VIFs: > e.g. 1) High VIFs are sufficient but not necessary conditions for > collinearity 2) VIFs don't diagnose the number of collinearities and 3) > No one has determined how high a VIF has to be for the collinearity to > be damaging. > > He then develops and suggests using condition indexes instead, so I was > wondering if anyone had programmed them. > > Thanks > > PeterI think Juergen Gross has something like that in his new book Gross, J. (2003): Linear Regression, Springer (in press - OK, not very helpful here). You might want to contact him privately (in CC). Uwe Ligges> > > Peter L. Flom, PhD > Assistant Director, Statistics and Data Analysis Core > Center for Drug Use and HIV Research > National Development and Research Institutes > 71 W. 23rd St > www.peterflom.com > New York, NY 10010 > (212) 845-4485 (voice) > (917) 438-0894 (fax) > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help
I'm under the impression that condition index is just the ratio of the maximum singular value to the minimum singular value. So just by doing eigen() or svd() on the design matrix (which you can get via, e.g., model.matrix) ought to be sufficient. The number of near-zero eigen values (and their corresponding eigenvectors) tell you the number (and nature) of the collinearity. HTH, Andy> -----Original Message----- > From: Peter Flom [mailto:flom at ndri.org] > Sent: Wednesday, July 23, 2003 10:55 AM > To: r-help at stat.math.ethz.ch > Subject: [R] Condition indexes and variance inflation factors > > > Has anyone programmed condition indexes in R? > > I know that there is a function for variance inflation > factors available in the car package; however, Belsley (1991) > Conditioning Diagnostics (Wiley) notes that there are several > weaknesses of VIFs: e.g. 1) High VIFs are sufficient but not > necessary conditions for collinearity 2) VIFs don't diagnose > the number of collinearities and 3) No one has determined how > high a VIF has to be for the collinearity to be damaging. > > He then develops and suggests using condition indexes > instead, so I was wondering if anyone had programmed them. > > Thanks > > Peter > > > > Peter L. Flom, PhD > Assistant Director, Statistics and Data Analysis Core > Center for Drug Use and HIV Research > National Development and Research Institutes > 71 W. 23rd St > www.peterflom.com > New York, NY 10010 > (212) 845-4485 (voice) > (917) 438-0894 (fax) > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo> /r-help >------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments, ...{{dropped}}
Dear Peter and Uwe, I don't have a copy of Belsley's 1991 book here, but I do have Belsley, Kuh, and Welsch, Regression Diagnostics (Wiley, 1980). If my memory is right, the approach is the same: Belsley's collinearity diagnostics are based on a singular-value decomposition of the scaled but uncentred model matrix. A straightforward, if inelegant, rendition is belsley <- function(model){ X <- model.matrix(model) X <- scale(X, center=FALSE)/sqrt(nrow(X) - 1) svd.X <- svd(X) result <- list(singular.values = svd.X$d, condition.indices = max(svd.X$d)/svd.X$d) phi <- sweep(svd.X$v^2, 2, svd.X$d^2, "/") Pi <- t(sweep(phi, 1, rowSums(phi), "/")) colnames(Pi) <- names(coef(model)) rownames(Pi) <- 1:nrow(Pi) result$pi <- Pi class(result) <- "belsley" result } print.belsley <- function(x, digits = 3, ...){ cat("\nSingular values: ", x$singular.values) cat("\nCondition indices: ", x$condition.indices) cat("\n\nVariance-decomposition proportions\n") print(round(x$pi, digits)) invisible(x) } This gives the singular values, condition indices, and variance-decomposition proportions. (I'm pretty sure that you can get the same thing more elegantly from the qr decomposition, but I don't know how off the top of my head -- someone else on the list doubtless can supply the details.) For example, for the illustration on p. 161 of BKW, > X V1 V2 V3 V4 V5 1 -74 80 18 -56 -112 2 14 -69 21 52 104 3 66 -72 -5 764 1528 4 -12 66 -30 4096 8192 5 3 8 -7 -13276 -26552 6 4 -12 4 8421 16842 > mod <- lm(y ~ X - 1) # nb., y was just randomly generated > belsley(mod) Singular values: 1.414214 1.361734 1.066707 0.08840437 3.614479e-17 Condition indices: 1 1.038538 1.325775 15.9971 3.912635e+16 Variance-decomposition proportions XV1 XV2 XV3 XV4 XV5 1 0.000 0.000 0.000 0 0 2 0.005 0.005 0.000 0 0 3 0.001 0.001 0.047 0 0 4 0.994 0.994 0.953 0 0 5 0.000 0.000 0.000 1 1 which is in good agreement with the values given in the text. Now some comments: (1) I've never liked this approach for a model with a constant, where it makes more sense to me to centre the data. I realize that opinions differ here, but it seems to me that failing to centre the data conflates collinearity with numerical instability. (2) I also disagree with the comment that condition indices are easier to interpret than variance-inflation factors. In either case, since collinearity is a continuous phenomenon, cutoffs for large values are necessarily arbitrary. (3) If you're interested in figuring out which variables are involved in each collinear relationship, then (for centred and scaled data) you can equivalently (and to me, more intuitively) work with the principal-components analysis of the predictors. (4) I have doubts about the whole enterprise. Collinearity is one source of imprecision -- others are small sample size, homogeneous predictors, and large error variance. Aren't the coefficient standard errors the bottom line? If these are sufficiently small, why worry? I hope that this helps. John At 05:35 PM 7/23/2003 +0200, Uwe Ligges wrote:>Peter Flom wrote: > >>Has anyone programmed condition indexes in R? >>I know that there is a function for variance inflation factors >>available in the car package; however, Belsley (1991) Conditioning >>Diagnostics (Wiley) notes that there are several weaknesses of VIFs: >>e.g. 1) High VIFs are sufficient but not necessary conditions for >>collinearity 2) VIFs don't diagnose the number of collinearities and 3) >>No one has determined how high a VIF has to be for the collinearity to >>be damaging. >>He then develops and suggests using condition indexes instead, so I was >>wondering if anyone had programmed them. >>Thanks >>Peter > > >I think Juergen Gross has something like that in his new book >Gross, J. (2003): Linear Regression, Springer (in press - OK, not very >helpful here). > >You might want to contact him privately (in CC). > >Uwe Ligges >----------------------------------------------------- John Fox Department of Sociology McMaster University Hamilton, Ontario, Canada L8S 4M4 email: jfox at mcmaster.ca phone: 905-525-9140x23604 web: www.socsci.mcmaster.ca/jfox
Thanks for all the help. Juergen Gross supplied a program which does just what Belsley suggested. Chuck Cleland, John Fox and Andy Liaw all made useful programming suggestions. John Fox asked <<< (1) I've never liked this approach for a model with a constant, where it makes more sense to me to centre the data. I realize that opinions differ here, but it seems to me that failing to centre the data conflates collinearity with numerical instability.>>>Opinions do differ. A few years ago, I could have given more details (my dissertation was on this topic, but a lot of the details have disappeared from memory); I think, though, that Belsley is looking for a measure that deals not only with collinearity, but with several other problems, including numerical instability (the subtitle of his later book is Collinearity and Weak Data in Regression). I remember being convinced that centering was generally not a good idea, but there are lots of people who disagree and who know a lot more statistics than I do. <<< (2) I also disagree with the comment that condition indices are easier to interpret than variance-inflation factors. In either case, since collinearity is a continuous phenomenon, cutoffs for large values are necessarily arbitrary.>>>While any cutoff is arbitrary (and Belsley advises against using a cutoff rigidly) he does provide some evidence of how regression models with different condition indices are affected by them. <<< (3) If you're interested in figuring out which variables are involved in each collinear relationship, then (for centred and scaled data) you can equivalently (and to me, more intuitively) work with the principal-components analysis of the predictors.>>>This would also work. <<< (4) I have doubts about the whole enterprise. Collinearity is one source of imprecision -- others are small sample size, homogeneous predictors, and large error variance. Aren't the coefficient standard errors the bottom line? If these are sufficiently small, why worry?>>>I think (correct me if I am wrong) that the s.e.s and the condition indices serve very different purposes. The condition indices are supposed to determine if small changes in the input data could make big differences in the results. Belsley provides some examples where a tiny change in the data results in completely different results (e.g., different standard errors, different coefficients (even reversing sign) and so on). Peter Peter L. Flom, PhD Assistant Director, Statistics and Data Analysis Core Center for Drug Use and HIV Research National Development and Research Institutes 71 W. 23rd St www.peterflom.com New York, NY 10010 (212) 845-4485 (voice) (917) 438-0894 (fax)
Dear John An interesting discussion! I would be the last to suggest ignoring such diagnostics as Cook's D; as you point out, it diagnoses a problem which condition indices do not: Whether a point is influential. OTOH, condition indices diagnose a problem which Cook's D does not: Would shifting the data slightly change the results. Consider the data given in Belsley (1991) on p. 5 y <- c( 3.3979, 1.6094, 3.7131, 1.6767, 0.0419, 3.3768, 1.1661, 0.4701) x2a <- c(-3.138, -0.297, -4.582, 0.301, 2.729, -4.836, 0.065, 4.102) x2b <- c(-3.136, -0.296, -4.581, 0.300, 2.730, -4.834, 0.064, 4.103) x3a <- c(1.286, 0.250, 1.247, 0.498, -0.280, 0.350, 0.208, 1.069) x3b <- c(1.288, 0.251, 1.246, 0.498, -0.281, 0.349, 0.206, 1.069) x4a <- c(0.169, 0.044, 0.109, 0.117, 0.035, -0.094, 0.047, 0.375) x4b <- c(0.170, 0.043, 0.108, 0.118, 0.036, -0.093, 0.048, 0.376) where x2a , x3a and x4a are very similar to x2b, x3b and x4b, respecttively, and where both are generated from y = 1.2I - 0.4 x2 + 0.6x3 + 0.9x4 + e e ~ N(0, 0.01) Then modela <- lm(y~ x2a + x3a + x4a) and modelb <- lm(y~x2b + x3b + x4b) give radically different results, with neither having any significant parameters other than the intercept. Admittedly, the standard errors for a couple of the parameters are large. But why are they large? I have certainly dealt with models with large standard errors that have nothing to do with collinearity. here, the function PI.lm (supplied by Juergen Gross) gives huge condition indices, and indicates that the nature of the problem is that all three of the x variables are highly collinear. Variance-Decomposition Proportions for Scaled Condition Indexes: (Intercept) x2b x3b x4b 1 0.0494 0 0 0 1 0.0009 0 0 0 3 0.8101 0 0 0 464 0.1396 1 1 1 Regards Peter Peter L. Flom, PhD Assistant Director, Statistics and Data Analysis Core Center for Drug Use and HIV Research National Development and Research Institutes 71 W. 23rd St www.peterflom.com New York, NY 10010 (212) 845-4485 (voice) (917) 438-0894 (fax)
Dear Peter, I'm sorry that I've taken a while to get back to you -- I was away for a few days. In the example that you give from Belsley (1991), the predictors are essentially perfectly linearly related; for example > summary(lm(x2a ~ x3a + x4a)) Call: lm(formula = x2a ~ x3a + x4a) Residuals: 1 2 3 4 5 6 7 -0.0195624 -0.0152938 0.0078068 0.0323025 -0.0087845 0.0025448 0.0014472 8 -0.0004606 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.007943 0.010025 0.792 0.464 x3a -6.181811 0.016069 -384.716 2.25e-12 *** x4a 28.540996 0.066907 426.580 1.34e-12 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 0.01901 on 5 degrees of freedom Multiple R-Squared: 1, Adjusted R-squared: 1 F-statistic: 1.033e+05 on 2 and 5 DF, p-value: 2.879e-12 In a case like this, the variance-inflation factors will also be very large: > vif(lm(y~ x2a + x3a + x4a)) x2a x3a x4a 41333.34 47141.19 57958.62 Any of several methods of discovering the linear relationship among the x's will work -- including the first regression above, a principal-components analysis, and Belsley's approach. I'm not arguing that discovering the source of large standard errors in a regression is completely uninteresting, although in most circumstances there isn't much that one can do about it short of collecting new data -- but this probably isn't a proper forum to have a detailed discussion about collinearity (my fault for broaching the issue in the first place). Except with respect to centering the data, I suspect that we largely agree about these matters. Regards, John At 11:03 AM 7/24/2003 -0400, Peter Flom wrote:>Dear John > >An interesting discussion! > >I would be the last to suggest ignoring such diagnostics as Cook's D; >as you point out, it diagnoses a problem which condition indices do not: >Whether a point is influential. > >OTOH, condition indices diagnose a problem which Cook's D does not: >Would shifting the data slightly change the results. > >Consider the data given in Belsley (1991) on p. 5 > >y <- c( 3.3979, 1.6094, 3.7131, 1.6767, 0.0419, 3.3768, 1.1661, >0.4701) >x2a <- c(-3.138, -0.297, -4.582, 0.301, 2.729, -4.836, 0.065, 4.102) >x2b <- c(-3.136, -0.296, -4.581, 0.300, 2.730, -4.834, 0.064, 4.103) >x3a <- c(1.286, 0.250, 1.247, 0.498, -0.280, 0.350, 0.208, 1.069) >x3b <- c(1.288, 0.251, 1.246, 0.498, -0.281, 0.349, 0.206, 1.069) >x4a <- c(0.169, 0.044, 0.109, 0.117, 0.035, -0.094, 0.047, 0.375) >x4b <- c(0.170, 0.043, 0.108, 0.118, 0.036, -0.093, 0.048, 0.376) > >where x2a , x3a and x4a are very similar to x2b, x3b and x4b, >respecttively, and where both are generated from > >y = 1.2I - 0.4 x2 + 0.6x3 + 0.9x4 + e > >e ~ N(0, 0.01) > >Then >modela <- lm(y~ x2a + x3a + x4a) >and >modelb <- lm(y~x2b + x3b + x4b) > >give radically different results, with neither having any significant >parameters other than the intercept. Admittedly, the standard errors >for a couple of the parameters are large. But why are they large? I >have certainly dealt with models with large standard errors that have >nothing to do with collinearity. > >here, the function PI.lm (supplied by Juergen Gross) gives huge >condition indices, and indicates that the nature of the problem is that >all three of the x variables are highly collinear. > >Variance-Decomposition Proportions for >Scaled Condition Indexes: > > (Intercept) x2b x3b x4b >1 0.0494 0 0 0 >1 0.0009 0 0 0 >3 0.8101 0 0 0 >464 0.1396 1 1 1 >----------------------------------------------------- John Fox Department of Sociology McMaster University Hamilton, Ontario, Canada L8S 4M4 email: jfox at mcmaster.ca phone: 905-525-9140x23604 web: www.socsci.mcmaster.ca/jfox
Maybe Matching Threads
- Bug in model.matrix.default for higher-order interaction encoding when specific model terms are missing
- Bug in model.matrix.default for higher-order interaction encoding when specific model terms are missing
- Bug in model.matrix.default for higher-order interaction encoding when specific model terms are missing
- Bug in model.matrix.default for higher-order interaction encoding when specific model terms are missing
- Bug in model.matrix.default for higher-order interaction encoding when specific model terms are missing