"Jens Oehlschlägel"
2004-Jan-21  16:57 UTC
[R] outlier identification: is there a redundancy-invariant substitution for mahalanobis distances?
Dear R-experts, Searching the help archives I found a recommendation to do multivariate outlier identification by mahalanobis distances based on a robustly estimated covariance matrix and compare the resulting distances to a chi^2-distribution with p (number of your variables) degrees of freedom. I understand that compared to euclidean distances this has the advantage of being scale-invariant. However, it seems that such mahalanobis distances are not invariant to redundancies: adding a highly collinear variable changes the mahalanobis distances (see code below). Isn't also the comparision to chi^2 assuming that all variables are independent? Can anyone recommend a procedure to calculate distances and identify multivariate outliers which is invariant to the degree of collinearity? Thanks to any advice Jens Oehlschl?gel # Example code library(MASS) # generate bivariate normal test data n <- 500 x <- matrix(rnorm(n*2), ncol=2) # scale, otherwise euclidean fails x <- scale(x) cr <- cov.rob(x, method="mcd") center <- cr$center # calculate squared euclidean and mahalanobis d <- rowSums(t(t(x)-center)^2) m <- as.vector(mahalanobis(x, center, cr$cov)) # euclidean an dmahalanobis basically coincide, mahalanobis slightly biased by robust covariance underestimation eqscplot(x=d, y=m); abline(0,1) # Now I add a highly redundant column in hope the distances between cases will not change x2 <- cbind(x, x[,1]+rnorm(n, sd=0.01)) # scale, otherwise euclidean fails x2 <- scale(x2) cr2 <- cov.rob(x2, method="mcd") center2 <- cr2$center d2 <- rowSums(t(t(x2)-center2)^2) m2 <- as.vector(mahalanobis(x2, center2, cr2$cov)) # though equally scaled, euclidean and mahalanobis diverge eqscplot(x=d2, y=m2); abline(0,1) # mahalanobis distances are obviously not redundancy invariant eqscplot(x=m, y=m2); abline(0,1) # especially if rank order of distances is considered eqscplot(x=rank(m), y=rank(m2)); abline(0,1) cor(m, m2) cor(m, m2, method="spearman") # euclidean distances look better but are also not redundancy invariant eqscplot(x=d, y=d2); abline(0,1) eqscplot(x=rank(d), y=rank(d2)); abline(0,1) cor(d, d2) cor(d, d2, method="spearman") -- Bis 31.1.: TopMail + Digicam f?r nur 29 EUR http://www.gmx.net/topmail
Prof Brian Ripley
2004-Jan-21  17:35 UTC
[R] outlier identification: is there a redundancy-invariant substitution for mahalanobis distances?
Your extra column is not redundant: it adds an extra column of information, and outliers in that column after removing the effects of the other columns are still multivariate outliers. Effectively you have added one more dimension to the sphered point cloud, and mahalanobis distance is Euclidean distance after sphering. On Wed, 21 Jan 2004, "Jens Oehlschl?gel" wrote:> > > Dear R-experts, > > Searching the help archives I found a recommendation to do multivariate > outlier identification by mahalanobis distances based on a robustly estimated > covariance matrix and compare the resulting distances to a chi^2-distribution > with p (number of your variables) degrees of freedom. I understand that > compared to euclidean distances this has the advantage of being scale-invariant. > However, it seems that such mahalanobis distances are not invariant to > redundancies: adding a highly collinear variable changes the mahalanobis distances > (see code below). Isn't also the comparision to chi^2 assuming that all > variables are independent?No. It assumes that *after sphering* all variables are independent, which is true by definition for a joint normal distribution.> Can anyone recommend a procedure to calculate distances and identify > multivariate outliers which is invariant to the degree of collinearity?I don't think that makes any sense, given what is usually meant by `multivariate outliers', outliers in any direction in the point cloud. [...] -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595