Kevin Wright
2004-May-26 19:31 UTC
[R] Outlier identification according to Hardin & Rocke (1999)
I'm trying to use a paper by Hardin & Rocke: http://handel.cipic.ucdavis.edu/~dmrocke/Robdist5.pdf as a guide for a function to identify outliers in multivariate data. Attached below is a function that is my attempt to reproduce their method and also a test to see what fraction of the data are identified as outliers. Using this function I am able to reproduce their results regarding the asymptotic chi-square method, but not their new method using an asymptotic F method. In particular, the asymptotic F method (and adjusted F method) seem to give critical distances that are too large and therefore no (or few) points are identified as outliers. I have tried unsuccessfully to contact the authors of the paper to seek further information and / or numerical examples. I would be most interested if anybody has used the method of Hardin & Rocke or can take the function I have provided below and modify it to reproduce their results. Note: The mvtnorm package is required. Best, Kevin Wright outliers.id <- function(x, alpha=.05){ # See: Hardin & Rocke (1999), "The Distribution of Robust Distances" # http://handel.cipic.ucdavis.edu/~dmrocke/Robdist5.pdf # See: Hardin & Rocke (2002), "Outlier Detection in the Multiple Cluster # Setting Using the Minimum Covariance Determinant Estimator" # http://bioinfo.cipic.ucdavis.edu/publications/print_pub?pub_id=736&category=1 # Drop factors first factors <- names(x)[sapply(x,is.factor)] if(length(factors)>0) x <- x[-factors] # Get the robust location/scale estimates require(MASS) covResult <- cov.rob(x) # Calculate the mahalanobis distance for each datum distance <- mahalanobis(x,covResult$center,covResult$cov) n <- nrow(x) p <- ncol(x) h <- floor((n+p+1)/2) # Asymptotic chi-square method (page 11) # Often identifies too many points as outliers critical.chi <- qchisq(1-alpha, p) cat("Chi square critical distance:", critical.chi,"\n") # Now the approximate F method. First estimate c (page 19) c <- pchisq(qchisq(1-h/n, p),p+2) / (h/n) # Now to estimate m (page 22) a <- (n-h)/n qa <- qchisq(1-a,p) ca <- (1-a)/pchisq(qa,p+2) c2 <- -pchisq(qa,p+2)/2 c3 <- -pchisq(qa,p+4)/2 c4 <- 3 * c3 b1 <- ca * (c3 - c4) / (1-a) b2 <- 0.5 + ca/(1-a) * (c3 - qa/p * (c2 + (1-a)/2)) v1 <- (1-a) * b1^2 * (a * (ca * qa/p -1)^2 -1) - 2 * c3 * ca^2 * (3* (b1 - p * b2)^2 + (p+2) * b2 * (2*b1 - p*b2)) v2 <- n * (b1 * (b1 - p*b2) * (1-a))^2 * ca^2 v <- v1/v2 m <- 2/(ca^2 * v) # (page 17) critical.F.asy <- p*m * qf(1-alpha, p, m-p+1) / (c * (m-p+1)) cat("Asymptotic F critical distance:",critical.F.asy,"\n") # The small-sample (hundreds of points) adjustment to m. # Hardin & Rocke, 2002, page 631 m <- m * exp(0.725 - 0.00663*p -0.0780 * log(n)) # Finally, the critical point (using adjusted m) critical.F.adj <- p*m*qf(1-alpha, p, m-p+1) / (c * (m-p+1)) cat("Adjusted asymptotic F critical distance:",critical.F.adj,"\n") #outliers <- as.integer(distance > critical) x$Distances <- distance #x$Outliers <- outliers attr(x,"critical.chi") <- critical.chi attr(x,"critical.F.asy") <- critical.F.asy attr(x,"critical.F.adj") <- critical.F.adj return(x) } # Try to reproduce the tables in the paper if(FALSE){ require(mvtnorm) # Simulate data n <- 500;p <- 5 dat <- rmvnorm(n,rep(0,p),diag(p)) # Identify outliers dat.out <- outliers.id(as.data.frame(dat),alpha=.05) # Now see what percent are identified as outliers cat("Chi-square \n") 100*sum(dat.out$Distances>attr(dat.out,"critical.chi"))/n cat("Approximate F asymptotic \n") 100*sum(dat.out$Distances>attr(dat.out,"critical.F.asy"))/n cat("Approximate F adjusted \n") 100*sum(dat.out$Distances>attr(dat.out,"critical.F.adj"))/n }