Feiming Chen
2011-Feb-03 20:03 UTC
[R] "hubers" function in R MASS library : problem and solution
Hello: I found the "hubers" function in MASS library is NOT working on the following data:> a <- >c(7.19,7.19,7.19,9.41,6.79,9.41,7.19,9.41,1.64,7.19,7.19,7.19,7.19,1.422,7.19,6.79,7.19,6.79,7.19,7.19,4.44,6.55,6.79,7.19,9.41,9.41,7.19,7.19,7.19,7.19,1.64,1.597,1.64,7.19,1.422,7.19,6.79,9.38,7.19,1.64,7.19,7.19,7.19,7.19,7.19,1.64,7.19,6.79,6.79,1.649,1.64,7.19,1.597,1.64,6.55,7.19,7.19,1.64,7.19,7.19,1.407,1.672,1.672,7.19,6.55,7.19,7.19,9.41,1.407,7.19,7.19,9.41,7.19,9.41,7.19,7.19,7.19,7.19,7.19,7.19,7.19,7.19,7.19,9.41,7.19,6.79,7.19,6.79,1.64,1.422,7.19,7.19,1.67,1.64,1.64,1.64,1.64,1.787,7.19,7.19,7.19,6.79,7.19,7.19,7.19,7.19,7.19,7.19,7.19,7.19,7.19,1.64,1.64,1.64,1.422,9.41,1.64,7.19,7.19,7.19,7.19,7.19,7.19,7.19,6.79,6.79,7.19,6.79,7.19,7.19,1.407,7.19,4.42,9,1.64,1.64,6.79,1.664,1.664) >> library(MASS) > hubers(a)## NO response! I think it is due to the infinite loop caused by the following line in the code of "hubers" (around Line 30): if ((abs(mu0 - mu1) < tol * s0) && abs(s0 - s1) < tol * s0) break where "s0" evaluates to ZERO initially (due to more than 50% of the number 7.19). I propose to change the "<" sign to "<=": if ((abs(mu0 - mu1) <= tol * s0) && abs(s0 - s1) <= tol * s0) break which will break the infinite loop. However, the new result is:> hubers(a)$mu [1] 7.19 $s [1] 0 which gives 0 standard deviation. Actually the data does vary and it is not true all values other than 7.19 are outliers. Take a look at:> plot(a)I think this is because we allow initial SD to equal to zero instead of a POSITIVE value. See Line 15 of the "hubers" function: if (missing(s)) s0 <- mad(y) I propose setting "s0" to "mad(y)" or a small positive number, whichever is larger. For example: if (missing(s)) s0 <- max(mad(y), tol) where tol=1e-6. With this change, the result is more sensible:> hubers(a)$mu [1] 5.88 $s [1] 2.937 Could anyone take a look at this and decide if the above modifications to the "hubers" function are warranted?Thanks a lot! Sincerely, Feiming Chen Read more >> Options >> [[alternative HTML version deleted]]
Martin Maechler
2011-Feb-04 09:05 UTC
[R] "hubers" function in R MASS library : problem and solution
>>>>> Feiming Chen <feimingchen at yahoo.com> >>>>> on Thu, 3 Feb 2011 12:03:05 -0800 (PST) writes:> Hello: > I found the "hubers" function in MASS library is NOT working on the following > data: >> a <- >> c(7.19,7.19,7.19,9.41,6.79,9.41,7.19,9.41,1.64,7.19,7.19,7.19,7.19,1.422,7.19,6.79,7.19,6.79,7.19,7.19,4.44,6.55,6.79,7.19,9.41,9.41,7.19,7.19,7.19,7.19,1.64,1.597,1.64,7.19,1.422,7.19,6.79,9.38,7.19,1.64,7.19,7.19,7.19,7.19,7.19,1.64,7.19,6.79,6.79,1.649,1.64,7.19,1.597,1.64,6.55,7.19,7.19,1.64,7.19,7.19,1.407,1.672,1.672,7.19,6.55,7.19,7.19,9.41,1.407,7.19,7.19,9.41,7.19,9.41,7.19,7.19,7.19,7.19,7.19,7.19,7.19,7.19,7.19,9.41,7.19,6.79,7.19,6.79,1.64,1.422,7.19,7.19,1.67,1.64,1.64,1.64,1.64,1.787,7.19,7.19,7.19,6.79,7.19,7.19,7.19,7.19,7.19,7.19,7.19,7.19,7.19,1.64,1.64,1.64,1.422,9.41,1.64,7.19,7.19,7.19,7.19,7.19,7.19,7.19,6.79,6.79,7.19,6.79,7.19,7.19,1.407,7.19,4.42,9,1.64,1.64,6.79,1.664,1.664) >> >> library(MASS) >> hubers(a) > ## NO response! > I think it is due to the infinite loop caused by the following line in the code > of "hubers" (around Line 30): > if ((abs(mu0 - mu1) < tol * s0) && > abs(s0 - s1) < tol * s0) break > where "s0" evaluates to ZERO initially (due to more than 50% of the number > 7.19). yes. Not only for this reason, the robustbase package has had the 'huberM()' function with some other slight advantages over MASS::huber. > I propose to change the "<" sign to "<=": > if ((abs(mu0 - mu1) <= tol * s0) && > abs(s0 - s1) <= tol * s0) break > which will break the infinite loop. However, the new result is: >> hubers(a) > $mu > [1] 7.19 > $s > [1] 0 > which gives 0 standard deviation. Actually the data does vary and it is not > true all values other than 7.19 are outliers. Sure. Nontheless, the way Peter Huber had defined the "proposal 2" Huber estimator, s = 0, is the correct result. With the robustbase huberM() function, you (can) get> huberM(a, warn0scale =TRUE)$mu [1] 7.19 $s [1] 0 $it [1] 0 Warning message: In huberM(a, warn0scale = TRUE) : scale 's' is zero -- returning initial 'mu' >> plot(a) > I think this is because we allow initial SD to equal to zero instead of a > POSITIVE value. See Line 15 of the "hubers" function: > if (missing(s)) > s0 <- mad(y) > I propose setting "s0" to "mad(y)" or a small positive number, whichever is > larger. For example: > if (missing(s)) > s0 <- max(mad(y), tol) > where tol=1e-6. but 'tol' is completely arbitrary and, the way you propose it makes the resulting estimate no-longer-scale-equivariant. huberM() *has* an s argument for specifying the scale estimate, so you could use it as huberM(a, s = max(mad(a), 1e-6)) if you want. Note that your sample 'a' is constructed in a way that all scale-equivariant 50%-breakpoint robust estimates of scale will return s = 0, as more than half of your observations are identical, and scale equivariance "ensures" that in this limiting case, indeed all other observations are "outliers". This last point is a somewhat interesting topic for "robustniks", and hence I'm CC'ing the dedicated "R + Robustness" mailing list, R-SIG-robust. Martin Maechler, ETH Zurich > With this change, the result is more sensible: >> hubers(a) > $mu > [1] 5.88 > $s > [1] 2.937 > Could anyone take a look at this and decide if the above modifications to the > "hubers" function are warranted?Thanks a lot! > Sincerely, > Feiming Chen > Read more >> Options >> > [[alternative HTML version deleted]] > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.