thr3ads.net - R help - [R] Hoaglin Outlier Method [Apr 2005]

If this information is useful, please help other people find it:
Share via:

khobson@fd9ns01.okladot.state.ok.us

2005-Apr-22 15:23 UTC

[R] Hoaglin Outlier Method

I am a new user of R so please bear with me.  I have reviewed some R books,
FAQs and such but the volume of material is great.  I am in the process of
porting my current SAS and SVS Script code to Lotus Approach, R and
WordPerfect.

My question is, can you help me determine the best R method to implement
the Hoaglin Outlier Method?  It is used in the Appendix A and B of the fo
llowing link. http://trb.org/publications/nchrp/nchrp_w71.pdf

The sample data from Appendix A for determining outliers in R:
T314Data <-
structure(list(Lab = as.integer(c(1:60)), X = c(4.89, 3.82, 2.57, 2.3,
2.034, 2, 1.97, 1.85,
1.85, 1.85, 1.84, 1.82, 1.82, 1.77, 1.76, 1.67, 1.66, 1.63, 1.62,
1.62, 1.55, 1.54, 1.54, 1.53, 1.53, 1.44, 1.428, 1.42, 1.39,
1.36, 1.35, 1.31, 1.28, 1.24, 1.24, 1.23, 1.22, 1.21, 1.19, 1.18,
1.18, 1.18, 1.17, 1.16, 1.13, 1.13, 1.099, 1.09, 1.09, 1.08,
1.07, 1.05, 0.98, 0.97, 0.84, 0.808, 0.69, 0.63, 0.6, 0.5), Y = c(5.28,
3.82, 2.41, 2.32, 2.211, 1.46, 2.24, 1.91, 1.78, 1.63, 1.81,
1.92, 1.2, 1.67, 1.28, 1.59, 1.45, 2.06, 1.91, 1.19, 1.26, 1.79,
1.39, 1.48, 0.72, 1.29, 1.517, 1.71, 1.12, 1.38, 0.93, 1.36,
1.2, 1.23, 0.71, 1.29, 1.26, 1.48, 1.26, 1.33, 1.21, 1.04, 1.57,
1.42, 1.08, 1.04, 1.33, 1.33, 1.2, 1.05, 1.24, 0.91, 0.99, 1.06,
1.27, 0.702, 0.77, 0.58, 1, 0.38)), .Names = c("Lab", "X",
"Y"
), class = "data.frame", row.names = c("1", "2",
"3", "4", "5",
"6", "7", "8", "9", "10",
"11", "12", "13", "14", "15",
"16",
"17", "18", "19", "20", "21",
"22", "23", "24", "25", "26",
"27",
"28", "29", "30", "31", "32",
"33", "34", "35", "36", "37",
"38",
"39", "40", "41", "42", "43",
"44", "45", "46", "47", "48",
"49",
"50", "51", "52", "53", "54",
"55", "56", "57", "58", "59",
"60"
))
>From this point on, I could use your advise.  There are several othermethods for determining outliers in R.  I'd rather not re-invent the wheel
or use a brute strength and force method if there is a better way in R.

Our usual method for determining outliers is a student's T test as in ASTM
E 178 or when the standard deviation for a lab is 3 or more.  We normally
have 120 labs to evaluate for outliers similar what is shown in T312Data.
On occasion, I have used the Wilk-Shapiro W statistic in SAS.  A point in
the right direction or an R code example would help greatly.  After I trim
the outliers, I will need to show which labs were eliminated but that
should be fairly trivial.

The reference in Appendix A is:
Hoaglin, D. C., Iglewicz, B., Tukey, J. W., Performance of Some Resistant
Rules for Outlier Labeling, Journal
of the American Statistical Association, Vol. 81, No. 396 (Dec., 1986), pp.
991-999.

The ASTM E 178 reference is:
Shapiro, S. S., and Wilk, M. B., An Analysis of Variance Test for
Non-Normality (Complete Samples), Biometrika, BIOKA, Vol 52,
1965, pp. 591611.

Kenneth Ray Hobson, P.E.
Oklahoma DOT - QA & IAS Manager
200 N.E. 21st Street
Oklahoma City, OK  73105-3204
(405) 522-4985, (405) 522-0552 fax

Liaw, Andy

2005-Apr-22 15:47 UTC

head link

[R] Hoaglin Outlier Method

That looks just like how `outliers' are determined in boxplots.  You can use
the output of boxplot.stats() to compute the limits.

[EDA purists would tell you that those shound be letter values (or `F' for
fourths), not quartiles.]

HTH,
Andy
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of 
> khobson at fd9ns01.okladot.state.ok.us
> Sent: Friday, April 22, 2005 11:23 AM
> To: r-help at stat.math.ethz.ch
> Subject: [R] Hoaglin Outlier Method
> 
> 
> 
> 
> 
> 
> I am a new user of R so please bear with me.  I have reviewed 
> some R books,
> FAQs and such but the volume of material is great.  I am in 
> the process of
> porting my current SAS and SVS Script code to Lotus Approach, R and
> WordPerfect.
> 
> My question is, can you help me determine the best R method 
> to implement
> the Hoaglin Outlier Method?  It is used in the Appendix A and 
> B of the fo
> llowing link. http://trb.org/publications/nchrp/nchrp_w71.pdf
> 
> The sample data from Appendix A for determining outliers in R:
> T314Data <-
> structure(list(Lab = as.integer(c(1:60)), X = c(4.89, 3.82, 2.57, 2.3,
> 2.034, 2, 1.97, 1.85,
> 1.85, 1.85, 1.84, 1.82, 1.82, 1.77, 1.76, 1.67, 1.66, 1.63, 1.62,
> 1.62, 1.55, 1.54, 1.54, 1.53, 1.53, 1.44, 1.428, 1.42, 1.39,
> 1.36, 1.35, 1.31, 1.28, 1.24, 1.24, 1.23, 1.22, 1.21, 1.19, 1.18,
> 1.18, 1.18, 1.17, 1.16, 1.13, 1.13, 1.099, 1.09, 1.09, 1.08,
> 1.07, 1.05, 0.98, 0.97, 0.84, 0.808, 0.69, 0.63, 0.6, 0.5), Y 
> = c(5.28,
> 3.82, 2.41, 2.32, 2.211, 1.46, 2.24, 1.91, 1.78, 1.63, 1.81,
> 1.92, 1.2, 1.67, 1.28, 1.59, 1.45, 2.06, 1.91, 1.19, 1.26, 1.79,
> 1.39, 1.48, 0.72, 1.29, 1.517, 1.71, 1.12, 1.38, 0.93, 1.36,
> 1.2, 1.23, 0.71, 1.29, 1.26, 1.48, 1.26, 1.33, 1.21, 1.04, 1.57,
> 1.42, 1.08, 1.04, 1.33, 1.33, 1.2, 1.05, 1.24, 0.91, 0.99, 1.06,
> 1.27, 0.702, 0.77, 0.58, 1, 0.38)), .Names = c("Lab",
"X", "Y"
> ), class = "data.frame", row.names = c("1",
"2", "3", "4", "5",
> "6", "7", "8", "9", "10",
"11", "12", "13", "14", "15",
"16",
> "17", "18", "19", "20",
"21", "22", "23", "24", "25",
"26", "27",
> "28", "29", "30", "31",
"32", "33", "34", "35", "36",
"37", "38",
> "39", "40", "41", "42",
"43", "44", "45", "46", "47",
"48", "49",
> "50", "51", "52", "53",
"54", "55", "56", "57", "58",
"59", "60"
> ))
> 
> >From this point on, I could use your advise.  There are several other
> methods for determining outliers in R.  I'd rather not 
> re-invent the wheel
> or use a brute strength and force method if there is a better 
> way in R.
> 
> Our usual method for determining outliers is a student's T 
> test as in ASTM
> E 178 or when the standard deviation for a lab is 3 or more.  
> We normally
> have 120 labs to evaluate for outliers similar what is shown 
> in T312Data.
> On occasion, I have used the Wilk-Shapiro W statistic in SAS. 
>  A point in
> the right direction or an R code example would help greatly.  
> After I trim
> the outliers, I will need to show which labs were eliminated but that
> should be fairly trivial.
> 
> The reference in Appendix A is:
> Hoaglin, D. C., Iglewicz, B., Tukey, J. W., "Performance of 
> Some Resistant
> Rules for Outlier Labeling," Journal
> of the American Statistical Association, Vol. 81, No. 396 
> (Dec., 1986), pp.
> 991-999.
> 
> The ASTM E 178 reference is:
> Shapiro, S. S., and Wilk, M. B., "An Analysis of Variance Test for
> Non-Normality (Complete Samples)," Biometrika, BIOKA, Vol 52,
> 1965, pp. 591-611.
> 
> Kenneth Ray Hobson, P.E.
> Oklahoma DOT - QA & IAS Manager
> 200 N.E. 21st Street
> Oklahoma City, OK  73105-3204
> (405) 522-4985, (405) 522-0552 fax
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
> 
>

Kenneth Hobson

2005-May-05 22:57 UTC

head link

[R] Re: Hoaglin Outlier Method

Boxplot.stats seems to be somewhat helpful but not the full answer to my needs
for eliminating outliers.  Any other suggestions?

In the first post I mentioned the Appendix A from 
http://trb.org/publications/nchrp/nchrp_w71.pdf .  They used X and Y varialbes
whereas boxplot.stats is using just one variable.  Can boxplot.stats use two
variables.  X and Y in this case are two samples that are usually from the same
population.

I've posted some example code below.  The first is the same as posted
earlier but a little easier to paste for testing.  The guts of what I tried
follow.  I used two iterations and it found 3 of the 4 outliers determined in
the Appendix A.

# Data from NCHRP Appendix A - http://trb.org/publications/nchrp/nchrp_w71.pdf
T314 <- structure(list(Lab = as.integer(c(1:60)), X = c(4.89, 3.82, 2.57,
2.3,2.034, 2, 1.97, 1.85,1.85, 1.85, 1.84, 1.82, 1.82, 1.77, 1.76, 1.67, 1.66,
1.63, 1.62,1.62, 1.55, 1.54, 1.54, 1.53, 1.53, 1.44, 1.428, 1.42, 1.39, 1.36,
1.35, 1.31, 1.28, 1.24, 1.24, 1.23, 1.22, 1.21, 1.19, 1.18, 1.18, 1.18, 1.17,
1.16, 1.13, 1.13, 1.099, 1.09, 1.09, 1.08, 1.07, 1.05, 0.98, 0.97, 0.84, 0.808,
0.69, 0.63, 0.6, 0.5), Y = c(5.28, 3.82, 2.41, 2.32, 2.211, 1.46, 2.24, 1.91,
1.78, 1.63, 1.81, 1.92, 1.2, 1.67, 1.28, 1.59, 1.45, 2.06, 1.91, 1.19, 1.26,
1.79, 1.39, 1.48, 0.72, 1.29, 1.517, 1.71, 1.12, 1.38, 0.93, 1.36, 1.2, 1.23,
0.71, 1.29, 1.26, 1.48, 1.26, 1.33, 1.21, 1.04, 1.57, 1.42, 1.08, 1.04, 1.33,
1.33, 1.2, 1.05, 1.24, 0.91, 0.99, 1.06, 1.27, 0.702, 0.77, 0.58, 1, 0.38)),
.Names = c("Lab", "X", "Y" ), class =
"data.frame",
row.names = as.character(c(1:60)))

# Eliminate outliers in X, sample 1
bs <- boxplot.stats(T314$X, coef=1.5)
bs.out <- bs$out
zX <- subset(T314, !T314$X %in% bs.out)
bs.out; nrow(zX); zX

# Eliminate outliers in X, sample 1, again (Recheck in other words)
bs <- boxplot.stats(zX$X, coef=1.5)
bs.out <- bs$out
zX <- subset(zX, !zX$X %in% bs.out)
bs.out; nrow(zX); zX
	[[alternative HTML version deleted]]

Reasonably Related Threads

Search for more seemingly similar threads

R help - Apr 2005 - Hoaglin Outlier Method

[R] Hoaglin Outlier Method

[R] Hoaglin Outlier Method

[R] Re: Hoaglin Outlier Method

Reasonably Related Threads