I am a new user of R so please bear with me. I have reviewed some R books, FAQs and such but the volume of material is great. I am in the process of porting my current SAS and SVS Script code to Lotus Approach, R and WordPerfect. My question is, can you help me determine the best R method to implement the Hoaglin Outlier Method? It is used in the Appendix A and B of the fo llowing link. http://trb.org/publications/nchrp/nchrp_w71.pdf The sample data from Appendix A for determining outliers in R: T314Data <- structure(list(Lab = as.integer(c(1:60)), X = c(4.89, 3.82, 2.57, 2.3, 2.034, 2, 1.97, 1.85, 1.85, 1.85, 1.84, 1.82, 1.82, 1.77, 1.76, 1.67, 1.66, 1.63, 1.62, 1.62, 1.55, 1.54, 1.54, 1.53, 1.53, 1.44, 1.428, 1.42, 1.39, 1.36, 1.35, 1.31, 1.28, 1.24, 1.24, 1.23, 1.22, 1.21, 1.19, 1.18, 1.18, 1.18, 1.17, 1.16, 1.13, 1.13, 1.099, 1.09, 1.09, 1.08, 1.07, 1.05, 0.98, 0.97, 0.84, 0.808, 0.69, 0.63, 0.6, 0.5), Y = c(5.28, 3.82, 2.41, 2.32, 2.211, 1.46, 2.24, 1.91, 1.78, 1.63, 1.81, 1.92, 1.2, 1.67, 1.28, 1.59, 1.45, 2.06, 1.91, 1.19, 1.26, 1.79, 1.39, 1.48, 0.72, 1.29, 1.517, 1.71, 1.12, 1.38, 0.93, 1.36, 1.2, 1.23, 0.71, 1.29, 1.26, 1.48, 1.26, 1.33, 1.21, 1.04, 1.57, 1.42, 1.08, 1.04, 1.33, 1.33, 1.2, 1.05, 1.24, 0.91, 0.99, 1.06, 1.27, 0.702, 0.77, 0.58, 1, 0.38)), .Names = c("Lab", "X", "Y" ), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60" ))>From this point on, I could use your advise. There are several othermethods for determining outliers in R. I'd rather not re-invent the wheel or use a brute strength and force method if there is a better way in R. Our usual method for determining outliers is a student's T test as in ASTM E 178 or when the standard deviation for a lab is 3 or more. We normally have 120 labs to evaluate for outliers similar what is shown in T312Data. On occasion, I have used the Wilk-Shapiro W statistic in SAS. A point in the right direction or an R code example would help greatly. After I trim the outliers, I will need to show which labs were eliminated but that should be fairly trivial. The reference in Appendix A is: Hoaglin, D. C., Iglewicz, B., Tukey, J. W., Performance of Some Resistant Rules for Outlier Labeling, Journal of the American Statistical Association, Vol. 81, No. 396 (Dec., 1986), pp. 991-999. The ASTM E 178 reference is: Shapiro, S. S., and Wilk, M. B., An Analysis of Variance Test for Non-Normality (Complete Samples), Biometrika, BIOKA, Vol 52, 1965, pp. 591611. Kenneth Ray Hobson, P.E. Oklahoma DOT - QA & IAS Manager 200 N.E. 21st Street Oklahoma City, OK 73105-3204 (405) 522-4985, (405) 522-0552 fax
That looks just like how `outliers' are determined in boxplots. You can use the output of boxplot.stats() to compute the limits. [EDA purists would tell you that those shound be letter values (or `F' for fourths), not quartiles.] HTH, Andy> -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of > khobson at fd9ns01.okladot.state.ok.us > Sent: Friday, April 22, 2005 11:23 AM > To: r-help at stat.math.ethz.ch > Subject: [R] Hoaglin Outlier Method > > > > > > > I am a new user of R so please bear with me. I have reviewed > some R books, > FAQs and such but the volume of material is great. I am in > the process of > porting my current SAS and SVS Script code to Lotus Approach, R and > WordPerfect. > > My question is, can you help me determine the best R method > to implement > the Hoaglin Outlier Method? It is used in the Appendix A and > B of the fo > llowing link. http://trb.org/publications/nchrp/nchrp_w71.pdf > > The sample data from Appendix A for determining outliers in R: > T314Data <- > structure(list(Lab = as.integer(c(1:60)), X = c(4.89, 3.82, 2.57, 2.3, > 2.034, 2, 1.97, 1.85, > 1.85, 1.85, 1.84, 1.82, 1.82, 1.77, 1.76, 1.67, 1.66, 1.63, 1.62, > 1.62, 1.55, 1.54, 1.54, 1.53, 1.53, 1.44, 1.428, 1.42, 1.39, > 1.36, 1.35, 1.31, 1.28, 1.24, 1.24, 1.23, 1.22, 1.21, 1.19, 1.18, > 1.18, 1.18, 1.17, 1.16, 1.13, 1.13, 1.099, 1.09, 1.09, 1.08, > 1.07, 1.05, 0.98, 0.97, 0.84, 0.808, 0.69, 0.63, 0.6, 0.5), Y > = c(5.28, > 3.82, 2.41, 2.32, 2.211, 1.46, 2.24, 1.91, 1.78, 1.63, 1.81, > 1.92, 1.2, 1.67, 1.28, 1.59, 1.45, 2.06, 1.91, 1.19, 1.26, 1.79, > 1.39, 1.48, 0.72, 1.29, 1.517, 1.71, 1.12, 1.38, 0.93, 1.36, > 1.2, 1.23, 0.71, 1.29, 1.26, 1.48, 1.26, 1.33, 1.21, 1.04, 1.57, > 1.42, 1.08, 1.04, 1.33, 1.33, 1.2, 1.05, 1.24, 0.91, 0.99, 1.06, > 1.27, 0.702, 0.77, 0.58, 1, 0.38)), .Names = c("Lab", "X", "Y" > ), class = "data.frame", row.names = c("1", "2", "3", "4", "5", > "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", > "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", > "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", > "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", > "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60" > )) > > >From this point on, I could use your advise. There are several other > methods for determining outliers in R. I'd rather not > re-invent the wheel > or use a brute strength and force method if there is a better > way in R. > > Our usual method for determining outliers is a student's T > test as in ASTM > E 178 or when the standard deviation for a lab is 3 or more. > We normally > have 120 labs to evaluate for outliers similar what is shown > in T312Data. > On occasion, I have used the Wilk-Shapiro W statistic in SAS. > A point in > the right direction or an R code example would help greatly. > After I trim > the outliers, I will need to show which labs were eliminated but that > should be fairly trivial. > > The reference in Appendix A is: > Hoaglin, D. C., Iglewicz, B., Tukey, J. W., "Performance of > Some Resistant > Rules for Outlier Labeling," Journal > of the American Statistical Association, Vol. 81, No. 396 > (Dec., 1986), pp. > 991-999. > > The ASTM E 178 reference is: > Shapiro, S. S., and Wilk, M. B., "An Analysis of Variance Test for > Non-Normality (Complete Samples)," Biometrika, BIOKA, Vol 52, > 1965, pp. 591-611. > > Kenneth Ray Hobson, P.E. > Oklahoma DOT - QA & IAS Manager > 200 N.E. 21st Street > Oklahoma City, OK 73105-3204 > (405) 522-4985, (405) 522-0552 fax > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > > >
Boxplot.stats seems to be somewhat helpful but not the full answer to my needs for eliminating outliers. Any other suggestions? In the first post I mentioned the Appendix A from http://trb.org/publications/nchrp/nchrp_w71.pdf . They used X and Y varialbes whereas boxplot.stats is using just one variable. Can boxplot.stats use two variables. X and Y in this case are two samples that are usually from the same population. I've posted some example code below. The first is the same as posted earlier but a little easier to paste for testing. The guts of what I tried follow. I used two iterations and it found 3 of the 4 outliers determined in the Appendix A. # Data from NCHRP Appendix A - http://trb.org/publications/nchrp/nchrp_w71.pdf T314 <- structure(list(Lab = as.integer(c(1:60)), X = c(4.89, 3.82, 2.57, 2.3,2.034, 2, 1.97, 1.85,1.85, 1.85, 1.84, 1.82, 1.82, 1.77, 1.76, 1.67, 1.66, 1.63, 1.62,1.62, 1.55, 1.54, 1.54, 1.53, 1.53, 1.44, 1.428, 1.42, 1.39, 1.36, 1.35, 1.31, 1.28, 1.24, 1.24, 1.23, 1.22, 1.21, 1.19, 1.18, 1.18, 1.18, 1.17, 1.16, 1.13, 1.13, 1.099, 1.09, 1.09, 1.08, 1.07, 1.05, 0.98, 0.97, 0.84, 0.808, 0.69, 0.63, 0.6, 0.5), Y = c(5.28, 3.82, 2.41, 2.32, 2.211, 1.46, 2.24, 1.91, 1.78, 1.63, 1.81, 1.92, 1.2, 1.67, 1.28, 1.59, 1.45, 2.06, 1.91, 1.19, 1.26, 1.79, 1.39, 1.48, 0.72, 1.29, 1.517, 1.71, 1.12, 1.38, 0.93, 1.36, 1.2, 1.23, 0.71, 1.29, 1.26, 1.48, 1.26, 1.33, 1.21, 1.04, 1.57, 1.42, 1.08, 1.04, 1.33, 1.33, 1.2, 1.05, 1.24, 0.91, 0.99, 1.06, 1.27, 0.702, 0.77, 0.58, 1, 0.38)), .Names = c("Lab", "X", "Y" ), class = "data.frame", row.names = as.character(c(1:60))) # Eliminate outliers in X, sample 1 bs <- boxplot.stats(T314$X, coef=1.5) bs.out <- bs$out zX <- subset(T314, !T314$X %in% bs.out) bs.out; nrow(zX); zX # Eliminate outliers in X, sample 1, again (Recheck in other words) bs <- boxplot.stats(zX$X, coef=1.5) bs.out <- bs$out zX <- subset(zX, !zX$X %in% bs.out) bs.out; nrow(zX); zX [[alternative HTML version deleted]]
Apparently Analagous Threads
- stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
- stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
- bwplot stats question
- pros and cons of "robust regression"? (i.e. rlm vs lm)
- Sampling data without having infinite numbers after diong a transformation