AbouEl-Makarim Aboueissa
2023-Apr-20 18:46 UTC
[R] detect and replace outliers by the average
Hi Rui: here is the dataset factor x1 x2 0 700 700 0 700 500 0 470 470 0 710 560 0 5555 520 0 610 720 0 710 670 0 610 9999 1 690 620 1 580 540 1 690 690 1 NA 401 1 450 580 1 700 700 1 400 8888 1 6666 600 1 500 400 1 680 650 2 117 63 2 120 68 2 130 73 2 120 69 2 125 54 2 999 70 2 165 62 2 130 987 2 123 70 2 78 2 98 2 5 2 321 NA with many thanks abou ______________________ *AbouEl-Makarim Aboueissa, PhD* *Professor, Mathematics and Statistics* *Graduate Coordinator* *Department of Mathematics and Statistics* *University of Southern Maine* On Thu, Apr 20, 2023 at 2:44?PM Rui Barradas <ruipbarradas at sapo.pt> wrote:> ?s 19:36 de 20/04/2023, AbouEl-Makarim Aboueissa escreveu: > > Dear All: > > > > > > > > *Re:* detect and replace outliers by the average > > > > > > > > The dataset, please see attached, contains a group factoring column ? > > *factor*? and two columns of data ?x1? and ?x2? with some NA values. I > need > > some help to detect the outliers and replace it and the NAs with the > > average within each level (0,1,2) for each variable ?x1? and ?x2?. > > > > > > > > I tried the below code, but it did not accomplish what I want to do. > > > > > > > > > > > > data<-read.csv("G:/20-Spring_2023/Outliers/data.csv", header=TRUE) > > > > data > > > > replace_outlier_with_mean <- function(x) { > > > > replace(x, x %in% boxplot.stats(x)$out, mean(x, na.rm=TRUE)) #### , > > na.rm=TRUE NOT working > > > > } > > > > data[] <- lapply(data, replace_outlier_with_mean) > > > > > > > > > > > > Thank you all very much for your help in advance. > > > > > > > > > > > > with many thanks > > > > abou > > > > > > ______________________ > > > > > > *AbouEl-Makarim Aboueissa, PhD* > > > > *Professor, Mathematics and Statistics* > > *Graduate Coordinator* > > > > *Department of Mathematics and Statistics* > > *University of Southern Maine* > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > Hello, > > There is no data set attached, see the posting guide on what file > extensions are allowed as attachments. > > As for the question, try to compute mean(x, na.rm = TRUE) first, then > use this value in the replace instruction. Without data I'm just guessing. > > Hope this helps, > > Rui Barradas > >[[alternative HTML version deleted]]
?s 19:46 de 20/04/2023, AbouEl-Makarim Aboueissa escreveu:> Hi Rui: > > > here is the dataset > > factor x1 x2 > 0 700 700 > 0 700 500 > 0 470 470 > 0 710 560 > 0 5555 520 > 0 610 720 > 0 710 670 > 0 610 9999 > 1 690 620 > 1 580 540 > 1 690 690 > 1 NA 401 > 1 450 580 > 1 700 700 > 1 400 8888 > 1 6666 600 > 1 500 400 > 1 680 650 > 2 117 63 > 2 120 68 > 2 130 73 > 2 120 69 > 2 125 54 > 2 999 70 > 2 165 62 > 2 130 987 > 2 123 70 > 2 78 > 2 98 > 2 5 > 2 321 NA > > with many thanks > abou > ______________________ > > > *AbouEl-Makarim Aboueissa, PhD* > > *Professor, Mathematics and Statistics* > *Graduate Coordinator* > > *Department of Mathematics and Statistics* > *University of Southern Maine* > > > > On Thu, Apr 20, 2023 at 2:44?PM Rui Barradas <ruipbarradas at sapo.pt> wrote: > >> ?s 19:36 de 20/04/2023, AbouEl-Makarim Aboueissa escreveu: >>> Dear All: >>> >>> >>> >>> *Re:* detect and replace outliers by the average >>> >>> >>> >>> The dataset, please see attached, contains a group factoring column ? >>> *factor*? and two columns of data ?x1? and ?x2? with some NA values. I >> need >>> some help to detect the outliers and replace it and the NAs with the >>> average within each level (0,1,2) for each variable ?x1? and ?x2?. >>> >>> >>> >>> I tried the below code, but it did not accomplish what I want to do. >>> >>> >>> >>> >>> >>> data<-read.csv("G:/20-Spring_2023/Outliers/data.csv", header=TRUE) >>> >>> data >>> >>> replace_outlier_with_mean <- function(x) { >>> >>> replace(x, x %in% boxplot.stats(x)$out, mean(x, na.rm=TRUE)) #### , >>> na.rm=TRUE NOT working >>> >>> } >>> >>> data[] <- lapply(data, replace_outlier_with_mean) >>> >>> >>> >>> >>> >>> Thank you all very much for your help in advance. >>> >>> >>> >>> >>> >>> with many thanks >>> >>> abou >>> >>> >>> ______________________ >>> >>> >>> *AbouEl-Makarim Aboueissa, PhD* >>> >>> *Professor, Mathematics and Statistics* >>> *Graduate Coordinator* >>> >>> *Department of Mathematics and Statistics* >>> *University of Southern Maine* >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >> Hello, >> >> There is no data set attached, see the posting guide on what file >> extensions are allowed as attachments. >> >> As for the question, try to compute mean(x, na.rm = TRUE) first, then >> use this value in the replace instruction. Without data I'm just guessing. >> >> Hope this helps, >> >> Rui Barradas >> >> >Hello, Here is a way. It uses ave in the function to group the data by the factor. df1 <- "factor x1 x2 0 700 700 0 700 500 0 470 470 0 710 560 0 5555 520 0 610 720 0 710 670 0 610 9999 1 690 620 1 580 540 1 690 690 1 NA 401 1 450 580 1 700 700 1 400 8888 1 6666 600 1 500 400 1 680 650 2 117 63 2 120 68 2 130 73 2 120 69 2 125 54 2 999 70 2 165 62 2 130 987 2 123 70 2 78 NA 2 98 NA 2 5 NA 2 321 NA" df1 <- read.table(text = df1, header = TRUE, colClasses = c("factor", "numeric", "numeric")) replace_outlier_with_mean <- function(x, f) { ave(x, f, FUN = \(y) { i <- is.na(y) | y %in% boxplot.stats(y, do.conf = FALSE)$out y[i] <- mean(y, na.rm = TRUE) y }) } lapply(df1[-1], replace_outlier_with_mean, f = df1$factor) #> $x1 #> [1] 700.0000 700.0000 470.0000 710.0000 1258.1250 610.0000 710.0000 #> [8] 610.0000 690.0000 580.0000 690.0000 1261.7778 450.0000 700.0000 #> [15] 400.0000 1261.7778 500.0000 680.0000 117.0000 120.0000 130.0000 #> [22] 120.0000 125.0000 194.6923 194.6923 130.0000 123.0000 194.6923 #> [29] 98.0000 194.6923 194.6923 #> #> $x2 #> [1] 700.0000 500.0000 470.0000 560.0000 520.0000 720.0000 670.0000 #> [8] 1767.3750 620.0000 540.0000 690.0000 401.0000 580.0000 700.0000 #> [15] 1406.9000 600.0000 400.0000 650.0000 63.0000 68.0000 73.0000 #> [22] 69.0000 54.0000 70.0000 62.0000 168.4444 70.0000 168.4444 #> [29] 168.4444 168.4444 168.4444 Hope this helps, Rui Barradas