AbouEl-Makarim Aboueissa
2023-Apr-20 18:36 UTC
[R] detect and replace outliers by the average
Dear All: *Re:* detect and replace outliers by the average The dataset, please see attached, contains a group factoring column ? *factor*? and two columns of data ?x1? and ?x2? with some NA values. I need some help to detect the outliers and replace it and the NAs with the average within each level (0,1,2) for each variable ?x1? and ?x2?. I tried the below code, but it did not accomplish what I want to do. data<-read.csv("G:/20-Spring_2023/Outliers/data.csv", header=TRUE) data replace_outlier_with_mean <- function(x) { replace(x, x %in% boxplot.stats(x)$out, mean(x, na.rm=TRUE)) #### , na.rm=TRUE NOT working } data[] <- lapply(data, replace_outlier_with_mean) Thank you all very much for your help in advance. with many thanks abou ______________________ *AbouEl-Makarim Aboueissa, PhD* *Professor, Mathematics and Statistics* *Graduate Coordinator* *Department of Mathematics and Statistics* *University of Southern Maine*
?s 19:36 de 20/04/2023, AbouEl-Makarim Aboueissa escreveu:> Dear All: > > > > *Re:* detect and replace outliers by the average > > > > The dataset, please see attached, contains a group factoring column ? > *factor*? and two columns of data ?x1? and ?x2? with some NA values. I need > some help to detect the outliers and replace it and the NAs with the > average within each level (0,1,2) for each variable ?x1? and ?x2?. > > > > I tried the below code, but it did not accomplish what I want to do. > > > > > > data<-read.csv("G:/20-Spring_2023/Outliers/data.csv", header=TRUE) > > data > > replace_outlier_with_mean <- function(x) { > > replace(x, x %in% boxplot.stats(x)$out, mean(x, na.rm=TRUE)) #### , > na.rm=TRUE NOT working > > } > > data[] <- lapply(data, replace_outlier_with_mean) > > > > > > Thank you all very much for your help in advance. > > > > > > with many thanks > > abou > > > ______________________ > > > *AbouEl-Makarim Aboueissa, PhD* > > *Professor, Mathematics and Statistics* > *Graduate Coordinator* > > *Department of Mathematics and Statistics* > *University of Southern Maine* > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.Hello, There is no data set attached, see the posting guide on what file extensions are allowed as attachments. As for the question, try to compute mean(x, na.rm = TRUE) first, then use this value in the replace instruction. Without data I'm just guessing. Hope this helps, Rui Barradas
This can be seen as three steps: (1) identify outliers (2) replace them with NA (trivial) (3) impute missing values. There are packages for imputing missing data. See https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/ Here I just want to address the first step. An observation is only an outlier relative to some model. Outliers can indicate - data that are just wrong (data entry error, failing battery in measurement device, all sorts of stuff). In this case, deletion + imputation makes sense. - data that are generated by a mixture of two or more processes, not the single process you thought was there. In this case, deletion + imputation is dangerous. The world is trying to tell you something and you are ignoring it. - the model is wrong. Here again, deletion + imputation is dangerous. You need a better model. "Detecting outliers in R" as a web query turned up https://statsandr.com/blog/outliers-detection-in-r/ on the first page of results. There's plenty of material about finding outliers. But please give very VERY serious consideration to the possibility that some or even all of your outliers are actually GOOD data telling you something you need to know. On Fri, 21 Apr 2023 at 06:38, AbouEl-Makarim Aboueissa < abouelmakarim1962 at gmail.com> wrote:> Dear All: > > > > *Re:* detect and replace outliers by the average > > > > The dataset, please see attached, contains a group factoring column ? > *factor*? and two columns of data ?x1? and ?x2? with some NA values. I need > some help to detect the outliers and replace it and the NAs with the > average within each level (0,1,2) for each variable ?x1? and ?x2?. > > > > I tried the below code, but it did not accomplish what I want to do. > > > > > > data<-read.csv("G:/20-Spring_2023/Outliers/data.csv", header=TRUE) > > data > > replace_outlier_with_mean <- function(x) { > > replace(x, x %in% boxplot.stats(x)$out, mean(x, na.rm=TRUE)) #### , > na.rm=TRUE NOT working > > } > > data[] <- lapply(data, replace_outlier_with_mean) > > > > > > Thank you all very much for your help in advance. > > > > > > with many thanks > > abou > > > ______________________ > > > *AbouEl-Makarim Aboueissa, PhD* > > *Professor, Mathematics and Statistics* > *Graduate Coordinator* > > *Department of Mathematics and Statistics* > *University of Southern Maine* > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]