Sometimes outliers happen. No matter the sample size there is always the
possibility that one or more values are correct though highly improbable.
-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Richard
O'Keefe
Sent: Friday, April 21, 2023 7:31 PM
To: AbouEl-Makarim Aboueissa <abouelmakarim1962 at gmail.com>
Cc: R mailing list <r-help at r-project.org>
Subject: Re: [R] detect and replace outliers by the average
[External Email]
This can be seen as three steps:
(1) identify outliers
(2) replace them with NA (trivial)
(3) impute missing values.
There are packages for imputing missing data.
See
https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/
Here I just want to address the first step.
An observation is only an outlier relative to some model.
Outliers can indicate
- data that are just wrong (data entry error, failing battery in measurement
device, all sorts of stuff). In this case, deletion + imputation makes
sense.
- data that are generated by a mixture of two or more processes,
not the single process you thought was there. In this case,
deletion + imputation is dangerous. The world is trying to tell
you something and you are ignoring it.
- the model is wrong. Here again, deletion + imputation is
dangerous. You need a better model.
"Detecting outliers in R" as a web query turned up
https://statsandr.com/blog/outliers-detection-in-r/
on the first page of results. There's plenty of material about finding
outliers.
But please give very VERY serious consideration to the possibility that some or
even all of your outliers are actually GOOD data telling you something you need
to know.
On Fri, 21 Apr 2023 at 06:38, AbouEl-Makarim Aboueissa < abouelmakarim1962 at
gmail.com> wrote:
> Dear All:
>
>
>
> *Re:* detect and replace outliers by the average
>
>
>
> The dataset, please see attached, contains a group factoring column "
> *factor*" and two columns of data "x1" and "x2"
with some NA values. I
> need some help to detect the outliers and replace it and the NAs with
> the average within each level (0,1,2) for each variable "x1" and
"x2".
>
>
>
> I tried the below code, but it did not accomplish what I want to do.
>
>
>
>
>
> data<-read.csv("G:/20-Spring_2023/Outliers/data.csv",
header=TRUE)
>
> data
>
> replace_outlier_with_mean <- function(x) {
>
> replace(x, x %in% boxplot.stats(x)$out, mean(x, na.rm=TRUE)) #### ,
> na.rm=TRUE NOT working
>
> }
>
> data[] <- lapply(data, replace_outlier_with_mean)
>
>
>
>
>
> Thank you all very much for your help in advance.
>
>
>
>
>
> with many thanks
>
> abou
>
>
> ______________________
>
>
> *AbouEl-Makarim Aboueissa, PhD*
>
> *Professor, Mathematics and Statistics* *Graduate Coordinator*
>
> *Department of Mathematics and Statistics* *University of Southern
> Maine* ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat/
> .ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=05%7C01%7Ctebert%40ufl.edu
> %7C1b625ca69ad442654a3e08db42c07f15%7C0d4da0f84a314d76ace60a62331e1b84
> %7C0%7C0%7C638177166777282433%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw
> MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sda
> ta=TkZ0pb02TnNHZz94QtR5j%2BcYHwVJLLZRVqnMhmdxpz8%3D&reserved=0
> PLEASE do read the posting guide
> http://www.r/
> -project.org%2Fposting-guide.html&data=05%7C01%7Ctebert%40ufl.edu%7C1b
> 625ca69ad442654a3e08db42c07f15%7C0d4da0f84a314d76ace60a62331e1b84%7C0%
> 7C0%7C638177166777282433%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiL
> CJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Rw
> %2F3iEOV%2Fu2bF16LPt8y8xt8aA9a0P8DsaeXYpo%2F97k%3D&reserved=0
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.r-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.