This is a sure way to get a biased variance estimate.
Instead, use a robust dispersion (scale) estimator such as Gini's mean
difference (average absolute difference between any two observations).
The median is a robust location estimator. There are others. If your
ultimate goal is a comparison you can use a robust nonparametric test.
You'll find that the word 'outlier' is hard to define so it's
best left
undefined and unused.
Frank
Mao Jianfeng wrote:> Dear R-helpers,
>
> Very small amount of outliers can greatly affect the mean and many other
> statistic of a numeric variable. So, usually we must deal with the outliers
> properly in the process of data analysis. Here, I want to replace outliers
> with the group median of the variable. But, I can not construct a good way
> to do that efficiently, because of I am a newbie to R and programming.
>
> Can anybody share any R script to do that? I think that is also valuable to
> so many others who is doing numerical data analysis.
>
> Here is a dummy dataframe with a group variable (three levels) and a
numeric
> one. I just want to know how to replace outliers by group median.
>
> population conlen3
> YXPy01 8.6
> YXPy01 8.1
> YXPy01 7.6
> YXPy01 7.6
> YXPy01 23
> YXPy01 7.6
> YXPy01 7.6
> BSPy01 7.5
> BSPy01 6.4
> BSPy01 5.4
> BSPy01 15
> BSPy01 6.6
> BSPy01 5.5
> YLPy01 5.4
> YLPy01 5.4
> YLPy01 5.6
> YLPy01 21
> YLPy01 5.4
> YLPy01 5.4
> YLPy01 5.4
> YLPy01 4.9
>
> Thank you a lot in advance.
>
> Best regards,
> Mao J-F
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University