thr3ads.net - R help - [R] Removing Outliers Function [Feb 2011]

If this information is useful, please help other people find it:
Share via:

kirtau

2011-Feb-09 02:11 UTC

[R] Removing Outliers Function

I am working on a function that will remove outliers for regression analysis.
I am stating that a data point is an outlier if its studentized residual is
above or below 3 and -3, respectively. The code below is what i have thus
far for the function

x = c(1:20)
y = c(1,3,4,2,5,6,18,8,10,8,11,13,14,14,15,85,17,19,19,20)
data1 = data.frame(x,y)

 
rm.outliers = function(dataset,dependent,independent){
    dataset$predicted = predict(lm(dependent~independent))
    dataset$stdres = rstudent(lm(dependent~independent))
    m = 1
    for(i in 1:length(dataset$stdres)){
      dataset$outlier_counter[i] = if(dataset$stdres[i] >= 3 |
dataset$stdres[i] <= -3) {m} else{0}
    }
    j = length(which(dataset$outlier_counter >= 1))
    while(j>=1){
      print(dataset[which(dataset$outlier_counter >= 1),])
      dataset = dataset[which(dataset$outlier_counter == 0),]
      dataset$predicted = predict(lm(dependent~independent))
      dataset$stdres = rstudent(lm(dependent~independent))
        m = m+1
        for(k in 1:length(dataset$stdres)){
          dataset$outlier_counter[k] = if(dataset$stdres[k] >= 3 |
dataset$stdres[k] <= -3) {m} else{0}
        }
      j = length(which(dataset$outlier_counter >= 1))
    }
    return(dataset)
}

The problem that I run into is that i receive this error when i type 

rm.outliers(data1,data1$y,data1$x)

"    x  y predicted   stdres outlier_counter
16 16 85  22.98647 24.04862               1
Error in `$<-.data.frame`(`*tmp*`, "predicted", value =
c(0.114285714285714,
: 
  replacement has 20 rows, data has 19"

Note: the outlier_counter variable is used to state which "round" of
the
loop the datapoint was marked as an outlier.

This would be a HUGE help to me and a few buddies who run a lot of different
regression tests.

Thanks, and if the question is still confusing please ask

 

-----
- AK
-- 
View this message in context:
http://r.789695.n4.nabble.com/Removing-Outliers-Function-tp3293395p3293395.html
Sent from the R help mailing list archive at Nabble.com.

David Winsemius

2011-Feb-09 03:05 UTC

head link

[R] Removing Outliers Function

On Feb 8, 2011, at 9:11 PM, kirtau wrote:
>
> I am working on a function that will remove outliers for regression  
> analysis.
> I am stating that a data point is an outlier if its studentized  
> residual is
> above or below 3 and -3, respectively. The code below is what i have  
> thus
> far for the function
>
> x = c(1:20)
> y = c(1,3,4,2,5,6,18,8,10,8,11,13,14,14,15,85,17,19,19,20)
> data1 = data.frame(x,y)
>
>
> rm.outliers = function(dataset,dependent,independent){
>    dataset$predicted = predict(lm(dependent~independent))
>    dataset$stdres = rstudent(lm(dependent~independent))
>    m = 1
>    for(i in 1:length(dataset$stdres)){
>      dataset$outlier_counter[i] = if(dataset$stdres[i] >= 3 |
> dataset$stdres[i] <= -3) {m} else{0}
>    }
>    j = length(which(dataset$outlier_counter >= 1))
>    while(j>=1){
>      print(dataset[which(dataset$outlier_counter >= 1),])
>      dataset = dataset[which(dataset$outlier_counter == 0),]
>      dataset$predicted = predict(lm(dependent~independent))
>      dataset$stdres = rstudent(lm(dependent~independent))
>        m = m+1
>        for(k in 1:length(dataset$stdres)){
>          dataset$outlier_counter[k] = if(dataset$stdres[k] >= 3 |
> dataset$stdres[k] <= -3) {m} else{0}
>        }
>      j = length(which(dataset$outlier_counter >= 1))
>    }
>    return(dataset)
> }
>
> The problem that I run into is that i receive this error when i type
>
> rm.outliers(data1,data1$y,data1$x)
>
> "    x  y predicted   stdres outlier_counter
> 16 16 85  22.98647 24.04862               1
> Error in `$<-.data.frame`(`*tmp*`, "predicted", value =  
> c(0.114285714285714,
> :
>  replacement has 20 rows, data has 19"
>
> Note: the outlier_counter variable is used to state which "round"
of
> the
> loop the datapoint was marked as an outlier.
>
> This would be a HUGE help to me and a few buddies who run a lot of  
> different
> regression tests.
The solution is about 3 or 4 lines of code to make the function, but  
removing outliers like this is simply statistical malpractice. Maybe  
it's a good thing that R has a shallow learning curve.

-- 

David Winsemius, MD
West Hartford, CT

kirtau

2011-Feb-09 18:06 UTC

head link

[R] Removing Outliers Function

I have two questions,

1) if the solutions is only three or four lines of code is there anyway you
can share those lines, without disrespecting me further

2) Can you explain why you feel that this is "statistical malpractice"

-----
- AK
-- 
View this message in context:
http://r.789695.n4.nabble.com/Removing-Outliers-Function-tp3293395p3297816.html
Sent from the R help mailing list archive at Nabble.com.

kirtau

2011-Feb-09 18:25 UTC

head link

[R] Removing Outliers Function

I have two questions, 

1) if the solutions is only three or four lines of code is there anyway you
can share those lines instead of stating that the solution is easy and
providing no code. I prefer not to use an R-Package but have a "raw
function". 

2) Can you explain why you feel that this is "statistical malpractice"

-----
- AK
-- 
View this message in context:
http://r.789695.n4.nabble.com/Removing-Outliers-Function-tp3293395p3297853.html
Sent from the R help mailing list archive at Nabble.com.

David Winsemius

2011-Feb-09 19:18 UTC

head link

[R] Removing Outliers Function

On Feb 9, 2011, at 1:25 PM, kirtau wrote:
>
> I have two questions,
>
> 1) if the solutions is only three or four lines of code is there  
> anyway you
> can share those lines instead of stating that the solution is easy and
> providing no code. I prefer not to use an R-Package but have a "raw
> function".
>
> 2) Can you explain why you feel that this is "statistical
malpractice"
You are proposing to systematically distort your data (apparently  
without even examining it)  before conducting an inferential process.  
The old FLA GIGO is operative here. The data arose from some process  
in nature and the outliers are just as important as the inliers. If  
you want methods that are robust to "outliers" you should look at the
Robust Statistics Task View:
http://cran.r-project.org/web/views/Robust.html

-- 
David Winsemius, MD
West Hartford, CT

Carl Witthoft

2011-Feb-09 22:31 UTC

head link

[R] Removing Outliers Function

To answer part 2:  You should read up on statistical distributions and 
when a sample size is (or isn't) large enough to produce reliable 
statistical parameters such as mean or variance.   I suspect David was 
implying that your yardstick, based on studentized residual,  removes 
valid samples.

I once wrote a simple bit of code (back when I had to do things in c 
rather than R :-(  ) that removed data points that were more than 
N*sigma off the current fitted data set, where N was 3 or 4.  Even that 
is sloppy, as it doesn't take the sample size or other fit parameters 
into account, but it's a lot easier than your setup.


Carl


<quote>
From: kirtau <kirtau_at_live.com>
Date: Wed, 09 Feb 2011 10:06:07 -0800 (PST)

I have two questions,

    1. if the solutions is only three or four lines of code is there 
anyway you can share those lines, without disrespecting me further
    2. Can you explain why you feel that this is "statistical
malpractice"
</quote>

Apparently Analagous Threads

Search for more reasonably related threads

R help - Feb 2011 - Removing Outliers Function

[R] Removing Outliers Function

[R] Removing Outliers Function

[R] Removing Outliers Function

[R] Removing Outliers Function

[R] Removing Outliers Function

[R] Removing Outliers Function

Apparently Analagous Threads