thr3ads.net - R help - [R] outliers using Random Forest [Apr 2004]

If this information is useful, please help other people find it:
Share via:

Edgar Acuna

2004-Apr-18 14:55 UTC

[R] outliers using Random Forest

Hello,
Does anybody know if the outscale option of randomForest yields the
standarized version of the outlier measure for each case? or the results
are only the raw values. Also I have notice that this measure presents
very high variability. I mean if I repeat the experiment I am getting very
different values for this measure and it is hard to flag the outliers.
This does not happen with two other criteria than I am using: LOF and
Bay's Orca. I am getting several cases that can be considered as outliers
with both approaches.
 I run my experiments  using Bupa and Diabetes available at
UCI Machine database repository.

Thanks in advance for any response.

Liaw, Andy

2004-Apr-18 20:24 UTC

head link

[R] outliers using Random Forest

The thing to do is probably:

1. Use fairly large number of trees (e.g., 1000).
2. Run a few times and average the results.

The reason for the instability is sort of two fold:

1. The random forest algorithm itself is based on randomization.  That's why
it's probably a good idea to have 500-1000 trees to get more stable
proximity measures (of which the outlying measures are based on).

2. If you are running randomForest in unsupervised mode (i.e., not giving it
the class labels), then the program treats the data as "class 1",
creates a
synthetic "class 2", and run the classification algorithm to get the
proximity measures.  You probably need to run the algorithm a few times so
that the result will be based on several simulated data, instead of just
one.

HTH,
Andy
> From: Edgar Acuna
> 
> Hello,
> Does anybody know if the outscale option of randomForest yields the
> standarized version of the outlier measure for each case? or 
> the results
> are only the raw values. Also I have notice that this measure presents
> very high variability. I mean if I repeat the experiment I am 
> getting very
> different values for this measure and it is hard to flag the outliers.
> This does not happen with two other criteria than I am using: LOF and
> Bay's Orca. I am getting several cases that can be considered 
> as outliers
> with both approaches.
>  I run my experiments  using Bupa and Diabetes available at
> UCI Machine database repository.
> 
> Thanks in advance for any response.
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
> 

------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments,...{{dropped}}

Liaw, Andy

2004-Apr-19 12:30 UTC

head link

[R] outliers using Random Forest

> From: Edgar Acuna [mailto:edgar at cs.uprm.edu] 
> 
> Dear Andy,
> Thanks for your quick answer. I increased the number of trees and the
> outlyingness measure got more stable. But still I do not know if I am
> working with the raw measure or with the normalized measure mentioned
> in the Breiman's Wald lecture. The normalized measure nout is
> 
> nout=(nout-med)/mean(abs(nout-med))
> where med is the median of the class containing the case correponding
> to nout.
Looking at the Fortran subroutine `locateout' in rfsub.f, yes, they are
normalized.  (That part of the code is not changed from Breiman &
Cutler's
original.)

Andy

 > Best regards
> Edgar Acuna
> 
> On Sun, 18 Apr 2004, Liaw, Andy wrote:
> 
> > The thing to do is probably:
> >
> > 1. Use fairly large number of trees (e.g., 1000).
> > 2. Run a few times and average the results.
> >
> > The reason for the instability is sort of two fold:
> >
> > 1. The random forest algorithm itself is based on 
> randomization.  That's why
> > it's probably a good idea to have 500-1000 trees to get more
stable
> > proximity measures (of which the outlying measures are based on).
> >
> > 2. If you are running randomForest in unsupervised mode 
> (i.e., not giving it
> > the class labels), then the program treats the data as 
> "class 1", creates a
> > synthetic "class 2", and run the classification algorithm to
get the
> > proximity measures.  You probably need to run the algorithm 
> a few times so
> > that the result will be based on several simulated data, 
> instead of just
> > one.
> >
> > HTH,
> > Andy
> >
> > > From: Edgar Acuna
> > >
> > > Hello,
> > > Does anybody know if the outscale option of randomForest 
> yields the
> > > standarized version of the outlier measure for each case? or
> > > the results
> > > are only the raw values. Also I have notice that this 
> measure presents
> > > very high variability. I mean if I repeat the experiment I am
> > > getting very
> > > different values for this measure and it is hard to flag 
> the outliers.
> > > This does not happen with two other criteria than I am 
> using: LOF and
> > > Bay's Orca. I am getting several cases that can be considered
> > > as outliers
> > > with both approaches.
> > >  I run my experiments  using Bupa and Diabetes available at
> > > UCI Machine database repository.
> > >
> > > Thanks in advance for any response.
> > >
> > > ______________________________________________
> > > R-help at stat.math.ethz.ch mailing list
> > > https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide!
> > > http://www.R-project.org/posting-guide.html
> > >
> > >
> >
> >
> > 
> --------------------------------------------------------------
> ----------------
> > Notice:  This e-mail message, together with any 
> attachments, contains
> > information of Merck & Co., Inc. (One Merck Drive, 
> Whitehouse Station, New
> > Jersey, USA 08889), and/or its affiliates (which may be 
> known outside the
> > United States as Merck Frosst, Merck Sharp & Dohme or MSD 
> and in Japan as
> > Banyu) that may be confidential, proprietary copyrighted 
> and/or legally
> > privileged. It is intended solely for the use of the 
> individual or entity
> > named on this message.  If you are not the intended 
> recipient, and have
> > received this message in error, please notify us 
> immediately by reply e-mail
> > and then delete it from your system.
> > 
> --------------------------------------------------------------
> ----------------
> >
> 
> 
> 

------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments,...{{dropped}}

Seemingly Similar Threads

Search for more reasonably related threads

R help - Apr 2004 - outliers using Random Forest

[R] outliers using Random Forest

[R] outliers using Random Forest

[R] outliers using Random Forest

Seemingly Similar Threads