thr3ads.net - R help - [R] RandomForest vs. bayes & svm classification performance [Jul 2006]

If this information is useful, please help other people find it:
Share via:

Eleni Rapsomaniki

2006-Jul-24 17:59 UTC

[R] RandomForest vs. bayes & svm classification performance

Hi

This is a question regarding classification performance using different methods.
So far I've tried NaiveBayes (klaR package), svm (e1071) package and
randomForest (randomForest). What has puzzled me is that randomForest seems to
perform far better (32% classification error) than svm and NaiveBayes, which
have similar classification errors (45%, 48% respectively). A similar
difference in performance is observed with different combinations of
parameters, priors and size of training data. 

Because I was expecting to see little difference in the perfomance of these
methods I am worried that I may have made a mistake in my randomForest call: 

my.rf=randomForest(x=train.df[,-response_index], y=train.df[,response_index],
xtest=test.df[,-response_index], ytest=test.df[,response_index],
importance=TRUE,proximity=FALSE, keep.forest=FALSE)

(where train.df and test.df are my train and test data.frames and response_index
is the column number specifiying the class)

My main question is: could there be a legitimate reason why random forest would
outperform the other two models (e.g. maybe one
method is more reliable with Gaussian data, handles categorical data
better etc)? Also, is there a way of evaluating the predictive ability of each
parameter in the bayesian model as it can be done for random Forests (through
the importance table)? 

I would appreciate any of your comments and suggestions on these.

Many thanks
Eleni Rapsomaniki

roger bos

2006-Jul-24 20:14 UTC

head link

[R] RandomForest vs. bayes & svm classification performance

I can't add much to your question, being a complete novice at
classification, but I have tried both randomForest and SVM and I get better
results from randomForest than SVM (even after tuning).  randomForest is
also much, much faster.  I just thought randomForest was a much better
algorithm, although I was wondering in the back of my head if I made a
mistake.  I am not sure that giving the call allows anyone to say if a
mistake is being made as there are many places in the code where something
could go wrong.

I hear SVM is used for very comlicated things like facial recognition, so I
wonder why it can't do better on my data set, but I have a limited amount of
time for testing.    It was interesting to hear your results.

Thanks,

Roger





On 7/24/06, Eleni Rapsomaniki <e.rapsomaniki@mail.cryst.bbk.ac.uk>
wrote:>
> Hi
>
> This is a question regarding classification performance using different
> methods.
> So far I've tried NaiveBayes (klaR package), svm (e1071) package and
> randomForest (randomForest). What has puzzled me is that randomForest
> seems to
> perform far better (32% classification error) than svm and NaiveBayes,
> which
> have similar classification errors (45%, 48% respectively). A similar
> difference in performance is observed with different combinations of
> parameters, priors and size of training data.
>
> Because I was expecting to see little difference in the perfomance of
> these
> methods I am worried that I may have made a mistake in my randomForest
> call:
>
> my.rf=randomForest(x=train.df[,-response_index], y=train.df
> [,response_index],
> xtest=test.df[,-response_index], ytest=test.df[,response_index],
> importance=TRUE,proximity=FALSE, keep.forest=FALSE)
>
> (where train.df and test.df are my train and test data.frames and
> response_index
> is the column number specifiying the class)
>
> My main question is: could there be a legitimate reason why random forest
> would
> outperform the other two models (e.g. maybe one
> method is more reliable with Gaussian data, handles categorical data
> better etc)? Also, is there a way of evaluating the predictive ability of
> each
> parameter in the bayesian model as it can be done for random Forests
> (through
> the importance table)?
>
> I would appreciate any of your comments and suggestions on these.
>
> Many thanks
> Eleni Rapsomaniki
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Jameson C. Burt

2006-Jul-27 19:22 UTC

head link

[R] RandomForest vs. bayes & svm classification performance

With remiss, I haven't tried these R tools.
However, I tried a dozen Naive Bayes-like programs, often used to filter
email, where the serious problem with spam has resulted in many
innovations.
The most touted of the worldwide Naive Bayes programs seems to be
CRM114 (not in R, I expect, since its programming is peculiar), 
whose 275 pages of documentation is at
http://crm114.sourceforge.net/CRM114_Revealed_20051207.pdf
However, unless you have several weeks and some flexible programming
skills, don't consider it.
It took me about 3 months to find that crm114 worked best,
then another month to break thru his documentation to control 
his program from a single Perl program with no external parameter files.
Crm114 can form groups of 5 words as word word, taking all combinations
of 5 consecutive words in documents.
Using 5 words produced better results than any filters I used; eg,
filtering/altering car manufacturer's standard form prompts like 
   Fire? Yes_ No_

Initially, I expected correct results of 99% or better, 
like my use of Naive Bayes to filter my email. 
However, email must accomplish some goal (go to their webpage or see
their low cost), so Naive Bayes approaches work very well on email.

U.S. Department of Transportation (DOT), defects investigation, contracted with
me
to try what I'd successfully used for email (others' programs).
They were accumulating 50,000 early warning reports a quarter,
yet their engineers had read only 3,000.
DOT contracted for a dozen people to slug thru the accumulated 300,000
reports, identifying those that might portend the necessity of a recall.
But these contractors (probably costing $1 million a year) agreed with 
the engineers no more than 50% of the time.

After 2 months, I was able to correctly identify only 30% of reports.
Then I read that Naive Bayes was, after all, "naive".
It presumed independence between words.
There's an old statitical saying,
  "Do you prefer to perfectly solve the wrong problem,
  or wrongly solve the correct problem?"
People using Naive Bayes use many heuristics, as the CRM114 documents
mention, including,
a. TOE, "Train on Error"
   for which you retrain any document that Naive Bayes classifies
   incorrectly.
   Statistically, this is somewhat like having a learned population
   with more than one of the same document.
b. SSTTT, "Single Sided Thick Threshhold Training"
   for which you retrain a document when it doesn't identify correctly 
   with a sufficiently high probability.
c. TUNE, "Train Until No Error"
   for which you recycle thru your known records until you
   reach perfection, although often forced a stop when no improvement 
   resulted after 12 cycles.
All these techniques improved correct identification and concentration
(proportion of "flagged" reports that are correctly flagged) to about
67%.

Then the engineers (gearheads) did the inexplicable -- they read about
20,000 reports, jumping the correctness of the crm114 Naive Bayes
approach with the above heuristics to about 88%.
Suddenly, crm114 Naive Bayes "flagged" reports were fun to read.
For example, a report no-one had yet identified described a fellow's 
car modified with airbags to lift the car to a high height
using canisters of some air in the back of his pickup.
Driving down the road, he notice a warning light flashing on his air
supply.
Soon afterwards, the passenger seat caught fire.
Even though his pickup was moving down the road,
the flashing warning light and flaming passenger seat
prompted him to open his driver's door and leap from his moving pickup.

While I worked the Bayesian approach and contractors read reports as two
approaches to slug thru 300,000 reports,
big software/contractor companies hovered over the spending and
potential spending.
But their approaches were all judged foolish -- expensively foolish.

So, if you really have a problem worthy of solving well, 
some time, and some programming skills,
you can integrate a Naive Bayes procedure with some heuristic
procedures, probably with good correct identification 
and a high concentration of correctly "flagged"
documents among Bayes flagged documents.

On Mon, Jul 24, 2006 at 06:59:31PM +0100, Eleni Rapsomaniki
wrote:> Hi
> 
> This is a question regarding classification performance using different
methods.
> So far I've tried NaiveBayes (klaR package), svm (e1071) package and
> randomForest (randomForest). What has puzzled me is that randomForest seems
to
> perform far better (32% classification error) than svm and NaiveBayes,
which
> have similar classification errors (45%, 48% respectively). A similar
> difference in performance is observed with different combinations of
> parameters, priors and size of training data. 
> 
> Because I was expecting to see little difference in the perfomance of these
> methods I am worried that I may have made a mistake in my randomForest
call:
> 
> my.rf=randomForest(x=train.df[,-response_index],
y=train.df[,response_index],
> xtest=test.df[,-response_index], ytest=test.df[,response_index],
> importance=TRUE,proximity=FALSE, keep.forest=FALSE)
> 
> (where train.df and test.df are my train and test data.frames and
response_index
> is the column number specifiying the class)
> 
> My main question is: could there be a legitimate reason why random forest
would
> outperform the other two models (e.g. maybe one
> method is more reliable with Gaussian data, handles categorical data
> better etc)? Also, is there a way of evaluating the predictive ability of
each
> parameter in the bayesian model as it can be done for random Forests
(through
> the importance table)? 
> 
> I would appreciate any of your comments and suggestions on these.
> 
> Many thanks
> Eleni Rapsomaniki
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
-- 
Jameson C. Burt, NJ9L   Fairfax, Virginia, USA
jameson at coost.com       http://www.coost.com
(202) 690-0380 (work)

LTSP.org:  magic "mysterious and awe-inspiring even though
                  we know they are real and not supernatural"

Reasonably Related Threads

Search for more apparently analagous threads

R help - Jul 2006 - RandomForest vs. bayes & svm classification performance

[R] RandomForest vs. bayes & svm classification performance

[R] RandomForest vs. bayes & svm classification performance

[R] RandomForest vs. bayes & svm classification performance

Reasonably Related Threads