Eleni Rapsomaniki
2006-Jul-24 17:59 UTC
[R] RandomForest vs. bayes & svm classification performance
Hi This is a question regarding classification performance using different methods. So far I've tried NaiveBayes (klaR package), svm (e1071) package and randomForest (randomForest). What has puzzled me is that randomForest seems to perform far better (32% classification error) than svm and NaiveBayes, which have similar classification errors (45%, 48% respectively). A similar difference in performance is observed with different combinations of parameters, priors and size of training data. Because I was expecting to see little difference in the perfomance of these methods I am worried that I may have made a mistake in my randomForest call: my.rf=randomForest(x=train.df[,-response_index], y=train.df[,response_index], xtest=test.df[,-response_index], ytest=test.df[,response_index], importance=TRUE,proximity=FALSE, keep.forest=FALSE) (where train.df and test.df are my train and test data.frames and response_index is the column number specifiying the class) My main question is: could there be a legitimate reason why random forest would outperform the other two models (e.g. maybe one method is more reliable with Gaussian data, handles categorical data better etc)? Also, is there a way of evaluating the predictive ability of each parameter in the bayesian model as it can be done for random Forests (through the importance table)? I would appreciate any of your comments and suggestions on these. Many thanks Eleni Rapsomaniki
roger bos
2006-Jul-24 20:14 UTC
[R] RandomForest vs. bayes & svm classification performance
I can't add much to your question, being a complete novice at classification, but I have tried both randomForest and SVM and I get better results from randomForest than SVM (even after tuning). randomForest is also much, much faster. I just thought randomForest was a much better algorithm, although I was wondering in the back of my head if I made a mistake. I am not sure that giving the call allows anyone to say if a mistake is being made as there are many places in the code where something could go wrong. I hear SVM is used for very comlicated things like facial recognition, so I wonder why it can't do better on my data set, but I have a limited amount of time for testing. It was interesting to hear your results. Thanks, Roger On 7/24/06, Eleni Rapsomaniki <e.rapsomaniki@mail.cryst.bbk.ac.uk> wrote:> > Hi > > This is a question regarding classification performance using different > methods. > So far I've tried NaiveBayes (klaR package), svm (e1071) package and > randomForest (randomForest). What has puzzled me is that randomForest > seems to > perform far better (32% classification error) than svm and NaiveBayes, > which > have similar classification errors (45%, 48% respectively). A similar > difference in performance is observed with different combinations of > parameters, priors and size of training data. > > Because I was expecting to see little difference in the perfomance of > these > methods I am worried that I may have made a mistake in my randomForest > call: > > my.rf=randomForest(x=train.df[,-response_index], y=train.df > [,response_index], > xtest=test.df[,-response_index], ytest=test.df[,response_index], > importance=TRUE,proximity=FALSE, keep.forest=FALSE) > > (where train.df and test.df are my train and test data.frames and > response_index > is the column number specifiying the class) > > My main question is: could there be a legitimate reason why random forest > would > outperform the other two models (e.g. maybe one > method is more reliable with Gaussian data, handles categorical data > better etc)? Also, is there a way of evaluating the predictive ability of > each > parameter in the bayesian model as it can be done for random Forests > (through > the importance table)? > > I would appreciate any of your comments and suggestions on these. > > Many thanks > Eleni Rapsomaniki > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Jameson C. Burt
2006-Jul-27 19:22 UTC
[R] RandomForest vs. bayes & svm classification performance
With remiss, I haven't tried these R tools. However, I tried a dozen Naive Bayes-like programs, often used to filter email, where the serious problem with spam has resulted in many innovations. The most touted of the worldwide Naive Bayes programs seems to be CRM114 (not in R, I expect, since its programming is peculiar), whose 275 pages of documentation is at http://crm114.sourceforge.net/CRM114_Revealed_20051207.pdf However, unless you have several weeks and some flexible programming skills, don't consider it. It took me about 3 months to find that crm114 worked best, then another month to break thru his documentation to control his program from a single Perl program with no external parameter files. Crm114 can form groups of 5 words as word word, taking all combinations of 5 consecutive words in documents. Using 5 words produced better results than any filters I used; eg, filtering/altering car manufacturer's standard form prompts like Fire? Yes_ No_ Initially, I expected correct results of 99% or better, like my use of Naive Bayes to filter my email. However, email must accomplish some goal (go to their webpage or see their low cost), so Naive Bayes approaches work very well on email. U.S. Department of Transportation (DOT), defects investigation, contracted with me to try what I'd successfully used for email (others' programs). They were accumulating 50,000 early warning reports a quarter, yet their engineers had read only 3,000. DOT contracted for a dozen people to slug thru the accumulated 300,000 reports, identifying those that might portend the necessity of a recall. But these contractors (probably costing $1 million a year) agreed with the engineers no more than 50% of the time. After 2 months, I was able to correctly identify only 30% of reports. Then I read that Naive Bayes was, after all, "naive". It presumed independence between words. There's an old statitical saying, "Do you prefer to perfectly solve the wrong problem, or wrongly solve the correct problem?" People using Naive Bayes use many heuristics, as the CRM114 documents mention, including, a. TOE, "Train on Error" for which you retrain any document that Naive Bayes classifies incorrectly. Statistically, this is somewhat like having a learned population with more than one of the same document. b. SSTTT, "Single Sided Thick Threshhold Training" for which you retrain a document when it doesn't identify correctly with a sufficiently high probability. c. TUNE, "Train Until No Error" for which you recycle thru your known records until you reach perfection, although often forced a stop when no improvement resulted after 12 cycles. All these techniques improved correct identification and concentration (proportion of "flagged" reports that are correctly flagged) to about 67%. Then the engineers (gearheads) did the inexplicable -- they read about 20,000 reports, jumping the correctness of the crm114 Naive Bayes approach with the above heuristics to about 88%. Suddenly, crm114 Naive Bayes "flagged" reports were fun to read. For example, a report no-one had yet identified described a fellow's car modified with airbags to lift the car to a high height using canisters of some air in the back of his pickup. Driving down the road, he notice a warning light flashing on his air supply. Soon afterwards, the passenger seat caught fire. Even though his pickup was moving down the road, the flashing warning light and flaming passenger seat prompted him to open his driver's door and leap from his moving pickup. While I worked the Bayesian approach and contractors read reports as two approaches to slug thru 300,000 reports, big software/contractor companies hovered over the spending and potential spending. But their approaches were all judged foolish -- expensively foolish. So, if you really have a problem worthy of solving well, some time, and some programming skills, you can integrate a Naive Bayes procedure with some heuristic procedures, probably with good correct identification and a high concentration of correctly "flagged" documents among Bayes flagged documents. On Mon, Jul 24, 2006 at 06:59:31PM +0100, Eleni Rapsomaniki wrote:> Hi > > This is a question regarding classification performance using different methods. > So far I've tried NaiveBayes (klaR package), svm (e1071) package and > randomForest (randomForest). What has puzzled me is that randomForest seems to > perform far better (32% classification error) than svm and NaiveBayes, which > have similar classification errors (45%, 48% respectively). A similar > difference in performance is observed with different combinations of > parameters, priors and size of training data. > > Because I was expecting to see little difference in the perfomance of these > methods I am worried that I may have made a mistake in my randomForest call: > > my.rf=randomForest(x=train.df[,-response_index], y=train.df[,response_index], > xtest=test.df[,-response_index], ytest=test.df[,response_index], > importance=TRUE,proximity=FALSE, keep.forest=FALSE) > > (where train.df and test.df are my train and test data.frames and response_index > is the column number specifiying the class) > > My main question is: could there be a legitimate reason why random forest would > outperform the other two models (e.g. maybe one > method is more reliable with Gaussian data, handles categorical data > better etc)? Also, is there a way of evaluating the predictive ability of each > parameter in the bayesian model as it can be done for random Forests (through > the importance table)? > > I would appreciate any of your comments and suggestions on these. > > Many thanks > Eleni Rapsomaniki > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Jameson C. Burt, NJ9L Fairfax, Virginia, USA jameson at coost.com http://www.coost.com (202) 690-0380 (work) LTSP.org: magic "mysterious and awe-inspiring even though we know they are real and not supernatural"