Henderson, Robin Michelle
2013-Nov-06 18:07 UTC
[R] R help-classification accuracy of DFA and RF using caret
Hi, I am a graduate student applying published R scripts to compare the classification accuracy of 2 predictive models, one built using discriminant function analysis and one using random forests (webpage link for these scripts is provided below). The purpose of these models is to predict the biotic integrity of streams. Specifically, I am trying to compare the classification accuracy (i.e., prediction of group membership)of both the DFA and RF models using k-fold crossvalidation for the following metrics: AUC ROC, percent correctly classified, specificity, sensitivity, and Kappa. I would also like to obtain the F statistic, Wilks lambda, MSE or RMSE for the random forest models as the script does not contain code to get this data. I think I need to use the caret package to obtain the classification accuracy, but I keep getting error messages when I apply the train function to my data. As I am relatively new to R and my thesis committee is unable to help as they are also unfamiliar with R, I thought it best to ask for help. Would someone be willing to help me? Thanks, Robin http://www.epa.gov/wed/pages/models/rivpacs/rivpacs.htm> TrainDataDFAgrps2 <-predcal > TrainClassesDFAgrps2 <-grp.2; > DFAgrps2Fit1 <- train(TrainDataDFAgrps2, TrainClassesDFAgrps2,+ method = "lda", + tuneLength = 10, + trControl = trainControl(method = "cv")); Error in train.default(TrainDataDFAgrps2, TrainClassesDFAgrps2, method = "lda", : wrong model type for regression> RFgrps2Fit1 <- train(TrainDataRFgrps2, TrainClassesRFgrps2,+ method = "rf", + tuneLength = 10, + trControl = trainControl(method = "cv")); There were 50 or more warnings (use warnings() to see the first 50) Clip of predcal (same length as grp.2, but too much data to display all):> predcalReference_Test HUC12_AREA_HA_log10 ELEV_m M_Slp_sqt Precip_mm Temp_CX10 2370 R 3.7 588.0 2.2 1751 148 559 R 4.0 643.1 1.8 1674 141 2062 R 4.0 643.1 1.8 1674 141 2467 R 4.0 643.1 1.8 1674 141 1176 R 3.9 694.3 2.4 1534 131 1840 R 3.9 694.3 2.4 1534 131 2052 R 3.9 694.3 2.4 1534 131 1174 R 4.1 605.0 2.1 1382 138 1841 R 4.1 605.0 2.1 1382 138 2051 R 4.1 605.0 2.1 1382 138 1831 R 4.1 363.9 1.7 937 156 Grps.2: grp.2 [1] 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 1 1 [45] 2 2 1 1 1 1 1 1 1 2 2 1 1 1 2 2 1 2 2 1 1 1 2 2 2 2 2 2 1 1 1 2 2 2 1 2 2 2 2 2 2 2 2 1 [89] 1 2 2 2 2 2 1 1 2 2 2 1 2 1 2 2 1 2 1 1 2 [[alternative HTML version deleted]]
David Winsemius
2013-Nov-06 21:57 UTC
[R] R help-classification accuracy of DFA and RF using caret
On Nov 6, 2013, at 10:07 AM, Henderson, Robin Michelle wrote:> Hi, > > I am a graduate student applying published R scripts to compare the classification accuracy of 2 predictive models, one built using discriminant function analysis and one using random forests (webpage link for these scripts is provided below). The purpose of these models is to predict the biotic integrity of streams. Specifically, I am trying to compare the classification accuracy (i.e., prediction of group membership)of both the DFA and RF models using k-fold crossvalidation for the following metrics: AUC ROC, percent correctly classified, specificity, sensitivity, and Kappa.Sensitivity, "accuracy" (= percent correct), and specificity are only defined when you establish a particular threshold for decision. The is no "sensitivity" or "specificity" that will accrue to a classification model. AUC is an effort at presenting such an overall value, but it has deficiencies and is insensitive to statistically significant differences in models.> I would also like to obtain the F statistic, Wilks lambda, MSE or RMSE for the random forest models as the script does not contain code to get this data.I doubt very much that is by accident or oversight on the part of the randomForest developers.> I think I need to use the caret package to obtain the classification accuracy, but I keep getting error messages when I apply the train function to my data. As I am relatively new to R and my thesis committee is unable to help as they are also unfamiliar with R, I thought it best to ask for help.I think you need to add a statistician to your committee. The difficulties you are facing (of which you appear to be unaware) are not just related to being new to R.> Would someone be willing to help me? > > > Thanks, > Robin > > http://www.epa.gov/wed/pages/models/rivpacs/rivpacs.htm > > >> TrainDataDFAgrps2 <-predcal >> TrainClassesDFAgrps2 <-grp.2; >> DFAgrps2Fit1 <- train(TrainDataDFAgrps2, TrainClassesDFAgrps2, > + method = "lda", > + tuneLength = 10, > + trControl = trainControl(method = "cv")); > Error in train.default(TrainDataDFAgrps2, TrainClassesDFAgrps2, method = "lda", : > wrong model type for regressionThat error is pointing out that you are choosing a method that expects a particular form of outcome (continuous) and does not accept a categorical (possibly an R factor?) outcome. I suspect you may be using the `caret` package, but it's unclear. I think this is further evidence of the need for competent statistical consultation. You would be advised to study further in Venables and Ripley's MASS(v4) or in Hastie, Tibshirani, and Freidmans ESL(v2). This link, found with a simple google search, suggests that the author of the cited code is at an academic institution only one state away from you: fw.oregonstate.edu/system/files/Van%20Sickle%20CV%20consult.pdf?. He may be willing to offer assistance. -- David.> >> RFgrps2Fit1 <- train(TrainDataRFgrps2, TrainClassesRFgrps2, > + method = "rf", > + tuneLength = 10, > + trControl = trainControl(method = "cv")); > There were 50 or more warnings (use warnings() to see the first 50) > > Clip of predcal (same length as grp.2, but too much data to display all): >> predcal > Reference_Test HUC12_AREA_HA_log10 ELEV_m M_Slp_sqt Precip_mm Temp_CX10 > 2370 R 3.7 588.0 2.2 1751 148 > 559 R 4.0 643.1 1.8 1674 141 > 2062 R 4.0 643.1 1.8 1674 141 > 2467 R 4.0 643.1 1.8 1674 141 > 1176 R 3.9 694.3 2.4 1534 131 > 1840 R 3.9 694.3 2.4 1534 131 > 2052 R 3.9 694.3 2.4 1534 131 > 1174 R 4.1 605.0 2.1 1382 138 > 1841 R 4.1 605.0 2.1 1382 138 > 2051 R 4.1 605.0 2.1 1382 138 > 1831 R 4.1 363.9 1.7 937 156 > > > Grps.2: > grp.2 > [1] 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 1 1 > [45] 2 2 1 1 1 1 1 1 1 2 2 1 1 1 2 2 1 2 2 1 1 1 2 2 2 2 2 2 1 1 1 2 2 2 1 2 2 2 2 2 2 2 2 1 > [89] 1 2 2 2 2 2 1 1 2 2 2 1 2 1 2 2 1 2 1 1 2 > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius Alameda, CA, USA