Dear R Helpers,
I need help with a slightly unusual situation in which I am trying to
select some columns from a data frame. I know how to use the subset
statement with column names as in:
x=as.data.frame(matrix(c(1,2,3,
1,2,3,
1,2,2,
1,2,2,
1,1,1),ncol=3,byrow=T))
all.cols<-colnames(x)
to.keep<-all.cols[1:2]
Kept<-subset(x,select=to.keep)
Kept
However, if I want to select some columns based on a selection of the most
important variables from a random forest then I find myself stuck. The
example below demonstrates the problem.
library(randomForest)
data(mtcars)
mtcars.rf <- randomForest(mpg ~ ., data=mtcars,importance=TRUE)
Importance<-data.frame(mtcars.rf$importance)
Importance
MSEImportance<-head(Importance[order(Importance$X.IncMSE,
decreasing=TRUE),],3)
MSEVars<-row.names(MSEImportance)
MSEVars<-data.frame(MSEVars,stringsAsFactors = FALSE)
colnames(MSEVars)<-"Vars"
NodeImportance<-head(Importance[order(Importance$IncNodePurity,decreasing=TRUE),],
3)
NodeVars<-row.names(NodeImportance)
NodeVars<-data.frame(NodeVars,stringsAsFactors = FALSE)
colnames(NodeVars)<-"Vars"
ImportantVars<-rbind(MSEVars,NodeVars)
ImportantVars<-unique(ImportantVars)
nrow(ImportantVars)
ImportantVars<-as.character(ImportantVars)
ImportantVars
CarsVarsKept<-subset(mtcars,select=ImportantVars)
Error in `[.data.frame`(x, r, vars, drop = drop) :
undefined columns selected
Any help on how to select these columns from the data frame would be most
appreciated.
--John J. Sparks, Ph.D.
Hello, It works for me if I replace> ImportantVars <- as.character(ImportantVars)by> ImportantVars <- ImportantVars$VarsHope this helps, Pascal 2013/5/17 Sparks, John James <jspark4@uic.edu>> Dear R Helpers, > > I need help with a slightly unusual situation in which I am trying to > select some columns from a data frame. I know how to use the subset > statement with column names as in: > > > x=as.data.frame(matrix(c(1,2,3, > 1,2,3, > 1,2,2, > 1,2,2, > 1,1,1),ncol=3,byrow=T)) > > all.cols<-colnames(x) > to.keep<-all.cols[1:2] > > Kept<-subset(x,select=to.keep) > Kept > > However, if I want to select some columns based on a selection of the most > important variables from a random forest then I find myself stuck. The > example below demonstrates the problem. > > > library(randomForest) > > data(mtcars) > mtcars.rf <- randomForest(mpg ~ ., data=mtcars,importance=TRUE) > Importance<-data.frame(mtcars.rf$importance) > Importance > > > > MSEImportance<-head(Importance[order(Importance$X.IncMSE, > decreasing=TRUE),],3) > MSEVars<-row.names(MSEImportance) > MSEVars<-data.frame(MSEVars,stringsAsFactors = FALSE) > colnames(MSEVars)<-"Vars" > > > NodeImportance<-head(Importance[order(Importance$IncNodePurity,decreasing=TRUE),], > 3) > NodeVars<-row.names(NodeImportance) > NodeVars<-data.frame(NodeVars,stringsAsFactors = FALSE) > colnames(NodeVars)<-"Vars" > > > ImportantVars<-rbind(MSEVars,NodeVars) > ImportantVars<-unique(ImportantVars) > nrow(ImportantVars) > ImportantVars<-as.character(ImportantVars) > ImportantVars > CarsVarsKept<-subset(mtcars,select=ImportantVars) > Error in `[.data.frame`(x, r, vars, drop = drop) : > undefined columns selected > > Any help on how to select these columns from the data frame would be most > appreciated. > > --John J. Sparks, Ph.D. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
On May 17, 2013, at 08:51 , Sparks, John James wrote:> Dear R Helpers, > > I need help with a slightly unusual situation in which I am trying to > select some columns from a data frame. I know how to use the subset > statement with column names as in:Notice that subset() is a convenience function for command line use. The non-standard evaluation tricks in it tend to become inconveniences if you try to use subset() in a function (I can say that, I wrote the blasted thing...). Just use normal subseting functions instead and everything behaves much more predictably. If ImportantVars is a vector of column names, use mtcars[ImportantVars] (or mtcars[,ImportantVars], which also works for matrices).> > > x=as.data.frame(matrix(c(1,2,3, > 1,2,3, > 1,2,2, > 1,2,2, > 1,1,1),ncol=3,byrow=T)) > > all.cols<-colnames(x) > to.keep<-all.cols[1:2] > > Kept<-subset(x,select=to.keep) > Kept > > However, if I want to select some columns based on a selection of the most > important variables from a random forest then I find myself stuck. The > example below demonstrates the problem. > > > library(randomForest) > > data(mtcars) > mtcars.rf <- randomForest(mpg ~ ., data=mtcars,importance=TRUE) > Importance<-data.frame(mtcars.rf$importance) > Importance > > > > MSEImportance<-head(Importance[order(Importance$X.IncMSE, > decreasing=TRUE),],3) > MSEVars<-row.names(MSEImportance) > MSEVars<-data.frame(MSEVars,stringsAsFactors = FALSE) > colnames(MSEVars)<-"Vars" > > NodeImportance<-head(Importance[order(Importance$IncNodePurity,decreasing=TRUE),], > 3) > NodeVars<-row.names(NodeImportance) > NodeVars<-data.frame(NodeVars,stringsAsFactors = FALSE) > colnames(NodeVars)<-"Vars" > > > ImportantVars<-rbind(MSEVars,NodeVars) > ImportantVars<-unique(ImportantVars) > nrow(ImportantVars) > ImportantVars<-as.character(ImportantVars) > ImportantVars > CarsVarsKept<-subset(mtcars,select=ImportantVars) > Error in `[.data.frame`(x, r, vars, drop = drop) : > undefined columns selected > > Any help on how to select these columns from the data frame would be most > appreciated. > > --John J. Sparks, Ph.D. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Peter Dalgaard, Professor Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
Possibly Parallel Threads
- Which column in randomForest importances (for regression) is MSE and which IncNodePurity
- question regarding "varImpPlot" results vs. model$importance data on package "RandomForest"
- randomForest partial dependence plot variable names
- Question on: Random Forest Variable Importance for Regression Problems
- randomForest: predictor importance (for regressions)