Aaditya Nanduri
2012-Oct-17 05:01 UTC
[R] How to optimize or build a better random forest?
Hello Everyone! It's been a while since I last posted a question! Hope everyone has been doing well! ~~~ CONTEXT ~~~ I have recently entered a beginner-level competition on kaggle. The goal of the competition is to build a model that predicts who did/did not survive on the Titanic. I decided to use random forests as I have been wanting to learn the algorithm and the competition was the perfect impetus. Unfortunately, the model I have built is not very accurate. ~~~ QUESTION ~~~ What can I do to make the model more accurate (less error for 1 (survived))? Is there a cost matrix that I can input into the model? Improve the code? Learn more statistics (please provide resources :) )? ~~~ SOME CODE ~~~ # Response variable: Survived # 0 = Did not survive # 1 = Did survive # First few steps # 1. Used regsubsets to identify the 5 best variables # 2. Cleaned the raw data and built a logistic regression to see the significance of the predictors (and their levels if factor) # 3. Develop a new 'train' dataset with a group of variables based on the significances from the logistic regression # PLEASE FEEL FREE TO SHARE FEATURE SELECTION/EXTRACTION METHODS AS I AM CLEARLY LACKING IN THAT AREA AS WELL :(> head(train)survived age sibsp pclass2 pclass3 sexmale 1 0 22 1 0 1 1 2 1 38 1 0 0 0 3 1 26 0 0 1 0 4 1 35 1 0 0 0 5 0 35 0 0 1 1 6 0 27 0 0 1 1> sapply(train,class) survived age sibsp pclass2 pclass3 sexmale"factor" "numeric" "integer" "factor" "factor" "factor"> sapply(split(train,train$survived),function(x) dim(x)[1]) 0 1549 342> rf <- randomForest(train[,-1], train[,1], ntree=10000,classwt=c(549/891,342/891),importance=TRUE,do.trace=FALSE)OOB estimate of error rate: 17.73% Confusion matrix: 0 1 class.error 0 500 49 0.08925319 1 109 233 0.31871345 [[alternative HTML version deleted]]