Aaditya Nanduri
2012-Oct-17 05:01 UTC
[R] How to optimize or build a better random forest?
Hello Everyone!
It's been a while since I last posted a question! Hope everyone has been
doing well!
~~~ CONTEXT ~~~
I have recently entered a beginner-level competition on kaggle. The
goal of the competition is to build a model that predicts who did/did not
survive on the Titanic.
I decided to use random forests as I have been wanting to learn the
algorithm and the competition was the perfect impetus. Unfortunately, the
model I have built is not very accurate.
~~~ QUESTION ~~~
What can I do to make the model more accurate (less error for 1
(survived))? Is there a cost matrix that I can input into the model?
Improve the code? Learn more statistics (please provide resources :) )?
~~~ SOME CODE ~~~
# Response variable: Survived
# 0 = Did not survive
# 1 = Did survive
# First few steps
# 1. Used regsubsets to identify the 5 best variables
# 2. Cleaned the raw data and built a logistic regression to see the
significance of the predictors (and their levels if factor)
# 3. Develop a new 'train' dataset with a group of variables based on
the
significances from the logistic regression
# PLEASE FEEL FREE TO SHARE FEATURE SELECTION/EXTRACTION METHODS AS I AM
CLEARLY LACKING IN THAT AREA AS WELL :(
> head(train)
survived age sibsp pclass2 pclass3 sexmale
1 0 22 1 0 1 1
2 1 38 1 0 0 0
3 1 26 0 0 1 0
4 1 35 1 0 0 0
5 0 35 0 0 1 1
6 0 27 0 0 1 1
> sapply(train,class) survived age sibsp pclass2 pclass3
sexmale
"factor" "numeric" "integer" "factor"
"factor" "factor"
> sapply(split(train,train$survived),function(x) dim(x)[1]) 0 1
549 342
> rf <- randomForest(train[,-1], train[,1],
ntree=10000,classwt=c(549/891,342/891),importance=TRUE,do.trace=FALSE)
OOB estimate of error rate: 17.73%
Confusion matrix:
0 1 class.error
0 500 49 0.08925319
1 109 233 0.31871345
[[alternative HTML version deleted]]
