thr3ads.net - R help - [R] How to optimize or build a better random forest? [Oct 2012]

If this information is useful, please help other people find it:
Share via:

Aaditya Nanduri

2012-Oct-17 05:01 UTC

[R] How to optimize or build a better random forest?

Hello Everyone!

It's been a while since I last posted a question! Hope everyone has been
doing well!

~~~ CONTEXT ~~~
     I have recently entered a beginner-level competition on kaggle. The
goal of the competition is to build a model that predicts who did/did not
survive on the Titanic.

     I decided to use random forests as I have been wanting to learn the
algorithm and the competition was the perfect impetus. Unfortunately, the
model I have built is not very accurate.


~~~ QUESTION ~~~
     What can I do to make the model more accurate (less error for 1
(survived))? Is there a cost matrix that I can input into the model?
Improve the code? Learn more statistics (please provide resources :) )?


~~~ SOME CODE ~~~
# Response variable: Survived
# 0 = Did not survive
# 1 = Did survive

# First few steps
# 1. Used regsubsets to identify the 5 best variables
# 2. Cleaned the raw data and built a logistic regression to see the
significance of the predictors (and their levels if factor)
# 3. Develop a new 'train' dataset with a group of variables based on
the
significances from the logistic regression
# PLEASE FEEL FREE TO SHARE FEATURE SELECTION/EXTRACTION METHODS AS I AM
CLEARLY LACKING IN THAT AREA AS WELL :(
> head(train)
  survived age sibsp pclass2 pclass3 sexmale
1        0  22     1       0       1       1
2        1  38     1       0       0       0
3        1  26     0       0       1       0
4        1  35     1       0       0       0
5        0  35     0       0       1       1
6        0  27     0       0       1       1

> sapply(train,class) survived       age     sibsp   pclass2   pclass3  
sexmale "factor" "numeric" "integer"  "factor" 
"factor"  "factor"

> sapply(split(train,train$survived),function(x) dim(x)[1])  0   1549 342

> rf <- randomForest(train[,-1], train[,1],
ntree=10000,classwt=c(549/891,342/891),importance=TRUE,do.trace=FALSE)        OOB estimate of  error rate: 17.73%
Confusion matrix:
    0   1 class.error
0 500  49  0.08925319
1 109 233  0.31871345

	[[alternative HTML version deleted]]

Seemingly Similar Threads

Search for more possibly parallel threads

R help - Oct 2012 - How to optimize or build a better random forest?

[R] How to optimize or build a better random forest?

Seemingly Similar Threads

Wisdom of the Ancients