thr3ads.net - R help - [R] different randomForest performance for same data [Dec 2009]

If this information is useful, please help other people find it:
Share via:

Häring, Tim (LWF)

2009-Dec-10 10:00 UTC

[R] different randomForest performance for same data

Hello,

I came across a problem when building a randomForest model. Maybe someone can
help me.
I have a training- and a testdataset with a discrete response and ten predictors
(numeric and factor variables). The two datasets are similar in terms of number
of predictor, name of variables and datatype of variables (factor, numeric)
except that only one predictor has got 20 levels in the training dataset and
only 19 levels in the test dataset.
I found that the model performance is different when train and test a model with
the unchanged datasets on the one hand and after assigning the levels of the
training dataset on the testdataset. I only assign the levels and do not change
the dataset itself however the models perform different.
Why???

Here is my code:> library(randomForest)
> load("datasets.RData")  # import traindat and testdat
> nlevels(traindat$predictor1)
[1] 20> nlevels(testdat$predictor1)
[1] 19> nrow(traindat)
[1] 9838> nrow(testdat)
[1] 3841> set.seed(10)
> rf_orig <- randomForest(x=traindat[,-1], y=traindat[,1],
xtest=testdat[,-1], ytest=testdat[,1],ntree=100)
> data.frame(rf_orig$test$err.rate)[100,1]      # Error on test-dataset[1] 0.3082531

# assign the levels of the training dataset th the test dataset for predictor
1> levels(testdat$predictor1) <- levels(traindat$predictor1)  
> nlevels(traindat$predictor1)
[1] 20> nlevels(testdat$predictor1)
[1] 20> nrow(traindat)
[1] 9838> nrow(testdat)
[1] 3841> set.seed(10)
> rf_mod <- randomForest(x=traindat[,-1], y=traindat[,1],
xtest=testdat[,-1], ytest=testdat[,1],ntree=100)
> data.frame(rf_mod$test$err.rate)[100,1]       # Error on test-dataset[1] 0.4808644  # is different

Cheers,
TIM

Uwe Ligges

2009-Dec-13 19:27 UTC

head link

[R] different randomForest performance for same data

H?ring wrote:> Hello,
> 
> I came across a problem when building a randomForest model. Maybe someone
can help me.
> I have a training- and a testdataset with a discrete response and ten
predictors (numeric and factor variables). The two datasets are similar in terms
of number of predictor, name of variables and datatype of variables (factor,
numeric) except that only one predictor has got 20 levels in the training
dataset and only 19 levels in the test dataset.
> I found that the model performance is different when train and test a model
with the unchanged datasets on the one hand and after assigning the levels of
the training dataset on the testdataset. I only assign the levels and do not
change the dataset itself however the models perform different.
> Why???
> 
> Here is my code:
>> library(randomForest)
>> load("datasets.RData")  # import traindat and testdat
>> nlevels(traindat$predictor1)
> [1] 20
>> nlevels(testdat$predictor1)
> [1] 19
>> nrow(traindat)
> [1] 9838
>> nrow(testdat)
> [1] 3841
>> set.seed(10)
>> rf_orig <- randomForest(x=traindat[,-1], y=traindat[,1],
xtest=testdat[,-1], ytest=testdat[,1],ntree=100)
>> data.frame(rf_orig$test$err.rate)[100,1]      # Error on test-dataset
> [1] 0.3082531
> 
> # assign the levels of the training dataset th the test dataset for
predictor 1
>> levels(testdat$predictor1) <- levels(traindat$predictor1)  
>> nlevels(traindat$predictor1)
> [1] 20
>> nlevels(testdat$predictor1)
> [1] 20
>> nrow(traindat)
> [1] 9838
>> nrow(testdat)
> [1] 3841
>> set.seed(10)
>> rf_mod <- randomForest(x=traindat[,-1], y=traindat[,1],
xtest=testdat[,-1], ytest=testdat[,1],ntree=100)
>> data.frame(rf_mod$test$err.rate)[100,1]       # Error on test-dataset
> [1] 0.4808644  # is different

Say testdat has 19 levels called L2, ..., L20 and traindat has 20 levels 
called L1, ..., L20.

After your call
  levels(testdat$predictor1) <- levels(traindat$predictor1)
You renamed L2 -> L1, L3 -> L2, ..., L20 -> L19 and invented a new
level
L20 that is unused.
Hence you confused all levels completely and given your ztrainikng is 
perfect, you will get an error rate of 100% in the end, because you 
renamed the levels in the testdata so that they do not fit to the 
traindata any more.

Uwe Ligges


> Cheers,
> TIM
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Liaw, Andy

2009-Dec-15 14:22 UTC

head link

[R] different randomForest performance for same data

You need to be _extremely_ careful when assigning levels of factors.  Look at
this example:

R> x1 = factor(c("a", "b", "c"))
R> x2 = factor(c("a", "c", "c"))
R> x3 = x2
R> levels(x3) <- levels(x1)
R> x3
[1] a b b
Levels: a b c

I'll try to add more XXXXproofing in the code... 

Andy
> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Haring, Tim (LWF)
> Sent: Thursday, December 10, 2009 5:00 AM
> To: r-help at r-project.org
> Subject: [R] different randomForest performance for same data
> 
> Hello,
> 
> I came across a problem when building a randomForest model. 
> Maybe someone can help me.
> I have a training- and a testdataset with a discrete response 
> and ten predictors (numeric and factor variables). The two 
> datasets are similar in terms of number of predictor, name of 
> variables and datatype of variables (factor, numeric) except 
> that only one predictor has got 20 levels in the training 
> dataset and only 19 levels in the test dataset.
> I found that the model performance is different when train 
> and test a model with the unchanged datasets on the one hand 
> and after assigning the levels of the training dataset on the 
> testdataset. I only assign the levels and do not change the 
> dataset itself however the models perform different.
> Why???
> 
> Here is my code:
> > library(randomForest)
> > load("datasets.RData")  # import traindat and testdat
> > nlevels(traindat$predictor1)
> [1] 20
> > nlevels(testdat$predictor1)
> [1] 19
> > nrow(traindat)
> [1] 9838
> > nrow(testdat)
> [1] 3841
> > set.seed(10)
> > rf_orig <- randomForest(x=traindat[,-1], y=traindat[,1], 
> xtest=testdat[,-1], ytest=testdat[,1],ntree=100)
> > data.frame(rf_orig$test$err.rate)[100,1]      # Error on 
> test-dataset
> [1] 0.3082531
> 
> # assign the levels of the training dataset th the test 
> dataset for predictor 1
> > levels(testdat$predictor1) <- levels(traindat$predictor1)  
> > nlevels(traindat$predictor1)
> [1] 20
> > nlevels(testdat$predictor1)
> [1] 20
> > nrow(traindat)
> [1] 9838
> > nrow(testdat)
> [1] 3841
> > set.seed(10)
> > rf_mod <- randomForest(x=traindat[,-1], y=traindat[,1], 
> xtest=testdat[,-1], ytest=testdat[,1],ntree=100)
> > data.frame(rf_mod$test$err.rate)[100,1]       # Error on 
> test-dataset
> [1] 0.4808644  # is different
> 
> Cheers,
> TIM
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> Notice:  This e-mail message, together with any attachme...{{dropped:10}}

Maybe Matching Threads

Search for more possibly parallel threads

R help - Dec 2009 - different randomForest performance for same data

[R] different randomForest performance for same data

[R] different randomForest performance for same data

[R] different randomForest performance for same data

Maybe Matching Threads