Häring, Tim (LWF)
2009-Dec-07 15:58 UTC
[R] different model performance because of nlevels()???
Hello everybody, I came across a problem when building a randomForest model. Maybe someone can help me. I have a training- and a testdataset with a discrete response and ten predictors (numeric and factor variables). The two datasets are similar in terms of number of predictor, name of variables and datatype of variables (factor, numeric) except that only one predictor has got 20 levels in the training dataset and only 19 levels in the test dataset. I found that the model performance is different when train and test a model with the unchanged datasets on the one hand and after assigning the levels of the training dataset on the testdataset. I only assign the levels and do not change the dataset itself however the models perform different. Why??? Here is my code:> library(randomForest) > load("datasets.RData") # import traindat and testdat > nlevels(traindat$predictor1)[1] 20> nlevels(testdat$predictor1)[1] 19> nrow(traindat)[1] 9838> nrow(testdat)[1] 3841> set.seed(10) > rf_orig <- randomForest(x=traindat[,-1], y=traindat[,1], xtest=testdat[,-1], ytest=testdat[,1],ntree=100) > data.frame(rf_orig$test$err.rate)[100,1] # Error on test-dataset[1] 0.3082531 # assign the levels of the training dataset th the test dataset for predictor 1> levels(testdat$predictor1) <- levels(traindat$predictor1) > nlevels(traindat$predictor1)[1] 20> nlevels(testdat$predictor1)[1] 20> nrow(traindat)[1] 9838> nrow(testdat)[1] 3841> set.seed(10) > rf_mod <- randomForest(x=traindat[,-1], y=traindat[,1], xtest=testdat[,-1], ytest=testdat[,1],ntree=100) > data.frame(rf_mod$test$err.rate)[100,1] # Error on test-dataset[1] 0.4808644 # is different Cheers, TIM