Yuliya Matveyeva
2012-Nov-01 11:37 UTC
[R] oblique.tree : the predict function asserts the dependent variable to be included in "newdata"
Dear R community, I have recently discovered the package oblique.tree and I must admit that it was a nice surprise for me, since I have actually made my own version of a kind of a classifier which uses the idea of oblique splits (splits by means of hyperplanes). So I am now interested in comparing these two classifiers. But what I do not seem to understand is why the function predict.oblique.tree asserts the dependent variable to be included in `newdata`. I have set update.tree.predictions to FALSE and I have used the formula interface when creating the model ( y~. ). Is there a way to avoid this kind of behaviour ? Or should I just create a dummy dependent-variable-column in my test set in order to use the prediction function ? And in the latter case can I actually be sure that the dependent variable is not ever going to be used in the prediction-procedure ? I would be really grateful for any tips regarding this problem. A piece of reproducible code : # ------------------------------------------------------------------------------------------------------- library(oblique.tree) N <- 100; nvars <- 3; x <- array(rnorm(n = N*nvars), c(N,nvars)) y <- as.factor(sample(0:1, size = N, replace = T)) m <- data.frame(x,y); var_names <- colnames(m); var_x_names <- var_names[-length(var_names)] n_train <- floor(N/2); n_test <- N - n_train; train <- m[1:n_train,]; test <- m[-(1:n_train),]; bot <- oblique.tree(formula = y ~., data = train, oblique.splits = "on", variable.selection = "none", split.impurity = "gini"); ### If the dependent variable is excluded from `newdata` the code ends up with this error : # Error in model.frame.default(formula as.formula(eval(object$call$formula)), : # variable lengths differ (found for 'X1') # In addition: Warning message: # 'newdata' had 50 rows but variable(s) found have 100 rows pred <- predict(bot, newdata = train[, var_x_names], type="vector", update.tree.predictions = F) ### An error does not occur if the dependent variable is included in `newdata` pred <- predict(bot, newdata = train[, var_names], type="vector", update.tree.predictions = F) ### Although: the result of the prediction does not seem to depend upon ### the values of the dependent variable included in the data pred1 <- predict(bot, newdata = test[, var_names], type="vector", update.tree.predictions = F); test$y <- as.factor(sample(0:1, size = dim(test)[1], replace = T)) pred2 <- predict(bot, newdata = test[, var_names], type="vector", update.tree.predictions = F); abs(mean(pred1[,1] - pred2[,1])) if (abs(mean(pred1[,1] - pred2[,1])) > 1e-3) { print("Results do differ."); } ### What is more curious is that the error message changes if I ### write my data.frame and then read it again. write.table(m, file = "m.txt", col.names = T, row.names = F, quote = F) rm(list = ls()); m <- read.table("m.txt", header = T, colClasses = "numeric"); m$y <- as.factor(m$y); var_names <- colnames(m); var_x_names <- var_names[-length(var_names)] N <- dim(m)[1]; n_train <- floor(N/2); n_test <- N - n_train; train <- m[1:n_train,]; test <- m[-(1:n_train),]; rm(m); bot <- oblique.tree(formula = y ~., data = train, oblique.splits = "on", variable.selection = "none", split.impurity = "gini"); ### If the dependent variable is excluded from `newdata` the code ends up with this error : # Error in eval(expr, envir, enclos) : object 'y' not found pred <- predict(bot, newdata = train[, var_x_names], type="vector", update.tree.predictions = F) # ------------------------------------------------------------------------------------------------------- -- Sincerely yours, Yulia Matveyeva, Department of Statistical Modelling, Faculty of Mathematics and Mechanics, St Petersburg State University, Russia [[alternative HTML version deleted]]