Yuliya Matveyeva
2012-Nov-01 11:37 UTC
[R] oblique.tree : the predict function asserts the dependent variable to be included in "newdata"
Dear R community,
I have recently discovered the package oblique.tree and I must admit that
it was a nice surprise for me,
since I have actually made my own version of a kind of a classifier which
uses the idea of oblique splits (splits by means of hyperplanes).
So I am now interested in comparing these two classifiers.
But what I do not seem to understand is why the function
predict.oblique.tree asserts the dependent variable to be included in
`newdata`.
I have set update.tree.predictions to FALSE and I have used the formula
interface when creating the model ( y~. ).
Is there a way to avoid this kind of behaviour ? Or should I just create a
dummy dependent-variable-column in my test set in order to use the
prediction function ? And in the latter case can I actually be sure that
the dependent variable is not ever going to be used in the
prediction-procedure ?
I would be really grateful for any tips regarding this problem.
A piece of reproducible code :
#
-------------------------------------------------------------------------------------------------------
library(oblique.tree)
N <- 100; nvars <- 3;
x <- array(rnorm(n = N*nvars), c(N,nvars))
y <- as.factor(sample(0:1, size = N, replace = T))
m <- data.frame(x,y);
var_names <- colnames(m);
var_x_names <- var_names[-length(var_names)]
n_train <- floor(N/2); n_test <- N - n_train;
train <- m[1:n_train,]; test <- m[-(1:n_train),];
bot <- oblique.tree(formula = y ~., data = train,
oblique.splits = "on", variable.selection = "none",
split.impurity = "gini");
### If the dependent variable is excluded from `newdata` the code ends up
with this error :
# Error in model.frame.default(formula as.formula(eval(object$call$formula)),
:
# variable lengths differ (found for 'X1')
# In addition: Warning message:
# 'newdata' had 50 rows but variable(s) found have 100 rows
pred <- predict(bot, newdata = train[, var_x_names],
type="vector", update.tree.predictions = F)
### An error does not occur if the dependent variable is included in
`newdata`
pred <- predict(bot, newdata = train[, var_names],
type="vector", update.tree.predictions = F)
### Although: the result of the prediction does not seem to depend upon
### the values of the dependent variable included in the data
pred1 <- predict(bot, newdata = test[, var_names],
type="vector", update.tree.predictions = F);
test$y <- as.factor(sample(0:1, size = dim(test)[1], replace = T))
pred2 <- predict(bot, newdata = test[, var_names],
type="vector", update.tree.predictions = F);
abs(mean(pred1[,1] - pred2[,1]))
if (abs(mean(pred1[,1] - pred2[,1])) > 1e-3) {
print("Results do differ.");
}
### What is more curious is that the error message changes if I
### write my data.frame and then read it again.
write.table(m, file = "m.txt", col.names = T, row.names = F, quote =
F)
rm(list = ls());
m <- read.table("m.txt", header = T, colClasses =
"numeric");
m$y <- as.factor(m$y);
var_names <- colnames(m);
var_x_names <- var_names[-length(var_names)]
N <- dim(m)[1];
n_train <- floor(N/2); n_test <- N - n_train;
train <- m[1:n_train,]; test <- m[-(1:n_train),]; rm(m);
bot <- oblique.tree(formula = y ~., data = train,
oblique.splits = "on", variable.selection = "none",
split.impurity = "gini");
### If the dependent variable is excluded from `newdata` the code ends up
with this error :
# Error in eval(expr, envir, enclos) : object 'y' not found
pred <- predict(bot, newdata = train[, var_x_names],
type="vector", update.tree.predictions = F)
#
-------------------------------------------------------------------------------------------------------
--
Sincerely yours,
Yulia Matveyeva,
Department of Statistical Modelling,
Faculty of Mathematics and Mechanics,
St Petersburg State University, Russia
[[alternative HTML version deleted]]
