I'm trying to fit a naive Bayes model and predict on a new data set using
the functions naivebayes and predict (package = e1071).
R version 2.5.1 on a Linux machine
My data set looks like this. "class" is the response and k1 - k3 are
the
independent variables. All of them are factors. The response has 52 levels
and k1 - k3 have 2-6 levels. I have about 9,300 independent variables but
omit the long list here for simple demonstration. There are no missing
values in the observations.
class k1 k2 k3
1 0 0 1
8 0 0 0
# model fitting, I also tried setting laplace=0 but didn't help
nbmodel <- naiveBayes(class~., data=train, laplace=1)
# predict
nb.fit <- predict(nbmodel, x.test[,-1])
First I had no trouble fitting the model. R also returned the predictions
for some of my large data sets. But for some data sets, R can fit the model
(no error message, nb.model$tables look ok). When I invoked the predict
function, it kept giving me the following message:
# my data set has 1 response variable and 9318 independent variables
Error in FUN(1:9319[[4L]], ...) : subscript out of bounds
# Here's what traceback() returns
10: FUN(1:9319[[4L]], ...)
9: lapply(X, FUN, ...)
8: sapply(1:nattribs, function(v) {
nd <- ndata[v]
if (is.na(nd))
rep(1, length(object$apriori))
else {
prob <- if (isnumeric[v]) {
msd <- object$tables[[v]]
dnorm(nd, msd[, 1], msd[, 2])
}
else object$tables[[v]][, nd]
prob[prob == 0] <- threshold
prob
}
})
7: log(sapply(1:nattribs, function(v) {
nd <- ndata[v]
if (is.na(nd))
rep(1, length(object$apriori))
else {
prob <- if (isnumeric[v]) {
msd <- object$tables[[v]]
dnorm(nd, msd[, 1], msd[, 2])
}
else object$tables[[v]][, nd]
prob[prob == 0] <- threshold
prob
}
}))
6: apply(log(sapply(1:nattribs, function(v) {
nd <- ndata[v]
if (is.na(nd))
rep(1, length(object$apriori))
else {
prob <- if (isnumeric[v]) {
msd <- object$tables[[v]]
dnorm(nd, msd[, 1], msd[, 2])
}
else object$tables[[v]][, nd]
prob[prob == 0] <- threshold
prob
}
})), 1, sum)
5: FUN(1:30[[1L]], ...)
4: lapply(X, FUN, ...)
3: sapply(1:nrow(newdata), function(i) {
ndata <- newdata[i, ]
L <- log(object$apriori) + apply(log(sapply(1:nattribs, function(v) {
nd <- ndata[v]
if (is.na(nd))
rep(1, length(object$apriori))
else {
prob <- if (isnumeric[v]) {
msd <- object$tables[[v]]
dnorm(nd, msd[, 1], msd[, 2])
}
else object$tables[[v]][, nd]
prob[prob == 0] <- threshold
prob
}
})), 1, sum)
if (type == "class")
L
else {
L <- exp(L)
L/sum(L)
}
})
2: predict.naiveBayes(nbmodel, validf[1:30, ])
1: predict(nbmodel, validf[1:30, ])
Does anyone have an idea what went wrong? Thanks in advance.
[[alternative HTML version deleted]]
Stephen Weigand
2007-Aug-29 02:09 UTC
[R] "subscript out of bounds" Error in predict.naivebayes
On 8/22/07, Polly He <biyuhe at gmail.com> wrote:> I'm trying to fit a naive Bayes model and predict on a new data set using > the functions naivebayes and predict (package = e1071). > > R version 2.5.1 on a Linux machine > > My data set looks like this. "class" is the response and k1 - k3 are the > independent variables. All of them are factors. The response has 52 levels > and k1 - k3 have 2-6 levels. I have about 9,300 independent variables but > omit the long list here for simple demonstration. There are no missing > values in the observations. > > class k1 k2 k3 > 1 0 0 1 > 8 0 0 0 > > # model fitting, I also tried setting laplace=0 but didn't help > nbmodel <- naiveBayes(class~., data=train, laplace=1) > > # predict > nb.fit <- predict(nbmodel, x.test[,-1]) > > First I had no trouble fitting the model. R also returned the predictions > for some of my large data sets. But for some data sets, R can fit the model > (no error message, nb.model$tables look ok). When I invoked the predict > function, it kept giving me the following message: > > # my data set has 1 response variable and 9318 independent variables > Error in FUN(1:9319[[4L]], ...) : subscript out of bounds[...] In my experience, some predict methods have trouble when newdata does not have all levels of a factor. This seems to be the case with predict.naiveBayes: example(naiveBayes) predict(model, subset(HouseVotes84, V1 == "n")) gives Error in object$tables[[v]] : subscript out of bounds One workaround is to predict for a "bigger" data set and retain a subset of the predictions. Hope this helps, Stephen -- Rochester, Minn. USA