matthew_wiener@merck.com
2002-Mar-13 17:05 UTC
[Rd] rpart error with 0-frequency factor levels (with partial fix) (PR#1378)
(I'm sending to r-bugs because rpart is one of the recommended packages and is always installed. I'm also sending it directly to Dr. Ripley, as the maintainer.) rpart working as a classifier does not work (produces no splits) when the class indicator has no instances of one of the factor levels, as long as the factor level is not the final level. I have at least a partial fix, which I describe after the example. I don't think the fix is ideal. I also don't know whether some of the other rpart methods might have a similar problem. This may be an atypical situation, but I have come across this when using rpart on a subset of a data frame that happens not to have any instances of one level of a factor. I've gotten identical results with R-1.4.1 on NT 4.0 and Mac OS X (Darwin version, running as an inferior process under emacs). System info for both machines is below. ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- - Here's an example. The following is run immediately following startup in a completely empty directory (no .RData, no .Rhistory).> t1 <- data.frame(V1 = sample(c("A", "B"), 100, replace = T), V2 runif(100), V3 = runif(100)) > > levels(t1$V1)[1] "A" "B"> > t2 <- t1 > t2$V1 <- factor(t2$V1, levels = c("A", "B", "C")) > t3 <- t2 > t3$V1[t3$V1 == "B"] <- "C" > > table(t2$V1)A B C 51 49 0> table(t3$V1)A B C 51 0 49> > library(rpart) > t1.rpart <- rpart(V1 ~ V2 + V3, data = t1, method = "class") > t2.rpart <- rpart(V1 ~ V2 + V3, data = t2, method = "class") > t3.rpart <- rpart(V1 ~ V2 + V3, data = t3, method = "class") > > print(t1.rpart)n= 100 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 100 49 A (0.5100000 0.4900000) 2) V3< 0.319873 33 13 A (0.6060606 0.3939394) 4) V3>=0.0740043 26 8 A (0.6923077 0.3076923) * 5) V3< 0.0740043 7 2 B (0.2857143 0.7142857) * 3) V3>=0.319873 67 31 B (0.4626866 0.5373134) 6) V2>=0.2482707 54 26 A (0.5185185 0.4814815) 12) V3< 0.7424548 31 13 A (0.5806452 0.4193548) 24) V3>=0.3740281 23 8 A (0.6521739 0.3478261) * 25) V3< 0.3740281 8 3 B (0.3750000 0.6250000) * 13) V3>=0.7424548 23 10 B (0.4347826 0.5652174) 26) V3>=0.9291285 8 3 A (0.6250000 0.3750000) * 27) V3< 0.9291285 15 5 B (0.3333333 0.6666667) * 7) V2< 0.2482707 13 3 B (0.2307692 0.7692308) *> > print(t2.rpart)n= 100 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 100 49 A (0.5100000 0.4900000) 2) V3< 0.319873 33 13 A (0.6060606 0.3939394) 4) V3>=0.0740043 26 8 A (0.6923077 0.3076923) * 5) V3< 0.0740043 7 2 B (0.2857143 0.7142857) * 3) V3>=0.319873 67 31 B (0.4626866 0.5373134) 6) V2>=0.2482707 54 26 A (0.5185185 0.4814815) 12) V3< 0.7424548 31 13 A (0.5806452 0.4193548) 24) V3>=0.3740281 23 8 A (0.6521739 0.3478261) * 25) V3< 0.3740281 8 3 B (0.3750000 0.6250000) * 13) V3>=0.7424548 23 10 B (0.4347826 0.5652174) 26) V3>=0.9291285 8 3 A (0.6250000 0.3750000) * 27) V3< 0.9291285 15 5 B (0.3333333 0.6666667) * 7) V2< 0.2482707 13 3 B (0.2307692 0.7692308) *> > print(t3.rpart)n= 100 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 100 NaN A (NaN NaN NaN) *> >If I recode the factor back down to 2 levels it works again:> t4 <- t3 > t4$V1 <- factor(t4$V1, levels = c("A", "C")) > table(t4$V1)A C 51 49> > t4.rpart <- rpart(V1 ~ V2 + V3, data = t4, method = "class") > print(t4.rpart)n= 100 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 100 49 A (0.5100000 0.4900000) 2) V3< 0.319873 33 13 A (0.6060606 0.3939394) 4) V3>=0.0740043 26 8 A (0.6923077 0.3076923) * 5) V3< 0.0740043 7 2 C (0.2857143 0.7142857) * 3) V3>=0.319873 67 31 C (0.4626866 0.5373134) 6) V2>=0.2482707 54 26 A (0.5185185 0.4814815) 12) V3< 0.7424548 31 13 A (0.5806452 0.4193548) 24) V3>=0.3740281 23 8 A (0.6521739 0.3478261) * 25) V3< 0.3740281 8 3 C (0.3750000 0.6250000) * 13) V3>=0.7424548 23 10 C (0.4347826 0.5652174) 26) V3>=0.9291285 8 3 A (0.6250000 0.3750000) * 27) V3< 0.9291285 15 5 C (0.3333333 0.6666667) * 7) V2< 0.2482707 13 3 C (0.2307692 0.7692308) * And, just to make sure I get the same result if it's the first factor level missing (so it's not just a property of the middle level missing):> t5 <- t3 > t5$V1[t5$V1 == "A"] <- "B" > table(t5$V1)A B C 0 51 49> > > t5.rpart <- rpart(V1 ~ V2 + V3, data = t5, method = "class") > > print(t5.rpart)n= 100 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 100 NaN A (NaN NaN NaN) *>---------------------------------------------------------------------------- ----------------------------------- I tried to trace the problem, and took a look at the beginning of rpart.class: "rpart.class" <- function (y, offset, parms, wt) { if (!is.null(offset)) stop("No offset allowed in classification models") fy <- as.factor(y) y <- as.integer(fy) numclass <- max(y[!is.na(y)]) ... } I thought the problem might have to do with the fact that numclass would register 3 for my examples t3 and t5, but 2 for my examples t2 and t4. So I tried substituting numclass <- length(unique(y[!is.na(y)])) to get the number of factors present rather than the largest integer code used, but that didn't help. Then I tried running nnet using this data, and it worked fine on all the examples, with a warning about an empty level. So I lifted some code from nnet to make the beginning of rpart.class look like this: "rpart.class" <- function (y, offset, parms, wt) { if (!is.null(offset)) stop("No offset allowed in classification models") fy <- as.factor(y) ### here comes code from nnet lev <- levels(y) counts <- table(fy) if (any(counts == 0)) { warning(paste("group(s)", paste(lev[counts == 0], collapse = " "), "are empty")) fy <- factor(fy, levels = lev[counts > 0]) } ### done with code from nnet y <- as.integer(fy) numclass <- max(y[!is.na(y)]) ... } When I run this it works, but the factor gets recoded. For many applications that's fine -- It seems to work OK when I use (say) the predicted factor as a factor, but I'm worried about what could happen if I use as.integer(predict(rpart.obj)) eventually and compare it to something with the original encoding. Not good programming practice, but I wouldn't be surprised to find that I did it somewhere. So the ideal would be saving the coding and restoring it somewhere. But it looks to me like that would require saving this somewhere and restoring the coding after rpart (if I restore it in rpart.class, I should just run into the problems I had before). And then I'm not sure whether the core group wants to mess with rpart to this extent. Thanks for the terrific software. I hope this is helpful. Matthew Wiener Applied Computer Science and Mathematics Department Merck Research Labs Rahway, NJ 07065-0900 732-594-5303 --please do not edit the information below-- Version: platform = i386-pc-mingw32 arch = x86 os = Win32 system = x86, Win32 status = major = 1 minor = 4.1 year = 2002 month = 01 day = 30 language = R Windows NT 4.0 (build 1381) Service Pack 6 Search Path: .GlobalEnv, package:rpart, package:ctest, Autoloads, package:base --please do not edit the information below-- Version: platform = powerpc-apple-darwin5.2 arch = powerpc os = darwin5.2 system = powerpc, darwin5.2 status = major = 1 minor = 4.1 year = 2002 month = 01 day = 30 language = R Search Path: .GlobalEnv, package:rpart, package:ctest, Autoloads, package:base -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._