HelponR
2015-Jan-13 16:59 UTC
[R] any r package can handle factor levels not in the test set
Thanks for your reply. But I cannot control the data. I am dealing with real world stream data. It is very normal that the test data(when you apply model to do prediction) have new values that are not seen in training data. If I code myself, I would give a random guess or just an intercept for such situation. But it seems most R package returns an error and exit. On Mon, Jan 12, 2015 at 6:08 PM, Richard M. Heiberger <rmh at temple.edu> wrote:> You need to define the levels of the training set to include all > levels that you might see. > Something like this > > > A <- factor(letters[1:5]) > > B <- factor(letters[c(1,3,5,7,9)]) > > A > [1] a b c d e > Levels: a b c d e > > B > [1] a c e g i > Levels: a c e g i > > training <- factor(A, levels=unique(c(levels(A), levels(B)))) > > training > [1] a b c d e > Levels: a b c d e g i > > > > In the future please "provide commented, minimal, self-contained, > reproducible code." > > On Mon, Jan 12, 2015 at 9:00 PM, HelponR <suncertain at gmail.com> wrote: > > It looks like gbm, glm all has this issue > > > > I wonder if any R package is immune of this? > > > > In reality, it is very normal that test data has data unseen in training > > data. It looks like I have to give up R? > > > > Thanks! > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Bert Gunter
2015-Jan-13 17:22 UTC
[R] any r package can handle factor levels not in the test set
Folks: I believe this discussion would be better moved to a statistical discussion forum, like stats.stackexchange.com ,as it appears to be all about statistical issues, not R. I do not understand how you can possibly expect to predict behavior in new categories for which you have no prior information, but perhaps I do not understand or there are appropriate ways to do this in your subject matter area that discussion on a statistical forum would uncover. If you find any, you might then come back to R (see CRAN's task views: http://cran.r-project.org/web/views/ or simply search using a search engine) to see whether/how such methodology is implemented in R. Cheers, Bert Bert Gunter Genentech Nonclinical Biostatistics (650) 467-7374 "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." Clifford Stoll On Tue, Jan 13, 2015 at 8:59 AM, HelponR <suncertain at gmail.com> wrote:> Thanks for your reply. But I cannot control the data. > I am dealing with real world stream data. It is very normal that the test > data(when you apply model to do prediction) have new values that are not > seen in training data. > If I code myself, I would give a random guess or just an intercept for such > situation. But it seems most R package returns an error and exit. > > On Mon, Jan 12, 2015 at 6:08 PM, Richard M. Heiberger <rmh at temple.edu> > wrote: > >> You need to define the levels of the training set to include all >> levels that you might see. >> Something like this >> >> > A <- factor(letters[1:5]) >> > B <- factor(letters[c(1,3,5,7,9)]) >> > A >> [1] a b c d e >> Levels: a b c d e >> > B >> [1] a c e g i >> Levels: a c e g i >> > training <- factor(A, levels=unique(c(levels(A), levels(B)))) >> > training >> [1] a b c d e >> Levels: a b c d e g i >> > >> >> In the future please "provide commented, minimal, self-contained, >> reproducible code." >> >> On Mon, Jan 12, 2015 at 9:00 PM, HelponR <suncertain at gmail.com> wrote: >> > It looks like gbm, glm all has this issue >> > >> > I wonder if any R package is immune of this? >> > >> > In reality, it is very normal that test data has data unseen in training >> > data. It looks like I have to give up R? >> > >> > Thanks! >> > >> > [[alternative HTML version deleted]] >> > >> > ______________________________________________ >> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> > and provide commented, minimal, self-contained, reproducible code. >> > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
William Dunlap
2015-Jan-13 18:18 UTC
[R] any r package can handle factor levels not in the test set
I think it would be nice if predict methods returned NA in appropriate spots instead of aborting when a categorical predictor contains levels not found in the training set. It should not be that hard to implement, as the 'xlevels' component of the model is already being used to put factor levels into the order found in the training set. Until that happens you can do this by hand as in the following example:> training <- data.frame(Cat=rep(c("One","Two","Three"),3), Dog = 1:9,Response=100+2^(1:9))> newdata <- data.frame(Cat=c("Two","Apocalypse","Three"), Dog=2) > model <- lm(data=training, Response ~ Cat + log(Dog)) > predict(model, newdata=newdata)Error in model.frame.default(Terms, newdata, na.action = na.action, xlev object$xlevels) : factor Cat has new levels Apocalypse> predict(model, newdata=newdata[-2,])1 3 85.50099 148.56609> # Use model$xlevels to replace unknown levels with NA's > newdata$Cat <- factor(newdata$Cat, levels=model$xlevels$Cat) > predict(model, newdata=newdata)1 2 3 85.50099 NA 148.56609 (I don't think that predict.lm should be trying anything fancy to give a non-NA value at the Apocalypse. That would be the job for another model fitting function, like rpart.) Bill Dunlap TIBCO Software wdunlap tibco.com On Tue, Jan 13, 2015 at 8:59 AM, HelponR <suncertain at gmail.com> wrote:> Thanks for your reply. But I cannot control the data. > I am dealing with real world stream data. It is very normal that the test > data(when you apply model to do prediction) have new values that are not > seen in training data. > If I code myself, I would give a random guess or just an intercept for such > situation. But it seems most R package returns an error and exit. > > On Mon, Jan 12, 2015 at 6:08 PM, Richard M. Heiberger <rmh at temple.edu> > wrote: > > > You need to define the levels of the training set to include all > > levels that you might see. > > Something like this > > > > > A <- factor(letters[1:5]) > > > B <- factor(letters[c(1,3,5,7,9)]) > > > A > > [1] a b c d e > > Levels: a b c d e > > > B > > [1] a c e g i > > Levels: a c e g i > > > training <- factor(A, levels=unique(c(levels(A), levels(B)))) > > > training > > [1] a b c d e > > Levels: a b c d e g i > > > > > > > In the future please "provide commented, minimal, self-contained, > > reproducible code." > > > > On Mon, Jan 12, 2015 at 9:00 PM, HelponR <suncertain at gmail.com> wrote: > > > It looks like gbm, glm all has this issue > > > > > > I wonder if any R package is immune of this? > > > > > > In reality, it is very normal that test data has data unseen in > training > > > data. It looks like I have to give up R? > > > > > > Thanks! > > > > > > [[alternative HTML version deleted]] > > > > > > ______________________________________________ > > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > > and provide commented, minimal, self-contained, reproducible code. > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]