thr3ads.net - R help - [R] GLM: What is a good way for dealing with new factor levels in the test set? [Apr 2015]

If this information is useful, please help other people find it:
Share via:

thuksu

2015-Apr-29 22:05 UTC

[R] GLM: What is a good way for dealing with new factor levels in the test set?

My training set and my test set have some factor levels that are
different....  It's rare, but it occurs.

What is a good way for dealing with this?

I don't want to throw away the entire row from the data frame, because there
is some valuable information in there.

Is there some way to say something like "use the weighted average
coefficient level for this factor"?



--
View this message in context:
http://r.789695.n4.nabble.com/GLM-What-is-a-good-way-for-dealing-with-new-factor-levels-in-the-test-set-tp4706621.html
Sent from the R help mailing list archive at Nabble.com.

Jim Lemon

2015-Apr-30 06:54 UTC

head link

[R] GLM: What is a good way for dealing with new factor levels in the test set?

Hi thuksu,
Would defining the factor in your training set with all the levels
that occur in the test set solve the problem? That is, there would be
at least one factor level in the training set even though there were
no instances of that factor.

Jim


On Thu, Apr 30, 2015 at 8:05 AM, thuksu <toby at huksu.com>
wrote:> My training set and my test set have some factor levels that are
> different....  It's rare, but it occurs.
>
> What is a good way for dealing with this?
>
> I don't want to throw away the entire row from the data frame, because
there
> is some valuable information in there.
>
> Is there some way to say something like "use the weighted average
> coefficient level for this factor"?
>
>
>
> --
> View this message in context:
http://r.789695.n4.nabble.com/GLM-What-is-a-good-way-for-dealing-with-new-factor-levels-in-the-test-set-tp4706621.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

thuksu

2015-Apr-30 15:02 UTC

head link

[R] GLM: What is a good way for dealing with new factor levels in the test set?

Hi, Thanks for the reply!

I did try this...

# res is a data frame
levels(res$mytypeid.f) <- c(levels(res$mytypeid.f),"mynewlevel")
logreg <- glm(yesno ~ mytypeid.f + amount, data=res,
family="binomial")
exp(coef(logreg)) 
# this result shows that the new level is not included in the regression. 
it's probably automatically removed.


I think what I want to do is identify new levels that are not in the
training set, and prune those from the test set.  Then I would be using the
dummy variable by default, which I think is the "average", from
reading
this:
http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm

Problem is, I'm not sure how to do that...



--
View this message in context:
http://r.789695.n4.nabble.com/GLM-What-is-a-good-way-for-dealing-with-new-factor-levels-in-the-test-set-tp4706621p4706644.html
Sent from the R help mailing list archive at Nabble.com.

R help - Apr 2015 - GLM: What is a good way for dealing with new factor levels in the test set?

[R] GLM: What is a good way for dealing with new factor levels in the test set?

[R] GLM: What is a good way for dealing with new factor levels in the test set?

[R] GLM: What is a good way for dealing with new factor levels in the test set?