Paul Johnson
2005-Nov-17 16:47 UTC
[R] Predicting and Plotting "hypothetical" values of factors
Last Friday, I noticed that it is difficult to work with regression models in which there are factors. It is easier to do the old fashioned thing of coding up "dummy" variables with 0-1 values. The predict function's newdata argument is not suited to insertion of hypothetical values for the factor, whereas it has no trouble with numeric variables. For example, if one uses a factor as a predictor in a logistic regression, then it is tough to plot the S-shaped curve that describes the probabilities. Here is some code with comments describing the difficulties that we have found. Are there simpler, less frustrating approaches? ----------------------------- # Paul Johnson <pauljohn at ku.edu> 2005-11-17 # factorFutzing-1.R myfac <- factor(c(1.1, 4.5, 1.1, 1.1, 4.5, 1.1, 4.5, 1.1)) y <- c(0,1,0,1,0,0,1,0) mymod1 <-glm (y~myfac, family=binomial) p1 <- predict(mymod1, type="response") plot(myfac,p1) # Wait, that's not the picture I wanted. I want to see the # proverbial S-shaped curve! # # The contrast variable that R creates is coded 0 and 1. # I'd like to have a sequence from 0 to 1 (seq(0,1,length.out=10)) and # get predicted values for each. That would give the S-shaped curve. # But the use of predict does not work because the factor has 2 # values and the predict function won't take my new data: predict(mymod1, newdata=data.frame(myfac=seq(0,1,length.out=8))) # Error: variable 'myfac' was fitted with class "factor" but class "numeric" was supplied # In addition: Warning message: # variable 'myfac' is not a factor in: model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) # Isn't there an easier way than this? c1 <- coef(mymod1) myseq <- seq(0,1, length.out=10) newdf <- data.frame(1, myseq) linearPredictor <- as.matrix(newdf) %*% c1 p2 <- 1/ (1 + exp(-linearPredictor)) # But there is a big disaster if you just try the obvious thing # of plotting with # lines(myseq,p2) # The line does not show up where you hope in the plot. # The plot "appears to" have the x-axis on the scale # 1.1 to 4.5, So in order to see the s-shaped curve, it appears we have # to re-scale. However, this is a big illusion. To see what I mean, # do points(2,.4) # you expected to see the point appear at (2,.4), but in the plot # it appears at (4.5,.4). Huh? # The actual values being plotted are the integer-valued levels that # R uses for factors, the numbers you get from as.numeric(myfac). # So it is only necessary to re-scale the sequence by adding one. myseq2 <- 1 + myseq lines( myseq2, p2, type="l" ) #Its not all that S-shaped, but you have to take what you can get! :) ----------------------------- -- Paul E. Johnson email: pauljohn at ku.edu Dept. of Political Science http://lark.cc.ku.edu/~pauljohn 1541 Lilac Lane, Rm 504 University of Kansas Office: (785) 864-9086 Lawrence, Kansas 66044-3177 FAX: (785) 864-5700
Peter Dalgaard
2005-Nov-17 17:00 UTC
[R] Predicting and Plotting "hypothetical" values of factors
Paul Johnson <pauljohn at ku.edu> writes:> Last Friday, I noticed that it is difficult to work with regression > models in which there are factors. It is easier to do the old fashioned > thing of coding up "dummy" variables with 0-1 values. The predict > function's newdata argument is not suited to insertion of hypothetical > values for the factor, whereas it has no trouble with numeric variables. > For example, if one uses a factor as a predictor in a logistic > regression, then it is tough to plot the S-shaped curve that describes > the probabilities. > > Here is some code with comments describing the difficulties that we have > found. Are there simpler, less frustrating approaches?I think the point is that if you think a factor can take intermediate values in between the groups (as in sex==1.245), then it is really a numeric variable and should be treated as such in the model, whether or not it has only two distinct values in your data set. It is not obvious to me what the "S shaped curve" means in a model which really only specifies two probabilities. -p> ----------------------------- > # Paul Johnson <pauljohn at ku.edu> 2005-11-17 > # factorFutzing-1.R > > > myfac <- factor(c(1.1, 4.5, 1.1, 1.1, 4.5, 1.1, 4.5, 1.1)) > > y <- c(0,1,0,1,0,0,1,0) > > mymod1 <-glm (y~myfac, family=binomial) > > p1 <- predict(mymod1, type="response") > > plot(myfac,p1) > > # Wait, that's not the picture I wanted. I want to see the > # proverbial S-shaped curve! > # > > # The contrast variable that R creates is coded 0 and 1. > # I'd like to have a sequence from 0 to 1 (seq(0,1,length.out=10)) and > # get predicted values for each. That would give the S-shaped curve. > > > # But the use of predict does not work because the factor has 2 > # values and the predict function won't take my new data: > > predict(mymod1, newdata=data.frame(myfac=seq(0,1,length.out=8))) > > # Error: variable 'myfac' was fitted with class "factor" but class > "numeric" was supplied > # In addition: Warning message: > # variable 'myfac' is not a factor in: model.frame.default(Terms, > newdata, na.action = na.action, xlev = object$xlevels) > > # Isn't there an easier way than this? > > c1 <- coef(mymod1) > > myseq <- seq(0,1, length.out=10) > > newdf <- data.frame(1, myseq) > > linearPredictor <- as.matrix(newdf) %*% c1 > > p2 <- 1/ (1 + exp(-linearPredictor)) > # But there is a big disaster if you just try the obvious thing > # of plotting with > # lines(myseq,p2) > # The line does not show up where you hope in the plot. > # The plot "appears to" have the x-axis on the scale > # 1.1 to 4.5, So in order to see the s-shaped curve, it appears we have > # to re-scale. However, this is a big illusion. To see what I mean, > # do > > points(2,.4) > > # you expected to see the point appear at (2,.4), but in the plot > # it appears at (4.5,.4). Huh? > > # The actual values being plotted are the integer-valued levels that > # R uses for factors, the numbers you get from as.numeric(myfac). > # So it is only necessary to re-scale the sequence by adding one. > > myseq2 <- 1 + myseq > > lines( myseq2, p2, type="l" ) > > #Its not all that S-shaped, but you have to take what you can get! :) > ----------------------------- > > -- > Paul E. Johnson email: pauljohn at ku.edu > Dept. of Political Science http://lark.cc.ku.edu/~pauljohn > 1541 Lilac Lane, Rm 504 > University of Kansas Office: (785) 864-9086 > Lawrence, Kansas 66044-3177 FAX: (785) 864-5700 > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >-- O__ ---- Peter Dalgaard ??ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907