Charlotte Whitham
2014-Nov-25 16:21 UTC
[R] Checking the proportional odds assumption holds in an ordinal logistic regression using polr function
Dear list, I have used the ?polr? function in the MASS package to run an ordinal logistic regression for an ordinal categorical response variable with 15 continuous explanatory variables. I have used the code (shown below) to check that my model meets the proportional odds assumption following advice provided at (http://www.ats.ucla.edu/stat/r/dae/ologit.htm) ? which has been extremely helpful, thank you to the authors! However, I?m a little worried about the output implying that not only are the coefficients across various cutpoints similar, but they are exactly the same (see graphic below). Here is the code I used (and see attached for the output graphic) FGV1b<-data.frame(FG1_val_cat=factor(FGV1b[,"FG1_val_cat"]),scale(FGV1[,c("X","Y","Slope","Ele","Aspect","Prox_to_for_FG","Prox_to_for_mL","Prox_to_nat_border","Prox_to_village","Prox_to_roads","Prox_to_rivers","Prox_to_waterFG","Prox_to_watermL","Prox_to_core","Prox_to_NR","PCA1","PCA2","PCA3")])) b<-polr(FGV1b$FG1_val_cat ~ FGV1b$X + FGV1b$Y + FGV1b$Slope + FGV1b$Ele + FGV1b$Aspect + FGV1b$Prox_to_for_FG + FGV1b$Prox_to_for_mL + FGV1b$Prox_to_nat_border + FGV1b$Prox_to_village + FGV1b$Prox_to_roads + FGV1b$Prox_to_rivers + FGV1b$Prox_to_waterFG + FGV1b$Prox_to_watermL + FGV1b$Prox_to_core + FGV1b$Prox_to_NR, data = FGV1b, Hess=TRUE) #Checking the assumption. So the following code will estimate the values to be graphed. First it shows us #the logit transformations of the probabilities of being greater than or equal to each value of the target #variable FGV1b$FG1_val_cat<-as.numeric(FGV1b$FG1_val_cat) sf <- function(y) { c('VC>=1' = qlogis(mean(FGV1b$FG1_val_cat >= 1)), 'VC>=2' = qlogis(mean(FGV1b$FG1_val_cat >= 2)), 'VC>=3' = qlogis(mean(FGV1b$FG1_val_cat >= 3)), 'VC>=4' = qlogis(mean(FGV1b$FG1_val_cat >= 4)), 'VC>=5' = qlogis(mean(FGV1b$FG1_val_cat >= 5)), 'VC>=6' = qlogis(mean(FGV1b$FG1_val_cat >= 6)), 'VC>=7' = qlogis(mean(FGV1b$FG1_val_cat >= 7)), 'VC>=8' = qlogis(mean(FGV1b$FG1_val_cat >= 8))) } (t <- with(FGV1b, summary(as.numeric(FGV1b$FG1_val_cat) ~ FGV1b$X + FGV1b$Y + FGV1b$Slope + FGV1b$Ele + FGV1b$Aspect + FGV1b$Prox_to_for_FG + FGV1b$Prox_to_for_mL + FGV1b$Prox_to_nat_border + FGV1b$Prox_to_village + FGV1b$Prox_to_roads + FGV1b$Prox_to_rivers + FGV1b$Prox_to_waterFG + FGV1b$Prox_to_watermL + FGV1b$Prox_to_core + FGV1b$Prox_to_NR, fun=sf))) #The table displays the (linear) predicted values we would get if we regressed our #dependent variable on our predictor variables one at a time, without the parallel slopes #assumption. So now, we can run a series of binary logistic regressions with varying cutpoints #on the dependent variable to check the equality of coefficients across cutpoints par(mfrow=c(1,1)) plot(t, which=1:8, pch=1:8, xlab='logit', main=' ', xlim=range(s[,7:8])) Apologies that I am no statistics expert and perhaps I am missing something obvious here. However, I have spent a long time trying to figure out if there is a problem in how I tested the model assumption and also trying to figure out other ways to run the same kind of model. For example, I read in many help mailing lists that others use the vglm function (in the VGAM package) and the lrm function (in the rms package) (for example see here: http://stats.stackexchange.com/questions/25988/proportional-odds-assumption-in-ordinal-logistic-regression-in-r-with-the-packag). I have tried to run the same models but am continuously coming up against warnings and errors. For example, when I try to fit the vglm model with the ?parallel=FALSE? argument (as the previous link mentions is important for testing the proportional odds assumption), I encounter the following error: Error in lm.fit(X.vlm, y = z.vlm, ...) : NA/NaN/Inf in 'y' In addition: Warning message: In Deviance.categorical.data.vgam(mu = mu, y = y, w = w, residuals = residuals, : fitted values close to 0 or 1 And after many searches for help, I can?t seem to find a way to fix this problem. I would like to ask please if there is anyone who might understand and be able to explain to me why the graph I produced above looks as it does. If indeed it means that something isn?t right, could you please help me find a way to test the proportional odds assumption when just using the polr function. Or if that is just not possible, then I will resort to trying to use the vglm function, but would then need some help to explain why I keep getting the error given above. I hope this is clear. Please do let me know if I should provide some more information that would help address this query. NOTE: As a background, there are 1000 datapoints here, which are actually location points across a study area. I am looking to see if there are any relationships between the categorical response variable and these 15 explanatory variables. All of those 15 explanatory variables are spatial characteristics (for example, elevation, x-y coordinates, proximity to forest etc.). The 1000 datapoints were randomly allocated using a GIS, but I took a stratified sampling approach. I made sure that 125 points were randomly chosen within each of the 8 different categorical response levels. I hope this information is also helpful. I am extremely grateful to anyone who could please give me some guidance with this. Thank you very much for your time, Charlie
Rune Haubo
2014-Nov-26 14:08 UTC
[R] Checking the proportional odds assumption holds in an ordinal logistic regression using polr function
Dear Charlie, I admit that I haven't read your email closely, but here is a way to test for non-proportional odds using the ordinal package (warning: self-promotion) using the wine data set also from the ordinal package. There is more information in the package vignettes Hope this is something you can use. Cheers, Rune> library(ordinal) > ## Fit model: > fm <- clm(rating ~ temp + contact, data=wine) > summary(fm)formula: rating ~ temp + contact data: wine link threshold nobs logLik AIC niter max.grad cond.H logit flexible 72 -86.49 184.98 6(0) 4.64e-15 2.7e+01 Coefficients: Estimate Std. Error z value Pr(>|z|) tempwarm 2.5031 0.5287 4.735 2.19e-06 *** contactyes 1.5278 0.4766 3.205 0.00135 ** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Threshold coefficients: Estimate Std. Error z value 1|2 -1.3444 0.5171 -2.600 2|3 1.2508 0.4379 2.857 3|4 3.4669 0.5978 5.800 4|5 5.0064 0.7309 6.850> ## Model with non-proportional odds for contact: > fm2 <- clm(rating ~ temp, nominal=~contact, data=wine) > ## Likelihood ratio test of non-proportional odds: > anova(fm, fm2)Likelihood ratio tests of cumulative link models: formula: nominal: link: threshold: fm rating ~ temp + contact ~1 logit flexible fm2 rating ~ temp ~contact logit flexible no.par AIC logLik LR.stat df Pr(>Chisq) fm 6 184.98 -86.492 fm2 9 190.42 -86.209 0.5667 3 0.904> ## Automatic tests of non-proportional odds for all varibles: > nominal_test(fm)Tests of nominal effects formula: rating ~ temp + contact Df logLik AIC LRT Pr(>Chi) <none> -86.492 184.98 temp 3 -84.904 187.81 3.1750 0.3654 contact 3 -86.209 190.42 0.5667 0.9040 On 25 November 2014 at 17:21, Charlotte Whitham <charlotte.whitham at gmail.com> wrote:> Dear list, > > I have used the ?polr? function in the MASS package to run an ordinal logistic regression for an ordinal categorical response variable with 15 continuous explanatory variables. > I have used the code (shown below) to check that my model meets the proportional odds assumption following advice provided at (http://www.ats.ucla.edu/stat/r/dae/ologit.htm) ? which has been extremely helpful, thank you to the authors! However, I?m a little worried about the output implying that not only are the coefficients across various cutpoints similar, but they are exactly the same (see graphic below). > > Here is the code I used (and see attached for the output graphic) > > FGV1b<-data.frame(FG1_val_cat=factor(FGV1b[,"FG1_val_cat"]),scale(FGV1[,c("X","Y","Slope","Ele","Aspect","Prox_to_for_FG","Prox_to_for_mL","Prox_to_nat_border","Prox_to_village","Prox_to_roads","Prox_to_rivers","Prox_to_waterFG","Prox_to_watermL","Prox_to_core","Prox_to_NR","PCA1","PCA2","PCA3")])) > > b<-polr(FGV1b$FG1_val_cat ~ FGV1b$X + FGV1b$Y + FGV1b$Slope + FGV1b$Ele + FGV1b$Aspect + FGV1b$Prox_to_for_FG + FGV1b$Prox_to_for_mL + FGV1b$Prox_to_nat_border + FGV1b$Prox_to_village + FGV1b$Prox_to_roads + FGV1b$Prox_to_rivers + FGV1b$Prox_to_waterFG + FGV1b$Prox_to_watermL + FGV1b$Prox_to_core + FGV1b$Prox_to_NR, data = FGV1b, Hess=TRUE) > > #Checking the assumption. So the following code will estimate the values to be graphed. First it shows us #the logit transformations of the probabilities of being greater than or equal to each value of the target #variable > > FGV1b$FG1_val_cat<-as.numeric(FGV1b$FG1_val_cat) > > sf <- function(y) { > > c('VC>=1' = qlogis(mean(FGV1b$FG1_val_cat >= 1)), > > 'VC>=2' = qlogis(mean(FGV1b$FG1_val_cat >= 2)), > > 'VC>=3' = qlogis(mean(FGV1b$FG1_val_cat >= 3)), > > 'VC>=4' = qlogis(mean(FGV1b$FG1_val_cat >= 4)), > > 'VC>=5' = qlogis(mean(FGV1b$FG1_val_cat >= 5)), > > 'VC>=6' = qlogis(mean(FGV1b$FG1_val_cat >= 6)), > > 'VC>=7' = qlogis(mean(FGV1b$FG1_val_cat >= 7)), > > 'VC>=8' = qlogis(mean(FGV1b$FG1_val_cat >= 8))) > > } > > (t <- with(FGV1b, summary(as.numeric(FGV1b$FG1_val_cat) ~ FGV1b$X + FGV1b$Y + FGV1b$Slope + FGV1b$Ele + FGV1b$Aspect + FGV1b$Prox_to_for_FG + FGV1b$Prox_to_for_mL + FGV1b$Prox_to_nat_border + FGV1b$Prox_to_village + FGV1b$Prox_to_roads + FGV1b$Prox_to_rivers + FGV1b$Prox_to_waterFG + FGV1b$Prox_to_watermL + FGV1b$Prox_to_core + FGV1b$Prox_to_NR, fun=sf))) > > > > #The table displays the (linear) predicted values we would get if we regressed our > > #dependent variable on our predictor variables one at a time, without the parallel slopes > > #assumption. So now, we can run a series of binary logistic regressions with varying cutpoints > > #on the dependent variable to check the equality of coefficients across cutpoints > > par(mfrow=c(1,1)) > > plot(t, which=1:8, pch=1:8, xlab='logit', main=' ', xlim=range(s[,7:8])) > > > > Apologies that I am no statistics expert and perhaps I am missing something obvious here. However, I have spent a long time trying to figure out if there is a problem in how I tested the model assumption and also trying to figure out other ways to run the same kind of model. > > For example, I read in many help mailing lists that others use the vglm function (in the VGAM package) and the lrm function (in the rms package) (for example see here: http://stats.stackexchange.com/questions/25988/proportional-odds-assumption-in-ordinal-logistic-regression-in-r-with-the-packag). I have tried to run the same models but am continuously coming up against warnings and errors. > > For example, when I try to fit the vglm model with the ?parallel=FALSE? argument (as the previous link mentions is important for testing the proportional odds assumption), I encounter the following error: > > > > Error in lm.fit(X.vlm, y = z.vlm, ...) : NA/NaN/Inf in 'y' > > In addition: Warning message: > > In Deviance.categorical.data.vgam(mu = mu, y = y, w = w, residuals = residuals, : > > fitted values close to 0 or 1 > > > > And after many searches for help, I can?t seem to find a way to fix this problem. > > I would like to ask please if there is anyone who might understand and be able to explain to me why the graph I produced above looks as it does. If indeed it means that something isn?t right, could you please help me find a way to test the proportional odds assumption when just using the polr function. Or if that is just not possible, then I will resort to trying to use the vglm function, but would then need some help to explain why I keep getting the error given above. > > I hope this is clear. Please do let me know if I should provide some more information that would help address this query. > > NOTE: As a background, there are 1000 datapoints here, which are actually location points across a study area. I am looking to see if there are any relationships between the categorical response variable and these 15 explanatory variables. All of those 15 explanatory variables are spatial characteristics (for example, elevation, x-y coordinates, proximity to forest etc.). The 1000 datapoints were randomly allocated using a GIS, but I took a stratified sampling approach. I made sure that 125 points were randomly chosen within each of the 8 different categorical response levels. I hope this information is also helpful. > > I am extremely grateful to anyone who could please give me some guidance with this. > > Thank you very much for your time, > > Charlie > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Charlotte Whitham
2014-Nov-26 16:55 UTC
[R] Checking the proportional odds assumption holds in an ordinal logistic regression using polr function
Dear Rune, Thank you for your prompt reply and it looks like the ordinal package could be the answer I was looking for! If you don't mind, I'd also like to know please what to do if the tests show the proportional odds assumption is NOT met. (Unfortunately I notice effects from almost all variables that breach the proportional odds assumption in my dataset) Would you recommend a multinomial logistic model? Or re-scaling of the data? Thank you for your time, Best wishes, Charlie On 26 Nov 2014, at 14:08, Rune Haubo <rune.haubo at gmail.com> wrote:> Dear Charlie, > > I admit that I haven't read your email closely, but here is a way to > test for non-proportional odds using the ordinal package (warning: > self-promotion) using the wine data set also from the ordinal package. > There is more information in the package vignettes > > Hope this is something you can use. > Cheers, > Rune > >> library(ordinal) >> ## Fit model: >> fm <- clm(rating ~ temp + contact, data=wine) >> summary(fm) > formula: rating ~ temp + contact > data: wine > > link threshold nobs logLik AIC niter max.grad cond.H > logit flexible 72 -86.49 184.98 6(0) 4.64e-15 2.7e+01 > > Coefficients: > Estimate Std. Error z value Pr(>|z|) > tempwarm 2.5031 0.5287 4.735 2.19e-06 *** > contactyes 1.5278 0.4766 3.205 0.00135 ** > --- > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > Threshold coefficients: > Estimate Std. Error z value > 1|2 -1.3444 0.5171 -2.600 > 2|3 1.2508 0.4379 2.857 > 3|4 3.4669 0.5978 5.800 > 4|5 5.0064 0.7309 6.850 >> ## Model with non-proportional odds for contact: >> fm2 <- clm(rating ~ temp, nominal=~contact, data=wine) >> ## Likelihood ratio test of non-proportional odds: >> anova(fm, fm2) > Likelihood ratio tests of cumulative link models: > > formula: nominal: link: threshold: > fm rating ~ temp + contact ~1 logit flexible > fm2 rating ~ temp ~contact logit flexible > > no.par AIC logLik LR.stat df Pr(>Chisq) > fm 6 184.98 -86.492 > fm2 9 190.42 -86.209 0.5667 3 0.904 >> ## Automatic tests of non-proportional odds for all varibles: >> nominal_test(fm) > Tests of nominal effects > > formula: rating ~ temp + contact > Df logLik AIC LRT Pr(>Chi) > <none> -86.492 184.98 > temp 3 -84.904 187.81 3.1750 0.3654 > contact 3 -86.209 190.42 0.5667 0.9040 > > On 25 November 2014 at 17:21, Charlotte Whitham > <charlotte.whitham at gmail.com> wrote: >> Dear list, >> >> I have used the ?polr? function in the MASS package to run an ordinal logistic regression for an ordinal categorical response variable with 15 continuous explanatory variables. >> I have used the code (shown below) to check that my model meets the proportional odds assumption following advice provided at (http://www.ats.ucla.edu/stat/r/dae/ologit.htm) ? which has been extremely helpful, thank you to the authors! However, I?m a little worried about the output implying that not only are the coefficients across various cutpoints similar, but they are exactly the same (see graphic below). >> >> Here is the code I used (and see attached for the output graphic) >> >> FGV1b<-data.frame(FG1_val_cat=factor(FGV1b[,"FG1_val_cat"]),scale(FGV1[,c("X","Y","Slope","Ele","Aspect","Prox_to_for_FG","Prox_to_for_mL","Prox_to_nat_border","Prox_to_village","Prox_to_roads","Prox_to_rivers","Prox_to_waterFG","Prox_to_watermL","Prox_to_core","Prox_to_NR","PCA1","PCA2","PCA3")])) >> >> b<-polr(FGV1b$FG1_val_cat ~ FGV1b$X + FGV1b$Y + FGV1b$Slope + FGV1b$Ele + FGV1b$Aspect + FGV1b$Prox_to_for_FG + FGV1b$Prox_to_for_mL + FGV1b$Prox_to_nat_border + FGV1b$Prox_to_village + FGV1b$Prox_to_roads + FGV1b$Prox_to_rivers + FGV1b$Prox_to_waterFG + FGV1b$Prox_to_watermL + FGV1b$Prox_to_core + FGV1b$Prox_to_NR, data = FGV1b, Hess=TRUE) >> >> #Checking the assumption. So the following code will estimate the values to be graphed. First it shows us #the logit transformations of the probabilities of being greater than or equal to each value of the target #variable >> >> FGV1b$FG1_val_cat<-as.numeric(FGV1b$FG1_val_cat) >> >> sf <- function(y) { >> >> c('VC>=1' = qlogis(mean(FGV1b$FG1_val_cat >= 1)), >> >> 'VC>=2' = qlogis(mean(FGV1b$FG1_val_cat >= 2)), >> >> 'VC>=3' = qlogis(mean(FGV1b$FG1_val_cat >= 3)), >> >> 'VC>=4' = qlogis(mean(FGV1b$FG1_val_cat >= 4)), >> >> 'VC>=5' = qlogis(mean(FGV1b$FG1_val_cat >= 5)), >> >> 'VC>=6' = qlogis(mean(FGV1b$FG1_val_cat >= 6)), >> >> 'VC>=7' = qlogis(mean(FGV1b$FG1_val_cat >= 7)), >> >> 'VC>=8' = qlogis(mean(FGV1b$FG1_val_cat >= 8))) >> >> } >> >> (t <- with(FGV1b, summary(as.numeric(FGV1b$FG1_val_cat) ~ FGV1b$X + FGV1b$Y + FGV1b$Slope + FGV1b$Ele + FGV1b$Aspect + FGV1b$Prox_to_for_FG + FGV1b$Prox_to_for_mL + FGV1b$Prox_to_nat_border + FGV1b$Prox_to_village + FGV1b$Prox_to_roads + FGV1b$Prox_to_rivers + FGV1b$Prox_to_waterFG + FGV1b$Prox_to_watermL + FGV1b$Prox_to_core + FGV1b$Prox_to_NR, fun=sf))) >> >> >> >> #The table displays the (linear) predicted values we would get if we regressed our >> >> #dependent variable on our predictor variables one at a time, without the parallel slopes >> >> #assumption. So now, we can run a series of binary logistic regressions with varying cutpoints >> >> #on the dependent variable to check the equality of coefficients across cutpoints >> >> par(mfrow=c(1,1)) >> >> plot(t, which=1:8, pch=1:8, xlab='logit', main=' ', xlim=range(s[,7:8])) >> >> >> >> Apologies that I am no statistics expert and perhaps I am missing something obvious here. However, I have spent a long time trying to figure out if there is a problem in how I tested the model assumption and also trying to figure out other ways to run the same kind of model. >> >> For example, I read in many help mailing lists that others use the vglm function (in the VGAM package) and the lrm function (in the rms package) (for example see here: http://stats.stackexchange.com/questions/25988/proportional-odds-assumption-in-ordinal-logistic-regression-in-r-with-the-packag). I have tried to run the same models but am continuously coming up against warnings and errors. >> >> For example, when I try to fit the vglm model with the ?parallel=FALSE? argument (as the previous link mentions is important for testing the proportional odds assumption), I encounter the following error: >> >> >> >> Error in lm.fit(X.vlm, y = z.vlm, ...) : NA/NaN/Inf in 'y' >> >> In addition: Warning message: >> >> In Deviance.categorical.data.vgam(mu = mu, y = y, w = w, residuals = residuals, : >> >> fitted values close to 0 or 1 >> >> >> >> And after many searches for help, I can?t seem to find a way to fix this problem. >> >> I would like to ask please if there is anyone who might understand and be able to explain to me why the graph I produced above looks as it does. If indeed it means that something isn?t right, could you please help me find a way to test the proportional odds assumption when just using the polr function. Or if that is just not possible, then I will resort to trying to use the vglm function, but would then need some help to explain why I keep getting the error given above. >> >> I hope this is clear. Please do let me know if I should provide some more information that would help address this query. >> >> NOTE: As a background, there are 1000 datapoints here, which are actually location points across a study area. I am looking to see if there are any relationships between the categorical response variable and these 15 explanatory variables. All of those 15 explanatory variables are spatial characteristics (for example, elevation, x-y coordinates, proximity to forest etc.). The 1000 datapoints were randomly allocated using a GIS, but I took a stratified sampling approach. I made sure that 125 points were randomly chosen within each of the 8 different categorical response levels. I hope this information is also helpful. >> >> I am extremely grateful to anyone who could please give me some guidance with this. >> >> Thank you very much for your time, >> >> Charlie >> ______________________________________________ >> R-help at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code.