Laura M Marx
2005-Jul-19 23:53 UTC
[R] Regression lines for differently-sized groups on the same plot
Hi there, I've looked through the very helpful advice about adding fitted lines to plots in the r-help archive, and can't find a post where someone has offered a solution for my specific problem. I need to plot logistic regression fits from three differently-sized data subsets on a plot of the entire dataset. A description and code are below: I have an unbalanced dataset consisting of three different species (hem, yb, and sm), with unequal numbers of wood pieces in each species group. I am trying to generate a plot that will show the size of the wood piece on the X axis, the probability of it having tree seedlings growing on it on the Y (a binomial yes or no variable), and three fitted curves showing how the probability of having tree seedlings changes with increasing wood piece size for each species. I have no problem generating fits using GLM, and no problem creating the plot. However, if I try to add a fitted curve based only on the hem data subset to a plot that shows the entire dataset, I get an error message that the lengths of those data sets differ. "Error in xy.coords(x,y) : x and y lengths differ". I could see R's point -- you can't plot a regression line of babies born as a function of stork abundance on a graph of cherries produced (Y) versus rainfall (X), which for all the program knows, I'm trying to do. As a temporary fix, I added NAs to the end of the hem, yb, and sm subsets to make them the same length as the entire dataset. I can now add my fitted curves to the plot, but the lines are not connected. That is, if the hem group only contains wood pieces that are 1, 4, and 10 meters long, the plot has an X axis that ranges from 1 to 10, but line segments for the hem group regression line only appear above 1, 4, and 10. How can I fix this? An ideal solution would not require me to make the hem subset of my data the same length as the full dataset, either (although the summaries of regressions with the NAs (or zeroes) added and taken away are identical). I'd also settle for a work-around that would have R connect the pieces of the curve so that I get a solid line rather than small dots and dashes where actual data exist. Thanks so much for your help! Laura Marx Michigan State University, Dept. of Forestry #Note: hemdata has all the rows that are not hemlock species replaced with #"NA"s. hemhem=glm(hempresence~logarea, family=binomial(logit), data=hemdata) hemyb=glm(hempresence~logarea, family=binomial(logit), data=birchdata) hemsm=glm(hempresence~logarea, family=binomial(logit), data=mapledata) attach(logreg) #logreg is the full dataset plot(logarea, hempresence, xlab = "Surface area of log (m2)", ylab="Probability of hemlock seedling presence", type="n", font.lab=2, cex.lab=1.5, axes=TRUE) lines(logarea,fitted(hemhem), lty=1, lwd=2) lines(logarea,fitted(hemyb), lty="dashed", lwd=2) lines(logarea,fitted(hemsm), lty="dotted", lwd=2)
Sundar Dorai-Raj
2005-Jul-20 02:56 UTC
[R] Regression lines for differently-sized groups on the same plot
Laura M Marx wrote:> Hi there, > I've looked through the very helpful advice about adding fitted lines to > plots in the r-help archive, and can't find a post where someone has offered > a solution for my specific problem. I need to plot logistic regression fits > from three differently-sized data subsets on a plot of the entire dataset. > A description and code are below: > I have an unbalanced dataset consisting of three different species (hem, > yb, and sm), with unequal numbers of wood pieces in each species group. I > am trying to generate a plot that will show the size of the wood piece on > the X axis, the probability of it having tree seedlings growing on it on the > Y (a binomial yes or no variable), and three fitted curves showing how the > probability of having tree seedlings changes with increasing wood piece size > for each species. > I have no problem generating fits using GLM, and no problem creating the > plot. However, if I try to add a fitted curve based only on the hem data > subset to a plot that shows the entire dataset, I get an error message that > the lengths of those data sets differ. "Error in xy.coords(x,y) : x and y > lengths differ". I could see R's point -- you can't plot a regression line > of babies born as a function of stork abundance on a graph of cherries > produced (Y) versus rainfall (X), which for all the program knows, I'm > trying to do. As a temporary fix, I added NAs to the end of the hem, yb, > and sm subsets to make them the same length as the entire dataset. I can > now add my fitted curves to the plot, but the lines are not connected. That > is, if the hem group only contains wood pieces that are 1, 4, and 10 meters > long, the plot has an X axis that ranges from 1 to 10, but line segments for > the hem group regression line only appear above 1, 4, and 10. How can I fix > this? An ideal solution would not require me to make the hem subset of my > data the same length as the full dataset, either (although the summaries of > regressions with the NAs (or zeroes) added and taken away are identical). > I'd also settle for a work-around that would have R connect the pieces of > the curve so that I get a solid line rather than small dots and dashes where > actual data exist. Thanks so much for your help! > Laura Marx > Michigan State University, Dept. of Forestry > > #Note: hemdata has all the rows that are not hemlock species replaced with > #"NA"s. > hemhem=glm(hempresence~logarea, family=binomial(logit), data=hemdata) > hemyb=glm(hempresence~logarea, family=binomial(logit), data=birchdata) > hemsm=glm(hempresence~logarea, family=binomial(logit), data=mapledata) > > attach(logreg) #logreg is the full dataset > plot(logarea, hempresence, xlab = "Surface area of log (m2)", > ylab="Probability of hemlock seedling presence", type="n", font.lab=2, > cex.lab=1.5, axes=TRUE) > lines(logarea,fitted(hemhem), lty=1, lwd=2) > lines(logarea,fitted(hemyb), lty="dashed", lwd=2) > lines(logarea,fitted(hemsm), lty="dotted", lwd=2) >Hi, Laura, Would ?predict.glm be better? plot(logarea, hempresence, xlab = "Surface area of log (m2)", ylab="Probability of hemlock seedling presence", type="n", font.lab=2, cex.lab=1.5, axes=TRUE) lines(logarea, predict(hemhem, logreg, "response"), lty=1, lwd=2) lines(logarea, predict(hemyb, logreg, "response"), lty="dashed", lwd=2) lines(logarea, predict(hemsm, logreg, "response"), lty="dotted", lwd=2) Without seeing more description of your data, this is still a guess. --sundar
Laura M Marx
2005-Jul-20 17:58 UTC
[R] Regression lines for differently-sized groups on the same plot
Sundar Dorai-Raj writes:> Hi, Laura, > > Would ?predict.glm be better? > > plot(logarea, hempresence, > xlab = "Surface area of log (m2)", > ylab="Probability of hemlock seedling presence", > type="n", font.lab=2, cex.lab=1.5, axes=TRUE) > lines(logarea, predict(hemhem, logreg, "response"), lty=1, lwd=2) > lines(logarea, predict(hemyb, logreg, "response"), lty="dashed", lwd=2) > lines(logarea, predict(hemsm, logreg, "response"), lty="dotted", lwd=2) > > Without seeing more description of your data, this is still a guess. > > --sundar >YES! Thank you. That solves all of the problems I was having. Thanks also to Tom Mulholland and Bill Venables for their replies. Laura The final code is now: logreg <- read.table("C:/Documents and Settings/Laura/Desktop/logreg.txt", header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE) #This is the full dataset. A hemlock row followed by a birch row might look #like: #Hemybsm logarea hempresence #1 0.054 0 #2 1.370 1 hemhem=glm(hempresence~logarea, family=binomial(logit), subset=Hemybsm<2) #This pulls out only the hemlocks from the full dataset and calculates the #regression. hemyb=glm(hempresence~logarea, family=binomial(logit), subset=Hemybsm==2) #Same code, with only the birches from the full dataset. hemsm=glm(hempresence~logarea, family=binomial(logit), subset=Hemybsm>2) plot(logarea, hempresence, xlab = "Surface area of log (m2)", ylab="Probability of hemlock seedling presence", type="n", font.lab=2, cex.lab=1.5, axes=TRUE) lines(logarea, predict(hemhem, logreg, "response"), lty=1, lwd=2) lines(logarea, predict(hemyb, logreg, "response"), lty="dashed", lwd=2) lines(logarea, predict(hemsm, logreg, "response"), lty="dotted", lwd=2)