Hi, I'm just learning about poison links for the glm function. One of the data sets I'm playing with has several of the variables as factors (i.e. month, group, etc.) When I call the glm function with a formula that has a factor variable, R automatically converts the variable to a series of variables with unique names and binary values. For example, with this pseudo data: y v1 month 2 1 january 3 1.4 februrary 1.5 6.3 february 1.2 4.5 january 5.5 4.0 march I use this call: m <- glm(y ~ v1 + month, family="poisson") R gives me back a model with variables of Intercept v1 monthJanuary monthFebruary monthMarch I'm concerned that this might be doing some strange things to my model. Can anyone offer some enlightenment? Thanks!
On 02-Mar-10 08:02:27, Noah Silverman wrote:> Hi, > I'm just learning about poison links for the glm function. > > One of the data sets I'm playing with has several of the > variables as factors (i.e. month, group, etc.) > > When I call the glm function with a formula that has a factor > variable, R automatically converts the variable to a series of > variables with unique names and binary values. > > For example, with this pseudo data: > > y v1 month > 2 1 january > 3 1.4 februrary > 1.5 6.3 february > 1.2 4.5 january > 5.5 4.0 march > > I use this call: > > m <- glm(y ~ v1 + month, family="poisson") > > R gives me back a model with variables of > Intercept > v1 > monthJanuary > monthFebruary > monthMarch > > I'm concerned that this might be doing some strange things > to my model. > Can anyone offer some enlightenment? > Thanks!The creation of auxiliary variables is the way to incorporate a factor variable into a model. These are usually called "dummy variables", and are essentially indicator variables. Your data above would correspond to variables I (for Intercept), J (for January), F (for February) and M (for March) in addition to the other variables y and v1 as below: y v1 I J F M # month 2 1 1 1 0 0 # january 3 1.4 1 0 1 0 # februrary 1.5 6.3 1 0 1 0 # february 1.2 4.5 1 1 0 0 # january 5.5 4.0 1 0 0 1 # march The linear predictor L in the model for y would then be L = a*I + b*v1 + c1*J + c2*F + c3*J evaluated arithmetically; e.g. for row 2 of the data it is a + b*1.4 + c2 However, as given, J + F + M = I, so there is redundancy in the variables, since there are only three independent values there (not so if you exclude the Intercept using a model formula y ~ v1 + month - 1), so R will provide estimates which are computed in terms of some pattern of differences between these four variables called contrasts. Different patterns of difference present different representations of the three independent aspects. There are many different kinds of contrasts available. One of these will be chosen as default by R (depending in particular on whether the factor variable is being used as an ordered factor or an unordered factor). See ?contrasts for an outline of what is there, ?contrast for more detail, and look at the help for particular contrasts such as ?contr.helmert, ?contr.poly, ?contr.sum, ?contr.treatment. After all that: No, R is not doing strange things to your model! ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 02-Mar-10 Time: 08:47:11 ------------------------------ XFMail ------------------------------
On 2/03/2010, at 9:02 PM, Noah Silverman wrote:> Hi, > > I'm just learning about poison links for the glm function. > > One of the data sets I'm playing with has several of the variables as > factors (i.e. month, group, etc.) > > When I call the glm function with a formula that has a factor variable, > R automatically converts the variable to a series of variables with > unique names and binary values. > > For example, with this pseudo data: > > y v1 month > 2 1 january > 3 1.4 februrary > 1.5 6.3 february > 1.2 4.5 january > 5.5 4.0 march > > I use this call: > > m <- glm(y ~ v1 + month, family="poisson") > > R gives me back a model with variables of > Intercept > v1 > monthJanuary > monthFebruary > monthMarchNo it didn't!!! You are kidding the troops/being economical with the truth. If you had used the data that you show, it would've ``given you a model with variables'': Intercept v1 monthfebruray monthjanuary monthmarch No caps in the month name and note the miss-spelling of ``february''. You actually have ***four*** levels for the month factor: january februrary february march If you had spelt ``februrary'' correctly you would have got variables Intercept v1 monthjanuary monthmarch The first level, february would have been omitted, under the default contrasts (contr.treatment). You need k-1 dummy variables to specify a factor with k levels.> I'm concerned that this might be doing some strange things to my model.No, you are doing strange things. Notice also that the Poisson distribution is a distribution of ***counts***. Non-negative integers. Whole numbers. Values like 1.5 and 1.2 make no immediate sense in terms of the Poisson distribution. The Poisson likelihood can be evaluated with non-integer responses, but the glm() function will quite rightly worry about non-integer values and give you a warning. (Which you didn't mention.) If you really have non-integer valued responses you shouldn't be using the Poisson family; the quasi family *might* be appropriate --- if you know what you're doing.> Can anyone offer some enlightenment?I hope you feel enlightened. cheers, Rolf Turner ###################################################################### Attention: This e-mail message is privileged and confidential. If you are not the intended recipient please delete the message and notify the sender. Any views or opinions presented are solely those of the author. This e-mail has been scanned and cleared by MailMarshal www.marshalsoftware.com ######################################################################