I'm uncertain if this is perhaps a stupid question: I want to create "centered" dummy variables to use in a call to glm(), and wondering if there's some slick method in R to do so. That is, rather than have a factor, which results in a glm() fit returning coefficients specifying either absence or presence of the factor, I'd like to fit a glm() without intercept such that the estimated coefficients (standard errors) represent the "average" value in my data set for that variable. An example: a data set has Race specified with 4 levels. I can manually specify 4 dummy variables for a no-intercept model with each variable rather than having a value of zero or one, has a centered value based on its frequency of occurrence in the data set. Thus if 30% of the records in the data set have Race of Hispanic, I can define a variable HISP that has a value of either -.3 or .7, resulting in my coefficient estimate for HISP representing the effect of an "average" person in the database (and a corresponding valid standard error). One way to create these "centered dummy variables" from the original factor is: "B"=scale(RACE=="B",scale=F), "W"=scale(RACE=="W",scale=F), "H"=scale(RACE=="H",scale=F), "OTHRACE"=scale(RACE=="OTHER",scale=F) However I wonder if there is some method in R to avoid having to manually define a large number of these dummy variables for a more complicated dataset. Thanks in advance, Peter Holck
Prof Brian Ripley
2004-Oct-13 10:48 UTC
[R] "Centered" dummy variables; non zero/one coding
This can done by setting a contrast function or matrix on a variable. Look in e.g. chapter 6 of MASS (the only comprehensive tutorial on coding factors in R, it seems). On Tue, 12 Oct 2004, Peter Holck wrote:> I'm uncertain if this is perhaps a stupid question: > > I want to create "centered" dummy variables to use in a call to glm(), and > wondering if there's some slick method in R to do so. That is, rather than > have a factor, which results in a glm() fit returning coefficients > specifying either absence or presence of the factor, I'd like to fit a glm() > without intercept such that the estimated coefficients (standard errors) > represent the "average" value in my data set for that variable.Is that really what you want? An `average' person having linear predictor 0, or more precisely, the linear predictor have average zero over the dataset? What family of glm is this?> An example: a data set has Race specified with 4 levels. I can manually > specify 4 dummy variables for a no-intercept model with each variable rather > than having a value of zero or one, has a centered value based on its > frequency of occurrence in the data set. Thus if 30% of the records in the > data set have Race of Hispanic, I can define a variable HISP that has a > value of either -.3 or .7, resulting in my coefficient estimate for HISP > representing the effect of an "average" person in the database (and a > corresponding valid standard error).Nope. A person can only have one race, so the coefficient estimates can only represent jointly the effect of picking one of the possible races. I think what you are striving for is that the average of the term `race' be zero over the whole dataset. That's easy -- just compute the average and subtract it via an offset term. Once you have two or more factor predictors you will get aliasing your way.> One way to create these "centered dummy variables" from the original factor > is: > "B"=scale(RACE=="B",scale=F), > "W"=scale(RACE=="W",scale=F), > "H"=scale(RACE=="H",scale=F), > "OTHRACE"=scale(RACE=="OTHER",scale=F) > > However I wonder if there is some method in R to avoid having to manually > define a large number of these dummy variables for a more complicated > dataset.-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595