Frank van Berkum
2015-Dec-29 13:33 UTC
[R] GLM: include observations with missing explanatory variables
Hi all, My problem is the following. Suppose I have a dataset with observations Y and explanatory variables X1, ..., Xn, and suppose one of these explanatory variables is geographical area (of which there are ten, j=1,...,10). For some observations I know the area, but for others it is unknown and therefore record as NA. I want to estimate a model of the form Y[i] ~ Poisson( lambda[i] ) with log(lambda[i]) = constant + \sum_j I[!is.na(area[i])] * I[area[i]==j] * beta[j] In words: we estimate a constant for all observations and a factor for each area. If it is unknown what the area is, we only include the constant. When estimating this model using glm(), the records with is.na(area[i]) are 'deleted' from the dataset, and this I do not want. I had hoped that the model as described above could be estimated using the function I() (interpret as), but so far my attempts have not succeeded. Any help on how to approach this is kindly appreciated. Kind regards, Frank van Berkum [[alternative HTML version deleted]]
Rolf Turner
2015-Dec-29 20:54 UTC
[R] GLM: include observations with missing explanatory variables
On 30/12/15 02:33, Frank van Berkum wrote:> Hi all, My problem is the following. Suppose I have a dataset with > observations Y and explanatory variables X1, ..., Xn, and suppose one > of these explanatory variables is geographical area (of which there > are ten, j=1,...,10). For some observations I know the area, but for > others it is unknown and therefore record as NA. I want to estimate a > model of the form Y[i] ~ Poisson( lambda[i] ) with log(lambda[i]) > constant + \sum_j I[!is.na(area[i])] * I[area[i]==j] * beta[j] In > words: we estimate a constant for all observations and a factor for > each area. If it is unknown what the area is, we only include the > constant. When estimating this model using glm(), the records with > is.na(area[i]) are 'deleted' from the dataset, and this I do not > want. I had hoped that the model as described above could be > estimated using the function I() (interpret as), but so far my > attempts have not succeeded. Any help on how to approach this is > kindly appreciated.As Deep Thought was heard to remark: "Hmmmmm. Tricky." After pondering for a while it seems to me that you want to make NA the reference level of the factor "geographical area". I.e. convert the NA values to a level of that factor and make it the *first* level. E.g.: set.seed(42) GA <- factor(sample(LETTERS[1:10],200,TRUE)) GA[sample(1:200,10)] <- NA tmp <- as.character(GA) tmp[is.na(tmp)] <- "unknown" GA <- factor(tmp,levels=c("unknown",LETTERS[1:10])) y <- rpois(200,20) # Artificial response. fit <- glm(y ~ GA,family=poisson) Note that if you set z <- predict(fit) (this gives predictions on the scale of the linear predictor) then the entries of z corresponding to "unknown" are equal to the intercept coefficient of fit. I can't quite get my head around whether this is *really* accomplishing what you want --- but it's a thought. cheers, Rolf Turner -- Technical Editor ANZJS Department of Statistics University of Auckland Phone: +64-9-373-7599 ext. 88276
David Winsemius
2015-Dec-29 21:03 UTC
[R] GLM: include observations with missing explanatory variables
> On Dec 29, 2015, at 5:33 AM, Frank van Berkum <frankieboytje at hotmail.com> wrote: > > Hi all, > My problem is the following. > Suppose I have a dataset with observations Y and explanatory variables X1, ..., Xn, and suppose one of these explanatory variables is geographical area (of which there are ten, j=1,...,10). For some observations I know the area, but for others it is unknown and therefore record as NA. > I want to estimate a model of the form Y[i] ~ Poisson( lambda[i] ) with log(lambda[i]) = constant + \sum_j I[!is.na(area[i])] * I[area[i]==j] * beta[j] > In words: we estimate a constant for all observations and a factor for each area. If it is unknown what the area is, we only include the constant. > When estimating this model using glm(), the records with is.na(area[i]) are 'deleted' from the dataset, and this I do not want. I had hoped that the model as described above could be estimated using the function I() (interpret as), but so far my attempts have not succeeded. > Any help on how to approach this is kindly appreciated. > Kind regards, > Frank van Berkum > [[alternative HTML version deleted]]I don't understand why you don't just recode the NA's to "unknown" and redo the model. This code is untested, but I think demonstrates the two steps needed: add a level and then recode the NA values assuming this factor is named `X6`: dat$X6 <- factor(dat$X6, levels=c( levels(dat$X6), "unknown") ) dat$X6[ is.na(dat$X6) ] <- "unknown" (I think this might be is a bit quicker than Rolf's approach.) In R the default contrasts are "treatment" (which is different than the contrasts you describe) and so each area will be referenced to the first area (and the first level of all other factors) in the lexical ordering of area names. This ordering and the contrast type can be changed. The are many postings on rhelp over the years demonstrating how to do this. Study these and the basics of R before reposting and if you do so, then post in plain text and include a small example constructed in R: ?factor ?C ?constrasts ?contr.sum -- David Winsemius Alameda, CA, USA