thr3ads.net - R help - [R] GLM: include observations with missing explanatory variables [Dec 2015]

If this information is useful, please help other people find it:
Share via:

Frank van Berkum

2015-Dec-29 13:33 UTC

[R] GLM: include observations with missing explanatory variables

Hi all,
My problem is the following. 
Suppose I have a dataset with observations Y and explanatory variables X1, ...,
Xn, and suppose one of these explanatory variables is geographical area (of
which there are ten, j=1,...,10).  For some observations I know the area, but
for others it is unknown and therefore record as NA.
I want to estimate a model of the form Y[i] ~ Poisson( lambda[i] ) with
log(lambda[i]) = constant + \sum_j I[!is.na(area[i])] * I[area[i]==j] * beta[j]
In words: we estimate a constant for all observations and a factor for each
area. If it is unknown what the area is, we only include the constant.
When estimating this model using glm(), the records with is.na(area[i]) are
'deleted' from the dataset, and this I do not want. I had hoped that the
model as described above could be estimated using the function I() (interpret
as), but so far my attempts have not succeeded.
Any help on how to approach this is kindly appreciated.
Kind regards,
Frank van Berkum 		 	   		  
	[[alternative HTML version deleted]]

Rolf Turner

2015-Dec-29 20:54 UTC

head link

[R] GLM: include observations with missing explanatory variables

On 30/12/15 02:33, Frank van Berkum wrote:
> Hi all, My problem is the following. Suppose I have a dataset with
> observations Y and explanatory variables X1, ..., Xn, and suppose one
> of these explanatory variables is geographical area (of which there
> are ten, j=1,...,10).  For some observations I know the area, but for
> others it is unknown and therefore record as NA. I want to estimate a
> model of the form Y[i] ~ Poisson( lambda[i] ) with log(lambda[i]) >
constant + \sum_j I[!is.na(area[i])] * I[area[i]==j] * beta[j] In
> words: we estimate a constant for all observations and a factor for
> each area. If it is unknown what the area is, we only include the
> constant. When estimating this model using glm(), the records with
> is.na(area[i]) are 'deleted' from the dataset, and this I do not
> want. I had hoped that the model as described above could be
> estimated using the function I() (interpret as), but so far my
> attempts have not succeeded. Any help on how to approach this is
> kindly appreciated.
As Deep Thought was heard to remark: "Hmmmmm.  Tricky."

After pondering for a while it seems to me that you want to make NA the 
reference level of the factor "geographical area".

I.e. convert the NA values to a level of that factor and make it the 
*first* level.

E.g.:

set.seed(42)
GA <- factor(sample(LETTERS[1:10],200,TRUE))
GA[sample(1:200,10)] <- NA
tmp <- as.character(GA)
tmp[is.na(tmp)] <- "unknown"
GA <- factor(tmp,levels=c("unknown",LETTERS[1:10]))

y <- rpois(200,20) # Artificial response.
fit <- glm(y ~ GA,family=poisson)

Note that if you set

     z <- predict(fit)

(this gives predictions on the scale of the linear predictor)
then the entries of z corresponding to "unknown" are equal to the 
intercept coefficient of fit.

I can't quite get my head around whether this is *really* accomplishing 
what you want --- but it's a thought.

cheers,

Rolf Turner

-- 
Technical Editor ANZJS
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276

David Winsemius

2015-Dec-29 21:03 UTC

head link

[R] GLM: include observations with missing explanatory variables

> On Dec 29, 2015, at 5:33 AM, Frank van Berkum <frankieboytje at
hotmail.com> wrote:
> 
> Hi all,
> My problem is the following. 
> Suppose I have a dataset with observations Y and explanatory variables X1,
..., Xn, and suppose one of these explanatory variables is geographical area (of
which there are ten, j=1,...,10).  For some observations I know the area, but
for others it is unknown and therefore record as NA.
> I want to estimate a model of the form Y[i] ~ Poisson( lambda[i] ) with
log(lambda[i]) = constant + \sum_j I[!is.na(area[i])] * I[area[i]==j] * beta[j]
> In words: we estimate a constant for all observations and a factor for each
area. If it is unknown what the area is, we only include the constant.
> When estimating this model using glm(), the records with is.na(area[i]) are
'deleted' from the dataset, and this I do not want. I had hoped that the
model as described above could be estimated using the function I() (interpret
as), but so far my attempts have not succeeded.
> Any help on how to approach this is kindly appreciated.
> Kind regards,
> Frank van Berkum 		 	   		  
> 	[[alternative HTML version deleted]]
I don't understand why you don't just recode the NA's to
"unknown" and redo the model. This code is untested, but I think
demonstrates the two steps needed: add a level and then recode the NA values
assuming this factor is named `X6`:

dat$X6 <- factor(dat$X6, levels=c( levels(dat$X6), "unknown") )
dat$X6[ is.na(dat$X6) ] <- "unknown"


(I think this might be is a bit quicker than Rolf's approach.)


In R the default contrasts are "treatment" (which is different than
the contrasts you describe) and so each area will be referenced to the first
area (and the first level of all other factors) in the lexical ordering of area
names. This ordering and the contrast type can be changed. The are many postings
on rhelp over the years demonstrating how to do this.

Study these and the basics of R before reposting and if you do so, then post in
plain text and include a small example constructed in R:

?factor
?C
?constrasts
?contr.sum

-- 

David Winsemius
Alameda, CA, USA

R help - Dec 2015 - GLM: include observations with missing explanatory variables

[R] GLM: include observations with missing explanatory variables

[R] GLM: include observations with missing explanatory variables

[R] GLM: include observations with missing explanatory variables