thr3ads.net - R help - [R] mgcv: GAM with clustered standard errors [Jul 2013]

If this information is useful, please help other people find it:
Share via:

Kathrine Veie

2013-Jul-11 14:46 UTC

[R] mgcv: GAM with clustered standard errors

Dear Help list,

I am relatively new to the mgcv package, which I am using to model prices of
housing transactions as a function of the characteristics of a home and a
neighborhood. I have several smooth terms to capture price evolution over time,
but also to non-parametrically fit the functional form of some characteristics
such as living area, lot size etc. In my model I have neighborhood fixed effects
(i.e. prices in different neighborhoods can have different means), but I would
also like to allow for within neighborhood correlation in my errors. My question
is: What is the best way to do this?

Sample size is approx. 14,000 obs. 

My model (without clustered residuals) looks something like (although I have
many more regressors, several of which are factor variables):
mod.1 <- gam(Price~ s(date.of.sale) + s(livingspace) + s(lotsize) +
factor(neighborhood), data=data, family=Gamma(link=log))

I was thinking that I could either include random effects at the neighborhood
level (s(neighborhood, bs="re")) or I could use a GAMM with correlated
errors within group:

mod.2 <- gamm(Price~ s(date.of.sale) + s(livingspace) + s(lotsize) +
factor(neighborhood), correlation=corSymm(form~1|neighborhood), data=data,
family=Gamma(link=log))

I tried out mod.1 with the random effects and it did provide larger s.e.'s
as I would expect given positive correlation in the residuals. But it also
seemed that the random effects component was not significant if I understand it
correctly: the edf are very close to zero and the significance is NaN. Perhaps
if this is the way to go, I should first demean the data at the neighborhood
level?

As for the gamm approach: I can't get it to work properly: It does not
recognize my groups (i.e. it defines only one group). I tried to correct for
this by transforming the neighborhood numbers into characters
neighborhood.c <-  (as.character) 
and then used this as the group indicator instead:
corSymm(form~1|neighborhood.c)

But this resulted in an error message: variable lengths differ (found for
'neighborhood.c')…The same happens when I write
"factor(neighborhood)" in the corSymm specification. My panel is not
balanced, i.e. the number of observations within neighborhoods varies.  Is this
a problem? I haven't seen any indication that the panel must be balanced to
use lme, but maybe I've missed it?

Any feedback would be much appreciated incl. suggestions on where I might read
more about how to use mgcv for this type of problem.

Thanks in advance!
Kathrine

	[[alternative HTML version deleted]]

Simon Wood

2013-Jul-11 18:50 UTC

head link

[R] mgcv: GAM with clustered standard errors

I think it's going to be a problem to have different sized groups in 
your second model. ?corSymm says that a general correlation matrix is 
being estimated (i.e. the correlation between each pair of observations 
is being estimated - for this to be meaningful across groups you need 
the jth price in one area to be somehow equivalent to the jth price in 
another area, which it probably isn't) - I can't figure out how this can
be done if the groups are different sizes.

Even if your groups sizes were all the same, I guess you have lots of 
data per neighbourhood, so there will be an aweful lot of correlation 
parameters to estimate, and I doubt that it will be successful. Might it 
make more sense to start with something less parameter rich like 
corCompSymm (which would also be ok with different group sizes, I think)?

Finally I would just set data$neighborhood <- factor(data$neighborhood) 
for this. You need this, e.g.  to be sure that s(neighborhood,bs="re")
is really doing what you want (i.e. giving a random coefficient for each 
neighbourhood, rather than a single random coefficient multiplying 
"neighborhood" interpreted as numeric). However if neighborhood is in
as
a factor, then s(neighborhood,bs="re") is adding nothing (you've 
effectively already included neighborhood as a random effect with 
infinite variance in the model, so including it again won't do anything 
interesting).

best,
Simon

On 11/07/13 15:46, Kathrine Veie wrote:> Dear Help list,
>
> I am relatively new to the mgcv package, which I am using to model prices
of housing transactions as a function of the characteristics of a home and a
neighborhood. I have several smooth terms to capture price evolution over time,
but also to non-parametrically fit the functional form of some characteristics
such as living area, lot size etc. In my model I have neighborhood fixed effects
(i.e. prices in different neighborhoods can have different means), but I would
also like to allow for within neighborhood correlation in my errors. My question
is: What is the best way to do this?
>
> Sample size is approx. 14,000 obs.
>
> My model (without clustered residuals) looks something like (although I
have many more regressors, several of which are factor variables):
> mod.1 <- gam(Price~ s(date.of.sale) + s(livingspace) + s(lotsize) +
factor(neighborhood), data=data, family=Gamma(link=log))
>
> I was thinking that I could either include random effects at the
neighborhood level (s(neighborhood, bs="re")) or I could use a GAMM
with correlated errors within group:
>
> mod.2 <- gamm(Price~ s(date.of.sale) + s(livingspace) + s(lotsize) +
factor(neighborhood), correlation=corSymm(form~1|neighborhood), data=data,
family=Gamma(link=log))
>
> I tried out mod.1 with the random effects and it did provide larger
s.e.'s as I would expect given positive correlation in the residuals. But it
also seemed that the random effects component was not significant if I
understand it correctly: the edf are very close to zero and the significance is
NaN. Perhaps if this is the way to go, I should first demean the data at the
neighborhood level?
>
> As for the gamm approach: I can't get it to work properly: It does not
recognize my groups (i.e. it defines only one group). I tried to correct for
this by transforming the neighborhood numbers into characters
> neighborhood.c <-  (as.character)
> and then used this as the group indicator instead:
corSymm(form~1|neighborhood.c)
>
> But this resulted in an error message: variable lengths differ (found for
'neighborhood.c')...The same happens when I write
"factor(neighborhood)" in the corSymm specification. My panel is not
balanced, i.e. the number of observations within neighborhoods varies.  Is this
a problem? I haven't seen any indication that the panel must be balanced to
use lme, but maybe I've missed it?
>
> Any feedback would be much appreciated incl. suggestions on where I might
read more about how to use mgcv for this type of problem.
>
> Thanks in advance!
> Kathrine
>
> 	[[alternative HTML version deleted]]
>
>
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

	[[alternative HTML version deleted]]

R help - Jul 2013 - mgcv: GAM with clustered standard errors

[R] mgcv: GAM with clustered standard errors

[R] mgcv: GAM with clustered standard errors