thr3ads.net - R help - [R] Why do two different calls to mgcv::gamm give different p-values? [Oct 2018]

If this information is useful, please help other people find it:
Share via:

Øystein Sørensen

2018-Oct-01 06:15 UTC

[R] Why do two different calls to mgcv::gamm give different p-values?

I have longitudinal data where a response y_{ij} has been measured
repeatedly for i=1,...,n individuals, j=1,...,m times for each individual.
Let x_{ij} be the age of individual i at her/his jth measurement. I wish to
see how the response varies with age, and have reason to believe that a
nonlinear fit is needed. I hence wish to model the relationship using an
additive mixed model of the form y_{ij} = f(x_{ij}) + b_{i} +
\epsilon_{ij}, where f() is some smooth function to be estimated, b_{i}
\sim N(0, \sigma_{b}^{2}) is the random effect for individual i,
\epsilon_{ij} \sim N(0, \sigma^{2}) is the residual.

Reading the documentation to the mgcv package, I see that such models can
be set up by calling the mgcv::gamm two different ways. One way shown to
set up such a model is with the call
b1 <- gamm(y ~ s(x), random = list(id =~ 1)),
where id is an indicator of the specific individual. In this way, the
random effect is specified in a list. The other way is to set up the random
effect with a smooth of the re:
b2 <- gamm(y ~ s(x) + s(id, bs = "re")).

As far as I can understand, these two setups should create the same
underlying model matrices. However, when running them on my data, the
p-values for the smooth terms, as well as their estimated degrees of
freedom, are different.

Below is a reproducible example on simulated data, which show that these
two types of specifying the models give different p-values and estimated
degrees of freedom. On my real data, these differences are sometimes even
more exaggerated.

My question is: Are not these two calls to gamm suppose to estimate the
same model, or have I misunderstood?

Here is a reproducible example:

library(mgcv)
set.seed(100)
n_sim <- 100
df <- data.frame(
  model1_pval = numeric(n_sim),
  model1_edf = numeric(n_sim),
  model2_pval = numeric(n_sim),
  model2_edf = numeric(n_sim)
)

# Number of observations
n <- 500
# Number of groups
ng <- 100

for(i in 1:n_sim){
  # Draw the fixed effect covariate
  x <- rnorm(n)
  # Set up the group
  id <- rep(1:ng, n/ng)
  # Draw the random effect
  z <- rnorm(ng)
  # Draw the response
  y <- 0.1 * x + 0.2 * x^3 + z[id] + rnorm(n)

  # Fit the two different models
  b1 <- gamm(y ~ s(x, k = 5), random = list(id =~ 1))
  b2 <- gamm(y ~ s(x, k = 5) + s(id, bs = "re"))

  df[i, "model1_pval"] <- anova(b1$gam)$s.pv
  df[i, "model1_edf"] <- anova(b1$gam)$edf
  df[i, "model2_pval"] <- anova(b2$gam)$s.pv[[1]]
  df[i, "model2_edf"] <- anova(b2$gam)$edf[[1]]
}

# Differences between model 1 and model 2 p values
mean(df$model1_pval)
#> [1] 6.790546e-21
mean(df$model2_pval)
#> [1] 3.090694e-14
max(abs(df$model1_pval - df$model2_pval))
#> [1] 2.770545e-12

# Differences between model1 and model 2 estimated degrees of freedom
mean(df$model1_edf)
#> [1] 3.829992
mean(df$model2_edf)
#> [1] 3.731438
max(abs(df$model1_edf - df$model2_edf))
#> [1] 0.6320281


Thanks in advance,
?ystein S?rensen

	[[alternative HTML version deleted]]

Simon Wood

2018-Oct-02 12:26 UTC

head link

[R] Why do two different calls to mgcv::gamm give different p-values?

Dear ?ystein,

In your code 'id' is set up to be numeric. In your first gamm call it 
gets coerced to a factor (since nothing else would make sense). In? your 
second gamm call it is simply treated as numeric, since that could makes 
sense. To make the two models equivalent you just need to substitute:

id <- as.factor(rep(1:ng, n/ng))

into your loop.

best,
Simon

On 01/10/18 08:15, ?ystein S?rensen wrote:> I have longitudinal data where a response y_{ij} has been measured
> repeatedly for i=1,...,n individuals, j=1,...,m times for each individual.
> Let x_{ij} be the age of individual i at her/his jth measurement. I wish to
> see how the response varies with age, and have reason to believe that a
> nonlinear fit is needed. I hence wish to model the relationship using an
> additive mixed model of the form y_{ij} = f(x_{ij}) + b_{i} +
> \epsilon_{ij}, where f() is some smooth function to be estimated, b_{i}
> \sim N(0, \sigma_{b}^{2}) is the random effect for individual i,
> \epsilon_{ij} \sim N(0, \sigma^{2}) is the residual.
>
> Reading the documentation to the mgcv package, I see that such models can
> be set up by calling the mgcv::gamm two different ways. One way shown to
> set up such a model is with the call
> b1 <- gamm(y ~ s(x), random = list(id =~ 1)),
> where id is an indicator of the specific individual. In this way, the
> random effect is specified in a list. The other way is to set up the random
> effect with a smooth of the re:
> b2 <- gamm(y ~ s(x) + s(id, bs = "re")).
>
> As far as I can understand, these two setups should create the same
> underlying model matrices. However, when running them on my data, the
> p-values for the smooth terms, as well as their estimated degrees of
> freedom, are different.
>
> Below is a reproducible example on simulated data, which show that these
> two types of specifying the models give different p-values and estimated
> degrees of freedom. On my real data, these differences are sometimes even
> more exaggerated.
>
> My question is: Are not these two calls to gamm suppose to estimate the
> same model, or have I misunderstood?
>
> Here is a reproducible example:
>
> library(mgcv)
> set.seed(100)
> n_sim <- 100
> df <- data.frame(
>    model1_pval = numeric(n_sim),
>    model1_edf = numeric(n_sim),
>    model2_pval = numeric(n_sim),
>    model2_edf = numeric(n_sim)
> )
>
> # Number of observations
> n <- 500
> # Number of groups
> ng <- 100
>
> for(i in 1:n_sim){
>    # Draw the fixed effect covariate
>    x <- rnorm(n)
>    # Set up the group
>    id <- rep(1:ng, n/ng)
>    # Draw the random effect
>    z <- rnorm(ng)
>    # Draw the response
>    y <- 0.1 * x + 0.2 * x^3 + z[id] + rnorm(n)
>
>    # Fit the two different models
>    b1 <- gamm(y ~ s(x, k = 5), random = list(id =~ 1))
>    b2 <- gamm(y ~ s(x, k = 5) + s(id, bs = "re"))
>
>    df[i, "model1_pval"] <- anova(b1$gam)$s.pv
>    df[i, "model1_edf"] <- anova(b1$gam)$edf
>    df[i, "model2_pval"] <- anova(b2$gam)$s.pv[[1]]
>    df[i, "model2_edf"] <- anova(b2$gam)$edf[[1]]
> }
>
> # Differences between model 1 and model 2 p values
> mean(df$model1_pval)
> #> [1] 6.790546e-21
> mean(df$model2_pval)
> #> [1] 3.090694e-14
> max(abs(df$model1_pval - df$model2_pval))
> #> [1] 2.770545e-12
>
> # Differences between model1 and model 2 estimated degrees of freedom
> mean(df$model1_edf)
> #> [1] 3.829992
> mean(df$model2_edf)
> #> [1] 3.731438
> max(abs(df$model1_edf - df$model2_edf))
> #> [1] 0.6320281
>
>
> Thanks in advance,
> ?ystein S?rensen
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Øystein Sørensen

2018-Oct-02 12:48 UTC

head link

[R] Why do two different calls to mgcv::gamm give different p-values?

Thanks a lot, Simon. That was a silly mistake on my part. After correcting,
there is still a slight difference between the p-values, but this is so
small that I guess it is just numerics.

Best,
?ystein

On Tue, Oct 2, 2018 at 2:26 PM Simon Wood <simon.wood at bath.edu> wrote:
> Dear ?ystein,
>
> In your code 'id' is set up to be numeric. In your first gamm call
it
> gets coerced to a factor (since nothing else would make sense). In  your
> second gamm call it is simply treated as numeric, since that could makes
> sense. To make the two models equivalent you just need to substitute:
>
> id <- as.factor(rep(1:ng, n/ng))
>
> into your loop.
>
> best,
> Simon
>
> On 01/10/18 08:15, ?ystein S?rensen wrote:
> > I have longitudinal data where a response y_{ij} has been measured
> > repeatedly for i=1,...,n individuals, j=1,...,m times for each
> individual.
> > Let x_{ij} be the age of individual i at her/his jth measurement. I
wish
> to
> > see how the response varies with age, and have reason to believe that
a
> > nonlinear fit is needed. I hence wish to model the relationship using
an
> > additive mixed model of the form y_{ij} = f(x_{ij}) + b_{i} +
> > \epsilon_{ij}, where f() is some smooth function to be estimated,
b_{i}
> > \sim N(0, \sigma_{b}^{2}) is the random effect for individual i,
> > \epsilon_{ij} \sim N(0, \sigma^{2}) is the residual.
> >
> > Reading the documentation to the mgcv package, I see that such models
can
> > be set up by calling the mgcv::gamm two different ways. One way shown
to
> > set up such a model is with the call
> > b1 <- gamm(y ~ s(x), random = list(id =~ 1)),
> > where id is an indicator of the specific individual. In this way, the
> > random effect is specified in a list. The other way is to set up the
> random
> > effect with a smooth of the re:
> > b2 <- gamm(y ~ s(x) + s(id, bs = "re")).
> >
> > As far as I can understand, these two setups should create the same
> > underlying model matrices. However, when running them on my data, the
> > p-values for the smooth terms, as well as their estimated degrees of
> > freedom, are different.
> >
> > Below is a reproducible example on simulated data, which show that
these
> > two types of specifying the models give different p-values and
estimated
> > degrees of freedom. On my real data, these differences are sometimes
even
> > more exaggerated.
> >
> > My question is: Are not these two calls to gamm suppose to estimate
the
> > same model, or have I misunderstood?
> >
> > Here is a reproducible example:
> >
> > library(mgcv)
> > set.seed(100)
> > n_sim <- 100
> > df <- data.frame(
> >    model1_pval = numeric(n_sim),
> >    model1_edf = numeric(n_sim),
> >    model2_pval = numeric(n_sim),
> >    model2_edf = numeric(n_sim)
> > )
> >
> > # Number of observations
> > n <- 500
> > # Number of groups
> > ng <- 100
> >
> > for(i in 1:n_sim){
> >    # Draw the fixed effect covariate
> >    x <- rnorm(n)
> >    # Set up the group
> >    id <- rep(1:ng, n/ng)
> >    # Draw the random effect
> >    z <- rnorm(ng)
> >    # Draw the response
> >    y <- 0.1 * x + 0.2 * x^3 + z[id] + rnorm(n)
> >
> >    # Fit the two different models
> >    b1 <- gamm(y ~ s(x, k = 5), random = list(id =~ 1))
> >    b2 <- gamm(y ~ s(x, k = 5) + s(id, bs = "re"))
> >
> >    df[i, "model1_pval"] <- anova(b1$gam)$s.pv
> >    df[i, "model1_edf"] <- anova(b1$gam)$edf
> >    df[i, "model2_pval"] <- anova(b2$gam)$s.pv[[1]]
> >    df[i, "model2_edf"] <- anova(b2$gam)$edf[[1]]
> > }
> >
> > # Differences between model 1 and model 2 p values
> > mean(df$model1_pval)
> > #> [1] 6.790546e-21
> > mean(df$model2_pval)
> > #> [1] 3.090694e-14
> > max(abs(df$model1_pval - df$model2_pval))
> > #> [1] 2.770545e-12
> >
> > # Differences between model1 and model 2 estimated degrees of freedom
> > mean(df$model1_edf)
> > #> [1] 3.829992
> > mean(df$model2_edf)
> > #> [1] 3.731438
> > max(abs(df$model1_edf - df$model2_edf))
> > #> [1] 0.6320281
> >
> >
> > Thanks in advance,
> > ?ystein S?rensen
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>
	[[alternative HTML version deleted]]

R help - Oct 2018 - Why do two different calls to mgcv::gamm give different p-values?

[R] Why do two different calls to mgcv::gamm give different p-values?

[R] Why do two different calls to mgcv::gamm give different p-values?

[R] Why do two different calls to mgcv::gamm give different p-values?