Øystein Sørensen
2018-Oct-01 06:15 UTC
[R] Why do two different calls to mgcv::gamm give different p-values?
I have longitudinal data where a response y_{ij} has been measured repeatedly for i=1,...,n individuals, j=1,...,m times for each individual. Let x_{ij} be the age of individual i at her/his jth measurement. I wish to see how the response varies with age, and have reason to believe that a nonlinear fit is needed. I hence wish to model the relationship using an additive mixed model of the form y_{ij} = f(x_{ij}) + b_{i} + \epsilon_{ij}, where f() is some smooth function to be estimated, b_{i} \sim N(0, \sigma_{b}^{2}) is the random effect for individual i, \epsilon_{ij} \sim N(0, \sigma^{2}) is the residual. Reading the documentation to the mgcv package, I see that such models can be set up by calling the mgcv::gamm two different ways. One way shown to set up such a model is with the call b1 <- gamm(y ~ s(x), random = list(id =~ 1)), where id is an indicator of the specific individual. In this way, the random effect is specified in a list. The other way is to set up the random effect with a smooth of the re: b2 <- gamm(y ~ s(x) + s(id, bs = "re")). As far as I can understand, these two setups should create the same underlying model matrices. However, when running them on my data, the p-values for the smooth terms, as well as their estimated degrees of freedom, are different. Below is a reproducible example on simulated data, which show that these two types of specifying the models give different p-values and estimated degrees of freedom. On my real data, these differences are sometimes even more exaggerated. My question is: Are not these two calls to gamm suppose to estimate the same model, or have I misunderstood? Here is a reproducible example: library(mgcv) set.seed(100) n_sim <- 100 df <- data.frame( model1_pval = numeric(n_sim), model1_edf = numeric(n_sim), model2_pval = numeric(n_sim), model2_edf = numeric(n_sim) ) # Number of observations n <- 500 # Number of groups ng <- 100 for(i in 1:n_sim){ # Draw the fixed effect covariate x <- rnorm(n) # Set up the group id <- rep(1:ng, n/ng) # Draw the random effect z <- rnorm(ng) # Draw the response y <- 0.1 * x + 0.2 * x^3 + z[id] + rnorm(n) # Fit the two different models b1 <- gamm(y ~ s(x, k = 5), random = list(id =~ 1)) b2 <- gamm(y ~ s(x, k = 5) + s(id, bs = "re")) df[i, "model1_pval"] <- anova(b1$gam)$s.pv df[i, "model1_edf"] <- anova(b1$gam)$edf df[i, "model2_pval"] <- anova(b2$gam)$s.pv[[1]] df[i, "model2_edf"] <- anova(b2$gam)$edf[[1]] } # Differences between model 1 and model 2 p values mean(df$model1_pval) #> [1] 6.790546e-21 mean(df$model2_pval) #> [1] 3.090694e-14 max(abs(df$model1_pval - df$model2_pval)) #> [1] 2.770545e-12 # Differences between model1 and model 2 estimated degrees of freedom mean(df$model1_edf) #> [1] 3.829992 mean(df$model2_edf) #> [1] 3.731438 max(abs(df$model1_edf - df$model2_edf)) #> [1] 0.6320281 Thanks in advance, ?ystein S?rensen [[alternative HTML version deleted]]
Simon Wood
2018-Oct-02 12:26 UTC
[R] Why do two different calls to mgcv::gamm give different p-values?
Dear ?ystein, In your code 'id' is set up to be numeric. In your first gamm call it gets coerced to a factor (since nothing else would make sense). In? your second gamm call it is simply treated as numeric, since that could makes sense. To make the two models equivalent you just need to substitute: id <- as.factor(rep(1:ng, n/ng)) into your loop. best, Simon On 01/10/18 08:15, ?ystein S?rensen wrote:> I have longitudinal data where a response y_{ij} has been measured > repeatedly for i=1,...,n individuals, j=1,...,m times for each individual. > Let x_{ij} be the age of individual i at her/his jth measurement. I wish to > see how the response varies with age, and have reason to believe that a > nonlinear fit is needed. I hence wish to model the relationship using an > additive mixed model of the form y_{ij} = f(x_{ij}) + b_{i} + > \epsilon_{ij}, where f() is some smooth function to be estimated, b_{i} > \sim N(0, \sigma_{b}^{2}) is the random effect for individual i, > \epsilon_{ij} \sim N(0, \sigma^{2}) is the residual. > > Reading the documentation to the mgcv package, I see that such models can > be set up by calling the mgcv::gamm two different ways. One way shown to > set up such a model is with the call > b1 <- gamm(y ~ s(x), random = list(id =~ 1)), > where id is an indicator of the specific individual. In this way, the > random effect is specified in a list. The other way is to set up the random > effect with a smooth of the re: > b2 <- gamm(y ~ s(x) + s(id, bs = "re")). > > As far as I can understand, these two setups should create the same > underlying model matrices. However, when running them on my data, the > p-values for the smooth terms, as well as their estimated degrees of > freedom, are different. > > Below is a reproducible example on simulated data, which show that these > two types of specifying the models give different p-values and estimated > degrees of freedom. On my real data, these differences are sometimes even > more exaggerated. > > My question is: Are not these two calls to gamm suppose to estimate the > same model, or have I misunderstood? > > Here is a reproducible example: > > library(mgcv) > set.seed(100) > n_sim <- 100 > df <- data.frame( > model1_pval = numeric(n_sim), > model1_edf = numeric(n_sim), > model2_pval = numeric(n_sim), > model2_edf = numeric(n_sim) > ) > > # Number of observations > n <- 500 > # Number of groups > ng <- 100 > > for(i in 1:n_sim){ > # Draw the fixed effect covariate > x <- rnorm(n) > # Set up the group > id <- rep(1:ng, n/ng) > # Draw the random effect > z <- rnorm(ng) > # Draw the response > y <- 0.1 * x + 0.2 * x^3 + z[id] + rnorm(n) > > # Fit the two different models > b1 <- gamm(y ~ s(x, k = 5), random = list(id =~ 1)) > b2 <- gamm(y ~ s(x, k = 5) + s(id, bs = "re")) > > df[i, "model1_pval"] <- anova(b1$gam)$s.pv > df[i, "model1_edf"] <- anova(b1$gam)$edf > df[i, "model2_pval"] <- anova(b2$gam)$s.pv[[1]] > df[i, "model2_edf"] <- anova(b2$gam)$edf[[1]] > } > > # Differences between model 1 and model 2 p values > mean(df$model1_pval) > #> [1] 6.790546e-21 > mean(df$model2_pval) > #> [1] 3.090694e-14 > max(abs(df$model1_pval - df$model2_pval)) > #> [1] 2.770545e-12 > > # Differences between model1 and model 2 estimated degrees of freedom > mean(df$model1_edf) > #> [1] 3.829992 > mean(df$model2_edf) > #> [1] 3.731438 > max(abs(df$model1_edf - df$model2_edf)) > #> [1] 0.6320281 > > > Thanks in advance, > ?ystein S?rensen > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Øystein Sørensen
2018-Oct-02 12:48 UTC
[R] Why do two different calls to mgcv::gamm give different p-values?
Thanks a lot, Simon. That was a silly mistake on my part. After correcting, there is still a slight difference between the p-values, but this is so small that I guess it is just numerics. Best, ?ystein On Tue, Oct 2, 2018 at 2:26 PM Simon Wood <simon.wood at bath.edu> wrote:> Dear ?ystein, > > In your code 'id' is set up to be numeric. In your first gamm call it > gets coerced to a factor (since nothing else would make sense). In your > second gamm call it is simply treated as numeric, since that could makes > sense. To make the two models equivalent you just need to substitute: > > id <- as.factor(rep(1:ng, n/ng)) > > into your loop. > > best, > Simon > > On 01/10/18 08:15, ?ystein S?rensen wrote: > > I have longitudinal data where a response y_{ij} has been measured > > repeatedly for i=1,...,n individuals, j=1,...,m times for each > individual. > > Let x_{ij} be the age of individual i at her/his jth measurement. I wish > to > > see how the response varies with age, and have reason to believe that a > > nonlinear fit is needed. I hence wish to model the relationship using an > > additive mixed model of the form y_{ij} = f(x_{ij}) + b_{i} + > > \epsilon_{ij}, where f() is some smooth function to be estimated, b_{i} > > \sim N(0, \sigma_{b}^{2}) is the random effect for individual i, > > \epsilon_{ij} \sim N(0, \sigma^{2}) is the residual. > > > > Reading the documentation to the mgcv package, I see that such models can > > be set up by calling the mgcv::gamm two different ways. One way shown to > > set up such a model is with the call > > b1 <- gamm(y ~ s(x), random = list(id =~ 1)), > > where id is an indicator of the specific individual. In this way, the > > random effect is specified in a list. The other way is to set up the > random > > effect with a smooth of the re: > > b2 <- gamm(y ~ s(x) + s(id, bs = "re")). > > > > As far as I can understand, these two setups should create the same > > underlying model matrices. However, when running them on my data, the > > p-values for the smooth terms, as well as their estimated degrees of > > freedom, are different. > > > > Below is a reproducible example on simulated data, which show that these > > two types of specifying the models give different p-values and estimated > > degrees of freedom. On my real data, these differences are sometimes even > > more exaggerated. > > > > My question is: Are not these two calls to gamm suppose to estimate the > > same model, or have I misunderstood? > > > > Here is a reproducible example: > > > > library(mgcv) > > set.seed(100) > > n_sim <- 100 > > df <- data.frame( > > model1_pval = numeric(n_sim), > > model1_edf = numeric(n_sim), > > model2_pval = numeric(n_sim), > > model2_edf = numeric(n_sim) > > ) > > > > # Number of observations > > n <- 500 > > # Number of groups > > ng <- 100 > > > > for(i in 1:n_sim){ > > # Draw the fixed effect covariate > > x <- rnorm(n) > > # Set up the group > > id <- rep(1:ng, n/ng) > > # Draw the random effect > > z <- rnorm(ng) > > # Draw the response > > y <- 0.1 * x + 0.2 * x^3 + z[id] + rnorm(n) > > > > # Fit the two different models > > b1 <- gamm(y ~ s(x, k = 5), random = list(id =~ 1)) > > b2 <- gamm(y ~ s(x, k = 5) + s(id, bs = "re")) > > > > df[i, "model1_pval"] <- anova(b1$gam)$s.pv > > df[i, "model1_edf"] <- anova(b1$gam)$edf > > df[i, "model2_pval"] <- anova(b2$gam)$s.pv[[1]] > > df[i, "model2_edf"] <- anova(b2$gam)$edf[[1]] > > } > > > > # Differences between model 1 and model 2 p values > > mean(df$model1_pval) > > #> [1] 6.790546e-21 > > mean(df$model2_pval) > > #> [1] 3.090694e-14 > > max(abs(df$model1_pval - df$model2_pval)) > > #> [1] 2.770545e-12 > > > > # Differences between model1 and model 2 estimated degrees of freedom > > mean(df$model1_edf) > > #> [1] 3.829992 > > mean(df$model2_edf) > > #> [1] 3.731438 > > max(abs(df$model1_edf - df$model2_edf)) > > #> [1] 0.6320281 > > > > > > Thanks in advance, > > ?ystein S?rensen > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > >[[alternative HTML version deleted]]