Martijn Wieling
2012-May-23 17:09 UTC
[R] gam (mgcv) vs. multiple regression breakpoint analysis: inconsistencies?
Dear useRs, I have a question with respect to fitting a non-linearity using gam (mgcv package, version 1.7-16). In a study I'm currently conducting, I'd like to find out if there is a breakpoint after which the effect of Age of Acquisition (AOA) of the second language changes. I.e. if the slope of AOA before the breakpoint (at a certain AOA) is different from the slope past the breakpoint. For this purpose, I can use the breakpoint analysis proposed by Baayen (2008) using lm or lmer. Using that approach I clearly find a breakpoint which improves the fit of the model significantly over the simpler model (on the basis of model comparison). Given that I can model a non-linearity in a gam (from mgcv) using s(...), I'd prefer to show the presence of a breakpoint (i.e. a non-linearity) by using a smooth. Instead of specifying an explicit breakpoint, I would then include s(AOA). I'd expect if a breakpoint would be found using multiple regression, this would be detected by the smooth as well. When I create a simple gam (model g1, below), I do find a non-linearity consistent with the breakpoint analysis. However, when I include an additional covariate (in model g2, below), the non-linearity disappears (this is inconsistent with the breakpoint analysis including the covariate). However, if I include the non-linearity estimated in g1 in addition to the covariate (model g3, below), this improves the fit over g2 (which did not detect a non-linearity). Please note that the actual analysis (not shown) contains two breakpoints in addition to a random-effect factor. Also breakpoints are highly significant improvements (p = 0.0004) over the models without breakpoints (the example below has less convincing p-values, but exhibits the same pattern). I'd really appreciate it if someone could help shed light on these findings... Perhaps I'm doing something wrong, but the results go against my intuition. With kind regards, Martijn Wieling, University of Groningen ### breakpoint analysis using lm, both with and without covariate # note that AoEO indicates the AOA # LR denotes Length of Residence deviances1 = rep(Inf, 20) deviances2 = rep(Inf, 20) for (breakpoint in 6:20) { spk$ShiftedAoEO = spk$AoEO - breakpoint; spk$PastBreakPoint = as.factor(spk$ShiftedAoEO > 0); m1 = lm(LDlog ~ ShiftedAoEO:PastBreakPoint,data=spk) m2 = lm(LDlog ~ LR + ShiftedAoEO:PastBreakPoint,data=spk) deviances1[breakpoint] = deviance(m1) deviances2[breakpoint] = deviance(m2) } breakpoint = which(deviances1 == min(deviances1)) # 18 spk$ShiftedAoEO = spk$AoEO - breakpoint; spk$PastBreakPoint = as.factor(spk$ShiftedAoEO > 0); m1 = lm(LDlog ~ ShiftedAoEO:PastBreakPoint,data=spk) m0 = lm(LDlog ~ AoEO,data=spk) anova(m0,m1) # m1 is better (p = 0.013) breakpoint = which(deviances2 == min(deviances2)) # 19 spk$ShiftedAoEO = spk$AoEO - breakpoint; spk$PastBreakPoint = as.factor(spk$ShiftedAoEO > 0); m2 = lm(LDlog ~ LR + ShiftedAoEO:PastBreakPoint,data=spk) m0 = lm(LDlog ~ LR + AoEO,data=spk) anova(m0,m2) # m2 is better (p = 0.033) summary(m1) # the line is steep before the breakpoint (18), flatter after #ShiftedAoEO:PastBreakPointFALSE 0.014795 0.001662 8.900 < 2e-16 *** #ShiftedAoEO:PastBreakPointTRUE 0.007713 0.001723 4.477 8.65e-06 *** summary(m2) # the line is steep before the breakpoint (19), flatter after #ShiftedAoEO:PastBreakPointFALSE 0.0139418 0.0016260 8.575 < 2e-16 *** #ShiftedAoEO:PastBreakPointTRUE 0.0076672 0.0018330 4.183 3.2e-05 *** ### end of breakpoint analysis ### alternative analysis using gam library(mgcv) # 1.7-16 g1 = gam(LDlog ~ s(AoEO), data=spk, method="ML") plot(g1) # steep before AoEO ~ 18, after that flatter summary(g1) # edf Ref.df F p-value #s(AoEO) 1.7 2.126 73.17 <2e-16 *** g0 = gam(LDlog ~ AoEO, data=spk, method="ML") anova(g0,g1,test="F") # g1 is better (p = 0.024) g2 = gam(LDlog ~ LR + s(AoEO), data=spk, method="ML") # LR improves the fit plot(g2) # linear relationship, no breakpoint summary(g2) # R-sq.(adj) = 0.167 # Estimate Std. Error t value Pr(>|t|) #LR -0.0018097 0.0005808 -3.116 0.0019 ** # # edf Ref.df F p-value #s(AoEO) 1 1 145 <2e-16 *** # use non-linearity from g1 pred = predict(g1, type="terms") # use predicted non-linear effect of g1 spk$AoEO.s = pred[,"s(AoEO)"] g3 = gam(LDlog ~ LR + AoEO.s, data=spk, method="ML") summary(g3) # both LR and AoEO.s significant, R-sq.(adj) = 0.171 # manual comparison of g2 and g3 (df of g3 = df of g2 + 0.7) pf(4.778,0.7,802.3,lower.tail=F) # p = 0.041 => g3 is better than g2 ####