j@de@shod@@ m@iii@g oii googiem@ii@com
2022-Jun-05 19:01 UTC
[R] High concurvity/ collinearity between time and temperature in GAM predicting deaths but low ACF. Does this matter?
Hello everyone, A few days ago I asked a question about concurvity in a GAM (the anologue of collinearity in a GLM) implemented in mgcv. I think my question was a bit unfocussed, so I am retrying again, but with additional information included about the autocorrelation function. I have also posted about this on Cross Validated. Given all the model output, it might make for easier reading:https://stats.stackexchange.com/questions/577790/high-concurvity-collinearity-between-time-and-temperature-in-gam-predicting-dea As mentioned previously, I have problems with concurvity in my thesis research, and don't have access to a statistician who works with time series, GAMs or R. I'd be very grateful for any (partial) answer, however short. I'll gladly return the favour where I can! For really helpful input I'd be more than happy to offer co-authorship on publication. Deadlines are very close, and I'm heading towards having no results at all if I can't solve this concurvity issue :( I'm using GAMs to try to understand the relationship between deaths and heat-related variables (e.g. temperature and humidity), using daily time series over a 14-year period from a tropical, low-income country. My aim is to understand the relationship between these variables and deaths, rather than pure prediction performance. The GAMs include distributed lag models (set up as 7-column matrices, see code at bottom of post), since deaths may occur over several days following exposure. Simple GAMs with just time, lagged temperature and lagged precipitation (a potential confounder) show very high concurvity between lagged temperature and time, regardless of the many different ways I have tried to decompose time. The autocorrelation functions (ACF) however, shows values close to zero, only just about breaching the 'significance line' in a few instances. It does show patterning though, although the regularity is difficult to define. My questions are: 1) Should I be worried about the high concurvity, or can I ignore it given the mostly non-significant ACF? I've read dozens of heat-mortality modelling studies and none report on concurvity between weather variables and time (though one 2012 paper discussed autocorrelation). 2) If I cannot ignore it, what should I do to resolve it? Would including an autoregressive term be appropriate, and if so, where can I find a coded example of how to do this? I've also come across sequential regression][1]. Would this be more or less appropriate? If appropriate, a pointer to an example would be really appreciated! Some example GAMs are specified as follows: ```r conc38b <- gam(deaths~te(year, month, week, weekday, bs=c("cr","cc","cc","cc")) + heap + te(temp_max, lag, k=c(10, 3)) + te(precip_daily_total, lag, k=c(10, 3)), data = dat, family = nb, method = 'REML', select = TRUE, knots = list(month = c(0.5, 12.5), week = c(0.5, 52.5), weekday = c(0, 6.5))) ``` Concurvity for the above model between (temp_max, lag) and (year, month, week, weekday) is 0.91: ```r $worst para te(year,month,week,weekday) te(temp_max,lag) te(precip_daily_total,lag) para 1.000000e+00 1.125625e-29 0.3150073 0.6666348 te(year,month,week,weekday) 1.400648e-29 1.000000e+00 0.9060552 0.6652313 te(temp_max,lag) 3.152795e-01 8.998113e-01 1.0000000 0.5781015 te(precip_daily_total,lag) 6.666348e-01 6.652313e-01 0.5805159 1.0000000 ``` Output from ```gam.check()```: ```r Method: REML Optimizer: outer newton full convergence after 16 iterations. Gradient range [-0.01467332,0.003096643] (score 8915.994 & scale 1). Hessian positive definite, eigenvalue range [5.048053e-05,26.50175]. Model rank = 544 / 544 Basis dimension (k) checking results. Low p-value (k-index<1) may indicate that k is too low, especially if edf is close to k'. k' edf k-index p-value te(year,month,week,weekday) 319.0000 26.6531 0.96 0.06 . te(temp_max,lag) 29.0000 3.3681 NA NA te(precip_daily_total,lag) 27.0000 0.0051 NA NA --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 ``` Some output from ```summary(conc38b)```: ```r Approximate significance of smooth terms: edf Ref.df Chi.sq p-value te(year,month,week,weekday) 26.653127 319 166.803 < 2e-16 *** te(temp_max,lag) 3.368129 27 11.130 0.00145 ** te(precip_daily_total,lag) 0.005104 27 0.002 0.69636 --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 R-sq.(adj) = 0.839 Deviance explained = 53.3% -REML = 8916 Scale est. = 1 n = 5107 ``` Below are the ACF plots (note limit y-axis = 0.1 for clarity of pattern). They show peaks at 5 and 15, and then there seems to be a recurring pattern at multiples of approx. 30 (suggesting month is not modelled adequately?). Not sure what would cause the spikes at 5 and 15. There is heaping of deaths on the 15th day of each month, to which deaths with unknown date were allocated. This heaping was modelled with categorical variable/ factor ```heap``` with 169 levels (0 for all non-heaping days and 1-168 (i.e. 14 * 12 for each subsequent heaping day over the 14-year period): [2]: https://i.stack.imgur.com/FzKyM.png [3]: https://i.stack.imgur.com/fE3aL.png I get an identical looking ACF when I decompose time into (year, month, monthday) as in model conc39 below, although concurvity between (temp_max, lag) and the time term has now dropped somewhat to 0.83: ```r conc39 <- gam(deaths~te(year, month, monthday, bs=c("cr","cc","cr")) + heap + te(temp_max, lag, k=c(10, 4)) + te(precip_daily_total, lag, k=c(10, 4)), data = dat, family = nb, method = 'REML', select = TRUE, knots = list(month = c(0.5, 12.5))) ``` ```r Method: REML Optimizer: outer newton full convergence after 14 iterations. Gradient range [-0.001578187,6.155096e-05] (score 8915.763 & scale 1). Hessian positive definite, eigenvalue range [1.894391e-05,26.99215]. Model rank = 323 / 323 Basis dimension (k) checking results. Low p-value (k-index<1) may indicate that k is too low, especially if edf is close to k'. k' edf k-index p-value te(year,month,monthday) 79.0000 25.0437 0.93 <2e-16 *** te(temp_max,lag) 39.0000 4.0875 NA NA te(precip_daily_total,lag) 36.0000 0.0107 NA NA --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 ``` Some output from ```summary(conc39)```: ```r Approximate significance of smooth terms: edf Ref.df Chi.sq p-value te(year,month,monthday) 38.75573 99 187.231 < 2e-16 *** te(temp_max,lag) 4.06437 37 25.927 1.66e-06 *** te(precip_daily_total,lag) 0.01173 36 0.008 0.557 --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 R-sq.(adj) = 0.839 Deviance explained = 53.8% -REML = 8915 Scale est. = 1 n = 5107 ``` ```r $worst para te(year,month,monthday) te(temp_max,lag) te(precip_daily_total,lag) para 1.000000e+00 3.261007e-31 0.3313549 0.6666532 te(year,month,monthday) 3.060763e-31 1.000000e+00 0.8266086 0.5670777 te(temp_max,lag) 3.331014e-01 8.225942e-01 1.0000000 0.5840875 te(precip_daily_total,lag) 6.666532e-01 5.670777e-01 0.5939380 1.0000000 ``` Modelling time as ```te(year, doy)``` with a cyclic spline for doy and various choices for k or as ```s(time)``` with various k does not reduce concurvity either. The default approach in time series studies of heat-mortality is to model time with fixed df, generally between 7-10 df per year of data. I am, however, apprehensive about this approach because a) mortality profiles vary with locality due to sociodemographic and environmental characteristics and b) the choice of df is based on higher income countries (where nearly all these studies have been done) with different mortality profiles and so may not be appropriate for tropical, low-income countries. Although the approach of fixing (high) df does remove more temporal patterns from the ACF (see model and output below), concurvity between time and lagged temperature has now risen to 0.99! Moreover, temperature (which has been a consistent, highly significant predictor in every model of the tens (hundreds?) I have run, has now turned non-significant. I am guessing this is because time is now a very wiggly function that not only models/ removes seasonal variation, but also some of the day-to-day variation that is needed for the temperature smooth : ```r conc20a <- gam(deaths~s(time, k=112, fx=TRUE) + heap + te(temp_max, lag, k=c(10,3)) + te(precip_daily_total, lag, k=c(10,3)), data = dat, family = nb, method = 'REML', select = TRUE) ``` Output from ```gam.check(conc20a, rep = 1000)```: ```r Method: REML Optimizer: outer newton full convergence after 9 iterations. Gradient range [-0.0008983099,9.546022e-05] (score 8750.13 & scale 1). Hessian positive definite, eigenvalue range [0.0001420112,15.40832]. Model rank = 336 / 336 Basis dimension (k) checking results. Low p-value (k-index<1) may indicate that k is too low, especially if edf is close to k'. k' edf k-index p-value s(time) 111.0000 111.0000 0.98 0.56 te(temp_max,lag) 29.0000 0.6548 NA NA te(precip_daily_total,lag) 27.0000 0.0046 NA NA ``` Output from ```concurvity(conc20a, full=FALSE)$worst```: ```r para s(time) te(temp_max,lag) te(precip_daily_total,lag) para 1.000000e+00 2.462064e-19 0.3165236 0.6666348 s(time) 2.462398e-19 1.000000e+00 0.9930674 0.6879284 te(temp_max,lag) 3.170844e-01 9.356384e-01 1.0000000 0.5788711 te(precip_daily_total,lag) 6.666348e-01 6.879284e-01 0.5788381 1.0000000 ``` Some output from ```summary(conc20a)```: ```r Approximate significance of smooth terms: edf Ref.df Chi.sq p-value s(time) 1.110e+02 111 419.375 <2e-16 *** te(temp_max,lag) 6.548e-01 27 0.895 0.249 te(precip_daily_total,lag) 4.598e-03 27 0.002 0.868 --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 R-sq.(adj) = 0.843 Deviance explained = 56.1% -REML = 8750.1 Scale est. = 1 n = 5107 ``` ACF functions: [4]: https://i.stack.imgur.com/7nbXS.png [5]: https://i.stack.imgur.com/pNnZU.png Data can be found on my [GitHub][6] site in the file [data_cross_validated_post2.rds][7]. A csv version is also available. This is my code: ```r library(readr) library(mgcv) df <- read_rds("data_crossvalidated_post2.rds") # Create matrices for lagged weather variables (6 day lags) based on example by Simon Wood # in his 2017 book ("Generalized additive models: an introduction with R", p. 349) and # gamair package documentation (https://cran.r-project.org/web/packages/gamair/gamair.pdf, p. 54) lagard <- function(x,n.lag=7) { n <- length(x); X <- matrix(NA,n,n.lag) for (i in 1:n.lag) X[i:n,i] <- x[i:n-i+1] X } dat <- list(lag=matrix(0:6,nrow(df),7,byrow=TRUE), deaths=df$deaths_total,doy=df$doy, year = df$year, month = df$month, weekday = df$weekday, week = df$week, monthday = df$monthday, time df$time, heap=df$heap, heap_bin = df$heap_bin, precip_hourly_dailysum = df$precip_hourly_dailysum) dat$temp_max <- lagard(df$temp_max) dat$temp_min <- lagard(df$temp_min) dat$temp_mean <- lagard(df$temp_mean) dat$wbgt_max <- lagard(df$wbgt_max) dat$wbgt_mean <- lagard(df$wbgt_mean) dat$wbgt_min <- lagard(df$wbgt_min) dat$temp_wb_nasa_max <- lagard(df$temp_wb_nasa_max) dat$sh_mean <- lagard(df$sh_mean) dat$solar_mean <- lagard(df$solar_mean) dat$wind2m_mean <- lagard(df$wind2m_mean) dat$sh_max <- lagard(df$sh_max) dat$solar_max <- lagard(df$solar_max) dat$wind2m_max <- lagard(df$wind2m_max) dat$temp_wb_nasa_mean <- lagard(df$temp_wb_nasa_mean) dat$precip_hourly_dailysum <- lagard(df$precip_hourly_dailysum) dat$precip_hourly <- lagard(df$precip_hourly) dat$precip_daily_total <- lagard( df$precip_daily_total) dat$temp <- lagard(df$temp) dat$sh <- lagard(df$sh) dat$rh <- lagard(df$rh) dat$solar <- lagard(df$solar) dat$wind2m <- lagard(df$wind2m) conc38b <- gam(deaths~te(year, month, week, weekday, bs=c("cr","cc","cc","cc")) + heap + te(temp_max, lag, k=c(10, 3)) + te(precip_daily_total, lag, k=c(10, 3)), data = dat, family = nb, method = 'REML', select = TRUE, knots = list(month = c(0.5, 12.5), week = c(0.5, 52.5), weekday = c(0, 6.5))) conc39 <- gam(deaths~te(year, month, monthday, bs=c("cr","cc","cr")) + heap + te(temp_max, lag, k=c(10, 4)) + te(precip_daily_total, lag, k=c(10, 4)), data = dat, family = nb, method = 'REML', select = TRUE, knots = list(month = c(0.5, 12.5))) conc20a <- gam(deaths~s(time, k=112, fx=TRUE) + heap + te(temp_max, lag, k=c(10,3)) + te(precip_daily_total, lag, k=c(10,3)), data = dat, family = nb, method = 'REML', select = TRUE) ``` Thank you if you've read this far!! :-)) [1]: https://scholar.google.co.uk/scholar?output=instlink&q=info:PKdjq7ZwozEJ:scholar.google.com/&hl=en&as_sdt=0,5&scillfp=17865929886710916120&oi=lle [2]: https://i.stack.imgur.com/FzKyM.png [3]: https://i.stack.imgur.com/fE3aL.png [4]: https://i.stack.imgur.com/7nbXS.png [5]: https://i.stack.imgur.com/pNnZU.png [6]: https://github.com/JadeShodan/heat-mortality [7]: https://github.com/JadeShodan/heat-mortality/blob/main/data_cross_validated_post2.rds
Ebert,Timothy Aaron
2022-Jun-06 00:54 UTC
[R] High concurvity/ collinearity between time and temperature in GAM predicting deaths but low ACF. Does this matter?
You are welcome to ask here. However, you should try contacting the authors of the gam package. Package authors are often extraordinarily helpful. Tim -----Original Message----- From: R-help <r-help-bounces at r-project.org> On Behalf Of jade.shodan--- via R-help Sent: Sunday, June 5, 2022 3:02 PM To: r-help at r-project.org Subject: [R] High concurvity/ collinearity between time and temperature in GAM predicting deaths but low ACF. Does this matter? [External Email] Hello everyone, A few days ago I asked a question about concurvity in a GAM (the anologue of collinearity in a GLM) implemented in mgcv. I think my question was a bit unfocussed, so I am retrying again, but with additional information included about the autocorrelation function. I have also posted about this on Cross Validated. Given all the model output, it might make for easier reading:https://urldefense.proofpoint.com/v2/url?u=https-3A__stats.stackexchange.com_questions_577790_high-2Dconcurvity-2Dcollinearity-2Dbetween-2Dtime-2Dand-2Dtemperature-2Din-2Dgam-2Dpredicting-2Ddea&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=OcAaOvxwBr8hQWb7DaBa1490imjSyKQGzC2TbT3r6Xq3uAna_UgeAo-SNr5SwaYD&s=UnxUj1p0u7yNcAwVR0Na6FmRBibHxhuhscEcpFA2qRQ&e As mentioned previously, I have problems with concurvity in my thesis research, and don't have access to a statistician who works with time series, GAMs or R. I'd be very grateful for any (partial) answer, however short. I'll gladly return the favour where I can! For really helpful input I'd be more than happy to offer co-authorship on publication. Deadlines are very close, and I'm heading towards having no results at all if I can't solve this concurvity issue :( I'm using GAMs to try to understand the relationship between deaths and heat-related variables (e.g. temperature and humidity), using daily time series over a 14-year period from a tropical, low-income country. My aim is to understand the relationship between these variables and deaths, rather than pure prediction performance. The GAMs include distributed lag models (set up as 7-column matrices, see code at bottom of post), since deaths may occur over several days following exposure. Simple GAMs with just time, lagged temperature and lagged precipitation (a potential confounder) show very high concurvity between lagged temperature and time, regardless of the many different ways I have tried to decompose time. The autocorrelation functions (ACF) however, shows values close to zero, only just about breaching the 'significance line' in a few instances. It does show patterning though, although the regularity is difficult to define. My questions are: 1) Should I be worried about the high concurvity, or can I ignore it given the mostly non-significant ACF? I've read dozens of heat-mortality modelling studies and none report on concurvity between weather variables and time (though one 2012 paper discussed autocorrelation). 2) If I cannot ignore it, what should I do to resolve it? Would including an autoregressive term be appropriate, and if so, where can I find a coded example of how to do this? I've also come across sequential regression][1]. Would this be more or less appropriate? If appropriate, a pointer to an example would be really appreciated! Some example GAMs are specified as follows: ```r conc38b <- gam(deaths~te(year, month, week, weekday, bs=c("cr","cc","cc","cc")) + heap + te(temp_max, lag, k=c(10, 3)) + te(precip_daily_total, lag, k=c(10, 3)), data = dat, family = nb, method = 'REML', select = TRUE, knots = list(month = c(0.5, 12.5), week = c(0.5, 52.5), weekday = c(0, 6.5))) ``` Concurvity for the above model between (temp_max, lag) and (year, month, week, weekday) is 0.91: ```r $worst para te(year,month,week,weekday) te(temp_max,lag) te(precip_daily_total,lag) para 1.000000e+00 1.125625e-29 0.3150073 0.6666348 te(year,month,week,weekday) 1.400648e-29 1.000000e+00 0.9060552 0.6652313 te(temp_max,lag) 3.152795e-01 8.998113e-01 1.0000000 0.5781015 te(precip_daily_total,lag) 6.666348e-01 6.652313e-01 0.5805159 1.0000000 ``` Output from ```gam.check()```: ```r Method: REML Optimizer: outer newton full convergence after 16 iterations. Gradient range [-0.01467332,0.003096643] (score 8915.994 & scale 1). Hessian positive definite, eigenvalue range [5.048053e-05,26.50175]. Model rank = 544 / 544 Basis dimension (k) checking results. Low p-value (k-index<1) may indicate that k is too low, especially if edf is close to k'. k' edf k-index p-value te(year,month,week,weekday) 319.0000 26.6531 0.96 0.06 . te(temp_max,lag) 29.0000 3.3681 NA NA te(precip_daily_total,lag) 27.0000 0.0051 NA NA --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 ``` Some output from ```summary(conc38b)```: ```r Approximate significance of smooth terms: edf Ref.df Chi.sq p-value te(year,month,week,weekday) 26.653127 319 166.803 < 2e-16 *** te(temp_max,lag) 3.368129 27 11.130 0.00145 ** te(precip_daily_total,lag) 0.005104 27 0.002 0.69636 --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 R-sq.(adj) = 0.839 Deviance explained = 53.3% -REML = 8916 Scale est. = 1 n = 5107 ``` Below are the ACF plots (note limit y-axis = 0.1 for clarity of pattern). They show peaks at 5 and 15, and then there seems to be a recurring pattern at multiples of approx. 30 (suggesting month is not modelled adequately?). Not sure what would cause the spikes at 5 and 15. There is heaping of deaths on the 15th day of each month, to which deaths with unknown date were allocated. This heaping was modelled with categorical variable/ factor ```heap``` with 169 levels (0 for all non-heaping days and 1-168 (i.e. 14 * 12 for each subsequent heaping day over the 14-year period): [2]: https://urldefense.proofpoint.com/v2/url?u=https-3A__i.stack.imgur.com_FzKyM.png&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=OcAaOvxwBr8hQWb7DaBa1490imjSyKQGzC2TbT3r6Xq3uAna_UgeAo-SNr5SwaYD&s=rhd6ZkNNDyYd1zntgAjnNzZFkYPFica9xzx9ruBHG9g&e [3]: https://urldefense.proofpoint.com/v2/url?u=https-3A__i.stack.imgur.com_fE3aL.png&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=OcAaOvxwBr8hQWb7DaBa1490imjSyKQGzC2TbT3r6Xq3uAna_UgeAo-SNr5SwaYD&s=DUqm7oXz2zc3oaDR6ESbWGKZdinIsZf-ULGgDsyIOfM&e I get an identical looking ACF when I decompose time into (year, month, monthday) as in model conc39 below, although concurvity between (temp_max, lag) and the time term has now dropped somewhat to 0.83: ```r conc39 <- gam(deaths~te(year, month, monthday, bs=c("cr","cc","cr")) + heap + te(temp_max, lag, k=c(10, 4)) + te(precip_daily_total, lag, k=c(10, 4)), data = dat, family = nb, method = 'REML', select = TRUE, knots = list(month = c(0.5, 12.5))) ``` ```r Method: REML Optimizer: outer newton full convergence after 14 iterations. Gradient range [-0.001578187,6.155096e-05] (score 8915.763 & scale 1). Hessian positive definite, eigenvalue range [1.894391e-05,26.99215]. Model rank = 323 / 323 Basis dimension (k) checking results. Low p-value (k-index<1) may indicate that k is too low, especially if edf is close to k'. k' edf k-index p-value te(year,month,monthday) 79.0000 25.0437 0.93 <2e-16 *** te(temp_max,lag) 39.0000 4.0875 NA NA te(precip_daily_total,lag) 36.0000 0.0107 NA NA --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 ``` Some output from ```summary(conc39)```: ```r Approximate significance of smooth terms: edf Ref.df Chi.sq p-value te(year,month,monthday) 38.75573 99 187.231 < 2e-16 *** te(temp_max,lag) 4.06437 37 25.927 1.66e-06 *** te(precip_daily_total,lag) 0.01173 36 0.008 0.557 --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 R-sq.(adj) = 0.839 Deviance explained = 53.8% -REML = 8915 Scale est. = 1 n = 5107 ``` ```r $worst para te(year,month,monthday) te(temp_max,lag) te(precip_daily_total,lag) para 1.000000e+00 3.261007e-31 0.3313549 0.6666532 te(year,month,monthday) 3.060763e-31 1.000000e+00 0.8266086 0.5670777 te(temp_max,lag) 3.331014e-01 8.225942e-01 1.0000000 0.5840875 te(precip_daily_total,lag) 6.666532e-01 5.670777e-01 0.5939380 1.0000000 ``` Modelling time as ```te(year, doy)``` with a cyclic spline for doy and various choices for k or as ```s(time)``` with various k does not reduce concurvity either. The default approach in time series studies of heat-mortality is to model time with fixed df, generally between 7-10 df per year of data. I am, however, apprehensive about this approach because a) mortality profiles vary with locality due to sociodemographic and environmental characteristics and b) the choice of df is based on higher income countries (where nearly all these studies have been done) with different mortality profiles and so may not be appropriate for tropical, low-income countries. Although the approach of fixing (high) df does remove more temporal patterns from the ACF (see model and output below), concurvity between time and lagged temperature has now risen to 0.99! Moreover, temperature (which has been a consistent, highly significant predictor in every model of the tens (hundreds?) I have run, has now turned non-significant. I am guessing this is because time is now a very wiggly function that not only models/ removes seasonal variation, but also some of the day-to-day variation that is needed for the temperature smooth : ```r conc20a <- gam(deaths~s(time, k=112, fx=TRUE) + heap + te(temp_max, lag, k=c(10,3)) + te(precip_daily_total, lag, k=c(10,3)), data = dat, family = nb, method = 'REML', select = TRUE) ``` Output from ```gam.check(conc20a, rep = 1000)```: ```r Method: REML Optimizer: outer newton full convergence after 9 iterations. Gradient range [-0.0008983099,9.546022e-05] (score 8750.13 & scale 1). Hessian positive definite, eigenvalue range [0.0001420112,15.40832]. Model rank = 336 / 336 Basis dimension (k) checking results. Low p-value (k-index<1) may indicate that k is too low, especially if edf is close to k'. k' edf k-index p-value s(time) 111.0000 111.0000 0.98 0.56 te(temp_max,lag) 29.0000 0.6548 NA NA te(precip_daily_total,lag) 27.0000 0.0046 NA NA ``` Output from ```concurvity(conc20a, full=FALSE)$worst```: ```r para s(time) te(temp_max,lag) te(precip_daily_total,lag) para 1.000000e+00 2.462064e-19 0.3165236 0.6666348 s(time) 2.462398e-19 1.000000e+00 0.9930674 0.6879284 te(temp_max,lag) 3.170844e-01 9.356384e-01 1.0000000 0.5788711 te(precip_daily_total,lag) 6.666348e-01 6.879284e-01 0.5788381 1.0000000 ``` Some output from ```summary(conc20a)```: ```r Approximate significance of smooth terms: edf Ref.df Chi.sq p-value s(time) 1.110e+02 111 419.375 <2e-16 *** te(temp_max,lag) 6.548e-01 27 0.895 0.249 te(precip_daily_total,lag) 4.598e-03 27 0.002 0.868 --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 R-sq.(adj) = 0.843 Deviance explained = 56.1% -REML = 8750.1 Scale est. = 1 n = 5107 ``` ACF functions: [4]: https://urldefense.proofpoint.com/v2/url?u=https-3A__i.stack.imgur.com_7nbXS.png&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=OcAaOvxwBr8hQWb7DaBa1490imjSyKQGzC2TbT3r6Xq3uAna_UgeAo-SNr5SwaYD&s=hXhWH-VySi9i27hNDKK184WiLooYmhdni7_7JOLhRcI&e[5]: https://urldefense.proofpoint.com/v2/url?u=https-3A__i.stack.imgur.com_pNnZU.png&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=OcAaOvxwBr8hQWb7DaBa1490imjSyKQGzC2TbT3r6Xq3uAna_UgeAo-SNr5SwaYD&s=HV6sMbzNkG-NdJZBjZRAjrvCDl2VtMPfl5rY2Ss8muM&e Data can be found on my [GitHub][6] site in the file [data_cross_validated_post2.rds][7]. A csv version is also available. This is my code: ```r library(readr) library(mgcv) df <- read_rds("data_crossvalidated_post2.rds") # Create matrices for lagged weather variables (6 day lags) based on example by Simon Wood # in his 2017 book ("Generalized additive models: an introduction with R", p. 349) and # gamair package documentation (https://urldefense.proofpoint.com/v2/url?u=https-3A__cran.r-2Dproject.org_web_packages_gamair_gamair.pdf&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=OcAaOvxwBr8hQWb7DaBa1490imjSyKQGzC2TbT3r6Xq3uAna_UgeAo-SNr5SwaYD&s=RmLMXgEhg76PLgzNq7CbZkD29EGQili4Pkd7ESctrJ0&e= , p. 54) lagard <- function(x,n.lag=7) { n <- length(x); X <- matrix(NA,n,n.lag) for (i in 1:n.lag) X[i:n,i] <- x[i:n-i+1] X } dat <- list(lag=matrix(0:6,nrow(df),7,byrow=TRUE), deaths=df$deaths_total,doy=df$doy, year = df$year, month = df$month, weekday = df$weekday, week = df$week, monthday = df$monthday, time = df$time, heap=df$heap, heap_bin = df$heap_bin, precip_hourly_dailysum = df$precip_hourly_dailysum) dat$temp_max <- lagard(df$temp_max) dat$temp_min <- lagard(df$temp_min) dat$temp_mean <- lagard(df$temp_mean) dat$wbgt_max <- lagard(df$wbgt_max) dat$wbgt_mean <- lagard(df$wbgt_mean) dat$wbgt_min <- lagard(df$wbgt_min) dat$temp_wb_nasa_max <- lagard(df$temp_wb_nasa_max) dat$sh_mean <- lagard(df$sh_mean) dat$solar_mean <- lagard(df$solar_mean) dat$wind2m_mean <- lagard(df$wind2m_mean) dat$sh_max <- lagard(df$sh_max) dat$solar_max <- lagard(df$solar_max) dat$wind2m_max <- lagard(df$wind2m_max) dat$temp_wb_nasa_mean <- lagard(df$temp_wb_nasa_mean) dat$precip_hourly_dailysum <- lagard(df$precip_hourly_dailysum) dat$precip_hourly <- lagard(df$precip_hourly) dat$precip_daily_total <- lagard( df$precip_daily_total) dat$temp <- lagard(df$temp) dat$sh <- lagard(df$sh) dat$rh <- lagard(df$rh) dat$solar <- lagard(df$solar) dat$wind2m <- lagard(df$wind2m) conc38b <- gam(deaths~te(year, month, week, weekday, bs=c("cr","cc","cc","cc")) + heap + te(temp_max, lag, k=c(10, 3)) + te(precip_daily_total, lag, k=c(10, 3)), data = dat, family = nb, method = 'REML', select = TRUE, knots = list(month = c(0.5, 12.5), week = c(0.5, 52.5), weekday = c(0, 6.5))) conc39 <- gam(deaths~te(year, month, monthday, bs=c("cr","cc","cr")) + heap + te(temp_max, lag, k=c(10, 4)) + te(precip_daily_total, lag, k=c(10, 4)), data = dat, family = nb, method = 'REML', select = TRUE, knots = list(month = c(0.5, 12.5))) conc20a <- gam(deaths~s(time, k=112, fx=TRUE) + heap + te(temp_max, lag, k=c(10,3)) + te(precip_daily_total, lag, k=c(10,3)), data = dat, family = nb, method = 'REML', select = TRUE) ``` Thank you if you've read this far!! :-)) [1]: https://urldefense.proofpoint.com/v2/url?u=https-3A__scholar.google.co.uk_scholar-3Foutput-3Dinstlink-26q-3Dinfo-3APKdjq7ZwozEJ-3Ascholar.google.com_-26hl-3Den-26as-5Fsdt-3D0-2C5-26scillfp-3D17865929886710916120-26oi-3Dlle&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=OcAaOvxwBr8hQWb7DaBa1490imjSyKQGzC2TbT3r6Xq3uAna_UgeAo-SNr5SwaYD&s=05YTFJr01J0QkoraKsVufRrfVvevvADTCCCjxskRbfY&e [2]: https://urldefense.proofpoint.com/v2/url?u=https-3A__i.stack.imgur.com_FzKyM.png&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=OcAaOvxwBr8hQWb7DaBa1490imjSyKQGzC2TbT3r6Xq3uAna_UgeAo-SNr5SwaYD&s=rhd6ZkNNDyYd1zntgAjnNzZFkYPFica9xzx9ruBHG9g&e [3]: https://urldefense.proofpoint.com/v2/url?u=https-3A__i.stack.imgur.com_fE3aL.png&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=OcAaOvxwBr8hQWb7DaBa1490imjSyKQGzC2TbT3r6Xq3uAna_UgeAo-SNr5SwaYD&s=DUqm7oXz2zc3oaDR6ESbWGKZdinIsZf-ULGgDsyIOfM&e [4]: https://urldefense.proofpoint.com/v2/url?u=https-3A__i.stack.imgur.com_7nbXS.png&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=OcAaOvxwBr8hQWb7DaBa1490imjSyKQGzC2TbT3r6Xq3uAna_UgeAo-SNr5SwaYD&s=hXhWH-VySi9i27hNDKK184WiLooYmhdni7_7JOLhRcI&e [5]: https://urldefense.proofpoint.com/v2/url?u=https-3A__i.stack.imgur.com_pNnZU.png&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=OcAaOvxwBr8hQWb7DaBa1490imjSyKQGzC2TbT3r6Xq3uAna_UgeAo-SNr5SwaYD&s=HV6sMbzNkG-NdJZBjZRAjrvCDl2VtMPfl5rY2Ss8muM&e [6]: https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_JadeShodan_heat-2Dmortality&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=OcAaOvxwBr8hQWb7DaBa1490imjSyKQGzC2TbT3r6Xq3uAna_UgeAo-SNr5SwaYD&s=zxKWkmFT-DWpRAx6t0DRbV9ldSbgyIE6V3LdJBm4ULU&e [7]: https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_JadeShodan_heat-2Dmortality_blob_main_data-5Fcross-5Fvalidated-5Fpost2.rds&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=OcAaOvxwBr8hQWb7DaBa1490imjSyKQGzC2TbT3r6Xq3uAna_UgeAo-SNr5SwaYD&s=VIm9gkrPHFVXJFZeg6sMnKYGcLD5HR3BKqh4z8iIQjc&e ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=OcAaOvxwBr8hQWb7DaBa1490imjSyKQGzC2TbT3r6Xq3uAna_UgeAo-SNr5SwaYD&s=QP0xRvlj8tppy7hmTxhzrD62Hd1N4mXYmPKW48XiiRg&ePLEASE do read the posting guide https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=OcAaOvxwBr8hQWb7DaBa1490imjSyKQGzC2TbT3r6Xq3uAna_UgeAo-SNr5SwaYD&s=M7FO2Q09m7Or14GXa16BtzX1E8Xa5AWB-eN9ZTGlcjA&eand provide commented, minimal, self-contained, reproducible code.
Simon Wood
2022-Jun-08 11:04 UTC
[R] High concurvity/ collinearity between time and temperature in GAM predicting deaths but low ACF. Does this matter?
I would not worry too much about high concurvity between variables like temperature and time. This just reflects the fact that temperature has a strong temporal pattern. I would also not be too worried about the low p-values on the k check. The check only looks for pattern in the residuals when they are ordered with respect to the variables of a smooth. When you have time series data and some smooths involve time then it's hard not to pick up some degree of residual auto-correlation, which you often would not want to model with a higher rank smoother. The NAs? for the distributed lag terms just reflect the fact that there is no obvious way to order the residuals w.r.t. the covariates for such terms, so the simple check for residual pattern is not really possible. One simple approach is to fit using bam(...,discrete=TRUE) which will let you specify an AR1 parameter to mop up some of the residual auto-correlation without resorting to a high rank smooth that then does all the work of the covariates as well. The AR1 parameter can be set by looking at the ACF of the residuals of the model without this. You need to look at the ACF of suitably standardized residuals to check how well this has worked. best, Simon On 05/06/2022 20:01, jade.shodan--- via R-help wrote:> Hello everyone, > > A few days ago I asked a question about concurvity in a GAM (the > anologue of collinearity in a GLM) implemented in mgcv. I think my > question was a bit unfocussed, so I am retrying again, but with > additional information included about the autocorrelation function. I > have also posted about this on Cross Validated. Given all the model > output, it might make for easier > reading:https://stats.stackexchange.com/questions/577790/high-concurvity-collinearity-between-time-and-temperature-in-gam-predicting-dea > > As mentioned previously, I have problems with concurvity in my thesis > research, and don't have access to a statistician who works with time > series, GAMs or R. I'd be very grateful for any (partial) answer, > however short. I'll gladly return the favour where I can! For really > helpful input I'd be more than happy to offer co-authorship on > publication. Deadlines are very close, and I'm heading towards having > no results at all if I can't solve this concurvity issue :( > > I'm using GAMs to try to understand the relationship between deaths > and heat-related variables (e.g. temperature and humidity), using > daily time series over a 14-year period from a tropical, low-income > country. My aim is to understand the relationship between these > variables and deaths, rather than pure prediction performance. > > The GAMs include distributed lag models (set up as 7-column matrices, > see code at bottom of post), since deaths may occur over several days > following exposure. > > Simple GAMs with just time, lagged temperature and lagged > precipitation (a potential confounder) show very high concurvity > between lagged temperature and time, regardless of the many different > ways I have tried to decompose time. The autocorrelation functions > (ACF) however, shows values close to zero, only just about breaching > the 'significance line' in a few instances. It does show patterning > though, although the regularity is difficult to define. > > My questions are: > 1) Should I be worried about the high concurvity, or can I ignore it > given the mostly non-significant ACF? I've read dozens of > heat-mortality modelling studies and none report on concurvity between > weather variables and time (though one 2012 paper discussed > autocorrelation). > > 2) If I cannot ignore it, what should I do to resolve it? Would > including an autoregressive term be appropriate, and if so, where can > I find a coded example of how to do this? I've also come across > sequential regression][1]. Would this be more or less appropriate? If > appropriate, a pointer to an example would be really appreciated! > > Some example GAMs are specified as follows: > ```r > conc38b <- gam(deaths~te(year, month, week, weekday, > bs=c("cr","cc","cc","cc")) + heap + > te(temp_max, lag, k=c(10, 3)) + > te(precip_daily_total, lag, k=c(10, 3)), > data = dat, family = nb, method = 'REML', select = TRUE, > knots = list(month = c(0.5, 12.5), week = c(0.5, > 52.5), weekday = c(0, 6.5))) > ``` > Concurvity for the above model between (temp_max, lag) and (year, > month, week, weekday) is 0.91: > > ```r > $worst > para te(year,month,week,weekday) > te(temp_max,lag) te(precip_daily_total,lag) > para 1.000000e+00 1.125625e-29 > 0.3150073 0.6666348 > te(year,month,week,weekday) 1.400648e-29 1.000000e+00 > 0.9060552 0.6652313 > te(temp_max,lag) 3.152795e-01 8.998113e-01 > 1.0000000 0.5781015 > te(precip_daily_total,lag) 6.666348e-01 6.652313e-01 > 0.5805159 1.0000000 > ``` > > Output from ```gam.check()```: > ```r > Method: REML Optimizer: outer newton > full convergence after 16 iterations. > Gradient range [-0.01467332,0.003096643] > (score 8915.994 & scale 1). > Hessian positive definite, eigenvalue range [5.048053e-05,26.50175]. > Model rank = 544 / 544 > > Basis dimension (k) checking results. Low p-value (k-index<1) may > indicate that k is too low, especially if edf is close to k'. > > k' edf k-index p-value > te(year,month,week,weekday) 319.0000 26.6531 0.96 0.06 . > te(temp_max,lag) 29.0000 3.3681 NA NA > te(precip_daily_total,lag) 27.0000 0.0051 NA NA > --- > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > ``` > > Some output from ```summary(conc38b)```: > ```r > Approximate significance of smooth terms: > edf Ref.df Chi.sq p-value > te(year,month,week,weekday) 26.653127 319 166.803 < 2e-16 *** > te(temp_max,lag) 3.368129 27 11.130 0.00145 ** > te(precip_daily_total,lag) 0.005104 27 0.002 0.69636 > --- > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > R-sq.(adj) = 0.839 Deviance explained = 53.3% > -REML = 8916 Scale est. = 1 n = 5107 > ``` > > > Below are the ACF plots (note limit y-axis = 0.1 for clarity of > pattern). They show peaks at 5 and 15, and then there seems to be a > recurring pattern at multiples of approx. 30 (suggesting month is not > modelled adequately?). Not sure what would cause the spikes at 5 and > 15. There is heaping of deaths on the 15th day of each month, to which > deaths with unknown date were allocated. This heaping was modelled > with categorical variable/ factor ```heap``` with 169 levels (0 for > all non-heaping days and 1-168 (i.e. 14 * 12 for each subsequent > heaping day over the 14-year period): > > [2]: https://i.stack.imgur.com/FzKyM.png > [3]: https://i.stack.imgur.com/fE3aL.png > > > I get an identical looking ACF when I decompose time into (year, > month, monthday) as in model conc39 below, although concurvity between > (temp_max, lag) and the time term has now dropped somewhat to 0.83: > > ```r > conc39 <- gam(deaths~te(year, month, monthday, bs=c("cr","cc","cr")) + heap + > te(temp_max, lag, k=c(10, 4)) + > te(precip_daily_total, lag, k=c(10, 4)), > data = dat, family = nb, method = 'REML', select = TRUE, > knots = list(month = c(0.5, 12.5))) > ``` > ```r > > Method: REML Optimizer: outer newton > full convergence after 14 iterations. > Gradient range [-0.001578187,6.155096e-05] > (score 8915.763 & scale 1). > Hessian positive definite, eigenvalue range [1.894391e-05,26.99215]. > Model rank = 323 / 323 > > Basis dimension (k) checking results. Low p-value (k-index<1) may > indicate that k is too low, especially if edf is close to k'. > > k' edf k-index p-value > te(year,month,monthday) 79.0000 25.0437 0.93 <2e-16 *** > te(temp_max,lag) 39.0000 4.0875 NA NA > te(precip_daily_total,lag) 36.0000 0.0107 NA NA > --- > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > ``` > Some output from ```summary(conc39)```: > ```r > Approximate significance of smooth terms: > edf Ref.df Chi.sq p-value > te(year,month,monthday) 38.75573 99 187.231 < 2e-16 *** > te(temp_max,lag) 4.06437 37 25.927 1.66e-06 *** > te(precip_daily_total,lag) 0.01173 36 0.008 0.557 > --- > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > R-sq.(adj) = 0.839 Deviance explained = 53.8% > -REML = 8915 Scale est. = 1 n = 5107 > ``` > > > ```r > $worst > para te(year,month,monthday) > te(temp_max,lag) te(precip_daily_total,lag) > para 1.000000e+00 3.261007e-31 > 0.3313549 0.6666532 > te(year,month,monthday) 3.060763e-31 1.000000e+00 > 0.8266086 0.5670777 > te(temp_max,lag) 3.331014e-01 8.225942e-01 > 1.0000000 0.5840875 > te(precip_daily_total,lag) 6.666532e-01 5.670777e-01 > 0.5939380 1.0000000 > ``` > > Modelling time as ```te(year, doy)``` with a cyclic spline for doy and > various choices for k or as ```s(time)``` with various k does not > reduce concurvity either. > > > The default approach in time series studies of heat-mortality is to > model time with fixed df, generally between 7-10 df per year of data. > I am, however, apprehensive about this approach because a) mortality > profiles vary with locality due to sociodemographic and environmental > characteristics and b) the choice of df is based on higher income > countries (where nearly all these studies have been done) with > different mortality profiles and so may not be appropriate for > tropical, low-income countries. > > Although the approach of fixing (high) df does remove more temporal > patterns from the ACF (see model and output below), concurvity between > time and lagged temperature has now risen to 0.99! Moreover, > temperature (which has been a consistent, highly significant predictor > in every model of the tens (hundreds?) I have run, has now turned > non-significant. I am guessing this is because time is now a very > wiggly function that not only models/ removes seasonal variation, but > also some of the day-to-day variation that is needed for the > temperature smooth : > > ```r > conc20a <- gam(deaths~s(time, k=112, fx=TRUE) + heap + > te(temp_max, lag, k=c(10,3)) + > te(precip_daily_total, lag, k=c(10,3)), > data = dat, family = nb, method = 'REML', select = TRUE) > ``` > Output from ```gam.check(conc20a, rep = 1000)```: > > ```r > Method: REML Optimizer: outer newton > full convergence after 9 iterations. > Gradient range [-0.0008983099,9.546022e-05] > (score 8750.13 & scale 1). > Hessian positive definite, eigenvalue range [0.0001420112,15.40832]. > Model rank = 336 / 336 > > Basis dimension (k) checking results. Low p-value (k-index<1) may > indicate that k is too low, especially if edf is close to k'. > > k' edf k-index p-value > s(time) 111.0000 111.0000 0.98 0.56 > te(temp_max,lag) 29.0000 0.6548 NA NA > te(precip_daily_total,lag) 27.0000 0.0046 NA NA > ``` > Output from ```concurvity(conc20a, full=FALSE)$worst```: > > ```r > para s(time) te(temp_max,lag) > te(precip_daily_total,lag) > para 1.000000e+00 2.462064e-19 0.3165236 > 0.6666348 > s(time) 2.462398e-19 1.000000e+00 0.9930674 > 0.6879284 > te(temp_max,lag) 3.170844e-01 9.356384e-01 1.0000000 > 0.5788711 > te(precip_daily_total,lag) 6.666348e-01 6.879284e-01 0.5788381 > 1.0000000 > > ``` > > Some output from ```summary(conc20a)```: > ```r > Approximate significance of smooth terms: > edf Ref.df Chi.sq p-value > s(time) 1.110e+02 111 419.375 <2e-16 *** > te(temp_max,lag) 6.548e-01 27 0.895 0.249 > te(precip_daily_total,lag) 4.598e-03 27 0.002 0.868 > --- > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > R-sq.(adj) = 0.843 Deviance explained = 56.1% > -REML = 8750.1 Scale est. = 1 n = 5107 > ``` > > ACF functions: > > [4]: https://i.stack.imgur.com/7nbXS.png > [5]: https://i.stack.imgur.com/pNnZU.png > > Data can be found on my [GitHub][6] site in the file > [data_cross_validated_post2.rds][7]. A csv version is also available. > This is my code: > > ```r > library(readr) > library(mgcv) > > df <- read_rds("data_crossvalidated_post2.rds") > > # Create matrices for lagged weather variables (6 day lags) based on > example by Simon Wood > # in his 2017 book ("Generalized additive models: an introduction with > R", p. 349) and > # gamair package documentation > (https://cran.r-project.org/web/packages/gamair/gamair.pdf, p. 54) > > lagard <- function(x,n.lag=7) { > n <- length(x); X <- matrix(NA,n,n.lag) > for (i in 1:n.lag) X[i:n,i] <- x[i:n-i+1] > X > } > > dat <- list(lag=matrix(0:6,nrow(df),7,byrow=TRUE), > deaths=df$deaths_total,doy=df$doy, year = df$year, month = df$month, > weekday = df$weekday, week = df$week, monthday = df$monthday, time > df$time, heap=df$heap, heap_bin = df$heap_bin, precip_hourly_dailysum > = df$precip_hourly_dailysum) > dat$temp_max <- lagard(df$temp_max) > dat$temp_min <- lagard(df$temp_min) > dat$temp_mean <- lagard(df$temp_mean) > dat$wbgt_max <- lagard(df$wbgt_max) > dat$wbgt_mean <- lagard(df$wbgt_mean) > dat$wbgt_min <- lagard(df$wbgt_min) > dat$temp_wb_nasa_max <- lagard(df$temp_wb_nasa_max) > dat$sh_mean <- lagard(df$sh_mean) > dat$solar_mean <- lagard(df$solar_mean) > dat$wind2m_mean <- lagard(df$wind2m_mean) > dat$sh_max <- lagard(df$sh_max) > dat$solar_max <- lagard(df$solar_max) > dat$wind2m_max <- lagard(df$wind2m_max) > dat$temp_wb_nasa_mean <- lagard(df$temp_wb_nasa_mean) > dat$precip_hourly_dailysum <- lagard(df$precip_hourly_dailysum) > dat$precip_hourly <- lagard(df$precip_hourly) > dat$precip_daily_total <- lagard( df$precip_daily_total) > dat$temp <- lagard(df$temp) > dat$sh <- lagard(df$sh) > dat$rh <- lagard(df$rh) > dat$solar <- lagard(df$solar) > dat$wind2m <- lagard(df$wind2m) > > > conc38b <- gam(deaths~te(year, month, week, weekday, > bs=c("cr","cc","cc","cc")) + heap + > te(temp_max, lag, k=c(10, 3)) + > te(precip_daily_total, lag, k=c(10, 3)), > data = dat, family = nb, method = 'REML', select = TRUE, > knots = list(month = c(0.5, 12.5), week = c(0.5, > 52.5), weekday = c(0, 6.5))) > > conc39 <- gam(deaths~te(year, month, monthday, bs=c("cr","cc","cr")) + heap + > te(temp_max, lag, k=c(10, 4)) + > te(precip_daily_total, lag, k=c(10, 4)), > data = dat, family = nb, method = 'REML', select = TRUE, > knots = list(month = c(0.5, 12.5))) > > conc20a <- gam(deaths~s(time, k=112, fx=TRUE) + heap + > te(temp_max, lag, k=c(10,3)) + > te(precip_daily_total, lag, k=c(10,3)), > data = dat, family = nb, method = 'REML', select = TRUE) > > ``` > Thank you if you've read this far!! :-)) > > [1]: https://scholar.google.co.uk/scholar?output=instlink&q=info:PKdjq7ZwozEJ:scholar.google.com/&hl=en&as_sdt=0,5&scillfp=17865929886710916120&oi=lle > [2]: https://i.stack.imgur.com/FzKyM.png > [3]: https://i.stack.imgur.com/fE3aL.png > [4]: https://i.stack.imgur.com/7nbXS.png > [5]: https://i.stack.imgur.com/pNnZU.png > [6]: https://github.com/JadeShodan/heat-mortality > [7]: https://github.com/JadeShodan/heat-mortality/blob/main/data_cross_validated_post2.rds > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Simon Wood, School of Mathematics, University of Edinburgh, https://www.maths.ed.ac.uk/~swood34/