Sebastien Bihorel
2017-Oct-24 01:57 UTC
[R] Issue of reproducibility with gam and lm.wfit in different versions of R
Dear R users, I recently stumbled upon problems of reproducibility while running GAM analyses in different R and gam package versions. In the example below, a small dataset is created in which the y and x1 variables are 100% correlated. The intents of this example were primarily for regression testing and, secondarily, to evaluate how the gam algorithm behaves under extreme/limit conditions. I ran this little snippet in 5 different environments and got 100% consistent results until I switched to R 3.3.2. * Comparing results from environments 1, 2, and 3 shows that changing the version of the gam package did not change the results under R 3.3.0. * Comparing results from environments 3 and 4 shows that changing the version of R altered the values of the AIC and the output of the step.gam call (changed to a NULL object) * Comparing results from environments 4 and 5 shows that reverting to an older version of the gam package in R 3.3.2 still produced altered AIC values and the NULL output from step.gam call Further investigations into these differences seem to show that the lm.wfit call in the gam.fit function (called from within gam and step.gam) may result in different values in R 3.3.0 and 3.3.2. So my questions are 2-fold: 1- Would you have any information about why lm.wfit would produce different outcomes in R 3.3.0 and 3.3.2? The source code does not appear significantly different. 2- Is it expected that the step.gam function returns a NULL object in R 3.3.2 while a valid model has been identified? Looking at the source of step.gam, the line 157 (if(is.null(form.list)) break) seems to be the reason the function breaks out and returns a NULL value. I thank you in advance for your time. Sebastien Bihorel library(gam) dat <- data.frame( y = c(57,57,98,83,122,69,108,86,80,87,75,76,97,101,121,111,105,84,65,54,61,71,125,60,50,112,102,110,77,45,93,62,120,115,70,113,117,85,46,123,89,95,116,55,110,92,109,100,72,88,105,119,94,45,67,58,60,45,107,73,100,79,47,99,51,53,68,125,90,48,82,85,65,52,70,59,125,49,118,103,91,124,78,81,63,63), x1 = c(52,52,93,78,117,64,103,81,75,82,70,71,92,96,116,106,100,79,60,49,56,66,120,55,45,107,97,105,72,40,88,57,115,110,65,108,112,80,41,118,84,90,111,50,105,87,104,95,67,83,100,114,89,40,62,53,55,40,102,68,95,74,42,94,46,48,63,120,85,43,77,80,60,47,65,54,120,44,113,98,86,119,73,76,58,58), x2 = c(0.0001,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0001,0) ) summary(dat) scope <- list( x1=c('1','x1','ns(x1, df=2)'), x2=c('1','x2') ) gam.object <- gam(y~1, data=dat) step.object <- step.gam(gam.object, scope=scope, trace=2) is.null(step.object) # Run this if you want to evaluate one lm.wfit example call made from within gam functions x <- cbind(rep(1,nrow(dat)), dat$x1) names(x) <- c('Intercept', 'x1') z <- dat$y w <- rep(1,nrow(dat)) str(eval(expression(lm.wfit(x, z, w, method = "qr", singular.ok = TRUE)))) ###################################################################### #1 R 3.1.2 (x86_64-redhat-linux-gnu (64-bit)) / gam 1.09.1 ######################################################################> step.object <- step.gam(gam.object, scope=scope, trace=2)Start: y ~ 1; AIC= 797.0034 Trial: y ~ x1 + 1 ; AIC= -5121.796 Trial: y ~ 1 + x2 ; AIC= 796.915 Step:1 y ~ x1 ; AIC= -5121.796 Trial: y ~ ns(x1, df=2) + 1 ; AIC= -5123.816 Trial: y ~ x1 + x2 ; AIC= -5174.408 Step:2 y ~ x1 + x2 ; AIC= -5174.408 Trial: y ~ ns(x1, df=2) + x2 ; AIC= -5164.829> is.null(step.object)[1] FALSE ###################################################################### #2: R 3.3.0 (x86_64-redhat-linux-gnu (64-bit)) / gam 1.12 ######################################################################> step.object <- step.gam(gam.object, scope=scope, trace=2)Start: y ~ 1; AIC= 797.0034 Trial: y ~ x1 + 1 ; AIC= -5121.796 Trial: y ~ 1 + x2 ; AIC= 796.915 Step:1 y ~ x1 ; AIC= -5121.796 Trial: y ~ ns(x1, df=2) + 1 ; AIC= -5123.816 Trial: y ~ x1 + x2 ; AIC= -5174.408 Step:2 y ~ x1 + x2 ; AIC= -5174.408 Trial: y ~ ns(x1, df=2) + x2 ; AIC= -5164.829> is.null(step.object)[1] FALSE ###################################################################### #3: R 3.3.0 (x86_64-redhat-linux-gnu (64-bit)) / gam 1.14-4 ######################################################################> step.object <- step.gam(gam.object, scope=scope, trace=2)Start: y ~ 1; AIC= 797.0034 Trial: y ~ x1 + 1 ; AIC= -5121.796 Trial: y ~ 1 + x2 ; AIC= 796.915 Step:1 y ~ x1 ; AIC= -5121.796 Trial: y ~ ns(x1, df=2) + 1 ; AIC= -5123.816 Trial: y ~ x1 + x2 ; AIC= -5174.408 Step:2 y ~ x1 + x2 ; AIC= -5174.408 Trial: y ~ ns(x1, df=2) + x2 ; AIC= -5164.829> is.null(step.object)[1] FALSE ###################################################################### #4: R 3.3.2 (x86_64-redhat-linux-gnu (64-bit)) / gam 1.14-4 ######################################################################> step.object <- step.gam(gam.object, scope=scope, trace=2)Start: y ~ 1; AIC= 797.0034 Trial: y ~ x1 + 1 ; AIC= -5122.886 Trial: y ~ 1 + x2 ; AIC= 796.915 Step:1 y ~ x1 ; AIC= -5122.886 Trial: y ~ ns(x1, df=2) + 1 ; AIC= -5177.531 Trial: y ~ x1 + x2 ; AIC= -5177.531 Step:2 y ~ ns(x1, df = 2) ; AIC= -5177.531 Trial: y ~ ns(x1, df=2) + x2 ; AIC= -5222.703 Step:3 y ~ ns(x1, df = 2) + x2 ; AIC= -5222.703> is.null(step.object)[1] TRUE ###################################################################### #5: R 3.3.2 (x86_64-redhat-linux-gnu (64-bit)) / gam 1.09.1 ######################################################################> step.object <- step.gam(gam.object, scope=scope, trace=2)Start: y ~ 1; AIC= 797.0034 Trial: y ~ x1 + 1 ; AIC= -5122.886 Trial: y ~ 1 + x2 ; AIC= 796.915 Step:1 y ~ x1 ; AIC= -5122.886 Trial: y ~ ns(x1, df=2) + 1 ; AIC= -5177.531 Trial: y ~ x1 + x2 ; AIC= -5177.531 Step:2 y ~ ns(x1, df = 2) ; AIC= -5177.531 Trial: y ~ ns(x1, df=2) + x2 ; AIC= -5222.703 Step:3 y ~ ns(x1, df = 2) + x2 ; AIC= -5222.703> is.null(step.object)[1] TRUE