Sorkin, John
2025-Jun-30 03:05 UTC
[R] Difference in p-value obtained in an interrupted time-series analysis (containing main effects and and interaction) vs. a simple regression containing only main effects.
The question that follows in NOT an R question, but rather a statistics question. I hope you will forgive my statistics question. I am investigating interrupted time-series analysis. My data has two periods, period 0 and period 1. In period 0 the slope is positive. In period 2 the slope is negative: mydata<- structure(list(period = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), time2 = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L), time = 1:32, y = c(0.0190981041979942, -0.648843854645016, 2.62941433593151, 2.97383527060948, 2.68476280937483, 2.51909810419799, 1.85115614535498, 5.12941433593151, 5.47383527060948, 5.18476280937483, 5.01909810419799, 4.35115614535498, 7.62941433593151, 7.97383527060948, 7.68476280937483, 7.51909810419799, 6.35115614535498, 8.12941433593151, 6.97383527060948, 5.18476280937483, 3.51909810419799, 1.35115614535498, 3.12941433593151, 1.97383527060948, 0.184762809374828, -1.48090189580201, -3.64884385464502, -1.87058566406849, -3.02616472939052, -4.81523719062517, -6.48090189580201, -8.64884385464502)), row.names = c(NA, -32L), class = "data.frame") mydata plot(mydata$time,mydata$y) title("Data Used in my analyses") # Regression of y on time, period 0 (16-data points) fit0 <- lm(y~time,data=mydata[1:16,]) summary(fit0) #(16-data points) abline(fit0,col="green") cat("Regression period 0:", "beta=",summary(fit0)$coefficients["time","Estimate"],"p-value=",summary(fit0)$coefficients["time","Pr(>|t|)"],"\n") # Regression of y on time, period 1 (16-data points) fit1 <- lm(y~time,data=mydata[17:32,]) #(16-data points) abline(fit1,col="red") summary(fit1) cat("Regression period 1:", "beta=",summary(fit1)$coefficients["time","Estimate"],"p-value=",summary(fit1)$coefficients["time","Pr(>|t|)"],"\n") # Regression of y on time, period 0 and 1 (32-data points) fit2 <- lm(y~time+period+time*period,data=mydata[1:32,]) summary(fit2) #(32-data points) cat("Regression period 1 and 2 using interaction:", "beta=",summary(fit2)$coefficients["time","Estimate"],"p-value=",summary(fit2)$coefficients["time","Pr(>|t|)"],"\n") Please note that the regression of y on time in period 0 (fit 0) returns time slope=0.52358 p=2.86e-07 Please note that the regression of y on time in period 0 (fit 2) and 1 returns time slope=0.52358 p=1.50e-09 The time slope are the same in the two models, fit0 and fit2, however the p-values are different, 2.86e-07 (fit 0) vs. 1.50e-09 (fit 2) a two-orders of magnitude improvement in the p-value! I am not surprised that the time slopes are the same. I am shocked that the p-values are different. While fit 0 used 16 lines of data, and fit 2 used 32 lines of data, the number of values that were used to compute the time slopes in period 0 by the two models are the same, 16. This being the case, why are the p-values of the time slopes different in fit0 vs. fit2? Thank you, John P.S. This question is NOT homework. I am many years beyond being a student (at least a student in a class), but I am a teacher of a class (and a life-long student). The question comes from a discussion I had with a student in one of my classes. John David Sorkin M.D., Ph.D. Professor of Medicine, University of Maryland School of Medicine; Associate Director for Biostatistics and Informatics, Baltimore VA Medical Center Geriatrics Research, Education, and Clinical Center;? PI?Biostatistics and Informatics Core, University of Maryland School of Medicine Claude D. Pepper Older Americans Independence Center; Senior Statistician University of Maryland Center for Vascular Research; Division of Gerontology and Paliative Care, 10 North Greene Street GRECC (BT/18/GR) Baltimore, MD 21201-1524 Cell phone 443-418-5382
John Fox
2025-Jun-30 03:43 UTC
[R] Difference in p-value obtained in an interrupted time-series analysis (containing main effects and and interaction) vs. a simple regression containing only main effects.
Dear John, Without investigating your data and models in detail, it's not surprising that the p-values for the two coefficients differ: the models have different residual standard errors (and hence different coefficient standard errors) and different residual degrees of freedom (and hence are based on different t-distributions). I hope this helps, John -- John Fox, Professor Emeritus McMaster University Hamilton, Ontario, Canada web: https://www.john-fox.ca/ -- On 2025-06-29 11:05 p.m., Sorkin, John wrote:> Caution: External email. > > > The question that follows in NOT an R question, but rather a statistics question. I hope you will forgive my statistics question. > > I am investigating interrupted time-series analysis. My data has two periods, period 0 and period 1. In period 0 the slope is positive. In period 2 the slope is negative: > > mydata<- structure(list(period = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), > time2 = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, > 13L, 14L, 15L, 16L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, > 11L, 12L, 13L, 14L, 15L, 16L), time = 1:32, > y = c(0.0190981041979942, > -0.648843854645016, 2.62941433593151, 2.97383527060948, 2.68476280937483, > 2.51909810419799, 1.85115614535498, 5.12941433593151, 5.47383527060948, > 5.18476280937483, 5.01909810419799, 4.35115614535498, 7.62941433593151, > 7.97383527060948, 7.68476280937483, 7.51909810419799, 6.35115614535498, > 8.12941433593151, 6.97383527060948, 5.18476280937483, 3.51909810419799, > 1.35115614535498, 3.12941433593151, 1.97383527060948, 0.184762809374828, > -1.48090189580201, -3.64884385464502, -1.87058566406849, > -3.02616472939052, -4.81523719062517, -6.48090189580201, > -8.64884385464502)), row.names = c(NA, -32L), class = "data.frame") > mydata > plot(mydata$time,mydata$y) > title("Data Used in my analyses") > > # Regression of y on time, period 0 (16-data points) > fit0 <- lm(y~time,data=mydata[1:16,]) > summary(fit0) #(16-data points) > abline(fit0,col="green") > cat("Regression period 0:", "beta=",summary(fit0)$coefficients["time","Estimate"],"p-value=",summary(fit0)$coefficients["time","Pr(>|t|)"],"\n") > > # Regression of y on time, period 1 (16-data points) > fit1 <- lm(y~time,data=mydata[17:32,]) #(16-data points) > abline(fit1,col="red") > summary(fit1) > cat("Regression period 1:", "beta=",summary(fit1)$coefficients["time","Estimate"],"p-value=",summary(fit1)$coefficients["time","Pr(>|t|)"],"\n") > > # Regression of y on time, period 0 and 1 (32-data points) > fit2 <- lm(y~time+period+time*period,data=mydata[1:32,]) > summary(fit2) #(32-data points) > cat("Regression period 1 and 2 using interaction:", "beta=",summary(fit2)$coefficients["time","Estimate"],"p-value=",summary(fit2)$coefficients["time","Pr(>|t|)"],"\n") > > > Please note that the regression of y on time in period 0 (fit 0) returns > time slope=0.52358 p=2.86e-07 > > Please note that the regression of y on time in period 0 (fit 2) and 1 returns > time slope=0.52358 p=1.50e-09 > > The time slope are the same in the two models, fit0 and fit2, however the p-values are different, 2.86e-07 (fit 0) vs. 1.50e-09 (fit 2) a two-orders of magnitude improvement in the p-value! > I am not surprised that the time slopes are the same. I am shocked that the p-values are different. While fit 0 used 16 lines of data, and fit 2 used 32 lines of data, the number of values that were used to compute the time slopes in period 0 by the two models are the same, 16. This being the case, why are the p-values of the time slopes different in fit0 vs. fit2? > > Thank you, > John > > P.S. This question is NOT homework. I am many years beyond being a student (at least a student in a class), but I am a teacher of a class (and a life-long student). The question comes from a discussion I had with a student in one of my classes. > > > > > > John David Sorkin M.D., Ph.D. > Professor of Medicine, University of Maryland School of Medicine; > Associate Director for Biostatistics and Informatics, Baltimore VA Medical Center Geriatrics Research, Education, and Clinical Center; > PI Biostatistics and Informatics Core, University of Maryland School of Medicine Claude D. Pepper Older Americans Independence Center; > Senior Statistician University of Maryland Center for Vascular Research; > > Division of Gerontology and Paliative Care, > 10 North Greene Street > GRECC (BT/18/GR) > Baltimore, MD 21201-1524 > Cell phone 443-418-5382 > > > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide https://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.