Kevin E. Thorpe
2009-Dec-01 19:44 UTC
[R] An R vs. SAS Discrepancy: How do I determine which is correct?
I was messing around with some data in R and SAS (the reason is unimportant) fitting a multiple linear regression and got a curious discrepancy. The data set is too big to post, but if someone wants it, I can send it. So, here are the (partial) results: From R: Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 61.11434 1.48065 41.275 < 2e-16 *** sexWomen 2.91108 0.35753 8.142 5e-16 *** diabp 0.20675 0.01504 13.746 < 2e-16 *** age -0.08085 0.02088 -3.871 0.000110 *** From SAS (sorry about word-wrap if it happens): Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Intercept Intercept 1 58.20326 1.57802 36.88 SEX SEX 1 2.91108 0.35753 8.14 DIABP Diastolic BP mmHg 1 0.20675 0.01504 13.75 AGE Age (years) at examination 1 -0.08085 0.02088 -3.87 Parameter Estimates Variable Label DF Pr > |t| Intercept Intercept 1 <.0001 SEX SEX 1 <.0001 DIABP Diastolic BP mmHg 1 <.0001 AGE Age (years) at examination 1 0.0001 The curious thihs is that all parameter estimates agree except the intercept. In R I also computed the coefficients directly using (X'X)^(-1) X' y and get the same coefficients as lm() have me. Also, ols() in Design agrees with lm() As far as I can tell, the data used in R and SAS are identical. So, whose answer is correct and how do I prove it? Here's my sessionInfo (yes, I know my version of R is oldish). > sessionInfo() R version 2.8.0 (2008-10-20) i686-pc-linux-gnu locale: LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=C;LC_MONETARY=C;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C attached base packages: [1] splines stats graphics grDevices utils datasets methods [8] base other attached packages: [1] Design_2.2-0 survival_2.35-4 Hmisc_3.6-0 lattice_0.17-25 loaded via a namespace (and not attached): [1] cluster_1.12.0 grid_2.8.0 -- Kevin E. Thorpe Biostatistician/Trialist, Knowledge Translation Program Assistant Professor, Dalla Lana School of Public Health University of Toronto email: kevin.thorpe at utoronto.ca Tel: 416.864.5776 Fax: 416.864.3016
Kevin E. Thorpe
2009-Dec-01 19:59 UTC
[R] An R vs. SAS Discrepancy: How do I determine which is correct?
Thanks to an insightful comment from Jeremy Miles, who politely pointed out my thick-headed moment, I know what happened. The sex variable was coded as 1/2 in the SAS data, but was a factor in the R data and so became a properly coded dummy variable. Sorry for the obvious question and answer. Kevin E. Thorpe wrote:> I was messing around with some data in R and SAS (the reason is > unimportant) fitting a multiple linear regression and got a > curious discrepancy. The data set is too big to post, but if > someone wants it, I can send it. > > So, here are the (partial) results: > > From R: > > Coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) 61.11434 1.48065 41.275 < 2e-16 *** > sexWomen 2.91108 0.35753 8.142 5e-16 *** > diabp 0.20675 0.01504 13.746 < 2e-16 *** > age -0.08085 0.02088 -3.871 0.000110 *** > > From SAS (sorry about word-wrap if it happens): > > Parameter Estimates > > Parameter Standard > Variable Label DF Estimate Error > t Value > > Intercept Intercept 1 58.20326 1.57802 > 36.88 > SEX SEX 1 2.91108 0.35753 > 8.14 > DIABP Diastolic BP mmHg 1 0.20675 0.01504 > 13.75 > AGE Age (years) at examination 1 -0.08085 0.02088 > -3.87 > > Parameter Estimates > > Variable Label DF Pr > |t| > > Intercept Intercept 1 <.0001 > SEX SEX 1 <.0001 > DIABP Diastolic BP mmHg 1 <.0001 > AGE Age (years) at examination 1 0.0001 > > The curious thihs is that all parameter estimates agree except the > intercept. In R I also computed the coefficients directly using > (X'X)^(-1) X' y and get the same coefficients as lm() have me. > Also, ols() in Design agrees with lm() > > As far as I can tell, the data used in R and SAS are identical. So, > whose answer is correct and how do I prove it? Here's my sessionInfo > (yes, I know my version of R is oldish). > > > sessionInfo() > R version 2.8.0 (2008-10-20) > i686-pc-linux-gnu > > locale: > LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=C;LC_MONETARY=C;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C > > > attached base packages: > [1] splines stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] Design_2.2-0 survival_2.35-4 Hmisc_3.6-0 lattice_0.17-25 > > loaded via a namespace (and not attached): > [1] cluster_1.12.0 grid_2.8.0 >-- Kevin E. Thorpe Biostatistician/Trialist, Knowledge Translation Program Assistant Professor, Dalla Lana School of Public Health University of Toronto email: kevin.thorpe at utoronto.ca Tel: 416.864.5776 Fax: 416.864.3016