Francesco Nutini
2010-Dec-03 14:42 UTC
[R] difference between linear model & scatterplot matrix
Dear R-users, I'm studing a DB, structured like this (just a little part of my dataset): _____________________________________________________________________________________________________________ Site Latitude Longitude Year Tot-Prod Total_Density dmp Dendoudi-1 15.441964 -13.540179 2005 3271.16 1007 16993.25 Dendoudi-2 15.397321 -13.611607 2005 1616.84 250 25376.67 ? ? ? ? ? ? ? _____________________________________________________________________________________________________________ If I made a scatterplotmatrix with the command show below I obtain a matrix (visible in the image) that show which variables is more correlated with dmp data (violet color). But, if I made a linear model between the dependent variable (dmp) and many independent variables I get different information about the significativity of the variable. I mean, variables that appear correlated with dependent variable in the matrix result not correlated in the summary of linear model, and vice versa. Have I made a mistake in the interpretation of the result, or not? Thank you in advance, Francesco #command for matrix-plot>dta <-senegal5[c( 2,4,5,6,7,8,9,13,15,17,21, 39,44,45)]>dta.r <-abs(cor(dta))>dta.col<- dmat.color(dta.r)>dta.o <-order.single(dta.r)>cpairs(dta,dta.o, panel.colors=dta.col, gap=.5,>main="Variables Ordered and Colored byCorrelation") #command for linear model and summary()>a<- lm ( dmp ~ Latitude+ Longitude + Year + Tot.Prod + Herbaceous.Prod.kg.ha. + Leaf.Prod + Tree.bio + Total_Density + X1st.SpecieDensity.trunk.ha.+ X2nd.SpecieDensity.trunk.ha.+ Herb_Specie_Index1 + iNDVI.JASO. + RFE.Cum.JASO., data=senegal5 )>summary(a)Call: lm(formula = dmp ~ Latitude + Longitude + Year + Tot.Prod + Herbaceous.Prod.kg.ha. + Leaf.Prod + Tree.bio + Total_Density + X1st.SpecieDensity.trunk.ha. + X2nd.SpecieDensity.trunk.ha. + Herb_Specie_Index1 + iNDVI.JASO. + RFE.Cum.JASO., data = senegal5) Residuals: Min 1Q Median 3Q Max -676.49 -195.77 -33.06 113.34 816.17 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.283e+05 4.505e+04 -7.288 4.41e-11 *** Latitude -6.100e+01 1.990e+02 -0.307 0.7598 Longitude -3.617e+02 8.639e+01 -4.187 5.60e-05 *** Year 1.604e+02 2.300e+01 6.973 2.15e-10 *** Tot.Prod -4.893e+00 1.565e+02 -0.031 0.9751 Herbaceous.Prod.kg.ha. 4.905e+00 1.565e+02 0.031 0.9751 Leaf.Prod 4.842e+00 1.565e+02 0.031 0.9754 Tree.bio -4.241e+01 2.771e+02 -0.153 0.8786 Total_Density -1.930e+00 8.933e-01 -2.160 0.0329 * X1st.SpecieDensity.trunk.ha. 1.992e+00 9.246e-01 2.154 0.0333 * X2nd.SpecieDensity.trunk.ha. 3.416e+00 1.642e+00 2.080 0.0398 * Herb_Specie_Index1 -1.091e+00 1.844e+00 -0.592 0.5552 iNDVI.JASO. 8.914e+02 6.076e+01 14.670 < 2e-16 *** RFE.Cum.JASO. 2.525e+00 4.529e-01 5.575 1.68e-07 *** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Residual standard error: 295.3 on 114 degrees of freedom Multiple R-squared: 0.9206, Adjusted R-squared: 0.9116 F-statistic: 101.7 on 13 and 114 DF, p-value: < 2.2e-16
Dear R-users, Why variables that appear correlated with dependent variable in a scatterplot, results not correlated in the summary of linear model, and vice versa? I mean, variable "Longitude" (see the example below) is correlated (***) with dependent variable in the linear model. But if I made a scatterplot the r2 is very low. How can I interpretate the information of command summary()? Thank you in advance, Francesco #command for summary() of linear model>summary(model_example)Call: lm(formula = dmp ~ Latitude + Longitude + Year + Tot.Prod + RFE.Cum.JASO., data = senegal5) Residuals: Min 1Q Median 3Q Max -676.49 -195.77 -33.06 113.34 816.17 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.283e+05 4.505e+04 -7.288 4.41e-11 *** Latitude -6.100e+01 1.990e+02 -0.307 0.7598 Longitude -3.617e+02 8.639e+01 -4.187 5.60e-05 *** Year 1.604e+02 2.300e+01 6.973 2.15e-10 *** Tot.Prod -4.893e+00 1.565e+02 -0.031 0.9751 RFE.Cum.JASO. 2.525e+00 4.529e-01 5.575 1.68e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 [[alternative HTML version deleted]]
Jonathan Christensen
2010-Dec-03 19:27 UTC
[R] difference between linear model & scatterplot matrix
Francesco, My guess would be collinearity of the predictors. The linear model gives you the best fit to all of the predictors at once; unless the predictors are orthogonal (which in a case like this is certainly not the case), there is no guarantee that the parameter estimates which give the best overall fit for the linear model will be similar to regression coefficients if you were to regress the response on each predictor individually. There are various ways to check collinearity, such as variance inflation factors (VIF). You may want to look into them. It's very dangerous to try to interpret your parameter estimates in the presence of collinearity. Jonathan On Fri, Dec 3, 2010 at 7:42 AM, Francesco Nutini <nutini.francesco at gmail.com> wrote:> > > > > Dear R-users, > I'm studing a DB, structured like this (just a little part of my dataset): > _____________________________________________________________________________________________________________ > > > > > > > > > > ?Site > ?Latitude > ?Longitude > ?Year > ?Tot-Prod > ?Total_Density > ?dmp > > > > ?Dendoudi-1 > ?15.441964 > ?-13.540179 > ?2005 > ?3271.16 > ?1007 > ?16993.25 > > > ?Dendoudi-2 > ?15.397321 > ?-13.611607 > ?2005 > ?1616.84 > ?250 > ?25376.67 > > > ?? > ?? > ?? > ?? > ?? > ?? > ?? > > _____________________________________________________________________________________________________________ > > If I made a scatterplotmatrix with the command show below I obtain a matrix (visible in the image) that show which variables is more correlated with dmp data (violet color). > But, if I made a linear model between the dependent variable (dmp) and ?many independent variables > I get different information about the significativity of the variable. > I mean, variables that appear correlated with dependent variable in the matrix result not correlated in the summary of linear model, and vice versa. Have I made a mistake in the interpretation of the result, or not? > > Thank you in advance, > Francesco > > > > #command for matrix-plot > > >>dta <- > senegal5[c( ?2,4,5,6,7,8,9,13,15,17,21, > 39,44,45)] > >>dta.r <- > abs(cor(dta)) > >>dta.col > <- dmat.color(dta.r) > >>dta.o <- > order.single(dta.r) > >>cpairs(dta, > dta.o, panel.colors=dta.col, gap=.5, > >>main="Variables Ordered and Colored by > Correlation") > #command for linear model and summary() > > >>a<- lm ( dmp ~ Latitude > + Longitude + ?Year + ?Tot.Prod + ? ?Herbaceous.Prod.kg.ha. + ?Leaf.Prod + ?Tree.bio ?+ Total_Density ?+ X1st.SpecieDensity.trunk.ha.+ > X2nd.SpecieDensity.trunk.ha.+ Herb_Specie_Index1 + ?iNDVI.JASO. > + > RFE.Cum.JASO., data=senegal5 ) > > > > >>summary(a) > > > > Call: > > lm(formula = dmp ~ > Latitude + Longitude + Year + Tot.Prod + Herbaceous.Prod.kg.ha. + > > ? ?Leaf.Prod + Tree.bio + Total_Density + > X1st.SpecieDensity.trunk.ha. + > > ? ?X2nd.SpecieDensity.trunk.ha. + > Herb_Specie_Index1 + iNDVI.JASO. + > > ? ?RFE.Cum.JASO., > data = senegal5) > > Residuals: > > ? ?Min > 1Q ?Median ? ? ?3Q > Max > > -676.49 -195.77 ?-33.06 > 113.34 ?816.17 > > > > Coefficients: > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Estimate Std. Error > t value Pr(>|t|) > > (Intercept) ? ? ? ? ? ? ? ? ?-3.283e+05 ?4.505e+04 > -7.288 4.41e-11 *** > > Latitude ? ? ? ? ? ? ? ? ? ? -6.100e+01 ?1.990e+02 > -0.307 ? 0.7598 > > Longitude ? ? ? ? ? ? ? ? ? ?-3.617e+02 ?8.639e+01 > -4.187 5.60e-05 *** > > Year ? ? ? ? ? ? ? ? ? ? ? ? ?1.604e+02 ?2.300e+01 > 6.973 2.15e-10 *** > > Tot.Prod ? ? ? ? ? ? ? ? ? ? -4.893e+00 ?1.565e+02 > -0.031 ? 0.9751 > > Herbaceous.Prod.kg.ha. ? ? ? ?4.905e+00 ?1.565e+02 > 0.031 ? 0.9751 > > Leaf.Prod > ? ? ? ? ? ? ? ? ?4.842e+00 ?1.565e+02 > 0.031 ? 0.9754 > > Tree.bio ? ? ? ? ? ? ? ? ? ? -4.241e+01 ?2.771e+02 > -0.153 ? 0.8786 > > Total_Density ? ? ? ? ? ? ? ?-1.930e+00 ?8.933e-01 > -2.160 ? 0.0329 * > > X1st.SpecieDensity.trunk.ha. ?1.992e+00 > 9.246e-01 ? 2.154 > 0.0333 * > > X2nd.SpecieDensity.trunk.ha. ?3.416e+00 > 1.642e+00 ? 2.080 ? 0.0398 * > > > Herb_Specie_Index1 ? ? ? ? ? -1.091e+00 ?1.844e+00 > -0.592 ? 0.5552 > > iNDVI.JASO. ? ? ? ? ? ? ? ? ? 8.914e+02 ?6.076e+01 > 14.670 ?< 2e-16 *** > > RFE.Cum.JASO. ? ? ? ? ? ? ? ? 2.525e+00 ?4.529e-01 > 5.575 1.68e-07 *** > > --- > > Signif. codes: ?0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? > 1 > > > > Residual standard > error: 295.3 on 114 degrees of freedom > > Multiple R-squared: > 0.9206, ? ? Adjusted R-squared: 0.9116 > > F-statistic: 101.7 on > 13 and 114 DF, ?p-value: < 2.2e-16 > > > > > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >