Lucas Sevilla García
2009-Sep-24 12:48 UTC
[R] P-value and R-squared variable selection criteria
Hi R community I have a question. I'll explain my situation. I have to build a climate model to obtain monthly and annual temperature from 2004 to 2008 from a specif area in Almeria (Spain). To build this climate model, I will use Multiple regression. My dependant variable will be monthly and annual temperature and independant variables will be Latitute, Longitude and Altitude and I will work with climate data from 10 climate stations distributed in my area of interest. I have to fit the climate model from the data to get temperature for each month. And I need to use p-value and r-squared adjusted from the model to obtain the best fit. I'll put an example. My climate data will be: V1 V2 V3 V4 V5 1 1 18 3 6 187 2 2 21 6 8 68 3 3 23 9 5 42 4 4 19 8 2 194 5 5 17 3 2 225 (V1 - climate station, V2 - temperature, V3 - Latitude, V4 - Longitude, V5 - Altitude) I fit the model to the data fit(V2~V3+V4+V5, data=clima) And I get Call: lm(formula = V2 ~ V3 + V4 + V5, data = clima) Residuals: 1 2 3 4 5 0.24684 -0.25200 0.17487 -0.05865 -0.11107 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 22.103408 2.526638 8.748 0.0725 . V3 0.236477 0.152067 1.555 0.3638 V4 -0.073973 0.169716 -0.436 0.7383 V5 -0.024684 0.006951 -3.551 0.1748 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4133 on 1 degrees of freedom Multiple R-squared: 0.9926, Adjusted R-squared: 0.9706 F-statistic: 44.95 on 3 and 1 DF, p-value: 0.1091 P- value for this model is 0.1091 However, I see that variable V4 has a really high p-value, so if I take it out, my model will have a better p-value. So: fit2<-lm(V2~V4+V5) Call: lm(formula = V2 ~ V4 + V5, data = clima) Residuals: 1 2 3 4 5 0.28356 -0.21880 0.05952 0.40918 -0.53346 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 25.764478 1.199212 21.485 0.00216 ** V4 -0.278286 0.140452 -1.981 0.18606 V5 -0.034109 0.004451 -7.664 0.01660 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.5403 on 2 degrees of freedom Multiple R-squared: 0.9748, Adjusted R-squared: 0.9497 F-statistic: 38.74 on 2 and 2 DF, p-value: 0.02516 My new p value for the model is lower, and better. So, this is what I have to do, I have to import climate data, and build the climate model using those independant variables that give me the best p-value for the model, and I have to do it automatic (since this example I did it manual). So, my question after all this long explanation. Is there a package u order I can download to apply selection of independent variables using as criteria p-value and adjusted R-squered, or on the contrary, I have to build what I need by myself. I guess I can build it by myself but it will take me a while but I would like to know if there is some package to help to do it faster. Well, thanks in advance. Lucas _________________________________________________________________ Nuevo Windows Live, un mundo lleno de posibilidades. Descúbrelo. http://www.microsoft.com/windows/windowslive/default.aspx [[alternative HTML version deleted]]
JLucke at ria.buffalo.edu
2009-Sep-24 13:38 UTC
[R] P-value and R-squared variable selection criteria
Lucas This problem is very old --- older than keypunches. There are several methods for selecting variables (forward, backwards, both, all subsets) using a variety of criteria (p-values, R^2, adjusted R^2, Cp, AIC, BIC, and more). Be sure you understand the methods, especially the tendency to overfit. I use the BIC --- the function is stepAIC with parameter k = log(sample size) from the MASS package. Joe [[alternative HTML version deleted]]
Don't throw out the baby with the bath water just yet. Note that even though your first model is insignificant, the R-squared is very high. This is because you fit the whole model with intercept and three coefficients on 1 degree of freedom. You need to first import the data, then run the model, and then decide which coefficients to include. Second, you may have data redundancy issues, for example, if altitude correlates with longitude or latitude (especially, since you have so few stations from a very restricted region, this seems more likely than for larger regions). Check the correlations. If they are high, you may think about data reduction strategies (e.g. principal components analysis). Further, your data is panel data (where the cross-section is the 10 stations and the time series is the 2004 to 2008 monthly data). Thus, it is very likely that fitting OLS without recognizing the dependence of the time-series within each station is problematic. On top, there is certainly correlation across stations, e.g., due to seasonal patterns that you may want to account for. That said, if you want to step down a model to exclude the insignificant predictor variables one by one (more specifically, those with a t-value smaller than 1), use "step" x1=rnorm(100) x2=rnorm(100) x3=rnorm(100) x4=rnorm(100) e=rnorm(100,0,2) y=x1+x3+e reg=lm(y~x1+x2+x3+x4) summary(reg) step(reg2) reg2=lm(y~x1+x3) summary(reg2) HTH Daniel ------------------------- cuncta stricte discussurus ------------------------- -----Urspr?ngliche Nachricht----- Von: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] Im Auftrag von Lucas Sevilla Garc?a Gesendet: Thursday, September 24, 2009 8:49 AM An: r-help at r-project.org Betreff: [R] P-value and R-squared variable selection criteria Hi R community I have a question. I'll explain my situation. I have to build a climate model to obtain monthly and annual temperature from 2004 to 2008 from a specif area in Almeria (Spain). To build this climate model, I will use Multiple regression. My dependant variable will be monthly and annual temperature and independant variables will be Latitute, Longitude and Altitude and I will work with climate data from 10 climate stations distributed in my area of interest. I have to fit the climate model from the data to get temperature for each month. And I need to use p-value and r-squared adjusted from the model to obtain the best fit. I'll put an example. My climate data will be: V1 V2 V3 V4 V5 1 1 18 3 6 187 2 2 21 6 8 68 3 3 23 9 5 42 4 4 19 8 2 194 5 5 17 3 2 225 (V1 - climate station, V2 - temperature, V3 - Latitude, V4 - Longitude, V5 - Altitude) I fit the model to the data fit(V2~V3+V4+V5, data=clima) And I get Call: lm(formula = V2 ~ V3 + V4 + V5, data = clima) Residuals: 1 2 3 4 5 0.24684 -0.25200 0.17487 -0.05865 -0.11107 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 22.103408 2.526638 8.748 0.0725 . V3 0.236477 0.152067 1.555 0.3638 V4 -0.073973 0.169716 -0.436 0.7383 V5 -0.024684 0.006951 -3.551 0.1748 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 0.4133 on 1 degrees of freedom Multiple R-squared: 0.9926, Adjusted R-squared: 0.9706 F-statistic: 44.95 on 3 and 1 DF, p-value: 0.1091 P- value for this model is 0.1091 However, I see that variable V4 has a really high p-value, so if I take it out, my model will have a better p-value. So: fit2<-lm(V2~V4+V5) Call: lm(formula = V2 ~ V4 + V5, data = clima) Residuals: 1 2 3 4 5 0.28356 -0.21880 0.05952 0.40918 -0.53346 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 25.764478 1.199212 21.485 0.00216 ** V4 -0.278286 0.140452 -1.981 0.18606 V5 -0.034109 0.004451 -7.664 0.01660 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 0.5403 on 2 degrees of freedom Multiple R-squared: 0.9748, Adjusted R-squared: 0.9497 F-statistic: 38.74 on 2 and 2 DF, p-value: 0.02516 My new p value for the model is lower, and better. So, this is what I have to do, I have to import climate data, and build the climate model using those independant variables that give me the best p-value for the model, and I have to do it automatic (since this example I did it manual). So, my question after all this long explanation. Is there a package u order I can download to apply selection of independent variables using as criteria p-value and adjusted R-squered, or on the contrary, I have to build what I need by myself. I guess I can build it by myself but it will take me a while but I would like to know if there is some package to help to do it faster. Well, thanks in advance. Lucas _________________________________________________________________ Nuevo Windows Live, un mundo lleno de posibilidades. Desczbrelo. http://www.microsoft.com/windows/windowslive/default.aspx [[alternative HTML version deleted]]