agent dunham
2011-Mar-22 16:31 UTC
[R] lm ~ v1 + log(v1) + ... improve adj Rsq ¿any sense?
Dear all, I want to improve my adj - R sq. I 've chequed some established models and they introduce two times the same variable, one transformed, and the other not. It also improves my adj - R sq. But, isn't this bad for the collinearity? Do I interpret coefficients as usual? Estimate Std. Error t value Pr(>|t|) (Intercept) 1.73140 7.22477 0.240 0.81086 v1 -0.33886 0.20321 -1.668 0.09705 . log(v1) 2.63194 3.74556 0.703 0.48311 v2 -0.01517 0.01089 -1.394 0.16507 log(v3) -0.45719 0.27656 -1.653 0.09995 . factor1 -1.81517 0.62155 -2.920 0.00392 ** factor2 -1.87330 0.84375 -2.220 0.02759 * Analysis of Variance Table Response: height rise Df Sum Sq Mean Sq F value Pr(>F) v1 1 51.25 51.246 21.4128 6.842e-06 *** log(v1) 1 13.62 13.617 5.6897 0.018048 * v2 1 2.84 2.836 1.1850 0.277713 log(v3) 1 3.02 3.024 1.2638 0.262357 factor1 1 17.62 17.616 7.3608 0.007279 ** factor2 1 11.80 11.797 4.9294 0.027586 * Residuals 190 454.71 2.393 Thanks, user at host.com -- View this message in context: http://r.789695.n4.nabble.com/lm-v1-log-v1-improve-adj-Rsq-any-sense-tp3396935p3396935.html Sent from the R help mailing list archive at Nabble.com.
Frank Harrell
2011-Mar-22 17:20 UTC
[R] lm ~ v1 + log(v1) + ... improve adj Rsq ¿any sense?
If you care about confidence interval coverage, type I error, or predictive accuracy, trying different models in this way is not the way to go. Frank agent dunham wrote:> > Dear all, > > I want to improve my adj - R sq. I 've chequed some established models and > they introduce two times the same variable, one transformed, and the other > not. It also improves my adj - R sq. > > But, isn't this bad for the collinearity? Do I interpret coefficients as > usual? > > Estimate Std. Error t value Pr(>|t|) > (Intercept) 1.73140 7.22477 0.240 0.81086 > v1 -0.33886 0.20321 -1.668 0.09705 . > log(v1) 2.63194 3.74556 0.703 0.48311 > v2 -0.01517 0.01089 -1.394 0.16507 > log(v3) -0.45719 0.27656 -1.653 0.09995 . > factor1 -1.81517 0.62155 -2.920 0.00392 ** > factor2 -1.87330 0.84375 -2.220 0.02759 * > > Analysis of Variance Table > > Response: height rise > Df Sum Sq Mean Sq F value Pr(>F) > v1 1 51.25 51.246 21.4128 6.842e-06 *** > log(v1) 1 13.62 13.617 5.6897 0.018048 * > v2 1 2.84 2.836 1.1850 0.277713 > log(v3) 1 3.02 3.024 1.2638 0.262357 > factor1 1 17.62 17.616 7.3608 0.007279 ** > factor2 1 11.80 11.797 4.9294 0.027586 * > Residuals 190 454.71 2.393 > > Thanks, > user at host.com >----- Frank Harrell Department of Biostatistics, Vanderbilt University -- View this message in context: http://r.789695.n4.nabble.com/lm-v1-log-v1-improve-adj-Rsq-any-sense-tp3396935p3397136.html Sent from the R help mailing list archive at Nabble.com.
Mike Marchywka
2011-Mar-23 01:55 UTC
[R] lm ~ v1 + log(v1) + ... improve adj Rsq ¿any sense?
----------------------------------------> Date: Tue, 22 Mar 2011 09:31:01 -0700 > From: crosspide at hotmail.com > To: r-help at r-project.org > Subject: [R] lm ~ v1 + log(v1) + ... improve adj Rsq ?any sense? > > Dear all, > > I want to improve my adj - R sq. I 've chequed some established models and > they introduce two times the same variable, one transformed, and the other > not. It also improves my adj - R sq. > > But, isn't this bad for the collinearity? Do I interpret coefficients as > usual?I'm not sure how many replies you got or if your question was answered but just offhand let me see if I understand your concern. If your data is only over a limited range of v1 where you can Taylor expand to linear term only then sure it can be hard to tell a linear from log dependence of quantify a mixture of the two. If you try to find a and b to fit y=a*f(x) + b*g(x) that minimizes some error, you should be able to see the issues on paper. Presumaly log is not linear over a larger range and any error function, like SSE, would have "reasonbly " peaked minimum for some values of the two coefficients but you could do a sensitivty analysis to check- find the second derivatives of your error function or just perturb the coefficients a bit. I guess if there is some direction where the error does not change as a and b vary then you have the case you are worried about. I'm not sure what you consider to be "usual" but when I'm doing something like this, I usually have some physical interpretation mind. Most uninfomratively, you could interpret these coefficients as those which minimize your error given the data you have :) What you do from there depends on a lot of specifics. To tell if a given function seems to be appropriate for the data, it is always good to look at a plot of residuals. Note that ability to find a unique set of coefficients that minimizes a given error has nothing to do with independence of the two terms attached to the coefficients- indeed polynomial fits are a common example( log having a taylor series just constrains a lot of coefficient relationships LOL). P-values and confidence intervals are another matter with post hoc exploratory work but I'll let a statistician comment on that as well as the meaning of the R output. Usually the final decision on a putative model impovement comes from your ability to infer something about the underlying system although you may just want a simple empirical approximation and be more worried about meeting a given error with a limited number of computations etc etc. Apparently you found on a retrospective literature search that everyone else is using the log term. Sometimes you see people ask questions like, " given that in 10 papers on the subject 4 of them used the log term and these authors have historically been right 50 percent of the time but the other 6 are right 40 percent of the time, what are the chances that the log term should be included?" I will also avoid commenting on this question except to say it illustrates a number of ways people do approach these problems and what you consider to be relevant to your situation.> > Estimate Std. Error t value Pr(>|t|) > (Intercept) 1.73140 7.22477 0.240 0.81086 > v1 -0.33886 0.20321 -1.668 0.09705 . > log(v1) 2.63194 3.74556 0.703 0.48311 > v2 -0.01517 0.01089 -1.394 0.16507 > log(v3) -0.45719 0.27656 -1.653 0.09995 . > factor1 -1.81517 0.62155 -2.920 0.00392 ** > factor2 -1.87330 0.84375 -2.220 0.02759 * > > Analysis of Variance Table > > Response: height rise > Df Sum Sq Mean Sq F value Pr(>F) > v1 1 51.25 51.246 21.4128 6.842e-06 *** > log(v1) 1 13.62 13.617 5.6897 0.018048 * > v2 1 2.84 2.836 1.1850 0.277713 > log(v3) 1 3.02 3.024 1.2638 0.262357 > factor1 1 17.62 17.616 7.3608 0.007279 ** > factor2 1 11.80 11.797 4.9294 0.027586 * > Residuals 190 454.71 2.393 > > Thanks, > user at host.com > > -- > View this message in context: http://r.789695.n4.nabble.com/lm-v1-log-v1-improve-adj-Rsq-any-sense-tp3396935p3396935.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.