I am using GLM to calculate logit models based on cross-sectional data. I am now down to the hard work of making the results intelligible to very average readers. Is there any way to calculate a psuedo analoque to the R^2 in standard linear regression for use as a purely descriptive statistic of goodness of fit? Most of the readers of my report will be vaguely familiar and more comfortable with R^2 than with any other regression diagnostics. Paul M. Jacobson Jacobson Consulting Inc. 80 Front Street East, Suite 720 Toronto, ON, M5E 1T4 Voice: +1(416)868-1141 Farm: +1(519)463-6061/6224 Fax: +1(416)868-1131 E-mail: pmj at jciconsult.com Web: http://www.jciconsult.com/ -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
The Nagelkerke R^2 is commonly used. The lrm function in the Design library computes this for logistic regression. The numerator is 1 - exp(-LR/n) where LR is the likelihood ratio chi-square stat and n is the total sample size. Divide it by the maximum attainable value of this if the model is perfect (which is a simple function of the -2 log likelihood with an intercept-only model) to get Nagelkerke's R^2. The numerator is exactly the ordinary R^2 in OLS, as LR = -n log(1-R^2) there. For a more interpretable index and one that measures purely discrimination ability, the ROC area or "C index" which is essentially a Mann-Whitney statistic based on concordance probability is recommended. The lrm function also outputs this or you can get it from the somers2 or rcorr.cens functions in the Hmisc library. Frank Harrell On Sun, 4 Aug 2002 09:08:46 -0400 "Paul M. Jacobson" <pmj at jciconsult.com> wrote:> I am using GLM to calculate logit models based on cross-sectional data. I > am now down to the hard work of making the results intelligible to very > average readers. Is there any way to calculate a psuedo analoque to the R^2 > in standard linear regression for use as a purely descriptive statistic of > goodness of fit? Most of the readers of my report will be vaguely familiar > and more comfortable with R^2 than with any other regression diagnostics. > > Paul M. Jacobson > Jacobson Consulting Inc. > 80 Front Street East, Suite 720 > Toronto, ON, M5E 1T4 > Voice: +1(416)868-1141 > Farm: +1(519)463-6061/6224 > Fax: +1(416)868-1131 > E-mail: pmj at jciconsult.com > Web: http://www.jciconsult.com/ > > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._-- Frank E Harrell Jr Prof. of Biostatistics & Statistics Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences U. Virginia School of Medicine http://hesweb1.med.virginia.edu/biostat -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On Aug 04, Paul M. Jacobson wrote:> I am using GLM to calculate logit models based on cross-sectional data. I > am now down to the hard work of making the results intelligible to very > average readers. Is there any way to calculate a psuedo analoque to the R^2 > in standard linear regression for use as a purely descriptive statistic of > goodness of fit? Most of the readers of my report will be vaguely familiar > and more comfortable with R^2 than with any other regression diagnostics.In fact, there are several "R^2-like" measures for logit and probit models (not surprisingly, called "pseudo-R^2"). An overview is in: "Pseudo-R Measures for Some Common Limited Dependent Variable Models" http://citeseer.nj.nec.com/veall96pseudor.html The Aldrich-Nelson measure appears to be the most widely used. You may also want to consider Herron's (1999) "Expected Percent Correctly Predicted" and related measures, described in Political Analysis 8(1): http://web.polmeth.ufl.edu/pa/herron.pdf; even traditional PCP/PRE measures tend to be quite informative (perhaps even more useful than Pseudo-R^2). Chris -- Chris Lawrence <cnlawren at olemiss.edu> - http://www.lordsutch.com/chris/ Instructor and Ph.D. Candidate, Political Science, Univ. of Mississippi 208 Deupree Hall - 662-915-5765 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Dear list My data frame has serious collinearity problem. I want to try the Incomplete Principal Component Regression, introduced by "Regression Analysis" by Rudolf j. Freund and William J. Wilson (1998). I wonder if I can find a function in R or S-plus to do it. Thanks a lot! Huan -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Dear Prof. Harrell and R list, I have done the variable clustering and summary scores. Thanks a lot for your kind help. But it hasn't solved the collinearity problem in my dataset. Afer the clustering and transcan, there is still very strong collinearity between the summary scores. The objective of my project is to find out the influential variables. I believe any variable resuction is not appropriate when the collinearity exists. I am thinking about the principal component regression and variable reduction based on it (Rudolf J. Freund and William J. Wilson (1998), P215). Does anybody have suggestion on the variable resuction under this condition? I will appreciate any kind imformation. Best Huan ----- Original Message ----- From: "Frank E Harrell Jr" <fharrell at virginia.edu> To: "Huan Huang" <huang at stats.ox.ac.uk> Sent: Sunday, August 04, 2002 7:56 PM Subject: Re: cluster summary score> On Sun, 4 Aug 2002 19:48:22 +0100 > Huan Huang <huang at stats.ox.ac.uk> wrote: > > > > > > > > This was just done by > > > > > > f <- lrm(y ~ all cluster summary scores) > > > fastbw(f, suitable stopping criteria) > > > > Thank you very much for your kind reply. But I don't know how to get the > > cluster summary score. > > > > I did: > > t <- transcan(x, transform = T) > > t$transform > > > > I got a new matrix, with the transformed value for each variable. Howcan I> > get the cluster summary scores? > > You see the little pc1 function I defined in Hmisc? I just do things like > > p1 <- pc1(t$transform) or pct1(t$transform[,c(3,5,7)]) to use variables3,5,7> > Frank > > > > > Huan > > > > > > > > Doing the fast backward stepdown is safer with cluster scores thanwith> > raw variables, especially if you use conservative stopping criteria(e.g.,> > large alpha). I allowed "highly insignificant" cluster scores to be > > dropped, and did not ever look at their component variables again. > > > > > > Frank > > > > > > > > > > > Actually I am doing my thesis project. My explanatory variableshave> > > > serious collinearity. I have used the function transcan and varcluson> > the > > > > variables and find out some clusters. I am trying to use the method > > > > introduced in this section to drop some variables. I want to knowhow> > you > > > > carry out the cluster summary scores. > > > > > > > > Thanks a lot and looking forward to hearing from you. > > > > > > > > Huan > > > > ----- Original Message ----- > > > > From: "Frank E Harrell Jr" <fharrell at virginia.edu> > > > > To: <pmj at jciconsult.com> > > > > Cc: <r-help at stat.math.ethz.ch> > > > > Sent: Sunday, August 04, 2002 4:36 PM > > > > Subject: Re: [R] Pseudo R^2 for logit - really naive question > > > > > > > > > > > > > The Nagelkerke R^2 is commonly used. The lrm function in theDesign> > > > library computes this for logistic regression. The numerator is 1 - > > > > exp(-LR/n) where LR is the likelihood ratio chi-square stat and n isthe> > > > total sample size. Divide it by the maximum attainable value ofthis if> > the > > > > model is perfect (which is a simple function of the -2 loglikelihood> > with > > > > an intercept-only model) to get Nagelkerke's R^2. The numerator is > > exactly > > > > the ordinary R^2 in OLS, as LR = -n log(1-R^2) there. For a more > > > > interpretable index and one that measures purely discriminationability,> > the > > > > ROC area or "C index" which is essentially a Mann-Whitney statistic > > based on > > > > concordance probability is recommended. The lrm function alsooutputs> > this > > > > or you can get it from the somers2 or rcorr.cens functions in theHmisc> > > > library. > > > > > > > > > > Frank Harrell > > > > > > > > > > On Sun, 4 Aug 2002 09:08:46 -0400 > > > > > "Paul M. Jacobson" <pmj at jciconsult.com> wrote: > > > > > > > > > > > I am using GLM to calculate logit models based oncross-sectional> > data. > > > > I > > > > > > am now down to the hard work of making the results intelligibleto> > very > > > > > > average readers. Is there any way to calculate a psuedoanaloque to> > the > > > > R^2 > > > > > > in standard linear regression for use as a purely descriptive > > statistic > > > > of > > > > > > goodness of fit? Most of the readers of my report will bevaguely> > > > familiar > > > > > > and more comfortable with R^2 than with any other regression > > > > diagnostics. > > > > > > > > > > > > Paul M. Jacobson > > > > > > Jacobson Consulting Inc. > > > > > > 80 Front Street East, Suite 720 > > > > > > Toronto, ON, M5E 1T4 > > > > > > Voice: +1(416)868-1141 > > > > > > Farm: +1(519)463-6061/6224 > > > > > > Fax: +1(416)868-1131 > > > > > > E-mail: pmj at jciconsult.com > > > > > > Web: http://www.jciconsult.com/ > > > > > > > > > > > > > > > > > > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. > > > > -.-.- > > > > > > r-help mailing list -- Read > > > > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > > > > > > Send "info", "help", or "[un]subscribe" > > > > > > (in the "body", not the subject !) To: > > r-help-request at stat.math.ethz.ch > > > > > > > > > > > >_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.> > > > _._ > > > > > > > > > > > > > > > -- > > > > > Frank E Harrell Jr Prof. of Biostatistics &Statistics> > > > > Div. of Biostatistics & Epidem. Dept. of Health EvaluationSciences> > > > > U. Virginia School of Medicine > > http://hesweb1.med.virginia.edu/biostat > > > > > > > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. > > > > -.-.- > > > > > r-help mailing list -- Read > > > > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > > > > > Send "info", "help", or "[un]subscribe" > > > > > (in the "body", not the subject !) To: > > r-help-request at stat.math.ethz.ch > > > > > > > > > > >_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.> > > > _._ > > > > > > > > > > > > > > > > > > -- > > > Frank E Harrell Jr Prof. of Biostatistics & Statistics > > > Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences > > > U. Virginia School of Medicinehttp://hesweb1.med.virginia.edu/biostat> > > > > > > > -- > Frank E Harrell Jr Prof. of Biostatistics & Statistics > Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences > U. Virginia School of Medicine http://hesweb1.med.virginia.edu/biostat >-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On 08/08/02 13:23, Huan Huang wrote:>Dear Prof. Harrell and R list, > >I have done the variable clustering and summary scores. Thanks a lot for >your kind help. > >But it hasn't solved the collinearity problem in my dataset. Afer the >clustering and transcan, there is still very strong collinearity between the >summary scores. The objective of my project is to find out the influential >variables. I believe any variable resuction is not appropriate when the >collinearity exists. I am thinking about the principal component regression >and variable reduction based on it (Rudolf J. Freund and William J. Wilson >(1998), P215). > >Does anybody have suggestion on the variable resuction under this condition? >I will appreciate any kind imformation.I'm not sure what you mean by resuction, but when I and many other psychologists face this kind of problem - reducing a set of variables - we often use factor analysis. A good progam is factanal in the mva library. Varimax rotation (the default) usuallly picks out a sensible set of factors, although of course other rotations may be more informative for a given case. You sort the loadings if you want. (Look at the various options for loadings() and print().) There are no fixed rules for this sort of thing. Sometimes one variable winds up in the wrong place by chance. The strategy I use is to figure out a sensible grouping of variables before I use them to predict anything, so that I am not biased by knowing the results. So I feel free to move or remove variables that don't make sense. Some people may prefer a more rigid approach, which further reduces the temptation to cheat. Having found the grouping of variables, you can do three different things: 1. Define "scores" by simply adding up the (standardized?) scores of the variables in each group (with high loadings in the same factor, perhaps). 2. Use the factor scores themselves as variables. 3. Use a single representative variable from each group. This seems to be what you were suggesting, but I'm having trouble thinking of a situation where this would be better than #1 or #2. Whatever you do, you need to figure out how many groups, and prcomp() or princomp() is often helpful here. (And take a look at biplot(). A really nice tool for looking at the first two principal components.) The factanal() program also reports a chi-square fit statistic. So in principle you could use that to figure out how many factors there are. However, that method usually gives more factors than are meaningful, especially when you have a large data set. -- Jonathan Baron, Professor of Psychology, University of Pennsylvania Home page: http://www.sas.upenn.edu/~baron R page: http://finzi.psych.upenn.edu/ -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._