Hi, First of all kudos to the creaters/contributors to R ! This is a great package and I am finding it very useful in my research, will love to contribute any modules and dataset which I develop to the project. While doing multiple regression I arrived at the following peculiar situation. Out of 8 variables only 4 have <0.04 p-values (of t-statistic), rest all have p-values between 0.1 and 1.0 and the coeff of Regression is coming around ~0.8 (adjusted ~0.78). The F-statistic is around 30 and its own p-value is ~0. Also I am constrained with a dataset of 130 datapoints. Being new to statistics I would really appreciate if someone can help me understand these values. 1) Does the above test values indicate a statistically sound and significant model ? 2) Is a dataset of 130 enough to run linear regression with ~7-10 variables ? If not what is approximately a good size. Thanks in advance. -Ankit [[alternative HTML version deleted]]
Ankit, (1) Not necessarily. Linear regression has a number of assumptions. I suggest you get a basics statistics textbook and do some reading. A brief summary of the assumptions include: (a) The relation between outcome and predictor variables lie along a line (or plane for a regression with multiple predictor variables) or some surface that can be modeled using a linear function (b) The predictor variables are independent of one another (c) The residuals from the regression are normally distributed (d) The variance of the residuals is constant through out the range of the independent variables. (e) The predictor variables are measured without error. Even if the above assumptions are violated, you can still get a significant f statistic, significance for some, or all of your predictor variables, etc. If the assumptions are violated, the meaning of the results you obtain from your regression analysis can be questionable, if not outrightly incorrect. There a number of tests that you can perform to make sure you model conforms to (or at least does not wildly violate) the basic assumptions. Some commonly performed tests like examining the pattern of residuals, can be done in R by simply plotting the fit you obtain, i.e. fit1<-lm(y~x+z) plot(fit1) #This produces a number of helpful graphs that will help you evaluate your model. Fortunately, linear regression is fairly robust to minor violations of several of the assumptions noted above, however in order to fully evaluate the appropriateness of you model, you will need to read a textbook, speak to people with more experience than you, and play, play, play with data. (2) The more predictor variables you have the more observations you need. Although there is no absolute rule, many people like to have a minimum of five to ten observations per independent variable. I like to have at least ten. Given that you have eight independent variables, you would, by my criteria need at least 80 observations. You have 130 so you should be OK, assuming that your observations are independent of one-another. Sorry I can't be of more help; statistics can not be learned in a single E-mail message. The fact that you are asking important questions about what you are doing reflects well on you. I suspect that in a year or so you will be answering, rather than asking questions posted on the R Listserv mailing list! John John Sorkin M.D., Ph.D. Chief, Biostatistics and Informatics Baltimore VA Medical Center GRECC, University of Maryland School of Medicine Claude D. Pepper OAIC, University of Maryland Clinical Nutrition Research Unit, and Baltimore VA Center Stroke of Excellence University of Maryland School of Medicine Division of Gerontology Baltimore VA Medical Center 10 North Greene Street GRECC (BT/18/GR) Baltimore, MD 21201-1524 (Phone) 410-605-7119 (Fax) 410-605-7913 (Please call phone number above prior to faxing) jsorkin at grecc.umaryland.edu>>> "rocker turtle" <rockingturtle at gmail.com> 10/07/07 2:32 PM >>>Hi, First of all kudos to the creaters/contributors to R ! This is a great package and I am finding it very useful in my research, will love to contribute any modules and dataset which I develop to the project. While doing multiple regression I arrived at the following peculiar situation. Out of 8 variables only 4 have <0.04 p-values (of t-statistic), rest all have p-values between 0.1 and 1.0 and the coeff of Regression is coming around ~0.8 (adjusted ~0.78). The F-statistic is around 30 and its own p-value is ~0. Also I am constrained with a dataset of 130 datapoints. Being new to statistics I would really appreciate if someone can help me understand these values. 1) Does the above test values indicate a statistically sound and significant model ? 2) Is a dataset of 130 enough to run linear regression with ~7-10 variables ? If not what is approximately a good size. Thanks in advance. -Ankit [[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Confidentiality Statement: This email message, including any attachments, is for th...{{dropped:6}}
rocker turtle wrote:> Hi, > > First of all kudos to the creaters/contributors to R ! This is a great > package and I am finding it very useful in my research, will love to > contribute any modules and dataset which I develop to the project. > > While doing multiple regression I arrived at the following peculiar > situation. > Out of 8 variables only 4 have <0.04 p-values (of t-statistic), rest all > have p-values between 0.1 and 1.0 and the coeff of Regression is coming > around ~0.8 (adjusted ~0.78). The F-statistic is > around 30 and its own p-value is ~0. Also I am constrained with a dataset of > 130 datapoints. > >Nothing particularly peculiar about this...> Being new to statistics I would really appreciate if someone can help me > understand these values. > 1) Does the above test values indicate a statistically sound and significant > model ? >Significant, yes, in a sense (see below). Soundness is something you cannot really see from the output of a regression analysis, because it contains results which are valid _provided_ the model assumption holds. To check the assumptions there is a battery of techniques, e.g. residual plots and interaction tests -- there are books about this, which won't really fit into a short email.... Re. significance, it is important to realise that you generally need to compare multiple model fits to assess which variables are important. With one fit, you can say what happens if you drop single variables from the model, so in your case, you have four seven-variable models that do not fit any worse than the full model. You can't really say anything about what happens if you remove two or more variables. You can also see what happens if you drop all variables; this is the overall F test, which in your case is highly significant, so at least one variable must be required. You can be fairly confident that variables with very small p-values cannot be removed, whereas borderline cases may end up with their p-values becoming insignificant when other variables are removed.> 2) Is a dataset of 130 enough to run linear regression with ~7-10 variables > ? If not what is approximately a good size. > >Wrong question, I think. Some people suggest heuristics like 10-20 observations per variable, but this contains an implicit understanding that you are dealing with "typical problems" in e.g. clinical epidemiology. Designed experiments can contain many more parameters, data with strong correlations require more observations to untangle which variables are important, and even otherwise, you might be looking for effects that are small compared to the residual variation and consequentially require more observations. When you do have the data, I think it is more sound to look at the standard errors of the regression coefficients and discuss whether they are sufficiently small for the kinds of conclusions you want to make.> Thanks in advance. > -Ankit > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907