George Markomanolis
2011-Jun-21 07:49 UTC
[R] question about linear regression and leverage
Dear all, I am new to this field and I have a question about a linear regression. I have a dataset of around to 31000 points and I want to apply a linear regression. The R-squared is 0.9 however when I check the diagnostic plots I can see that there are around to 250 points with big leverage value. As I know the points with big leverage influence a lot the fit. If I remove these points in order to check their influence, the R-squared of the rest points is 0.71. So I removed less than 1% of my data and the fit is not so good. Could you please give me any advice about this? Is it right to let these 250 points in my dataset or not? Could I do something else? The data are measured through an experiment so even these 250 points are real values. Thanks a lot, George
On Jun 21, 2011, at 3:49 AM, George Markomanolis wrote:> Dear all, > > I am new to this field and I have a question about a linear > regression. > I have a dataset of around to 31000 points and I want to apply a > linear > regression. The R-squared is 0.9 however when I check the diagnostic > plots I can see that there are around to 250 points with big leverage > value. As I know the points with big leverage influence a lot the fit. > If I remove these points in order to check their influence, the > R-squared of the rest points is 0.71. So I removed less than 1% of my > data and the fit is not so good. Could you please give me any advice > about this? Is it right to let these 250 points in my dataset or not? > Could I do something else? The data are measured through an experiment > so even these 250 points are real values.You could be looking at the descriptive statistics on the points. Perhaps they are at one end of a variable range, or you perhaps have some other feature that is scientifically interesting. So far you have only been examining one set of simple linear hypotheses and have not (presumably) been looking at any non-linear possibilities or the potential that interactions are affecting the outcome. The prior science of your (so far undescribed) domain should be carefully considered, but in your message we see no evidence of such. -- David Winsemius, MD West Hartford, CT
George Markomanolis
2011-Jun-21 11:49 UTC
[R] question about linear regression and leverage
Dear David, Thanks for your answer. Yes now that you mentioned these points are in the beginning of a variable range. From the plot of the residuals seems to have non constant variance which is solved by a transformation. I checked also for interactions by using the symbol : between two variables and the change on the result was not so important. I am working on computer science field but I wanted to do an analysis from scratch because some previous results that I have seen are not good for such cases. Moreover the data are not the same of course. Thanks, George On 06/21/2011 01:08 PM, David Winsemius wrote:> > On Jun 21, 2011, at 3:49 AM, George Markomanolis wrote: > >> Dear all, >> >> I am new to this field and I have a question about a linear regression. >> I have a dataset of around to 31000 points and I want to apply a linear >> regression. The R-squared is 0.9 however when I check the diagnostic >> plots I can see that there are around to 250 points with big leverage >> value. As I know the points with big leverage influence a lot the fit. >> If I remove these points in order to check their influence, the >> R-squared of the rest points is 0.71. So I removed less than 1% of my >> data and the fit is not so good. Could you please give me any advice >> about this? Is it right to let these 250 points in my dataset or not? >> Could I do something else? The data are measured through an experiment >> so even these 250 points are real values. > > You could be looking at the descriptive statistics on the points. > Perhaps they are at one end of a variable range, or you perhaps have > some other feature that is scientifically interesting. So far you have > only been examining one set of simple linear hypotheses and have not > (presumably) been looking at any non-linear possibilities or the > potential that interactions are affecting the outcome. The prior > science of your (so far undescribed) domain should be carefully > considered, but in your message we see no evidence of such. >
You really really need to consult with a local statistician for help. You are making a valiant effort, but it is clear that you have insufficient background and experience. Get help from an expert if you can. It is no dishonor, you will learn a lot, and you will avoid incorrect conclusions. Cheers, Bert On Tue, Jun 21, 2011 at 4:49 AM, George Markomanolis <george at markomanolis.com> wrote:> Dear David, > > Thanks for your answer. Yes now that you mentioned these points are in > the beginning of a variable range. From the plot of the residuals seems > to have non constant variance which is solved by a transformation. I > checked also for interactions by using the symbol : between two > variables and the change on the result was not so important. I am > working on computer science field but I wanted to do an analysis from > scratch because some previous results that I have seen are not good for > such cases. Moreover the data are not the same of course. > > Thanks, > George > > On 06/21/2011 01:08 PM, David Winsemius wrote: >> >> On Jun 21, 2011, at 3:49 AM, George Markomanolis wrote: >> >>> Dear all, >>> >>> I am new to this field and I have a question about a linear regression. >>> I have a dataset of around to 31000 points and I want to apply a linear >>> regression. The R-squared is 0.9 however when I check the diagnostic >>> plots I can see that there are around to 250 points with big leverage >>> value. As I know the points with big leverage influence a lot the fit. >>> If I remove these points in order to check their influence, the >>> R-squared of the rest points is 0.71. So I removed less than 1% of my >>> data and the fit is not so good. Could you please give me any advice >>> about this? Is it right to let these 250 points in my dataset or not? >>> Could I do something else? The data are measured through an experiment >>> so even these 250 points are real values. >> >> You could be looking at the descriptive statistics on the points. >> Perhaps they are at one end of a variable range, or you perhaps have >> some other feature that is scientifically interesting. So far you have >> only been examining one set of simple linear hypotheses and have not >> (presumably) been looking at any non-linear possibilities or the >> potential that interactions are affecting the outcome. The prior >> science of your (so far undescribed) domain should be carefully >> considered, but in your message we see no evidence of such. >> > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- "Men by nature long to get on to the ultimate truths, and will often be impatient with elementary studies or fight shy of them. If it were possible to reach the ultimate truths without the elementary studies usually prefixed to them, these would not be preparatory studies but superfluous diversions." -- Maimonides (1135-1204) Bert Gunter Genentech Nonclinical Biostatistics