All - My question is a bit involved, so bear with me. I have some data that looks like: Lake LL LW 81 2.176091259 1.342422681 81 2.176091259 1.414973348 81 2.176091259 1.447158031 81 2.181843588 1.414973348 81 2.181843588 1.447158031 81 2.184691431 1.462397998 81 2.187520721 1.447158031 81 2.187520721 1.477121255 81 2.187520721 1.505149978 ... [truncated] I'm trying to: 1) fit a simple lm(LW~LL) 2) calculate the dffits for those data points 3) remove those data points that are 2*sqrt(p/n) (where p=the number of parameters and n=number of data points; p=3 in a linear model, correct? Intercept, slope, and error term?) 4) rerun the model MINUS those data points 5) compare the two lm() Now, each of these steps I can do seperately, but only by outputting the dffits to a .csv then removing the large dffits by hand, reading the .csv back into R, rerunning the lm(), and comparing the first lm() to the second lm(). I would imagine that there is a better (easier, I hope!) way to doing all of this. Any ideas? My programming knowledge of R is rather limited but getting better all the time thanks to this board and the R-help archive. Thanks, SR Steven H. Ranney Graduate Research Assistant (Ph.D) USGS Montana Cooperative Fishery Research Unit Montana State University P.O. Box 173460 Bozeman, MT 59717-3460 phone: (406) 994-6643 fax: (406) 994-7479 [[alternative HTML version deleted]]
Ranney, Steven <steven.ranney <at> montana.edu> writes:> 1) fit a simple lm(LW~LL) > 2) calculate the dffits for those data points > 3) remove those data points that are 2*sqrt(p/n) (where p=the number of > parameters and n=number of data points; p=3 in a linear model, correct? > Intercept, slope, and error term?) > 4) rerun the model MINUS those data points > 5) compare the two lm() > > Now, each of these steps I can do seperately, but only by outputting the > dffits to a .csv then removing the large dffits by hand, reading the .csv > back into R, rerunning the lm(), and comparing the first lm() to the second > lm(). I would imagine that there is a better (easier, I hope!) way to doing > all of this. Any ideas? >You could do the following: # -------------------- x = rnorm(100) y=rnorm(100) y[40] = y[40]+30 # generate outliere df = data.frame(x=x,y=y) lmfit1 = lm(y~x, data=df) # fit all data thresh = 3 # Choose any data-dependent threshold nice = abs(dffits(lmfit)) < thresh # note that nice[40] is the only FALSE df2 = df[nice,] lmfit2 = lm(y~x, data=df2) summary(lmfit1) summary(lmfit2) # -------------------- However, this is a bit Denver-Style Home-Brewery. Instead of using this ad-hoc method, you are probably better off using one of the robust methods, for example in MASS. Dieter