Robert A. LaBudde
2007-May-27 20:55 UTC
[R] How to reference or sort rownames in a data frame
As I was working through elementary examples, I was using dataset "plasma" of package "HSAUR". In performing a logistic regression of the data, and making the diagnostic plots (R-2.5.0) data(plasma,package='HSAUR') plasma_1<- glm(ESR ~ fibrinogen * globulin, data=plasma, family=binomial()) layout(matrix(1:4,nrow=2)) plot(plasma_1) I find that data points corresponding to rownames 17 and 23 are outliers and high leverage. I would then like to perform a fit without these two rows. In principle this should be easy, using an update() with subset=-c(17,23). The problem is that the rownames in this dataset are not ordered, and, in fact, the relevant rows are 30 and 31, not 17 and 23. This brings up the following (elementary?) questions: 1. How do you reference rows in "subset=" for which you know the rownames, but not the row numbers? 2. How do you discovery the rows corresponding to particular rownames? (Using plasma[rownames(plasma)==17,] shows the data, but NOT the row number!) (Probably the same answer as in Q. 1 above.) 3. How do you sort (order) the rows of an existing data frame so that the rownames are in order? I don't seem to know the magic words to find the answers to these questions in the help systems. Obviously this can be done by writing new, brute force, functions scanning the subscripts, but there must be an (obvious?) direct way of doing this more elegantly. Thanks for any pointers. ===============================================================Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: ral at lcfltd.com Least Cost Formulations, Ltd. URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239 Fax: 757-467-2947 "Vere scire est per causas scire"
Gabor Grothendieck
2007-May-28 02:29 UTC
[R] How to reference or sort rownames in a data frame
On 5/27/07, Robert A. LaBudde <ral at lcfltd.com> wrote:> As I was working through elementary examples, I was using dataset > "plasma" of package "HSAUR". > > In performing a logistic regression of the data, and making the > diagnostic plots (R-2.5.0) > > data(plasma,package='HSAUR') > plasma_1<- glm(ESR ~ fibrinogen * globulin, data=plasma, family=binomial()) > layout(matrix(1:4,nrow=2)) > plot(plasma_1) > > I find that data points corresponding to rownames 17 and 23 are > outliers and high leverage. > > I would then like to perform a fit without these two rows. > > In principle this should be easy, using an update() with subset=-c(17,23). > > The problem is that the rownames in this dataset are not ordered, > and, in fact, the relevant rows are 30 and 31, not 17 and 23. > > This brings up the following (elementary?) questions: > > 1. How do you reference rows in "subset=" for which you know the > rownames, but not the row numbers?Use a logical vector: rownames(plasma) %in% c(17, 23)> > 2. How do you discovery the rows corresponding to particular > rownames? (Using plasma[rownames(plasma)==17,] shows the data, but > NOT the row number!) (Probably the same answer as in Q. 1 above.)which(rownames(plasma) %in% c(17, 23)) # 30, 31> > 3. How do you sort (order) the rows of an existing data frame so that > the rownames are in order?plasma[order(as.numeric(rownames(plasma))), ]