sms13+@pitt.edu
2005-May-24 18:27 UTC
[R] obtaining first and last record for rows with same identifier
I have a dataframe that contains fields such as patid, labdate, labvalue. The same patid may show up in multiple rows because of lab measurements on multiple days. Is there a simple way to obtain just the first and last record for each patient, or do I need to write some code that performs that. Thanks, Steven
Sean Davis
2005-May-24 18:37 UTC
[R] obtaining first and last record for rows with same identifier
If you have your data.frame ordered by the patid, you can use the function rle in combination with cumsum. As a vector example: > a <- rep(c('a','b','c'),10) > a [1] "a" "b" "c" "a" "b" "c" "a" "b" "c" "a" "b" "c" "a" "b" "c" "a" "b" "c" "a" [20] "b" "c" "a" "b" "c" "a" "b" "c" "a" "b" "c" > b <- a[order(a)] > b [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "b" "b" "b" "b" "b" "b" "b" "b" "b" [20] "b" "c" "c" "c" "c" "c" "c" "c" "c" "c" "c" > l <- rle(b)$length > cbind(l,cumsum(l),cumsum(l)-l+1) l [1,] 10 10 1 [2,] 10 20 11 [3,] 10 30 21 # use the line below to get the length of the block of the dataframe, the start, and then end indices > cbind(l,cumsum(l)-l+1,cumsum(l)) l [1,] 10 1 10 [2,] 10 11 20 [3,] 10 21 30 > Sean On May 24, 2005, at 2:27 PM, sms13+ at pitt.edu wrote:> I have a dataframe that contains fields such as patid, labdate, > labvalue. > The same patid may show up in multiple rows because of lab > measurements on multiple days. Is there a simple way to obtain just > the first and last record for each patient, or do I need to write some > code that performs that. > > Thanks, > Steven > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html
Francisco J. Zagmutt
2005-May-25 02:15 UTC
[R] obtaining first and last record for rows with same identifier
If you want to obtain a data frame you can use the functions head and tail like: dat=data.frame(id=rep(1:5,3),num=rnorm(15), num2=rnorm(15))#Creates data frame with id last=do.call("rbind",by(dat,dat$id,tail,1))#Selects the last observation for each id first=do.call("rbind",by(dat,dat$id,head,1))#Selects the first observation for each id newdat=rbind(first,last)#Joins data newdat=newdat[order(newdat$id),]#sorts data by id Notice that rownames will give you the original row location of the observations selected I hope this helps Francisco>From: Berton Gunter <gunter.berton at gene.com> >To: "'Sean Davis'" <sdavis2 at mail.nih.gov>, <sms13+ at pitt.edu> >CC: "'rhelp'" <r-help at stat.math.ethz.ch> >Subject: RE: [R] obtaining first and last record for rows with same >identifier >Date: Tue, 24 May 2005 12:17:58 -0700 > > >I think by() is simpler: > > by(yourframe,factor(yourframe$patid),function(x)x[c(1,nrow(x)),]) > > > >-- Bert Gunter >Genentech Non-Clinical Statistics >South San Francisco, CA > >"The business of the statistician is to catalyze the scientific learning >process." - George E. P. Box > > > > > -----Original Message----- > > From: r-help-bounces at stat.math.ethz.ch > > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Sean Davis > > Sent: Tuesday, May 24, 2005 11:38 AM > > To: sms13+ at pitt.edu > > Cc: rhelp > > Subject: Re: [R] obtaining first and last record for rows > > with same identifier > > > > If you have your data.frame ordered by the patid, you can use the > > function rle in combination with cumsum. As a vector example: > > > > > a <- rep(c('a','b','c'),10) > > > a > > [1] "a" "b" "c" "a" "b" "c" "a" "b" "c" "a" "b" "c" "a" "b" "c" "a" > > "b" "c" "a" > > [20] "b" "c" "a" "b" "c" "a" "b" "c" "a" "b" "c" > > > b <- a[order(a)] > > > b > > [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "b" "b" "b" "b" "b" "b" > > "b" "b" "b" > > [20] "b" "c" "c" "c" "c" "c" "c" "c" "c" "c" "c" > > > l <- rle(b)$length > > > cbind(l,cumsum(l),cumsum(l)-l+1) > > l > > [1,] 10 10 1 > > [2,] 10 20 11 > > [3,] 10 30 21 > > > > # use the line below to get the length of the block of the dataframe, > > the start, and then end indices > > > cbind(l,cumsum(l)-l+1,cumsum(l)) > > l > > [1,] 10 1 10 > > [2,] 10 11 20 > > [3,] 10 21 30 > > > > > > > Sean > > > > > > On May 24, 2005, at 2:27 PM, sms13+ at pitt.edu wrote: > > > > > I have a dataframe that contains fields such as patid, labdate, > > > labvalue. > > > The same patid may show up in multiple rows because of lab > > > measurements on multiple days. Is there a simple way to > > obtain just > > > the first and last record for each patient, or do I need to > > write some > > > code that performs that. > > > > > > Thanks, > > > Steven > > > > > > ______________________________________________ > > > R-help at stat.math.ethz.ch mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide! > > > http://www.R-project.org/posting-guide.html > > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > > http://www.R-project.org/posting-guide.html > > > >______________________________________________ >R-help at stat.math.ethz.ch mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide! >http://www.R-project.org/posting-guide.html
Frank E Harrell Jr
2005-May-25 12:20 UTC
[R] obtaining first and last record for rows with same identifier
Francisco J. Zagmutt wrote:> If you want to obtain a data frame you can use the functions head and > tail like: > > dat=data.frame(id=rep(1:5,3),num=rnorm(15), num2=rnorm(15))#Creates data > frame with id > last=do.call("rbind",by(dat,dat$id,tail,1))#Selects the last observation > for each id > first=do.call("rbind",by(dat,dat$id,head,1))#Selects the first > observation for each id > newdat=rbind(first,last)#Joins data > newdat=newdat[order(newdat$id),]#sorts data by id > > Notice that rownames will give you the original row location of the > observations selected > > I hope this helps > > Francisco >. . . You might also look at section 4.3 of http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RS/sintro.pdf -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University