sms13+@pitt.edu
2005-May-24 18:27 UTC
[R] obtaining first and last record for rows with same identifier
I have a dataframe that contains fields such as patid, labdate, labvalue. The same patid may show up in multiple rows because of lab measurements on multiple days. Is there a simple way to obtain just the first and last record for each patient, or do I need to write some code that performs that. Thanks, Steven
Sean Davis
2005-May-24 18:37 UTC
[R] obtaining first and last record for rows with same identifier
If you have your data.frame ordered by the patid, you can use the
function rle in combination with cumsum. As a vector example:
> a <- rep(c('a','b','c'),10)
> a
[1] "a" "b" "c" "a" "b"
"c" "a" "b" "c" "a"
"b" "c" "a" "b" "c"
"a"
"b" "c" "a"
[20] "b" "c" "a" "b" "c"
"a" "b" "c" "a" "b"
"c"
> b <- a[order(a)]
> b
[1] "a" "a" "a" "a" "a"
"a" "a" "a" "a" "a"
"b" "b" "b" "b" "b"
"b"
"b" "b" "b"
[20] "b" "c" "c" "c" "c"
"c" "c" "c" "c" "c"
"c"
> l <- rle(b)$length
> cbind(l,cumsum(l),cumsum(l)-l+1)
l
[1,] 10 10 1
[2,] 10 20 11
[3,] 10 30 21
# use the line below to get the length of the block of the dataframe,
the start, and then end indices
> cbind(l,cumsum(l)-l+1,cumsum(l))
l
[1,] 10 1 10
[2,] 10 11 20
[3,] 10 21 30
>
Sean
On May 24, 2005, at 2:27 PM, sms13+ at pitt.edu wrote:
> I have a dataframe that contains fields such as patid, labdate,
> labvalue.
> The same patid may show up in multiple rows because of lab
> measurements on multiple days. Is there a simple way to obtain just
> the first and last record for each patient, or do I need to write some
> code that performs that.
>
> Thanks,
> Steven
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
Francisco J. Zagmutt
2005-May-25 02:15 UTC
[R] obtaining first and last record for rows with same identifier
If you want to obtain a data frame you can use the functions head and tail
like:
dat=data.frame(id=rep(1:5,3),num=rnorm(15), num2=rnorm(15))#Creates data
frame with id
last=do.call("rbind",by(dat,dat$id,tail,1))#Selects the last
observation for
each id
first=do.call("rbind",by(dat,dat$id,head,1))#Selects the first
observation
for each id
newdat=rbind(first,last)#Joins data
newdat=newdat[order(newdat$id),]#sorts data by id
Notice that rownames will give you the original row location of the
observations selected
I hope this helps
Francisco
>From: Berton Gunter <gunter.berton at gene.com>
>To: "'Sean Davis'" <sdavis2 at mail.nih.gov>,
<sms13+ at pitt.edu>
>CC: "'rhelp'" <r-help at stat.math.ethz.ch>
>Subject: RE: [R] obtaining first and last record for rows with same
>identifier
>Date: Tue, 24 May 2005 12:17:58 -0700
>
>
>I think by() is simpler:
>
> by(yourframe,factor(yourframe$patid),function(x)x[c(1,nrow(x)),])
>
>
>
>-- Bert Gunter
>Genentech Non-Clinical Statistics
>South San Francisco, CA
>
>"The business of the statistician is to catalyze the scientific
learning
>process." - George E. P. Box
>
>
>
> > -----Original Message-----
> > From: r-help-bounces at stat.math.ethz.ch
> > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Sean Davis
> > Sent: Tuesday, May 24, 2005 11:38 AM
> > To: sms13+ at pitt.edu
> > Cc: rhelp
> > Subject: Re: [R] obtaining first and last record for rows
> > with same identifier
> >
> > If you have your data.frame ordered by the patid, you can use the
> > function rle in combination with cumsum. As a vector example:
> >
> > > a <- rep(c('a','b','c'),10)
> > > a
> > [1] "a" "b" "c" "a"
"b" "c" "a" "b" "c"
"a" "b" "c" "a" "b"
"c" "a"
> > "b" "c" "a"
> > [20] "b" "c" "a" "b"
"c" "a" "b" "c" "a"
"b" "c"
> > > b <- a[order(a)]
> > > b
> > [1] "a" "a" "a" "a"
"a" "a" "a" "a" "a"
"a" "b" "b" "b" "b"
"b" "b"
> > "b" "b" "b"
> > [20] "b" "c" "c" "c"
"c" "c" "c" "c" "c"
"c" "c"
> > > l <- rle(b)$length
> > > cbind(l,cumsum(l),cumsum(l)-l+1)
> > l
> > [1,] 10 10 1
> > [2,] 10 20 11
> > [3,] 10 30 21
> >
> > # use the line below to get the length of the block of the dataframe,
> > the start, and then end indices
> > > cbind(l,cumsum(l)-l+1,cumsum(l))
> > l
> > [1,] 10 1 10
> > [2,] 10 11 20
> > [3,] 10 21 30
> > >
> >
> > Sean
> >
> >
> > On May 24, 2005, at 2:27 PM, sms13+ at pitt.edu wrote:
> >
> > > I have a dataframe that contains fields such as patid, labdate,
> > > labvalue.
> > > The same patid may show up in multiple rows because of lab
> > > measurements on multiple days. Is there a simple way to
> > obtain just
> > > the first and last record for each patient, or do I need to
> > write some
> > > code that performs that.
> > >
> > > Thanks,
> > > Steven
> > >
> > > ______________________________________________
> > > R-help at stat.math.ethz.ch mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide!
> > > http://www.R-project.org/posting-guide.html
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> >
>
>______________________________________________
>R-help at stat.math.ethz.ch mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide!
>http://www.R-project.org/posting-guide.html
Frank E Harrell Jr
2005-May-25 12:20 UTC
[R] obtaining first and last record for rows with same identifier
Francisco J. Zagmutt wrote:> If you want to obtain a data frame you can use the functions head and > tail like: > > dat=data.frame(id=rep(1:5,3),num=rnorm(15), num2=rnorm(15))#Creates data > frame with id > last=do.call("rbind",by(dat,dat$id,tail,1))#Selects the last observation > for each id > first=do.call("rbind",by(dat,dat$id,head,1))#Selects the first > observation for each id > newdat=rbind(first,last)#Joins data > newdat=newdat[order(newdat$id),]#sorts data by id > > Notice that rownames will give you the original row location of the > observations selected > > I hope this helps > > Francisco >. . . You might also look at section 4.3 of http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RS/sintro.pdf -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University