Kathleen Rollet
2011-Aug-24 21:15 UTC
[R] Creating new variable with maximum visit date by group_id
Dear R users, I am encoutering the following problem: I have a dataset with a 'unique_id' and different 'visit_date' (formatted as.Date, "%d/%m/%Y") per unique_id. I would like to create a new variable with the most recent date of visit per unique_id as shown below. unique_id visit_date last_visit_date 1 01/06/2010 01/06/2011 1 01/01/2011 01/06/2011 1 01/06/2011 01/06/2011 2 01/01/2009 01/07/2011 2 01/06/2009 01/07/2011 2 01/06/2010 01/07/2011 2 01/01/2011 01/07/2011 2 01/07/2011 01/07/2011 3 01/01/2008 01/01/2008 4 01/01/2009 01/01/2010 4 01/01/2010 01/01/2010 I know the coding to easily do this in Stata, SAS, and Excel but I cannot find how to do it in R. I try multiple function such as tapply( ), ave( ), ddply ( ), and transform ( ) after looking into previous postings. The codes are running but only NA values are generated or I get error messages that the replacement has less row than the data has (there are about 1000 unique_id and over 4000 rows in my dataset presently). I would greatly appreciate if someone could help me. Thank you! Kathleen R. Epidemiologist Montreal, QC, Canada [[alternative HTML version deleted]]
David Winsemius
2011-Aug-24 22:25 UTC
[R] Creating new variable with maximum visit date by group_id
On Aug 24, 2011, at 5:15 PM, Kathleen Rollet wrote:> Dear R users, > > I am encoutering the following problem: I have a dataset with a > 'unique_id' and different 'visit_date' (formatted as.Date, "%d/%m/ > %Y") per unique_id. I would like to create a new variable with the > most recent date of visit per unique_id as shown below.That should not result in what is below unless you have changes something in options() forcing a different data output format. (Is that even possible?)> > unique_id visit_date last_visit_date > 1 01/06/2010 01/06/2011 > 1 01/01/2011 01/06/2011 > 1 01/06/2011 01/06/2011 > 2 01/01/2009 01/07/2011 > 2 01/06/2009 01/07/2011 > 2 01/06/2010 01/07/2011 > 2 01/01/2011 01/07/2011 > 2 01/07/2011 01/07/2011 > 3 01/01/2008 01/01/2008 > 4 01/01/2009 01/01/2010 > 4 01/01/2010 01/01/2010 >Read it in as dfrm named "dat" with: colClasses=c("numeric", "character", "character") Then: dat$visit_date <-as.Date(dat$visit_date, format="%d/%m/%Y", origin="1970-01-01") dat$last_visit_date <-as.Date(dat$last_visit_date, format="%d/%m/%Y", origin="1970-01-01")> I know the coding to easily do this in Stata, SAS, and Excel but I > cannot find how to do it in R. I try multiple function such as > tapply( ), ave( ), ddply ( ), and transform ( ) after looking into > previous postings. The codes are running but only NA values are > generated or I get error messages that the replacement has less row > than the data has (there are about 1000 unique_id and over 4000 rows > in my dataset presently).The 'ave' function should be able to do it. It returns a vector as long as the dataframe has rows. You are asked to post your failures as well as reproducible code which is best produced with dput(). (This apples doubly so when you choose non-standard formats for Date objects.) Please read: ?dput ?ave Worked example: dat$most_recent<- format(ave(dat$visit_date, dat$unique_id, FUN=max), format="%d/%m/%Y") dat NOTE: that last column is not an R date but rather a character vector.> I would greatly appreciate if someone could help me. > > Thank you! > > Kathleen R. > Epidemiologist > Montreal, QC, Canada > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD West Hartford, CT
Jean V Adams
2011-Aug-24 22:29 UTC
[R] Creating new variable with maximum visit date by group_id
Try this: require(zoo) lvd <- tapply(df$visit_date, df$unique_id, max) index <- tapply(df$visit_date, df$unique_id) df$last_visit_date <- as.Date(lvd[index]) Jean Kathleen Rollet wrote on 08/24/2011 04:15:45 PM:> > Dear R users, > > I am encoutering the following problem: I have a dataset with a > 'unique_id' and different 'visit_date' (formatted as.Date, "%d/%m/% > Y") per unique_id. I would like to create a new variable with the > most recent date of visit per unique_id as shown below. > > unique_id visit_date last_visit_date > 1 01/06/2010 01/06/2011 > 1 01/01/2011 01/06/2011 > 1 01/06/2011 01/06/2011 > 2 01/01/2009 01/07/2011 > 2 01/06/2009 01/07/2011 > 2 01/06/2010 01/07/2011 > 2 01/01/2011 01/07/2011 > 2 01/07/2011 01/07/2011 > 3 01/01/2008 01/01/2008 > 4 01/01/2009 01/01/2010 > 4 01/01/2010 01/01/2010 > > I know the coding to easily do this in Stata, SAS, and Excel but I > cannot find how to do it in R. I try multiple function such as > tapply( ), ave( ), ddply ( ), and transform ( ) after looking into > previous postings. The codes are running but only NA values are > generated or I get error messages that the replacement has less row > than the data has (there are about 1000 unique_id and over 4000 rows > in my dataset presently). > I would greatly appreciate if someone could help me. > > Thank you! > > Kathleen R. > Epidemiologist > Montreal, QC, Canada > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.[[alternative HTML version deleted]]
Dennis Murphy
2011-Aug-25 05:21 UTC
[R] Creating new variable with maximum visit date by group_id
Hi: Since you tried several functions (reasonably so IMO), here is how they would work in this problem, in addition to the solutions already supplied. Some data massaging before starting, taking your data as input, saved into an object named visits: visits <- structure(list(unique_id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 4L, 4L), visit_date = structure(c(14615, 14975, 14980, 14245, 14250, 14615, 14975, 14981, 13879, 14245, 14610), class = "Date")), .Names = c("unique_id", "visit_date"), row.names = c(NA, -11L), class = "data.frame") # plyr package: ddply(visits, .(unique_id), transform, last_visit_date = max(visit_date)) # Faster version, using the more recent function mutate(): mutate(visits, .(unique_id), last_visit_date = max(visit_date)) # data.table: library(data.table) # Create the data table from a data frame, using unique_id as a key: visDT <- data.table(visits, key = 'unique_id') # list() is used to output multiple variables: visDT[, list(visit_date, last_visit_date = max(visit_date)), by = 'unique_id'] # doBy package: library(doBy) transformBy(~ unique_id, data = visits, last_visit_date = max(visit_date)) # base package, using transform(): transform(visits, last_visit_date = ave(visit_date, unique_id, FUN = max)) # ...which you could have gotten from the original data as follows, including # the conversion of visit_date to a Date variable assuming it was read in as character # rather than factor, as below: visits0 <- structure(list(unique_id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 4L, 4L), visit_date = c("01/06/2010", "01/01/2011", "01/06/2011", "01/01/2009", "01/06/2009", "01/06/2010", "01/01/2011", "01/07/2011", "01/01/2008", "01/01/2009", "01/01/2010")), .Names = c("unique_id", "visit_date"), row.names = c(NA, -11L), class = "data.frame") str(visits0) within(visits0, { visit_date <- as.Date(visit_date, format = '%m/%d/%Y') last_visit_date <- ave(visit_date, unique_id, FUN = max) } ) # All of the above produce unique_id visit_date last_visit_date 1 1 2010-01-06 2011-01-06 2 1 2011-01-01 2011-01-06 3 1 2011-01-06 2011-01-06 4 2 2009-01-01 2011-01-07 5 2 2009-01-06 2011-01-07 6 2 2010-01-06 2011-01-07 7 2 2011-01-01 2011-01-07 8 2 2011-01-07 2011-01-07 9 3 2008-01-01 2008-01-01 10 4 2009-01-01 2010-01-01 11 4 2010-01-01 2010-01-01 With basic summaries such as this, there are a wealth of options available. mutate() is faster than ddply() for transform operations, data.table can be very fast, especially in large data sets, and the within() statement above shows how to perform the entire task (including conversion to dates) in one fell swoop. Notice that by using within(), one can convert visit_date to a Date object and then use it as input to the next function. You should be able to do this with the mysql package, too, but my SQL programming skills are pretty limited so I'll pass on that one. I made a weak effort but....no. HTH, Dennis On Wed, Aug 24, 2011 at 2:15 PM, Kathleen Rollet <kathleen955 at hotmail.com> wrote:> > > > > > Dear R users, > > I am encoutering the following problem: I have a dataset with a 'unique_id' and different 'visit_date' (formatted as.Date, "%d/%m/%Y") per unique_id. I would like to create a new variable with the most recent date of visit per unique_id as shown below. > > unique_id visit_date last_visit_date > 1 ?01/06/2010 ?01/06/2011 > 1 ?01/01/2011 ?01/06/2011 > 1 ?01/06/2011 ?01/06/2011 > 2 ?01/01/2009 ?01/07/2011 > 2 ?01/06/2009 ?01/07/2011 > 2 ?01/06/2010 ?01/07/2011 > 2 ?01/01/2011 ?01/07/2011 > 2 ?01/07/2011 ?01/07/2011 > 3 ?01/01/2008 ?01/01/2008 > 4 ?01/01/2009 ?01/01/2010 > 4 ?01/01/2010 ?01/01/2010 > > I know the coding to easily do this in Stata, SAS, and Excel but I cannot find how to do it in R. I try multiple function such as tapply( ), ave( ), ddply ( ), and transform ( ) after looking into previous postings. The codes are running but only NA values are generated or I get error messages that the replacement has less row than the data has (there are about 1000 unique_id and over 4000 rows in my dataset presently). > I would greatly appreciate if someone could help me. > > Thank you! > > Kathleen R. > Epidemiologist > Montreal, QC, Canada > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >