Muhuri, Pradip (SAMHSA/CBHSQ)
2014-Nov-09 04:40 UTC
[R] Getting the most recent dates in a new column from dates in four columns using the dplyr package (mutate verb)
Hello, The example data frame in the reproducible code below has 5 columns (1 column for id and 4 columns for dates), and there are 7 observations. I would like to insert the most recent date from those 4 date columns into a new column (oiddate) using the mutate() function in the dplyr package. I am getting correct results (NA in the new column) if a given row has all NA's in the four columns. However, the issue is that the date value inserted into the new column (oidflag) is incorrect for 5 of the remaining 6 rows (with a non-NA value in at least 1 of the four columns). I would appreciate receiving your help toward resolving the issue. Please see the R console and the R script (reproducible example)below. Thanks in advance. Pradip ###### from the console ######## print (data2) id mrjdate cocdate inhdate haldate oidflag 1 1 2004-11-04 2008-07-18 2005-07-07 2007-11-07 2011-11-04 2 2 <NA> <NA> <NA> <NA> <NA> 3 3 2009-10-24 <NA> 2011-10-13 <NA> 2011-11-04 4 4 2007-10-10 <NA> <NA> <NA> 2011-11-04 5 5 2006-09-01 2005-08-10 <NA> <NA> 2011-11-04 6 6 2007-09-04 2011-10-05 <NA> <NA> 2011-11-04 7 7 2005-10-25 <NA> <NA> 2011-11-04 2011-11-04 ################## Reproducible code and data ##################################### library(dplyr) library(lubridate) library(zoo) # data object - description of the temp <- "id mrjdate cocdate inhdate haldate 1 2004-11-04 2008-07-18 2005-07-07 2007-11-07 2 NA NA NA NA 3 2009-10-24 NA 2011-10-13 NA 4 2007-10-10 NA NA NA 5 2006-09-01 2005-08-10 NA NA 6 2007-09-04 2011-10-05 NA NA 7 2005-10-25 NA NA 2011-11-04" # read the data object data1 <- read.table(textConnection(temp), colClasses=c("character", "Date", "Date", "Date", "Date"), header=TRUE, as.is=TRUE ) # create a new column data2 <- mutate(data1, oidflag= ifelse(is.na(mrjdate) & is.na(cocdate) & is.na(inhdate) & is.na(haldate), NA, max(mrjdate, cocdate, inhdate, haldate,na.rm=TRUE ) ) ) # convert to date data2$oidflag = as.Date(data2$oidflag, origin="1970-01-01") # print records print (data2) Pradip K. Muhuri, PhD SAMHSA/CBHSQ 1 Choke Cherry Road, Room 2-1071 Rockville, MD 20857 Tel: 240-276-1070 Fax: 240-276-1260 [[alternative HTML version deleted]]
Mark Sharp
2014-Nov-09 08:00 UTC
[R] Getting the most recent dates in a new column from dates in four columns using the dplyr package (mutate verb)
Pradip, mutate() works on the entire column as a vector so that you find the maximum of the entire data set. I am almost certain there is some nice way to handle this, but the sapply() function is a standard approach. max() does not want a dataframe thus the use of unlist(). Using your definition of data1: data3 <- data1 data3$oidflag <- as.Date(sapply(seq_along(data3$id), function(row) { if (all(is.na(unlist(data1[row, -1])))) { max_d <- NA } else { max_d <- max(unlist(data1[row, -1]), na.rm = TRUE) } max_d}), origin = "1970-01-01") data3 id mrjdate cocdate inhdate haldate oidflag 1 1 2004-11-04 2008-07-18 2005-07-07 2007-11-07 2008-07-18 2 2 <NA> <NA> <NA> <NA> <NA> 3 3 2009-10-24 <NA> 2011-10-13 <NA> 2011-10-13 4 4 2007-10-10 <NA> <NA> <NA> 2007-10-10 5 5 2006-09-01 2005-08-10 <NA> <NA> 2006-09-01 6 6 2007-09-04 2011-10-05 <NA> <NA> 2011-10-05 7 7 2005-10-25 <NA> <NA> 2011-11-04 2011-11-04 R. Mark Sharp, Ph.D. Director of Primate Records Database Southwest National Primate Research Center Texas Biomedical Research Institute P.O. Box 760549 San Antonio, TX 78245-0549 Telephone: (210)258-9476 e-mail: msharp at TxBiomed.org NOTICE: This E-Mail (including attachments) is confidential and may be legally privileged. It is covered by the Electronic Communications Privacy Act, 18 U.S.C.2510-2521. If you are not the intended recipient, you are hereby notified that any retention, dissemination, distribution or copying of this communication is strictly prohibited. Please reply to the sender that you have received this message in error, then delete it.
Daniel Nordlund
2014-Nov-09 10:32 UTC
[R] Getting the most recent dates in a new column from dates in four columns using the dplyr package (mutate verb)
On 11/8/2014 8:40 PM, Muhuri, Pradip (SAMHSA/CBHSQ) wrote:> Hello, > > > > The example data frame in the reproducible code below has 5 columns (1 column for id and 4 columns for dates), and there are 7 observations. I would like to insert the most recent date from those 4 date columns into a new column (oiddate) using the mutate() function in the dplyr package. I am getting correct results (NA in the new column) if a given row has all NA's in the four columns. However, the issue is that the date value inserted into the new column (oidflag) is incorrect for 5 of the remaining 6 rows (with a non-NA value in at least 1 of the four columns). > > > > I would appreciate receiving your help toward resolving the issue. Please see the R console and the R script (reproducible example)below. > > > > Thanks in advance. > > > > Pradip > > > > > > ###### from the console ######## > > print (data2) > > id mrjdate cocdate inhdate haldate oidflag > > 1 1 2004-11-04 2008-07-18 2005-07-07 2007-11-07 2011-11-04 > > 2 2 <NA> <NA> <NA> <NA> <NA> > > 3 3 2009-10-24 <NA> 2011-10-13 <NA> 2011-11-04 > > 4 4 2007-10-10 <NA> <NA> <NA> 2011-11-04 > > 5 5 2006-09-01 2005-08-10 <NA> <NA> 2011-11-04 > > 6 6 2007-09-04 2011-10-05 <NA> <NA> 2011-11-04 > > 7 7 2005-10-25 <NA> <NA> 2011-11-04 2011-11-04 > > > > > > ################## Reproducible code and data ##################################### > > > > library(dplyr) > > library(lubridate) > > library(zoo) > > # data object - description of the > > > > temp <- "id mrjdate cocdate inhdate haldate > > 1 2004-11-04 2008-07-18 2005-07-07 2007-11-07 > > 2 NA NA NA NA > > 3 2009-10-24 NA 2011-10-13 NA > > 4 2007-10-10 NA NA NA > > 5 2006-09-01 2005-08-10 NA NA > > 6 2007-09-04 2011-10-05 NA NA > > 7 2005-10-25 NA NA 2011-11-04" > > > > # read the data object > > > > data1 <- read.table(textConnection(temp), > > colClasses=c("character", "Date", "Date", "Date", "Date"), > > header=TRUE, as.is=TRUE > > ) > > # create a new column > > > > data2 <- mutate(data1, > > oidflag= ifelse(is.na(mrjdate) & is.na(cocdate) & is.na(inhdate) & is.na(haldate), NA, > > max(mrjdate, cocdate, inhdate, haldate,na.rm=TRUE ) > > ) > > ) > > > > # convert to date > > data2$oidflag = as.Date(data2$oidflag, origin="1970-01-01") > > > > # print records > > > > print (data2) > > > > > > Pradip K. Muhuri, PhD > > SAMHSA/CBHSQ > > 1 Choke Cherry Road, Room 2-1071 > > Rockville, MD 20857 > > Tel: 240-276-1070 > > Fax: 240-276-1260 > > > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >I am not familiar with the mutate() function from dplyr, but you can get your wanted results as follows: data2 <- within(data1, oidflag <- apply(data1[,-1], 1, max, na.rm=TRUE)) Hope this is helpful, Dan Daniel Nordlund Bothell, WA USA
Muhuri, Pradip (SAMHSA/CBHSQ)
2014-Nov-09 11:11 UTC
[R] Getting the most recent dates in a new column from dates in four columns using the dplyr package (mutate verb)
Hi Mark, Your code has also given me the results I expected. Thank you so much for your help. Regards, Pradip Pradip K. Muhuri, PhD SAMHSA/CBHSQ 1 Choke Cherry Road, Room 2-1071 Rockville, MD 20857 Tel: 240-276-1070 Fax: 240-276-1260 -----Original Message----- From: Mark Sharp [mailto:msharp at TxBiomed.org] Sent: Sunday, November 09, 2014 3:01 AM To: Muhuri, Pradip (SAMHSA/CBHSQ) Cc: r-help at r-project.org Subject: Re: [R] Getting the most recent dates in a new column from dates in four columns using the dplyr package (mutate verb) Pradip, mutate() works on the entire column as a vector so that you find the maximum of the entire data set. I am almost certain there is some nice way to handle this, but the sapply() function is a standard approach. max() does not want a dataframe thus the use of unlist(). Using your definition of data1: data3 <- data1 data3$oidflag <- as.Date(sapply(seq_along(data3$id), function(row) { if (all(is.na(unlist(data1[row, -1])))) { max_d <- NA } else { max_d <- max(unlist(data1[row, -1]), na.rm = TRUE) } max_d}), origin = "1970-01-01") data3 id mrjdate cocdate inhdate haldate oidflag 1 1 2004-11-04 2008-07-18 2005-07-07 2007-11-07 2008-07-18 2 2 <NA> <NA> <NA> <NA> <NA> 3 3 2009-10-24 <NA> 2011-10-13 <NA> 2011-10-13 4 4 2007-10-10 <NA> <NA> <NA> 2007-10-10 5 5 2006-09-01 2005-08-10 <NA> <NA> 2006-09-01 6 6 2007-09-04 2011-10-05 <NA> <NA> 2011-10-05 7 7 2005-10-25 <NA> <NA> 2011-11-04 2011-11-04 R. Mark Sharp, Ph.D. Director of Primate Records Database Southwest National Primate Research Center Texas Biomedical Research Institute P.O. Box 760549 San Antonio, TX 78245-0549 Telephone: (210)258-9476 e-mail: msharp at TxBiomed.org NOTICE: This E-Mail (including attachments) is confidential and may be legally privileged. It is covered by the Electronic Communications Privacy Act, 18 U.S.C.2510-2521. If you are not the intended recipient, you are hereby notified that any retention, dissemination, distribution or copying of this communication is strictly prohibited. Please reply to the sender that you have received this message in error, then delete it.
arun
2014-Nov-09 12:00 UTC
[R] Getting the most recent dates in a new column from dates in four columns using the dplyr package (mutate verb)
You could try library(dplyr) data1 %>% rowwise() %>% mutate(oldflag=as.Date(max(mrjdate,cocdate, inhdate, haldate, na.rm=TRUE), origin='1970-01-01')) Source: local data frame [7 x 6] Groups: <by row> id mrjdate cocdate inhdate haldate oldflag 1 1 2004-11-04 2008-07-18 2005-07-07 2007-11-07 2008-07-18 2 2 <NA> <NA> <NA> <NA> <NA> 3 3 2009-10-24 <NA> 2011-10-13 <NA> 2011-10-13 4 4 2007-10-10 <NA> <NA> <NA> 2007-10-10 5 5 2006-09-01 2005-08-10 <NA> <NA> 2006-09-01 6 6 2007-09-04 2011-10-05 <NA> <NA> 2011-10-05 7 7 2005-10-25 <NA> <NA> 2011-11-04 2011-11-04 A.K. On Saturday, November 8, 2014 11:42 PM, "Muhuri, Pradip (SAMHSA/CBHSQ)" <Pradip.Muhuri at samhsa.hhs.gov> wrote: Hello, The example data frame in the reproducible code below has 5 columns (1 column for id and 4 columns for dates), and there are 7 observations. I would like to insert the most recent date from those 4 date columns into a new column (oiddate) using the mutate() function in the dplyr package. I am getting correct results (NA in the new column) if a given row has all NA's in the four columns. However, the issue is that the date value inserted into the new column (oidflag) is incorrect for 5 of the remaining 6 rows (with a non-NA value in at least 1 of the four columns). I would appreciate receiving your help toward resolving the issue. Please see the R console and the R script (reproducible example)below. Thanks in advance. Pradip ###### from the console ######## print (data2) id mrjdate cocdate inhdate haldate oidflag 1 1 2004-11-04 2008-07-18 2005-07-07 2007-11-07 2011-11-04 2 2 <NA> <NA> <NA> <NA> <NA> 3 3 2009-10-24 <NA> 2011-10-13 <NA> 2011-11-04 4 4 2007-10-10 <NA> <NA> <NA> 2011-11-04 5 5 2006-09-01 2005-08-10 <NA> <NA> 2011-11-04 6 6 2007-09-04 2011-10-05 <NA> <NA> 2011-11-04 7 7 2005-10-25 <NA> <NA> 2011-11-04 2011-11-04 ################## Reproducible code and data ##################################### library(dplyr) library(lubridate) library(zoo) # data object - description of the temp <- "id mrjdate cocdate inhdate haldate 1 2004-11-04 2008-07-18 2005-07-07 2007-11-07 2 NA NA NA NA 3 2009-10-24 NA 2011-10-13 NA 4 2007-10-10 NA NA NA 5 2006-09-01 2005-08-10 NA NA 6 2007-09-04 2011-10-05 NA NA 7 2005-10-25 NA NA 2011-11-04" # read the data object data1 <- read.table(textConnection(temp), colClasses=c("character", "Date", "Date", "Date", "Date"), header=TRUE, as.is=TRUE ) # create a new column data2 <- mutate(data1, oidflag= ifelse(is.na(mrjdate) & is.na(cocdate) & is.na(inhdate) & is.na(haldate), NA, max(mrjdate, cocdate, inhdate, haldate,na.rm=TRUE ) ) ) # convert to date data2$oidflag = as.Date(data2$oidflag, origin="1970-01-01") # print records print (data2) Pradip K. Muhuri, PhD SAMHSA/CBHSQ 1 Choke Cherry Road, Room 2-1071 Rockville, MD 20857 Tel: 240-276-1070 Fax: 240-276-1260 [[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.