Muhuri, Pradip (SAMHSA/CBHSQ)
2014-Dec-04 03:43 UTC
[R] Getting the most recent dates in a new column from dates in four columns using the dplyr package (mutate verb)
Hello Chel and David, Thank you very much for providing new insights into this issue. Here is one more question. Why does the mutate () give incorrect results here? # The following gives INCORRECT results - mutated()ed object na.date.cases = ifelse(!is.na(oiddate),1,0) # The following gives CORRECT results new2$na.date.cases = ifelse(!is.na(new2$oiddate),1,0) ############################### reproducible example - slightly revised/modified ############### library(dplyr) # data object - description temp <- "id mrjdate cocdate inhdate haldate 1 2004-11-04 2008-07-18 2005-07-07 2007-11-07 2 NA NA NA NA 3 2009-10-24 NA 2011-10-13 NA 4 2007-10-10 NA NA NA 5 2006-09-01 2005-08-10 NA NA 6 2007-09-04 2011-10-05 NA NA 7 2005-10-25 NA NA 2011-11-04" # read the data object example.data <- read.table(textConnection(temp), colClasses=c("character", "Date", "Date", "Date", "Date"), header=TRUE, as.is=TRUE ) # create a new column -dplyr solution (Acknowledgement: Arun) new1 <- example.data %>% rowwise() %>% mutate(oiddate=as.Date(max(mrjdate,cocdate, inhdate, haldate, na.rm=TRUE), origin='1970-01-01'), na.date.cases = ifelse(!is.na(oiddate),1,0) ) # create a new column - Base R solution (Acknowlegement: Mark Sharp) new2 <- example.data new2$oiddate <- as.Date(sapply(seq_along(new2$id), function(row) { if (all(is.na(unlist(example.data[row, c('mrjdate','cocdate', 'inhdate', 'haldate')])))) { max_d <- NA } else { max_d <- max(unlist(example.data[row, c('mrjdate','cocdate', 'inhdate', 'haldate')]), na.rm = TRUE) } max_d}), origin = "1970-01-01") new2$na.date.cases = ifelse(!is.na(new2$oiddate),1,0) identical(new1, new2) table(new1$oiddate) table(new2$oiddate) # print records print (new1); print(new2) Pradip K. Muhuri, PhD SAMHSA/CBHSQ 1 Choke Cherry Road, Room 2-1071 Rockville, MD 20857 Tel: 240-276-1070 Fax: 240-276-1260 -----Original Message----- From: Chel Hee Lee [mailto:chl948 at mail.usask.ca] Sent: Wednesday, December 03, 2014 8:48 PM To: Muhuri, Pradip (SAMHSA/CBHSQ); r-help at r-project.org Subject: Re: [R] Getting the most recent dates in a new column from dates in four columns using the dplyr package (mutate verb) The output in the object 'new1' are apparently same the output in the object 'new2'. Are you trying to compare the entries of two outputs 'new1' and 'new2'? If so, the function 'all()' would be useful: > all(new1 == new2, na.rm=TRUE) [1] TRUE If you are interested in the comparison of two objects in terms of class, then the function 'identical()' is useful: > attributes(new1) $names [1] "id" "mrjdate" "cocdate" "inhdate" "haldate" "oldflag" $class [1] "rowwise_df" "tbl_df" "tbl" "data.frame" $row.names [1] 1 2 3 4 5 6 7 > attributes(new2) $names [1] "id" "mrjdate" "cocdate" "inhdate" "haldate" "oiddate" $row.names [1] 1 2 3 4 5 6 7 $class [1] "data.frame" I hope this helps. Chel Hee Lee On 12/03/2014 04:10 PM, Muhuri, Pradip (SAMHSA/CBHSQ) wrote:> Hello, > > Two alternative approaches - mutate() vs. sapply() - were used to get the desired results (i.e., creating a new column of the most recent date from 4 dates ) with help from Arun and Mark on this forum. I now find that the two data objects (created using two different approaches) are not identical although results are exactly the same. > > identical(new1, new2) > [1] FALSE > > Please see the reproducible example below. > > I don't understand why the code returns FALSE here. Any hints/comments will be appreciated. > > Thanks, > > Pradip > > ############################################# reproducible example > ######################################## > library(dplyr) > # data object - description > > temp <- "id mrjdate cocdate inhdate haldate > 1 2004-11-04 2008-07-18 2005-07-07 2007-11-07 > 2 NA NA NA NA > 3 2009-10-24 NA 2011-10-13 NA > 4 2007-10-10 NA NA NA > 5 2006-09-01 2005-08-10 NA NA > 6 2007-09-04 2011-10-05 NA NA > 7 2005-10-25 NA NA 2011-11-04" > > # read the data object > > example.data <- read.table(textConnection(temp), > colClasses=c("character", "Date", "Date", "Date", "Date"), > header=TRUE, as.is=TRUE > ) > > > # create a new column -dplyr solution (Acknowledgement: Arun) > > new1 <- example.data %>% > rowwise() %>% > mutate(oldflag=as.Date(max(mrjdate,cocdate, inhdate, haldate, > > na.rm=TRUE), origin='1970-01-01')) > > # create a new column - Base R solution (Acknowlegement: Mark Sharp) > > new2 <- example.data > new2$oiddate <- as.Date(sapply(seq_along(new2$id), function(row) { > if (all(is.na(unlist(example.data[row, c('mrjdate','cocdate', 'inhdate', 'haldate')])))) { > max_d <- NA > } else { > max_d <- max(unlist(example.data[row, c('mrjdate','cocdate', 'inhdate', 'haldate')]), na.rm = TRUE) > } > max_d}), > origin = "1970-01-01") > > identical(new1, new2) > > # print records > > print (new1); print(new2) > > Pradip K. Muhuri > SAMHSA/CBHSQ > 1 Choke Cherry Road, Room 2-1071 > Rockville, MD 20857 > Tel: 240-276-1070 > Fax: 240-276-1260 > > -----Original Message----- > From: r-help-bounces at r-project.org > [mailto:r-help-bounces at r-project.org] On Behalf Of Muhuri, Pradip > (SAMHSA/CBHSQ) > Sent: Sunday, November 09, 2014 6:11 AM > To: 'Mark Sharp' > Cc: r-help at r-project.org > Subject: Re: [R] Getting the most recent dates in a new column from > dates in four columns using the dplyr package (mutate verb) > > Hi Mark, > > Your code has also given me the results I expected. Thank you so much for your help. > > Regards, > > Pradip > > Pradip K. Muhuri, PhD > SAMHSA/CBHSQ > 1 Choke Cherry Road, Room 2-1071 > Rockville, MD 20857 > Tel: 240-276-1070 > Fax: 240-276-1260 > > > -----Original Message----- > From: Mark Sharp [mailto:msharp at TxBiomed.org] > Sent: Sunday, November 09, 2014 3:01 AM > To: Muhuri, Pradip (SAMHSA/CBHSQ) > Cc: r-help at r-project.org > Subject: Re: [R] Getting the most recent dates in a new column from > dates in four columns using the dplyr package (mutate verb) > > Pradip, > > mutate() works on the entire column as a vector so that you find the maximum of the entire data set. > > I am almost certain there is some nice way to handle this, but the sapply() function is a standard approach. > > max() does not want a dataframe thus the use of unlist(). > > Using your definition of data1: > > data3 <- data1 > data3$oidflag <- as.Date(sapply(seq_along(data3$id), function(row) { > if (all(is.na(unlist(data1[row, -1])))) { > max_d <- NA > } else { > max_d <- max(unlist(data1[row, -1]), na.rm = TRUE) > } > max_d}), > origin = "1970-01-01") > > data3 > id mrjdate cocdate inhdate haldate oidflag > 1 1 2004-11-04 2008-07-18 2005-07-07 2007-11-07 2008-07-18 > 2 2 <NA> <NA> <NA> <NA> <NA> > 3 3 2009-10-24 <NA> 2011-10-13 <NA> 2011-10-13 > 4 4 2007-10-10 <NA> <NA> <NA> 2007-10-10 > 5 5 2006-09-01 2005-08-10 <NA> <NA> 2006-09-01 > 6 6 2007-09-04 2011-10-05 <NA> <NA> 2011-10-05 > 7 7 2005-10-25 <NA> <NA> 2011-11-04 2011-11-04 > > > > R. Mark Sharp, Ph.D. > Director of Primate Records Database > Southwest National Primate Research Center Texas Biomedical Research > Institute P.O. Box 760549 San Antonio, TX 78245-0549 > Telephone: (210)258-9476 > e-mail: msharp at TxBiomed.org > > > > > > NOTICE: This E-Mail (including attachments) is confidential and may be legally privileged. It is covered by the Electronic Communications Privacy Act, 18 U.S.C.2510-2521. If you are not the intended recipient, you are hereby notified that any retention, dissemination, distribution or copying of this communication is strictly prohibited. Please reply to the sender that you have received this message in error, then delete it. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
David Winsemius
2014-Dec-04 18:14 UTC
[R] Getting the most recent dates in a new column from dates in four columns using the dplyr package (mutate verb)
On Dec 3, 2014, at 7:43 PM, Muhuri, Pradip (SAMHSA/CBHSQ) wrote:> Hello Chel and David, > > Thank you very much for providing new insights into this issue. Here is one more question. Why does the mutate () give incorrect results here? > > # The following gives INCORRECT results - mutated()ed object > na.date.cases = ifelse(!is.na(oiddate),1,0) > > # The following gives CORRECT results > new2$na.date.cases = ifelse(!is.na(new2$oiddate),1,0) > > ############################### reproducible example - slightly revised/modified ############### > library(dplyr) > # data object - description > > temp <- "id mrjdate cocdate inhdate haldate > 1 2004-11-04 2008-07-18 2005-07-07 2007-11-07 > 2 NA NA NA NA > 3 2009-10-24 NA 2011-10-13 NA > 4 2007-10-10 NA NA NA > 5 2006-09-01 2005-08-10 NA NA > 6 2007-09-04 2011-10-05 NA NA > 7 2005-10-25 NA NA 2011-11-04" > > # read the data object > > example.data <- read.table(textConnection(temp), > colClasses=c("character", "Date", "Date", "Date", "Date"), > header=TRUE, as.is=TRUE > ) > > > # create a new column -dplyr solution (Acknowledgement: Arun) > > new1 <- example.data %>% > rowwise() %>% > mutate(oiddate=as.Date(max(mrjdate,cocdate, inhdate, haldate, na.rm=TRUE), origin='1970-01-01'), > na.date.cases = ifelse(!is.na(oiddate),1,0) > ) >It would have been polite to include the warning printed to the console after this line of code. It seems to me that this highlights the fact that you used different logic in the two methods and got, therefore, different answers. -- David.> # create a new column - Base R solution (Acknowlegement: Mark Sharp) > > new2 <- example.data > new2$oiddate <- as.Date(sapply(seq_along(new2$id), function(row) { > if (all(is.na(unlist(example.data[row, c('mrjdate','cocdate', 'inhdate', 'haldate')])))) { > max_d <- NA > } else { > max_d <- max(unlist(example.data[row, c('mrjdate','cocdate', 'inhdate', 'haldate')]), na.rm = TRUE) > } > max_d}), > origin = "1970-01-01") > > new2$na.date.cases = ifelse(!is.na(new2$oiddate),1,0) > > > identical(new1, new2) > > table(new1$oiddate) > table(new2$oiddate) > > # print records > > print (new1); print(new2) > > Pradip K. Muhuri, PhD > SAMHSA/CBHSQ > 1 Choke Cherry Road, Room 2-1071 > Rockville, MD 20857 > Tel: 240-276-1070 > Fax: 240-276-1260 > > -----Original Message----- > From: Chel Hee Lee [mailto:chl948 at mail.usask.ca] > Sent: Wednesday, December 03, 2014 8:48 PM > To: Muhuri, Pradip (SAMHSA/CBHSQ); r-help at r-project.org > Subject: Re: [R] Getting the most recent dates in a new column from dates in four columns using the dplyr package (mutate verb) > > The output in the object 'new1' are apparently same the output in the object 'new2'. Are you trying to compare the entries of two outputs 'new1' and 'new2'? If so, the function 'all()' would be useful: > >> all(new1 == new2, na.rm=TRUE) > [1] TRUE > > If you are interested in the comparison of two objects in terms of class, then the function 'identical()' is useful: > >> attributes(new1) > $names > [1] "id" "mrjdate" "cocdate" "inhdate" "haldate" "oldflag" > > $class > [1] "rowwise_df" "tbl_df" "tbl" "data.frame" > > $row.names > [1] 1 2 3 4 5 6 7 > >> attributes(new2) > $names > [1] "id" "mrjdate" "cocdate" "inhdate" "haldate" "oiddate" > > $row.names > [1] 1 2 3 4 5 6 7 > > $class > [1] "data.frame" > > I hope this helps. > > Chel Hee Lee > > On 12/03/2014 04:10 PM, Muhuri, Pradip (SAMHSA/CBHSQ) wrote: >> Hello, >> >> Two alternative approaches - mutate() vs. sapply() - were used to get the desired results (i.e., creating a new column of the most recent date from 4 dates ) with help from Arun and Mark on this forum. I now find that the two data objects (created using two different approaches) are not identical although results are exactly the same. >> >> identical(new1, new2) >> [1] FALSE >> >> Please see the reproducible example below. >> >> I don't understand why the code returns FALSE here. Any hints/comments will be appreciated. >> >> Thanks, >> >> Pradip >> >> ############################################# reproducible example >> ######################################## >> library(dplyr) >> # data object - description >> >> temp <- "id mrjdate cocdate inhdate haldate >> 1 2004-11-04 2008-07-18 2005-07-07 2007-11-07 >> 2 NA NA NA NA >> 3 2009-10-24 NA 2011-10-13 NA >> 4 2007-10-10 NA NA NA >> 5 2006-09-01 2005-08-10 NA NA >> 6 2007-09-04 2011-10-05 NA NA >> 7 2005-10-25 NA NA 2011-11-04" >> >> # read the data object >> >> example.data <- read.table(textConnection(temp), >> colClasses=c("character", "Date", "Date", "Date", "Date"), >> header=TRUE, as.is=TRUE >> ) >> >> >> # create a new column -dplyr solution (Acknowledgement: Arun) >> >> new1 <- example.data %>% >> rowwise() %>% >> mutate(oldflag=as.Date(max(mrjdate,cocdate, inhdate, haldate, >> >> na.rm=TRUE), origin='1970-01-01')) >> >> # create a new column - Base R solution (Acknowlegement: Mark Sharp) >> >> new2 <- example.data >> new2$oiddate <- as.Date(sapply(seq_along(new2$id), function(row) { >> if (all(is.na(unlist(example.data[row, c('mrjdate','cocdate', 'inhdate', 'haldate')])))) { >> max_d <- NA >> } else { >> max_d <- max(unlist(example.data[row, c('mrjdate','cocdate', 'inhdate', 'haldate')]), na.rm = TRUE) >> } >> max_d}), >> origin = "1970-01-01") >> >> identical(new1, new2) >> >> # print records >> >> print (new1); print(new2) >> >> Pradip K. Muhuri >> SAMHSA/CBHSQ >> 1 Choke Cherry Road, Room 2-1071 >> Rockville, MD 20857 >> Tel: 240-276-1070 >> Fax: 240-276-1260 >> >> -----Original Message----- >> From: r-help-bounces at r-project.org >> [mailto:r-help-bounces at r-project.org] On Behalf Of Muhuri, Pradip >> (SAMHSA/CBHSQ) >> Sent: Sunday, November 09, 2014 6:11 AM >> To: 'Mark Sharp' >> Cc: r-help at r-project.org >> Subject: Re: [R] Getting the most recent dates in a new column from >> dates in four columns using the dplyr package (mutate verb) >> >> Hi Mark, >> >> Your code has also given me the results I expected. Thank you so much for your help. >> >> Regards, >> >> Pradip >> >> Pradip K. Muhuri, PhD >> SAMHSA/CBHSQ >> 1 Choke Cherry Road, Room 2-1071 >> Rockville, MD 20857 >> Tel: 240-276-1070 >> Fax: 240-276-1260 >> >> >> -----Original Message----- >> From: Mark Sharp [mailto:msharp at TxBiomed.org] >> Sent: Sunday, November 09, 2014 3:01 AM >> To: Muhuri, Pradip (SAMHSA/CBHSQ) >> Cc: r-help at r-project.org >> Subject: Re: [R] Getting the most recent dates in a new column from >> dates in four columns using the dplyr package (mutate verb) >> >> Pradip, >> >> mutate() works on the entire column as a vector so that you find the maximum of the entire data set. >> >> I am almost certain there is some nice way to handle this, but the sapply() function is a standard approach. >> >> max() does not want a dataframe thus the use of unlist(). >> >> Using your definition of data1: >> >> data3 <- data1 >> data3$oidflag <- as.Date(sapply(seq_along(data3$id), function(row) { >> if (all(is.na(unlist(data1[row, -1])))) { >> max_d <- NA >> } else { >> max_d <- max(unlist(data1[row, -1]), na.rm = TRUE) >> } >> max_d}), >> origin = "1970-01-01") >> >> data3 >> id mrjdate cocdate inhdate haldate oidflag >> 1 1 2004-11-04 2008-07-18 2005-07-07 2007-11-07 2008-07-18 >> 2 2 <NA> <NA> <NA> <NA> <NA> >> 3 3 2009-10-24 <NA> 2011-10-13 <NA> 2011-10-13 >> 4 4 2007-10-10 <NA> <NA> <NA> 2007-10-10 >> 5 5 2006-09-01 2005-08-10 <NA> <NA> 2006-09-01 >> 6 6 2007-09-04 2011-10-05 <NA> <NA> 2011-10-05 >> 7 7 2005-10-25 <NA> <NA> 2011-11-04 2011-11-04 >> >> >> >> R. Mark Sharp, Ph.D. >> Director of Primate Records Database >> Southwest National Primate Research Center Texas Biomedical Research >> Institute P.O. Box 760549 San Antonio, TX 78245-0549 >> Telephone: (210)258-9476 >> e-mail: msharp at TxBiomed.org >> >> >> >> >> >> NOTICE: This E-Mail (including attachments) is confidential and may be legally privileged. It is covered by the Electronic Communications Privacy Act, 18 U.S.C.2510-2521. If you are not the intended recipient, you are hereby notified that any retention, dissemination, distribution or copying of this communication is strictly prohibited. Please reply to the sender that you have received this message in error, then delete it. >> >> ______________________________________________ >> R-help at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >>David Winsemius Alameda, CA, USA
Jeff Newmiller
2014-Dec-04 18:20 UTC
[R] Getting the most recent dates in a new column from dates in four columns using the dplyr package (mutate verb)
There is something weird going on with mutate's interaction with the scalar Date objects. It seems to be passing them to max as constants of mode double. Regardless, use of rowwise should be very rare, and you are definitely abusing it. Learn to work with vectors of values rather than one value at a time. new3 <- example.data %>% mutate( oiddate = pmax( mrjdate, cocdate, inhdate, haldate, na.rm=TRUE) , na.date.cases= as.numeric( !is.na( oiddate ) ) ) You might find it more useful to not convert the result of is.na to numeric... logical indexing can use that more efficiently than testing which rows have na.date.cases==1. --------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live... DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --------------------------------------------------------------------------- Sent from my phone. Please excuse my brevity. On December 3, 2014 7:43:37 PM PST, "Muhuri, Pradip (SAMHSA/CBHSQ)" <Pradip.Muhuri at samhsa.hhs.gov> wrote:>Hello Chel and David, > >Thank you very much for providing new insights into this issue. Here >is one more question. Why does the mutate () give incorrect results >here? > ># The following gives INCORRECT results - mutated()ed object >na.date.cases = ifelse(!is.na(oiddate),1,0) > ># The following gives CORRECT results >new2$na.date.cases = ifelse(!is.na(new2$oiddate),1,0) > >############################### reproducible example - slightly >revised/modified ############### >library(dplyr) ># data object - description > >temp <- "id mrjdate cocdate inhdate haldate >1 2004-11-04 2008-07-18 2005-07-07 2007-11-07 >2 NA NA NA NA >3 2009-10-24 NA 2011-10-13 NA >4 2007-10-10 NA NA NA >5 2006-09-01 2005-08-10 NA NA >6 2007-09-04 2011-10-05 NA NA >7 2005-10-25 NA NA 2011-11-04" > ># read the data object > >example.data <- read.table(textConnection(temp), > colClasses=c("character", "Date", "Date", "Date", "Date"), > header=TRUE, as.is=TRUE > ) > > ># create a new column -dplyr solution (Acknowledgement: Arun) > >new1 <- example.data %>% > rowwise() %>% >mutate(oiddate=as.Date(max(mrjdate,cocdate, inhdate, haldate, >na.rm=TRUE), origin='1970-01-01'), > na.date.cases = ifelse(!is.na(oiddate),1,0) > ) > ># create a new column - Base R solution (Acknowlegement: Mark Sharp) > >new2 <- example.data >new2$oiddate <- as.Date(sapply(seq_along(new2$id), function(row) { >if (all(is.na(unlist(example.data[row, c('mrjdate','cocdate', >'inhdate', 'haldate')])))) { > max_d <- NA > } else { >max_d <- max(unlist(example.data[row, c('mrjdate','cocdate', 'inhdate', >'haldate')]), na.rm = TRUE) > } > max_d}), > origin = "1970-01-01") > >new2$na.date.cases = ifelse(!is.na(new2$oiddate),1,0) > > >identical(new1, new2) > >table(new1$oiddate) >table(new2$oiddate) > ># print records > >print (new1); print(new2) > >Pradip K. Muhuri, PhD >SAMHSA/CBHSQ >1 Choke Cherry Road, Room 2-1071 >Rockville, MD 20857 >Tel: 240-276-1070 >Fax: 240-276-1260 > >-----Original Message----- >From: Chel Hee Lee [mailto:chl948 at mail.usask.ca] >Sent: Wednesday, December 03, 2014 8:48 PM >To: Muhuri, Pradip (SAMHSA/CBHSQ); r-help at r-project.org >Subject: Re: [R] Getting the most recent dates in a new column from >dates in four columns using the dplyr package (mutate verb) > >The output in the object 'new1' are apparently same the output in the >object 'new2'. Are you trying to compare the entries of two outputs >'new1' and 'new2'? If so, the function 'all()' would be useful: > > > all(new1 == new2, na.rm=TRUE) >[1] TRUE > >If you are interested in the comparison of two objects in terms of >class, then the function 'identical()' is useful: > > > attributes(new1) >$names >[1] "id" "mrjdate" "cocdate" "inhdate" "haldate" "oldflag" > >$class >[1] "rowwise_df" "tbl_df" "tbl" "data.frame" > >$row.names >[1] 1 2 3 4 5 6 7 > > > attributes(new2) >$names >[1] "id" "mrjdate" "cocdate" "inhdate" "haldate" "oiddate" > >$row.names >[1] 1 2 3 4 5 6 7 > >$class >[1] "data.frame" > >I hope this helps. > >Chel Hee Lee > >On 12/03/2014 04:10 PM, Muhuri, Pradip (SAMHSA/CBHSQ) wrote: >> Hello, >> >> Two alternative approaches - mutate() vs. sapply() - were used to get >the desired results (i.e., creating a new column of the most recent >date from 4 dates ) with help from Arun and Mark on this forum. I now >find that the two data objects (created using two different approaches) >are not identical although results are exactly the same. >> >> identical(new1, new2) >> [1] FALSE >> >> Please see the reproducible example below. >> >> I don't understand why the code returns FALSE here. Any >hints/comments will be appreciated. >> >> Thanks, >> >> Pradip >> >> ############################################# reproducible example >> ######################################## >> library(dplyr) >> # data object - description >> >> temp <- "id mrjdate cocdate inhdate haldate >> 1 2004-11-04 2008-07-18 2005-07-07 2007-11-07 >> 2 NA NA NA NA >> 3 2009-10-24 NA 2011-10-13 NA >> 4 2007-10-10 NA NA NA >> 5 2006-09-01 2005-08-10 NA NA >> 6 2007-09-04 2011-10-05 NA NA >> 7 2005-10-25 NA NA 2011-11-04" >> >> # read the data object >> >> example.data <- read.table(textConnection(temp), >> colClasses=c("character", "Date", "Date", >"Date", "Date"), >> header=TRUE, as.is=TRUE >> ) >> >> >> # create a new column -dplyr solution (Acknowledgement: Arun) >> >> new1 <- example.data %>% >> rowwise() %>% >> mutate(oldflag=as.Date(max(mrjdate,cocdate, inhdate, haldate, >> >> na.rm=TRUE), origin='1970-01-01')) >> >> # create a new column - Base R solution (Acknowlegement: Mark Sharp) >> >> new2 <- example.data >> new2$oiddate <- as.Date(sapply(seq_along(new2$id), function(row) { >> if (all(is.na(unlist(example.data[row, c('mrjdate','cocdate', >'inhdate', 'haldate')])))) { >> max_d <- NA >> } else { >> max_d <- max(unlist(example.data[row, c('mrjdate','cocdate', >'inhdate', 'haldate')]), na.rm = TRUE) >> } >> max_d}), >> origin = "1970-01-01") >> >> identical(new1, new2) >> >> # print records >> >> print (new1); print(new2) >> >> Pradip K. Muhuri >> SAMHSA/CBHSQ >> 1 Choke Cherry Road, Room 2-1071 >> Rockville, MD 20857 >> Tel: 240-276-1070 >> Fax: 240-276-1260 >> >> -----Original Message----- >> From: r-help-bounces at r-project.org >> [mailto:r-help-bounces at r-project.org] On Behalf Of Muhuri, Pradip >> (SAMHSA/CBHSQ) >> Sent: Sunday, November 09, 2014 6:11 AM >> To: 'Mark Sharp' >> Cc: r-help at r-project.org >> Subject: Re: [R] Getting the most recent dates in a new column from >> dates in four columns using the dplyr package (mutate verb) >> >> Hi Mark, >> >> Your code has also given me the results I expected. Thank you so >much for your help. >> >> Regards, >> >> Pradip >> >> Pradip K. Muhuri, PhD >> SAMHSA/CBHSQ >> 1 Choke Cherry Road, Room 2-1071 >> Rockville, MD 20857 >> Tel: 240-276-1070 >> Fax: 240-276-1260 >> >> >> -----Original Message----- >> From: Mark Sharp [mailto:msharp at TxBiomed.org] >> Sent: Sunday, November 09, 2014 3:01 AM >> To: Muhuri, Pradip (SAMHSA/CBHSQ) >> Cc: r-help at r-project.org >> Subject: Re: [R] Getting the most recent dates in a new column from >> dates in four columns using the dplyr package (mutate verb) >> >> Pradip, >> >> mutate() works on the entire column as a vector so that you find the >maximum of the entire data set. >> >> I am almost certain there is some nice way to handle this, but the >sapply() function is a standard approach. >> >> max() does not want a dataframe thus the use of unlist(). >> >> Using your definition of data1: >> >> data3 <- data1 >> data3$oidflag <- as.Date(sapply(seq_along(data3$id), function(row) { >> if (all(is.na(unlist(data1[row, -1])))) { >> max_d <- NA >> } else { >> max_d <- max(unlist(data1[row, -1]), na.rm = TRUE) >> } >> max_d}), >> origin = "1970-01-01") >> >> data3 >> id mrjdate cocdate inhdate haldate oidflag >> 1 1 2004-11-04 2008-07-18 2005-07-07 2007-11-07 2008-07-18 >> 2 2 <NA> <NA> <NA> <NA> <NA> >> 3 3 2009-10-24 <NA> 2011-10-13 <NA> 2011-10-13 >> 4 4 2007-10-10 <NA> <NA> <NA> 2007-10-10 >> 5 5 2006-09-01 2005-08-10 <NA> <NA> 2006-09-01 >> 6 6 2007-09-04 2011-10-05 <NA> <NA> 2011-10-05 >> 7 7 2005-10-25 <NA> <NA> 2011-11-04 2011-11-04 >> >> >> >> R. Mark Sharp, Ph.D. >> Director of Primate Records Database >> Southwest National Primate Research Center Texas Biomedical Research >> Institute P.O. Box 760549 San Antonio, TX 78245-0549 >> Telephone: (210)258-9476 >> e-mail: msharp at TxBiomed.org >> >> >> >> >> >> NOTICE: This E-Mail (including attachments) is confidential and may >be legally privileged. It is covered by the Electronic Communications >Privacy Act, 18 U.S.C.2510-2521. If you are not the intended >recipient, you are hereby notified that any retention, dissemination, >distribution or copying of this communication is strictly prohibited. >Please reply to the sender that you have received this message in >error, then delete it. >> >> ______________________________________________ >> R-help at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > >______________________________________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.
Muhuri, Pradip (SAMHSA/CBHSQ)
2014-Dec-04 19:32 UTC
[R] Getting the most recent dates in a new column from dates in four columns using the dplyr package (mutate verb)
Hello Jeff, Your code has given me desired results, and your advice is well taken. I agree with you regarding the use of logical indexing for testing conditions. Thank you so much for your time and advice. Pradip Pradip K. Muhuri SAMHSA/CBHSQ 1 Choke Cherry Road, Room 2-1071 Rockville, MD 20857 Tel: 240-276-1070 Fax: 240-276-1260 -----Original Message----- From: Jeff Newmiller [mailto:jdnewmil at dcn.davis.CA.us] Sent: Thursday, December 04, 2014 1:20 PM To: Muhuri, Pradip (SAMHSA/CBHSQ); r-help at r-project.org Subject: Re: [R] Getting the most recent dates in a new column from dates in four columns using the dplyr package (mutate verb) There is something weird going on with mutate's interaction with the scalar Date objects. It seems to be passing them to max as constants of mode double. Regardless, use of rowwise should be very rare, and you are definitely abusing it. Learn to work with vectors of values rather than one value at a time. new3 <- example.data %>% mutate( oiddate = pmax( mrjdate, cocdate, inhdate, haldate, na.rm=TRUE) , na.date.cases= as.numeric( !is.na( oiddate ) ) ) You might find it more useful to not convert the result of is.na to numeric... logical indexing can use that more efficiently than testing which rows have na.date.cases==1. --------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live... DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --------------------------------------------------------------------------- Sent from my phone. Please excuse my brevity. On December 3, 2014 7:43:37 PM PST, "Muhuri, Pradip (SAMHSA/CBHSQ)" <Pradip.Muhuri at samhsa.hhs.gov> wrote:>Hello Chel and David, > >Thank you very much for providing new insights into this issue. Here >is one more question. Why does the mutate () give incorrect results >here? > ># The following gives INCORRECT results - mutated()ed object >na.date.cases = ifelse(!is.na(oiddate),1,0) > ># The following gives CORRECT results >new2$na.date.cases = ifelse(!is.na(new2$oiddate),1,0) > >############################### reproducible example - slightly >revised/modified ############### >library(dplyr) ># data object - description > >temp <- "id mrjdate cocdate inhdate haldate >1 2004-11-04 2008-07-18 2005-07-07 2007-11-07 >2 NA NA NA NA >3 2009-10-24 NA 2011-10-13 NA >4 2007-10-10 NA NA NA >5 2006-09-01 2005-08-10 NA NA >6 2007-09-04 2011-10-05 NA NA >7 2005-10-25 NA NA 2011-11-04" > ># read the data object > >example.data <- read.table(textConnection(temp), > colClasses=c("character", "Date", "Date", "Date", "Date"), > header=TRUE, as.is=TRUE > ) > > ># create a new column -dplyr solution (Acknowledgement: Arun) > >new1 <- example.data %>% > rowwise() %>% >mutate(oiddate=as.Date(max(mrjdate,cocdate, inhdate, haldate, >na.rm=TRUE), origin='1970-01-01'), > na.date.cases = ifelse(!is.na(oiddate),1,0) > ) > ># create a new column - Base R solution (Acknowlegement: Mark Sharp) > >new2 <- example.data >new2$oiddate <- as.Date(sapply(seq_along(new2$id), function(row) { if >(all(is.na(unlist(example.data[row, c('mrjdate','cocdate', 'inhdate', >'haldate')])))) { > max_d <- NA > } else { >max_d <- max(unlist(example.data[row, c('mrjdate','cocdate', 'inhdate', >'haldate')]), na.rm = TRUE) > } > max_d}), > origin = "1970-01-01") > >new2$na.date.cases = ifelse(!is.na(new2$oiddate),1,0) > > >identical(new1, new2) > >table(new1$oiddate) >table(new2$oiddate) > ># print records > >print (new1); print(new2) > >Pradip K. Muhuri, PhD >SAMHSA/CBHSQ >1 Choke Cherry Road, Room 2-1071 >Rockville, MD 20857 >Tel: 240-276-1070 >Fax: 240-276-1260 > >-----Original Message----- >From: Chel Hee Lee [mailto:chl948 at mail.usask.ca] >Sent: Wednesday, December 03, 2014 8:48 PM >To: Muhuri, Pradip (SAMHSA/CBHSQ); r-help at r-project.org >Subject: Re: [R] Getting the most recent dates in a new column from >dates in four columns using the dplyr package (mutate verb) > >The output in the object 'new1' are apparently same the output in the >object 'new2'. Are you trying to compare the entries of two outputs >'new1' and 'new2'? If so, the function 'all()' would be useful: > > > all(new1 == new2, na.rm=TRUE) >[1] TRUE > >If you are interested in the comparison of two objects in terms of >class, then the function 'identical()' is useful: > > > attributes(new1) >$names >[1] "id" "mrjdate" "cocdate" "inhdate" "haldate" "oldflag" > >$class >[1] "rowwise_df" "tbl_df" "tbl" "data.frame" > >$row.names >[1] 1 2 3 4 5 6 7 > > > attributes(new2) >$names >[1] "id" "mrjdate" "cocdate" "inhdate" "haldate" "oiddate" > >$row.names >[1] 1 2 3 4 5 6 7 > >$class >[1] "data.frame" > >I hope this helps. > >Chel Hee Lee > >On 12/03/2014 04:10 PM, Muhuri, Pradip (SAMHSA/CBHSQ) wrote: >> Hello, >> >> Two alternative approaches - mutate() vs. sapply() - were used to get >the desired results (i.e., creating a new column of the most recent >date from 4 dates ) with help from Arun and Mark on this forum. I now >find that the two data objects (created using two different approaches) >are not identical although results are exactly the same. >> >> identical(new1, new2) >> [1] FALSE >> >> Please see the reproducible example below. >> >> I don't understand why the code returns FALSE here. Any >hints/comments will be appreciated. >> >> Thanks, >> >> Pradip >> >> ############################################# reproducible example >> ######################################## >> library(dplyr) >> # data object - description >> >> temp <- "id mrjdate cocdate inhdate haldate >> 1 2004-11-04 2008-07-18 2005-07-07 2007-11-07 >> 2 NA NA NA NA >> 3 2009-10-24 NA 2011-10-13 NA >> 4 2007-10-10 NA NA NA >> 5 2006-09-01 2005-08-10 NA NA >> 6 2007-09-04 2011-10-05 NA NA >> 7 2005-10-25 NA NA 2011-11-04" >> >> # read the data object >> >> example.data <- read.table(textConnection(temp), >> colClasses=c("character", "Date", "Date", >"Date", "Date"), >> header=TRUE, as.is=TRUE >> ) >> >> >> # create a new column -dplyr solution (Acknowledgement: Arun) >> >> new1 <- example.data %>% >> rowwise() %>% >> mutate(oldflag=as.Date(max(mrjdate,cocdate, inhdate, haldate, >> >> na.rm=TRUE), origin='1970-01-01')) >> >> # create a new column - Base R solution (Acknowlegement: Mark Sharp) >> >> new2 <- example.data >> new2$oiddate <- as.Date(sapply(seq_along(new2$id), function(row) { >> if (all(is.na(unlist(example.data[row, c('mrjdate','cocdate', >'inhdate', 'haldate')])))) { >> max_d <- NA >> } else { >> max_d <- max(unlist(example.data[row, c('mrjdate','cocdate', >'inhdate', 'haldate')]), na.rm = TRUE) >> } >> max_d}), >> origin = "1970-01-01") >> >> identical(new1, new2) >> >> # print records >> >> print (new1); print(new2) >> >> Pradip K. Muhuri >> SAMHSA/CBHSQ >> 1 Choke Cherry Road, Room 2-1071 >> Rockville, MD 20857 >> Tel: 240-276-1070 >> Fax: 240-276-1260 >> >> -----Original Message----- >> From: r-help-bounces at r-project.org >> [mailto:r-help-bounces at r-project.org] On Behalf Of Muhuri, Pradip >> (SAMHSA/CBHSQ) >> Sent: Sunday, November 09, 2014 6:11 AM >> To: 'Mark Sharp' >> Cc: r-help at r-project.org >> Subject: Re: [R] Getting the most recent dates in a new column from >> dates in four columns using the dplyr package (mutate verb) >> >> Hi Mark, >> >> Your code has also given me the results I expected. Thank you so >much for your help. >> >> Regards, >> >> Pradip >> >> Pradip K. Muhuri, PhD >> SAMHSA/CBHSQ >> 1 Choke Cherry Road, Room 2-1071 >> Rockville, MD 20857 >> Tel: 240-276-1070 >> Fax: 240-276-1260 >> >> >> -----Original Message----- >> From: Mark Sharp [mailto:msharp at TxBiomed.org] >> Sent: Sunday, November 09, 2014 3:01 AM >> To: Muhuri, Pradip (SAMHSA/CBHSQ) >> Cc: r-help at r-project.org >> Subject: Re: [R] Getting the most recent dates in a new column from >> dates in four columns using the dplyr package (mutate verb) >> >> Pradip, >> >> mutate() works on the entire column as a vector so that you find the >maximum of the entire data set. >> >> I am almost certain there is some nice way to handle this, but the >sapply() function is a standard approach. >> >> max() does not want a dataframe thus the use of unlist(). >> >> Using your definition of data1: >> >> data3 <- data1 >> data3$oidflag <- as.Date(sapply(seq_along(data3$id), function(row) { >> if (all(is.na(unlist(data1[row, -1])))) { >> max_d <- NA >> } else { >> max_d <- max(unlist(data1[row, -1]), na.rm = TRUE) >> } >> max_d}), >> origin = "1970-01-01") >> >> data3 >> id mrjdate cocdate inhdate haldate oidflag >> 1 1 2004-11-04 2008-07-18 2005-07-07 2007-11-07 2008-07-18 >> 2 2 <NA> <NA> <NA> <NA> <NA> >> 3 3 2009-10-24 <NA> 2011-10-13 <NA> 2011-10-13 >> 4 4 2007-10-10 <NA> <NA> <NA> 2007-10-10 >> 5 5 2006-09-01 2005-08-10 <NA> <NA> 2006-09-01 >> 6 6 2007-09-04 2011-10-05 <NA> <NA> 2011-10-05 >> 7 7 2005-10-25 <NA> <NA> 2011-11-04 2011-11-04 >> >> >> >> R. Mark Sharp, Ph.D. >> Director of Primate Records Database >> Southwest National Primate Research Center Texas Biomedical Research >> Institute P.O. Box 760549 San Antonio, TX 78245-0549 >> Telephone: (210)258-9476 >> e-mail: msharp at TxBiomed.org >> >> >> >> >> >> NOTICE: This E-Mail (including attachments) is confidential and may >be legally privileged. It is covered by the Electronic Communications >Privacy Act, 18 U.S.C.2510-2521. If you are not the intended >recipient, you are hereby notified that any retention, dissemination, >distribution or copying of this communication is strictly prohibited. >Please reply to the sender that you have received this message in >error, then delete it. >> >> ______________________________________________ >> R-help at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > >______________________________________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.