thr3ads.net - R help - [R] get latest dates for different people in a dataset [Jan 2015]

If this information is useful, please help other people find it:
Share via:

Göran Broström

2015-Jan-25 09:01 UTC

[R] get latest dates for different people in a dataset

On 2015-01-24 01:14, William Dunlap wrote:> Here is one way.  Sort the data.frame, first by Name then break ties with
> CheckInDate.
> Then choose the rows that are the last in a run of identical Name values.
I do it by sorting by the reverse order of CheckinDate (last date first) 
within Name, then

 > dLatestVisit <- dSorted[!duplicated(dSorted$Name), ]

I guess it is faster, but who knows?

G?ran
>
>> txt <- "Name    CheckInDate      Temp
> + John      1/3/2014              97
> + Mary     1/3/2014              98.1
> + Sam       1/4/2014              97.5
> + John      1/4/2014              99"
>> d <- read.table(header=TRUE,
>
colClasses=c("character","character","numeric"),
text=txt)
>> d$CheckInDate <- as.Date(d$CheckInDate, as.Date,
format="%d/%m/%Y")
>> isEndOfRun <- function(x) c(x[-1] != x[-length(x)], TRUE)
>> dSorted <- d[order(d$Name, d$CheckInDate), ]
>> dLatestVisit <- dSorted[isEndOfRun(dSorted$Name), ]
>> dLatestVisit
>    Name CheckInDate Temp
> 4 John  2014-04-01 99.0
> 2 Mary  2014-03-01 98.1
> 3  Sam  2014-04-01 97.5
>
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
> On Fri, Jan 23, 2015 at 3:43 PM, Tan, Richard <RTan at panagora.com>
wrote:
>
>> Hi,
>>
>> Can someone help for a R question?
>>
>> I have a data set like:
>>
>> Name    CheckInDate      Temp
>> John      1/3/2014              97
>> Mary     1/3/2014              98.1
>> Sam       1/4/2014              97.5
>> John      1/4/2014              99
>>
>> I'd like to return a dataset that for each Name, get the row that
is the
>> latest CheckInDate for that person.  For the example above it would be
>>
>> Name    CheckInDate      Temp
>> John      1/4/2014              99
>> Mary     1/3/2014              98.1
>> Sam       1/4/2014              97.5
>>
>>
>> Thank you for your help!
>>
>> Richard
>>
>>
>>          [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

William Dunlap

2015-Jan-25 19:27 UTC

head link

[R] get latest dates for different people in a dataset

>> dLatestVisit <- dSorted[!duplicated(dSorted$Name), ]
>
>I guess it is faster, but who knows?
You can find out by making a function that generates datasets of
various sizes and timing the suggested algorithms.  E.g.,
makeData <-
function(nPatients, aveVisitsPerPatient, uniqueNameDate = TRUE){
    nrow <- trunc(nPatients * aveVisitsPerPatient)
    patientNames <- paste0("P",seq_len(nPatients))
    possibleDates <- as.Date(16001:17000,
origin=as.Date("1970-01-01"))
    possibleTemps <- seq(97, 103, by=0.1)
    data <- data.frame(Name=sample(patientNames, replace=TRUE, size=nrow),
               CheckInDate=sample(possibleDates, replace=TRUE, size=nrow),
               Temp=sample(possibleTemps, replace=TRUE, size=nrow))
    if (uniqueNameDate) {
        data <- data[!duplicated(data[, c("Name",
"CheckInDate")]), ]
    }
    data
}
funs <- list(
    f1 = function(data) {
        do.call(rbind, lapply(split(data, data$Name), function(x)
x[order(x$CheckInDate),][nrow(x),]))
    }, f2 = function (d)
    {
        isEndOfRun <- function(x) c(x[-1] != x[-length(x)], TRUE)
        dSorted <- d[order(d$Name, d$CheckInDate), ]
        dSorted[isEndOfRun(dSorted$Name), ]
    }, f3 = function (d)
    {
        # is the following how you did reverse sort on date (& fwd on name)?
        #  Too bad that order's decreasing arg is not vectorized
        dSorted <- d[order(d$Name, -as.numeric(d$CheckInDate)), ]
        dSorted[!duplicated(dSorted$Name), ]
    }, f4 = function(dta)
    {
        dta %>% group_by(Name)  %>% filter(CheckInDate==max(CheckInDate))
    })

D <- makeData(nPatients=35000, aveVisitsPerPatient=3.7) # c. 129000 visits
library(dplyr)
Z <- lapply(funs, function(fun){
    time <- system.time( result <- fun(D) ) ; list(time=time,
result=result) })

sapply(Z, function(x)x$time)
#               f1   f2   f3   f4
#user.self  461.25 0.47 0.36 3.01
#sys.self     1.20 0.00 0.00 0.01
#elapsed    472.33 0.47 0.39 3.03
#user.child     NA   NA   NA   NA
#sys.child      NA   NA   NA   NA

# duplicated is a bit better than diff, dplyr rather slower, rbind much
slower.

equivResults <- function(a, b) {
   # results have different classes and different orders, so only check
size and contents
    identical(dim(a),dim(b)) &&
all(a[order(a$Name),]==b[order(b$Name),])
}
sapply(Z[-1], function(x)equivResults(x$result, Z[[1]]$result))
#  f2   f3   f4
#TRUE TRUE TRUE

Note that the various functions give different results if any patient comes
in twice on the same day.  f4 includes both visits in the ouput, the other
include either the first or last (as ordered in the original file).

Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Sun, Jan 25, 2015 at 1:01 AM, G?ran Brostr?m <goran.brostrom at umu.se>
wrote:
> On 2015-01-24 01:14, William Dunlap wrote:
>
>> Here is one way.  Sort the data.frame, first by Name then break ties
with
>> CheckInDate.
>> Then choose the rows that are the last in a run of identical Name
values.
>>
>
> I do it by sorting by the reverse order of CheckinDate (last date first)
> within Name, then
>
> > dLatestVisit <- dSorted[!duplicated(dSorted$Name), ]
>
> I guess it is faster, but who knows?
>
> G?ran
>
>
>>  txt <- "Name    CheckInDate      Temp
>>>
>> + John      1/3/2014              97
>> + Mary     1/3/2014              98.1
>> + Sam       1/4/2014              97.5
>> + John      1/4/2014              99"
>>
>>> d <- read.table(header=TRUE,
>>>
>>
colClasses=c("character","character","numeric"),
text=txt)
>>
>>> d$CheckInDate <- as.Date(d$CheckInDate, as.Date,
format="%d/%m/%Y")
>>> isEndOfRun <- function(x) c(x[-1] != x[-length(x)], TRUE)
>>> dSorted <- d[order(d$Name, d$CheckInDate), ]
>>> dLatestVisit <- dSorted[isEndOfRun(dSorted$Name), ]
>>> dLatestVisit
>>>
>>    Name CheckInDate Temp
>> 4 John  2014-04-01 99.0
>> 2 Mary  2014-03-01 98.1
>> 3  Sam  2014-04-01 97.5
>>
>>
>> Bill Dunlap
>> TIBCO Software
>> wdunlap tibco.com
>>
>>
>> On Fri, Jan 23, 2015 at 3:43 PM, Tan, Richard <RTan at
panagora.com> wrote:
>>
>>  Hi,
>>>
>>> Can someone help for a R question?
>>>
>>> I have a data set like:
>>>
>>> Name    CheckInDate      Temp
>>> John      1/3/2014              97
>>> Mary     1/3/2014              98.1
>>> Sam       1/4/2014              97.5
>>> John      1/4/2014              99
>>>
>>> I'd like to return a dataset that for each Name, get the row
that is the
>>> latest CheckInDate for that person.  For the example above it would
be
>>>
>>> Name    CheckInDate      Temp
>>> John      1/4/2014              99
>>> Mary     1/3/2014              98.1
>>> Sam       1/4/2014              97.5
>>>
>>>
>>> Thank you for your help!
>>>
>>> Richard
>>>
>>>
>>>          [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/
>> posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Göran Broström

2015-Jan-25 20:38 UTC

head link

[R] get latest dates for different people in a dataset

See inline;

On 2015-01-25 20:27, William Dunlap wrote:>  >> dLatestVisit <- dSorted[!duplicated(dSorted$__Name), ]
>  >
>  >I guess it is faster, but who knows?
>
> You can find out by making a function that generates datasets of
> various sizes and timing the suggested algorithms.  E.g.,
> makeData <-
> function(nPatients, aveVisitsPerPatient, uniqueNameDate = TRUE){
>      nrow <- trunc(nPatients * aveVisitsPerPatient)
>      patientNames <- paste0("P",seq_len(nPatients))
>      possibleDates <- as.Date(16001:17000,
origin=as.Date("1970-01-01"))
>      possibleTemps <- seq(97, 103, by=0.1)
>      data <- data.frame(Name=sample(patientNames, replace=TRUE,
size=nrow),
>                 CheckInDate=sample(possibleDates, replace=TRUE, size=nrow),
>                 Temp=sample(possibleTemps, replace=TRUE, size=nrow))
>      if (uniqueNameDate) {
>          data <- data[!duplicated(data[, c("Name",
"CheckInDate")]), ]
>      }
>      data
> }
> funs <- list(
>      f1 = function(data) {
>          do.call(rbind, lapply(split(data, data$Name), function(x)
> x[order(x$CheckInDate),][nrow(x),]))
>      }, f2 = function (d)
>      {
>          isEndOfRun <- function(x) c(x[-1] != x[-length(x)], TRUE)
>          dSorted <- d[order(d$Name, d$CheckInDate), ]
>          dSorted[isEndOfRun(dSorted$Name), ]
>      }, f3 = function (d)
>      {
>          # is the following how you did reverse sort on date (& fwd on
> name)?
Yes; in fact I do this all the time in my applications (survival 
analysis), where I have several records for each individual.

G?ran
>          #  Too bad that order's decreasing arg is not vectorized
>          dSorted <- d[order(d$Name, -as.numeric(d$CheckInDate)), ]
>          dSorted[!duplicated(dSorted$Name), ]
>      }, f4 = function(dta)
>      {
>          dta %>% group_by(Name)  %>%
filter(CheckInDate==max(CheckInDate))
>      })
>
> D <- makeData(nPatients=35000, aveVisitsPerPatient=3.7) # c. 129000
visits
> library(dplyr)
> Z <- lapply(funs, function(fun){
>      time <- system.time( result <- fun(D) ) ; list(time=time,
> result=result) })
>
> sapply(Z, function(x)x$time)
> #               f1   f2   f3   f4
> #user.self  461.25 0.47 0.36 3.01
> #sys.self     1.20 0.00 0.00 0.01
> #elapsed    472.33 0.47 0.39 3.03
> #user.child     NA   NA   NA   NA
> #sys.child      NA   NA   NA   NA
>
> # duplicated is a bit better than diff, dplyr rather slower, rbind much
> slower.
>
> equivResults <- function(a, b) {
>     # results have different classes and different orders, so only check
> size and contents
>      identical(dim(a),dim(b)) &&
all(a[order(a$Name),]==b[order(b$Name),])
> }
> sapply(Z[-1], function(x)equivResults(x$result, Z[[1]]$result))
> #  f2   f3   f4
> #TRUE TRUE TRUE
>
> Note that the various functions give different results if any patient comes
> in twice on the same day.  f4 includes both visits in the ouput, the other
> include either the first or last (as ordered in the original file).
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com <http://tibco.com>
>
> On Sun, Jan 25, 2015 at 1:01 AM, G?ran Brostr?m <goran.brostrom at
umu.se
> <mailto:goran.brostrom at umu.se>> wrote:
>
>     On 2015-01-24 01:14, William Dunlap wrote:
>
>         Here is one way.  Sort the data.frame, first by Name then break
>         ties with
>         CheckInDate.
>         Then choose the rows that are the last in a run of identical
>         Name values.
>
>
>     I do it by sorting by the reverse order of CheckinDate (last date
>     first) within Name, then
>
>      > dLatestVisit <- dSorted[!duplicated(dSorted$__Name), ]
>
>     I guess it is faster, but who knows?
>
>     G?ran
>
>
>             txt <- "Name    CheckInDate      Temp
>
>         + John      1/3/2014              97
>         + Mary     1/3/2014              98.1
>         + Sam       1/4/2014              97.5
>         + John      1/4/2014              99"
>
>             d <- read.table(header=TRUE,
>
>        
colClasses=c("character","__character","numeric"),
text=txt)
>
>             d$CheckInDate <- as.Date(d$CheckInDate, as.Date,
>             format="%d/%m/%Y")
>             isEndOfRun <- function(x) c(x[-1] != x[-length(x)], TRUE)
>             dSorted <- d[order(d$Name, d$CheckInDate), ]
>             dLatestVisit <- dSorted[isEndOfRun(dSorted$__Name), ]
>             dLatestVisit
>
>             Name CheckInDate Temp
>         4 John 2014-04-01 99 <tel:2014-04-01%2099>.0
>         2 Mary 2014-03-01 98 <tel:2014-03-01%2098>.1
>         3  Sam 2014-04-01 97 <tel:2014-04-01%2097>.5
>
>
>         Bill Dunlap
>         TIBCO Software
>         wdunlap tibco.com <http://tibco.com>
>
>
>         On Fri, Jan 23, 2015 at 3:43 PM, Tan, Richard <RTan at
panagora.com
>         <mailto:RTan at panagora.com>> wrote:
>
>             Hi,
>
>             Can someone help for a R question?
>
>             I have a data set like:
>
>             Name    CheckInDate      Temp
>             John      1/3/2014              97
>             Mary     1/3/2014              98.1
>             Sam       1/4/2014              97.5
>             John      1/4/2014              99
>
>             I'd like to return a dataset that for each Name, get the
row
>             that is the
>             latest CheckInDate for that person.  For the example above
>             it would be
>
>             Name    CheckInDate      Temp
>             John      1/4/2014              99
>             Mary     1/3/2014              98.1
>             Sam       1/4/2014              97.5
>
>
>             Thank you for your help!
>
>             Richard
>
>
>                       [[alternative HTML version deleted]]
>
>             ________________________________________________
>             R-help at r-project.org <mailto:R-help at r-project.org>
mailing
>             list -- To UNSUBSCRIBE and more, see
>             https://stat.ethz.ch/mailman/__listinfo/r-help
>             <https://stat.ethz.ch/mailman/listinfo/r-help>
>             PLEASE do read the posting guide
>             http://www.R-project.org/__posting-guide.html
>             <http://www.R-project.org/posting-guide.html>
>             and provide commented, minimal, self-contained, reproducible
>             code.
>
>
>                  [[alternative HTML version deleted]]
>
>         ________________________________________________
>         R-help at r-project.org <mailto:R-help at r-project.org>
mailing list
>         -- To UNSUBSCRIBE and more, see
>         https://stat.ethz.ch/mailman/__listinfo/r-help
>         <https://stat.ethz.ch/mailman/listinfo/r-help>
>         PLEASE do read the posting guide
>         http://www.R-project.org/__posting-guide.html
>         <http://www.R-project.org/posting-guide.html>
>         and provide commented, minimal, self-contained, reproducible code.
>
>
>     ________________________________________________
>     R-help at r-project.org <mailto:R-help at r-project.org> mailing
list --
>     To UNSUBSCRIBE and more, see
>     https://stat.ethz.ch/mailman/__listinfo/r-help
>     <https://stat.ethz.ch/mailman/listinfo/r-help>
>     PLEASE do read the posting guide
>     http://www.R-project.org/__posting-guide.html
>     <http://www.R-project.org/posting-guide.html>
>     and provide commented, minimal, self-contained, reproducible code.
>
>

R help - Jan 2015 - get latest dates for different people in a dataset

[R] get latest dates for different people in a dataset

[R] get latest dates for different people in a dataset

[R] get latest dates for different people in a dataset