Andrew Ziem
2009-Feb-27  19:02 UTC
[R] Optimize for loop / find last record for each person
I want to find the last record for each person_id in a data frame
(from a SQL database) ordered by date.  Is there a better way than
this for loop?
for (i in 2:length(history[,1])) {
    if (history[i, "person_id"] == history[i - 1,
"person_id"])
      history[i, "order"] = history[i - 1, "order"] + 1 #
same person
    else
      history[i, "order"] = 1 # new person
}
# ignore all records except the last for each con_id
history2 <- subset(history, order == 1)
Andrew
Jorge Ivan Velez
2009-Feb-27  19:27 UTC
[R] Optimize for loop / find last record for each person
Dear Andrew, Here is one way: # Some data set.seed(1) mydata<-data.frame(person_id=rep(1:10,10),x=rnorm(100)) mydata # last register for person_id with(mydata,tapply(x,person_id,function(x) tail(x,1))) # test for person_id=1 mydata[mydata$person_id==1,] # see the last number in column 2 HTH, Jorge On Fri, Feb 27, 2009 at 2:02 PM, Andrew Ziem <ahz001@gmail.com> wrote:> I want to find the last record for each person_id in a data frame > (from a SQL database) ordered by date. Is there a better way than > this for loop? > > for (i in 2:length(history[,1])) { > if (history[i, "person_id"] == history[i - 1, "person_id"]) > history[i, "order"] = history[i - 1, "order"] + 1 # same person > else > history[i, "order"] = 1 # new person > } > > # ignore all records except the last for each con_id > history2 <- subset(history, order == 1) > > > Andrew > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
David Winsemius
2009-Feb-27  20:10 UTC
[R] Optimize for loop / find last record for each person
"...from an SQL database."    How? Structure of the result?
You say "ordered by date" but then you don't reference any date  
variable? And your code creates an "order" column, but that would not
appear necessary for the stated purpose and you don't output the last  
"order" within a "person_id".
See if the tapply  strategy below gives you alternate approaches  
(which should work for the typical default numbering of data.frame  
rows):
DF <-structure(list(Month = structure(c(2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 3L, 3L, 3L, 3L, 3L), .Label = c("Aug", "July",
"Sept"), class =
"factor"),
     Week = 27:39, Estpassage = c(665L, 2232L, 9241L, 28464L,
     41049L, 82216L, 230411L, 358541L, 747839L, 459682L, 609567L,
     979475L, 837189L), MedFL = c(34L, 35L, 35L, 35L, 35L, 35L,
     35L, 35L, 35L, 36L, 36L, 36L, 36L)), .Names = c("Month",
"Week", "Estpassage", "MedFL"), class =
"data.frame", row.names = c(NA,
-13L))
 > DF
    Month Week Estpassage MedFL
1   July   27        665    34
2   July   28       2232    35
3   July   29       9241    35
4   July   30      28464    35
5    Aug   31      41049    35
6    Aug   32      82216    35
7    Aug   33     230411    35
8    Aug   34     358541    35
9   Sept   35     747839    35
10  Sept   36     459682    36
11  Sept   37     609567    36
12  Sept   38     979475    36
13  Sept   39     837189    36
tapply(as.numeric(rownames(DF)), DF$Month, max)
# substitute history for DF,  and history$person_id  for DF$Month
  Aug July Sept
    8    4   13
You need the as.numeric around the rownames to prevent the alpha  
interpretation of "maximum" from being used. The by() and aggregate()
functions are convenience functions using tapply, but I have more  
success with tapply itself.
I rather wonder what use this is to be put to, and whether there might  
be more efficient overall approach. It is not that typical to need the  
location of the last member of a group within a dataframe.
-- 
David
On Feb 27, 2009, at 2:02 PM, Andrew Ziem wrote:
> I want to find the last record for each person_id in a data frame
> (from a SQL database) ordered by date.  Is there a better way than
> this for loop?
>
> for (i in 2:length(history[,1])) {
>    if (history[i, "person_id"] == history[i - 1,
"person_id"])
>      history[i, "order"] = history[i - 1, "order"] + 1
# same person
>    else
>      history[i, "order"] = 1 # new person
> }
>
> # ignore all records except the last for each con_id
> history2 <- subset(history, order == 1)
>
>
> Andrew
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
Jorge Ivan Velez
2009-Feb-27  20:25 UTC
[R] Optimize for loop / find last record for each person
Dear Andrew:
Here is another way assuming you have an order column in your history data
as well as a person_id. Again, your variable
of interest is x:
# Some data
set.seed(1)
history<-data.frame(
         person_id=rep(1:10,each=10),
         record=rep(sample(10),10),
         x=rnorm(100)
        )
history
# Splitting the data by person_id
spl<-with(history,split(mydata,person_id)) # splitting by person_id
spl
# Selecting the last record of x by person_d and reporting
# the person_id and the x value  --  you can change this to your convenience
:)
do.call(rbind,lapply(spl,function(x) x[which.max(x[,2]),c(1,3)]))
HTH,
Jorge
On Fri, Feb 27, 2009 at 2:02 PM, Andrew Ziem <ahz001@gmail.com> wrote:
> I want to find the last record for each person_id in a data frame
> (from a SQL database) ordered by date.  Is there a better way than
> this for loop?
>
> for (i in 2:length(history[,1])) {
>    if (history[i, "person_id"] == history[i - 1,
"person_id"])
>      history[i, "order"] = history[i - 1, "order"] + 1
# same person
>    else
>      history[i, "order"] = 1 # new person
> }
>
> # ignore all records except the last for each con_id
> history2 <- subset(history, order == 1)
>
>
> Andrew
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]
William Dunlap
2009-Feb-27  21:10 UTC
[R] Optimize for loop / find last record for each person
Andrew, it makes it easier to help if you supply a typical input and expected output along with your code. I tried your code with the following input:> historyperson_id date 1 Mary 1 2 Mary 2 3 Sue 3 4 Alex 4 5 Joe 5 6 Alex 6 7 Alex 7 8 Sue 8 9 Sue 9 10 Joe 10 made with the function f<-function(n){ cached.rs <- .Random.seed on.exit(.Random.seed<<-cached.rs) set.seed(1) data.frame(person_id=sample(c("Joe","Mary","Sue","Alex"), size=n,replace=TRUE), date=seq_len(n)) } and it failed because there was no column called 'order'. The following function, f2, does what I think you are saying you want. It sorts the data by person_id, breaking ties with date, and then selects the rows where the person_id entry does not match the person_id entry in the next row. It then sorts the result by date. (I don't know if the last sort it needed in your application.) It should be pretty quick for long datasets with lots of distinct person_id values. f2 <-function (history) { # assume history has, at least, columns called "person_id" and "date" # Return rows containing the last entry (by date) for each person. last <- function(x) c(x[-1]!=x[-length(x)], TRUE) history <- history[with(history, order(person_id,date)),,drop=FALSE] history <- history[last(history[,"person_id"]),,drop=FALSE] history[order(history$date),,drop=FALSE] }> f2(history)person_id date 2 Mary 2 7 Alex 7 9 Sue 9 10 Joe 10 Bill Dunlap TIBCO Software Inc - Spotfire Division wdunlap tibco.com --------------------------------------------------------------- [R] Optimize for loop / find last record for each person Andrew Ziem ahz001 at gmail.com Fri Feb 27 20:02:31 CET 2009 I want to find the last record for each person_id in a data frame (from a SQL database) ordered by date. Is there a better way than this for loop? for (i in 2:length(history[,1])) { if (history[i, "person_id"] == history[i - 1, "person_id"]) history[i, "order"] = history[i - 1, "order"] + 1 # same person else history[i, "order"] = 1 # new person } # ignore all records except the last for each con_id history2 <- subset(history, order == 1) Andrew