Andrew Ziem
2009-Feb-27 19:02 UTC
[R] Optimize for loop / find last record for each person
I want to find the last record for each person_id in a data frame (from a SQL database) ordered by date. Is there a better way than this for loop? for (i in 2:length(history[,1])) { if (history[i, "person_id"] == history[i - 1, "person_id"]) history[i, "order"] = history[i - 1, "order"] + 1 # same person else history[i, "order"] = 1 # new person } # ignore all records except the last for each con_id history2 <- subset(history, order == 1) Andrew
Jorge Ivan Velez
2009-Feb-27 19:27 UTC
[R] Optimize for loop / find last record for each person
Dear Andrew, Here is one way: # Some data set.seed(1) mydata<-data.frame(person_id=rep(1:10,10),x=rnorm(100)) mydata # last register for person_id with(mydata,tapply(x,person_id,function(x) tail(x,1))) # test for person_id=1 mydata[mydata$person_id==1,] # see the last number in column 2 HTH, Jorge On Fri, Feb 27, 2009 at 2:02 PM, Andrew Ziem <ahz001@gmail.com> wrote:> I want to find the last record for each person_id in a data frame > (from a SQL database) ordered by date. Is there a better way than > this for loop? > > for (i in 2:length(history[,1])) { > if (history[i, "person_id"] == history[i - 1, "person_id"]) > history[i, "order"] = history[i - 1, "order"] + 1 # same person > else > history[i, "order"] = 1 # new person > } > > # ignore all records except the last for each con_id > history2 <- subset(history, order == 1) > > > Andrew > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
David Winsemius
2009-Feb-27 20:10 UTC
[R] Optimize for loop / find last record for each person
"...from an SQL database." How? Structure of the result? You say "ordered by date" but then you don't reference any date variable? And your code creates an "order" column, but that would not appear necessary for the stated purpose and you don't output the last "order" within a "person_id". See if the tapply strategy below gives you alternate approaches (which should work for the typical default numbering of data.frame rows): DF <-structure(list(Month = structure(c(2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L), .Label = c("Aug", "July", "Sept"), class = "factor"), Week = 27:39, Estpassage = c(665L, 2232L, 9241L, 28464L, 41049L, 82216L, 230411L, 358541L, 747839L, 459682L, 609567L, 979475L, 837189L), MedFL = c(34L, 35L, 35L, 35L, 35L, 35L, 35L, 35L, 35L, 36L, 36L, 36L, 36L)), .Names = c("Month", "Week", "Estpassage", "MedFL"), class = "data.frame", row.names = c(NA, -13L)) > DF Month Week Estpassage MedFL 1 July 27 665 34 2 July 28 2232 35 3 July 29 9241 35 4 July 30 28464 35 5 Aug 31 41049 35 6 Aug 32 82216 35 7 Aug 33 230411 35 8 Aug 34 358541 35 9 Sept 35 747839 35 10 Sept 36 459682 36 11 Sept 37 609567 36 12 Sept 38 979475 36 13 Sept 39 837189 36 tapply(as.numeric(rownames(DF)), DF$Month, max) # substitute history for DF, and history$person_id for DF$Month Aug July Sept 8 4 13 You need the as.numeric around the rownames to prevent the alpha interpretation of "maximum" from being used. The by() and aggregate() functions are convenience functions using tapply, but I have more success with tapply itself. I rather wonder what use this is to be put to, and whether there might be more efficient overall approach. It is not that typical to need the location of the last member of a group within a dataframe. -- David On Feb 27, 2009, at 2:02 PM, Andrew Ziem wrote:> I want to find the last record for each person_id in a data frame > (from a SQL database) ordered by date. Is there a better way than > this for loop? > > for (i in 2:length(history[,1])) { > if (history[i, "person_id"] == history[i - 1, "person_id"]) > history[i, "order"] = history[i - 1, "order"] + 1 # same person > else > history[i, "order"] = 1 # new person > } > > # ignore all records except the last for each con_id > history2 <- subset(history, order == 1) > > > Andrew > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Jorge Ivan Velez
2009-Feb-27 20:25 UTC
[R] Optimize for loop / find last record for each person
Dear Andrew: Here is another way assuming you have an order column in your history data as well as a person_id. Again, your variable of interest is x: # Some data set.seed(1) history<-data.frame( person_id=rep(1:10,each=10), record=rep(sample(10),10), x=rnorm(100) ) history # Splitting the data by person_id spl<-with(history,split(mydata,person_id)) # splitting by person_id spl # Selecting the last record of x by person_d and reporting # the person_id and the x value -- you can change this to your convenience :) do.call(rbind,lapply(spl,function(x) x[which.max(x[,2]),c(1,3)])) HTH, Jorge On Fri, Feb 27, 2009 at 2:02 PM, Andrew Ziem <ahz001@gmail.com> wrote:> I want to find the last record for each person_id in a data frame > (from a SQL database) ordered by date. Is there a better way than > this for loop? > > for (i in 2:length(history[,1])) { > if (history[i, "person_id"] == history[i - 1, "person_id"]) > history[i, "order"] = history[i - 1, "order"] + 1 # same person > else > history[i, "order"] = 1 # new person > } > > # ignore all records except the last for each con_id > history2 <- subset(history, order == 1) > > > Andrew > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
William Dunlap
2009-Feb-27 21:10 UTC
[R] Optimize for loop / find last record for each person
Andrew, it makes it easier to help if you supply a typical input and expected output along with your code. I tried your code with the following input:> historyperson_id date 1 Mary 1 2 Mary 2 3 Sue 3 4 Alex 4 5 Joe 5 6 Alex 6 7 Alex 7 8 Sue 8 9 Sue 9 10 Joe 10 made with the function f<-function(n){ cached.rs <- .Random.seed on.exit(.Random.seed<<-cached.rs) set.seed(1) data.frame(person_id=sample(c("Joe","Mary","Sue","Alex"), size=n,replace=TRUE), date=seq_len(n)) } and it failed because there was no column called 'order'. The following function, f2, does what I think you are saying you want. It sorts the data by person_id, breaking ties with date, and then selects the rows where the person_id entry does not match the person_id entry in the next row. It then sorts the result by date. (I don't know if the last sort it needed in your application.) It should be pretty quick for long datasets with lots of distinct person_id values. f2 <-function (history) { # assume history has, at least, columns called "person_id" and "date" # Return rows containing the last entry (by date) for each person. last <- function(x) c(x[-1]!=x[-length(x)], TRUE) history <- history[with(history, order(person_id,date)),,drop=FALSE] history <- history[last(history[,"person_id"]),,drop=FALSE] history[order(history$date),,drop=FALSE] }> f2(history)person_id date 2 Mary 2 7 Alex 7 9 Sue 9 10 Joe 10 Bill Dunlap TIBCO Software Inc - Spotfire Division wdunlap tibco.com --------------------------------------------------------------- [R] Optimize for loop / find last record for each person Andrew Ziem ahz001 at gmail.com Fri Feb 27 20:02:31 CET 2009 I want to find the last record for each person_id in a data frame (from a SQL database) ordered by date. Is there a better way than this for loop? for (i in 2:length(history[,1])) { if (history[i, "person_id"] == history[i - 1, "person_id"]) history[i, "order"] = history[i - 1, "order"] + 1 # same person else history[i, "order"] = 1 # new person } # ignore all records except the last for each con_id history2 <- subset(history, order == 1) Andrew